guild icon
Toit
#Getting watchdog up and running
Thread channel in help
theHuanter
theHuanter 11/06/2023 11:21 AM
Hey @floitsch and @bitphlipphar I try to run the watchdog implementation and get the following issues (file is following)

I guess there is some sort of racecondition that the watchdog service is not yet running when my containers try to open it.
Should I use with_timeout to open the connection or is there a different approach?
theHuanter
theHuanter 11/06/2023 11:21 AM
this is the log
theHuanter
theHuanter 11/06/2023 11:27 AM
with_timeout makes it worse. It somehow can not find the service
****************************************************************************** Decoding by `jag`, device has version <2.0.0-alpha.118> ****************************************************************************** EXCEPTION error. Cannot find service 0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49 1: ServiceClient.open <sdk>\system\services.toit:176:40 2: ServiceClient.open <sdk>\system\services.toit:167:12 3: main.<block> C:\Users\Mirko\AppData\Local\Temp\artemis-7ca7d9eb-1271-4aa9-a13a-d976cb1b8302\clone\src\services\mqtt\mqtt.toit:58:20 4: Task_.with-deadline_.<block> <sdk>\core\task.toit:203:16 5: Task_.with-deadline_ <sdk>\core\task.toit:197:3 6: with-timeout <sdk>\core\utils.toit:181:24 7: with-timeout <sdk>\core\utils.toit:173:10 8: main C:\Users\Mirko\AppData\Local\Temp\artemis-7ca7d9eb-1271-4aa9-a13a-d976cb1b8302\clone\src\services\mqtt\mqtt.toit:56:3 ******************************************************************************
theHuanter
theHuanter 11/06/2023 11:28 AM
I install the provider in my ethernet container (its the simplest one) like:

main: (provider.WatchdogServiceProvider).install logger.debug "Watchdog provider installed" watchdogclient := WatchdogServiceClient watchdogclient.open dog := watchdogclient.create "mqtt-dog" dog.start --s=60 logger.debug "Watchdog started" dog.feed dog.stop dog.close

not sure if I need to call it once but 'I wanted to give it a try
theHuanterOPtheHuanter
this is the log
floitsch
floitsch 11/06/2023 12:40 PM
Hmm. This looks like the watchdog provider failed to feed the system watchdog.
floitsch
floitsch 11/06/2023 12:40 PM
It should have a second to do that.
theHuanter
theHuanter 11/06/2023 12:50 PM
do I need to update anything else?
floitsch
floitsch 11/06/2023 12:50 PM
I'm testing your second program now.
theHuanter
theHuanter 11/06/2023 12:51 PM
with-timeout --ms=2000: watchdogclient := WatchdogServiceClient watchdogclient.open dog = watchdogclient.create "mqtt-dog" dog.start --s=60

I am calling it like so
theHuanter
theHuanter 11/06/2023 12:51 PM
and in my ble container and my mqtt container I do have a ehile true loop which simply sleeps 5 or 10s and there I am calling dog.feed (but I guess we never reach that part so far
floitsch
floitsch 11/06/2023 12:52 PM
I did just find a bug.
floitsch
floitsch 11/06/2023 12:53 PM
Don't yet see how it could lead to the seen issue, though.
bitphlipphar
bitphlipphar 11/06/2023 12:54 PM
The default timeout for client.open is 100ms. You can specify a higher one like this watchdogclient.open --timeout=(Duration --s=1). Not sure if that is necessary.
👍1
floitsch
floitsch 11/06/2023 01:00 PM
Please update the package.
From your descriptions it doesn't fix every issue, but there is at least one less bug in it now.
theHuanter
theHuanter 11/06/2023 01:04 PM
jaguar is lying to me
theHuanter
theHuanter 11/06/2023 01:04 PM
$ jag pkg install watchdog Info: Package 'github.com/toitware/[email protected]' installed with name 'watchdog'
theHuanter
theHuanter 11/06/2023 01:04 PM
in the package yaml and in the folder I still only see 1.0.1
bitphlipphar
bitphlipphar 11/06/2023 01:05 PM
jag pkg update?
bitphlipphar
bitphlipphar 11/06/2023 01:06 PM
(or uninstall first)
theHuanter
theHuanter 11/06/2023 01:06 PM
yea nevermind I was running it a completely wrong folder :🙄:
bitphlipphar
bitphlipphar 11/06/2023 01:06 PM
Even better :🙂:
theHuanter
theHuanter 11/06/2023 01:10 PM
theHuanter
theHuanter 11/06/2023 01:10 PM
I removed the with_timeout part but this seems to be the same issue
theHuanter
theHuanter 11/06/2023 01:10 PM
I did not changed the timeout
theHuanter
theHuanter 11/06/2023 01:12 PM
[eth] DEBUG: Watchdog provider installed ****************************************************************************** Decoding by `jag`, device has version <2.0.0-alpha.118> ****************************************************************************** EXCEPTION error. Cannot find service 0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49 1: ServiceClient.open <sdk>\system\services.toit:176:40 2: ServiceClient.open <sdk>\system\services.toit:167:12 3: main C:\Users\Mirko\AppData\Local\Temp\artemis-cdea2632-796b-42c6-8473-9813f0c442d2\clone\src\ble.toit:35:18 ****************************************************************************** [eth] DEBUG: Watchdog started

not sure at which part exactly it is crashing but it might be the provider install
floitsch
floitsch 11/06/2023 01:14 PM
Does it work to run the provider in the same container?
floitsch
floitsch 11/06/2023 01:14 PM
This code works for me:
import watchdog import watchdog.provider main: provider.main print "installed" client := watchdog.WatchdogServiceClient client.open dog := client.create "foo" dog.start --s=60 print "started" dog.feed dog.stop print "stopped" dog.close
theHuanter
theHuanter 11/06/2023 01:15 PM
this is the client part - I think the provider install does not work already
floitsch
floitsch 11/06/2023 01:15 PM
This has the provider install in it.
floitsch
floitsch 11/06/2023 01:15 PM
(provider.main does an install)
theHuanter
theHuanter 11/06/2023 01:16 PM
ah ok
theHuanter
theHuanter 11/06/2023 01:25 PM
dogprovider.main logger.debug "Watchdog provider installed" watchdogclient := WatchdogServiceClient watchdogclient.open dog := watchdogclient.create "mqtt-dog" dog.start --s=60 logger.debug "Watchdog started" dog.feed dog.stop dog.close
theHuanter
theHuanter 11/06/2023 01:25 PM
this is working for me now - but I had to disable the clients in the other containers
theHuanter
theHuanter 11/06/2023 01:26 PM
if I try to open the watchdog service connection with the client in one of the other containers it crashes
floitsch
floitsch 11/06/2023 01:29 PM
interesting.
Let me try that. Clearly I didn't test it enough... :😦:
floitsch
floitsch 11/06/2023 01:45 PM
For me things are working now.
I'm installing the provider in one container.
Then use the clients in other containers to get their dogs.
floitsch
floitsch 11/06/2023 01:45 PM
What kind of crashes do you get?
bitphlipphar
bitphlipphar 11/06/2023 01:49 PM
(just a thought: is there a risk that the ethernet container is doing other work for more than 1s thus starving the watchdog provider?)
floitsch
floitsch 11/06/2023 01:50 PM
I would hope that it yields at least once every two seconds.
floitsch
floitsch 11/06/2023 01:52 PM
Let me create a new version with more logger entries. That should help.
theHuanter
theHuanter 11/06/2023 02:05 PM
****************************************************************************** Decoding by `jag`, device has version <2.0.0-alpha.118> ****************************************************************************** EXCEPTION error. Cannot find service 0: ServiceClient.open.<block> <sdk>\system\services.toit:167:49 1: ServiceClient.open <sdk>\system\services.toit:176:40 2: ServiceClient.open <sdk>\system\services.toit:167:12 3: main C:\Users\Mirko\AppData\Local\Temp\artemis-616970cc-d0f6-46c7-88c7-f1db1182f3eb\clone\src\services\mqtt\mqtt.toit:57:18 ******************************************************************************
theHuanter
theHuanter 11/06/2023 02:06 PM
I receive this still - but I think this is in the mqtt container
theHuanter
theHuanter 11/06/2023 02:11 PM
after the first part working I added the watchdog open to the ble container an got this issue
theHuanter
theHuanter 11/06/2023 02:12 PM
I increased the timrout to 2s
theHuanter
theHuanter 11/06/2023 02:12 PM
with that it was not crashing - or at least it crashed now later with the watchdog triggering.. but I am not sure why tho
theHuanter
theHuanter 11/06/2023 02:16 PM
I am running this at the beginning of the main
watchdogclient := WatchdogServiceClient watchdogclient.open --timeout=(Duration --s=2) dog = watchdogclient.create "ble-dog" dog.start --s=60

and then a bit later in my while loop:

while true: dog.feed sleep --ms=5000
that should actually not trigger the watchdog
floitsch
floitsch 11/06/2023 02:18 PM
Not sure it's relevant, but there is no support for uninstalling the watchdog provider.
If you kill that container (or install another one on top), then you will likely get a watchdog-trigger, since the original system-watchdog-loop isn't running anymore.
floitsch
floitsch 11/06/2023 02:19 PM
However that should only be an issue for one watchdog-reset. After that only the provider you installed should run.
theHuanterOPtheHuanter
after the first part working I added the watchdog open to the ble container an got this issue
floitsch
floitsch 11/06/2023 02:20 PM
This looks as if the watchdog timer was started, but then failed to run.
floitsch
floitsch 11/06/2023 02:20 PM
It has 2 seconds to do so. I don't see how a process would be delayed by that much.
theHuanter
theHuanter 11/06/2023 02:23 PM
it feeds the dog right after the debug line:
[ble] DEBUG: Advertising: 2d66 with name sbceade1c9c
theHuanter
theHuanter 11/06/2023 02:23 PM
no delay - after that I see a more or less 2s of nothing an then it crashes
theHuanter
theHuanter 11/06/2023 02:24 PM
floitsch
floitsch 11/06/2023 02:25 PM
Try to upgrade to v1.2.0, and then use the following container to install the provider:
import log import watchdog import watchdog.provider main: provider := provider.WatchdogServiceProvider --logger=((log.default.with-name "watchdog").with-level log.DEBUG-LEVEL) provider.install print "installed"
theHuanter
theHuanter 11/06/2023 02:25 PM
I marked it - it does say no connection available tho
theHuanter
theHuanter 11/06/2023 02:50 PM
that seem to work now
theHuanter
theHuanter 11/06/2023 02:51 PM
theHuanter
theHuanter 11/06/2023 02:56 PM
should the ethernet container also get the watchdog? it basically only installs the provider so I think it doesn't really need it, right
floitsch
floitsch 11/06/2023 02:58 PM
I don't think so.
floitsch
floitsch 11/06/2023 02:59 PM
If you bump the log-level (to INFO) you would drop the feeding system watchdog messages.
👍1
floitsch
floitsch 11/06/2023 02:59 PM
If you want to keep some watchdog logging.
theHuanter
theHuanter 11/06/2023 03:04 PM
have you tried it with an firmware update actually?
theHuanter
theHuanter 11/06/2023 03:05 PM
:🥲: the watchdog triggers of course if I try to update the device
floitsch
floitsch 11/06/2023 03:09 PM
You would need to stop all the timers.
floitsch
floitsch 11/06/2023 03:11 PM
We will integrate the watchdog timers with Artemis a bit more.
1. if a container doesn't work, we don't need to reset immediately but can try to just restart the container.
2. depending on the settings, a container might allow to disable the watchdog automatically on fw updates. Critical containers should still have their watchdogs run.
3. Artemis will probably have the ability to disable watchdogs so it can make progress (in case one just keeps getting in the way).
floitsch
floitsch 11/06/2023 03:12 PM
We will have to think a bit more about how this should work best.
theHuanter
theHuanter 11/06/2023 03:12 PM
for the sake of testing right now it doesn't matter, I mostly flash anyway
floitsch
floitsch 11/06/2023 03:12 PM
Oh. And the watchdog-provider container should probably be marked as critical.
theHuanter
theHuanter 11/06/2023 03:12 PM
I would not know when to close the watchdog for a firmware update tbh
floitsch
floitsch 11/06/2023 03:13 PM
For now I would probably mark the watchdog-provider container as critical, and use very big timeouts for the watchdogs.
floitsch
floitsch 11/06/2023 03:13 PM
Just to make sure something is happening within 10 minutes or so.
floitsch
floitsch 11/06/2023 03:13 PM
At the very least the devices should never get fully stuck this way.
theHuanter
theHuanter 11/06/2023 03:14 PM
you mean the feeding should happen only once every few minutes?
floitsch
floitsch 11/06/2023 03:14 PM
The timeout should be set to 10 minutes. The feeding could still be more frequent.
floitsch
floitsch 11/06/2023 03:14 PM
In the current setup.
theHuanter
theHuanter 11/06/2023 03:14 PM
ah so the timeout is the time I set when creating the watchdog
floitsch
floitsch 11/06/2023 03:14 PM
This would give Artemis a window of 10 minutes to do its thing.
floitsch
floitsch 11/06/2023 03:15 PM
Correct. At start you tell the watchdog the max interval between feedings.
theHuanter
theHuanter 11/06/2023 03:15 PM
dog.start --s=60 <---- this
theHuanter
theHuanter 11/06/2023 03:15 PM
okay
floitsch
floitsch 11/06/2023 03:15 PM
You have 1 minute to feed it. Yes.
floitsch
floitsch 11/06/2023 03:15 PM
If you make this 10 minutes, and feed every 30 seconds, then Artemis has 9:30 to do its thing.
floitsch
floitsch 11/06/2023 03:16 PM
(when it shuts down your container for a firmware update).
floitsch
floitsch 11/06/2023 03:16 PM
Again: in the future we should improve this.
theHuanter
theHuanter 11/06/2023 03:16 PM
yes yes
floitsch
floitsch 11/06/2023 03:16 PM
That said: I'm actually not 100% sure this will work.
😅1
floitsch
floitsch 11/06/2023 03:17 PM
I think a deep-sleep 0 with a watchdog running triggers a watchdog error.
floitsch
floitsch 11/06/2023 03:17 PM
Not sure if we then try to update.
floitsch
floitsch 11/06/2023 03:17 PM
I will test that.
theHuanter
theHuanter 11/06/2023 03:18 PM
I am almost pretty sure, that this will not catch my issue with the device disappearing but I really hope so
theHuanter
theHuanter 11/06/2023 03:26 PM
okay the watchdog is running on the device and writing the log into a file
theHuanter
theHuanter 11/06/2023 03:26 PM
I only watch ble and mqtt I guess the others do not really need a watch
floitsch
floitsch 11/06/2023 03:33 PM
Looks good. I was able to do a fw update with the watchdogs active.
👍1
theHuanter
theHuanter 11/06/2023 07:25 PM
pretty long log but at some point there was a 502 and then the synchronise job stopped printing its debug synchronise log!
theHuanter
theHuanter 11/06/2023 07:25 PM
and few minutes later the whole device stopped working
bitphlipphar
bitphlipphar 11/06/2023 07:27 PM
But the watchdog still didn't force a reboot?
theHuanter
theHuanter 11/06/2023 07:59 PM
nope the log stopped completely as well
theHuanter
theHuanter 11/06/2023 07:59 PM
I really wonder what that could be
floitsch
floitsch 11/06/2023 08:37 PM
So the synchronize stopped but the watchdogs were still fed?
floitsch
floitsch 11/06/2023 08:38 PM
It looks like we need to add a watchdog for our own Artemis code to avoid this.
bitphlipphar
bitphlipphar 11/07/2023 04:51 AM
As I understand it, the watchdog feeding code - BLE + MQTT containers - also stopped, but nothing rebooted.
theHuanter
theHuanter 11/07/2023 07:56 AM
exactly
bitphlipphar
bitphlipphar 11/07/2023 07:57 AM
Does the device have any LEDs?
theHuanter
theHuanter 11/07/2023 07:57 AM
yes
bitphlipphar
bitphlipphar 11/07/2023 07:57 AM
Anything we can blink on regular intervals?
theHuanter
theHuanter 11/07/2023 07:57 AM
let me check
theHuanter
theHuanter 11/07/2023 07:57 AM
it has a bunch of leds to show Rx and Tx of the ethernet
theHuanter
theHuanter 11/07/2023 07:58 AM
but I will chekc if it has some debug led on it
theHuanter
theHuanter 11/07/2023 08:01 AM
no debug led but I could maybe add one
theHuanter
theHuanter 11/07/2023 08:01 AM
the way I check if the device is still alive is, I check for the BLE device on my phone. if it is gone, the container is gone
bitphlipphar
bitphlipphar 11/07/2023 08:02 AM
You also get no serial output, right?
theHuanter
theHuanter 11/07/2023 08:03 AM
exactly, nothing is printed there
theHuanter
theHuanter 11/07/2023 08:03 AM
only if I hard reset the device using the reset button(edited)
bitphlipphar
bitphlipphar 11/07/2023 08:04 AM
So we don't see any prints from the watchdog provider, so we assume that it actually isn't resetting the low-level watchdog, which should cause that to reboot the system. It's pretty weird.
bitphlipphar
bitphlipphar 11/07/2023 08:04 AM
It would be great to rule out that we're just not getting serial output anymore. Are you hooking prints at the Toit level in any way?
theHuanter
theHuanter 11/07/2023 08:04 AM
it is also not the device, it happens on other devices as well
theHuanter
theHuanter 11/07/2023 08:06 AM
It would be great to rule out that we're just not getting serial output anymore
this is ruled out by the bluetooth device being gone. The container advertises and then stays alive. if the ble device is gone from scanning, the container must be gone
theHuanter
theHuanter 11/07/2023 08:06 AM
with the watchdog I added a log once it feeds the dog every 30sec
bitphlipphar
bitphlipphar 11/07/2023 08:07 AM
I get that, but at the same time, I'd like to know if we just stopped processing some Toit code. If you hook prints at the Toit level, then you need (more) Toit code to run to get anything printed on serial.
theHuanter
theHuanter 11/07/2023 08:07 AM
it also disappears from my router as internet device I think (I was checking it but I can not remember right now anymore)
theHuanter
theHuanter 11/07/2023 08:08 AM
If you hook prints at the Toit level, then you need (more) Toit code to run to get anything printed on serial.
I don't really get that toit level part
theHuanter
theHuanter 11/07/2023 08:08 AM
what do you mean by that
bitphlipphar
bitphlipphar 11/07/2023 08:08 AM
Program your microcontrollers in a fast and robust high-level language. - toitlang/toit
bitphlipphar
bitphlipphar 11/07/2023 08:09 AM
I think my real question is: Is there anyway the watchdog provider's Toit code continues to run while everything else seems to stall?
floitsch
floitsch 11/07/2023 08:10 AM
Your board is an olimex board. Right?
Could you maybe send us the code so we can try to reproduce?
It's probably enough to get the snapshot/image for the code you don't want to share.
bitphlippharbitphlipphar
theHuanter
theHuanter 11/07/2023 08:11 AM
no
👍1
floitschfloitsch
Your board is an olimex board. Right? Could you maybe send us the code so we can try to reproduce? It's probably enough to get the snapshot/image for the code you don't want to sha...
theHuanter
theHuanter 11/07/2023 08:11 AM
I can share all of it if you want
theHuanter
theHuanter 11/07/2023 08:12 AM
but to make it similar you might need an MQTT broker which it can connect to?
theHuanter
theHuanter 11/07/2023 08:12 AM
thats actually all
floitsch
floitsch 11/07/2023 08:12 AM
We can find one.
theHuanter
theHuanter 11/07/2023 08:14 AM
do you know any other ESP32-WROVER board?
floitsch
floitsch 11/07/2023 08:14 AM
We have a few more of those as well. Not with Ethernet, though.(edited)
theHuanter
theHuanter 11/07/2023 08:15 AM
I somehow have the suspicion it is related to hardware but the code only runs on the WROVER variant due to its size (somehow)
bitphlipphar
bitphlipphar 11/07/2023 08:15 AM
@theHuanter Do you think this can be reproduced without the BLE container?
theHuanter
theHuanter 11/07/2023 08:15 AM
I can not really tell, could be
bitphlipphar
bitphlipphar 11/07/2023 08:16 AM
It feels like what you have now is pretty reproducible, which is great. I just wonder if we can make the repro case smaller.
theHuanter
theHuanter 11/07/2023 08:17 AM
the thing is I don't know when it happens. I started the edvice yesterday and within few hours it stopped. then I restarted and it is still running
theHuanter
theHuanter 11/07/2023 08:19 AM
I can send you the code incl. the BLE part, thats not an issue. But because we don't know what exactly causes it, I would also run it incl. the MQTT part which wants to connect to a broker . We can also not connect it I think but then it might be different already. not sure
theHuanter
theHuanter 11/07/2023 08:20 AM
or do you want to check if BLE is causing it and not run it?
bitphlipphar
bitphlipphar 11/07/2023 08:22 AM
I believe we have a device like yours (I could be wrong) in our test lab that just runs Artemis via ethernet and it syncs with the cloud every 20s for days and days.
theHuanter
theHuanter 11/07/2023 08:23 AM
I also installed the MQTT and BLE container onto another ESP-DevKit using Jaguarg (not artemis) and it was running forever
bitphlipphar
bitphlipphar 11/07/2023 08:23 AM
We'll have to double check that it is actually a WROVER.
theHuanterOPtheHuanter
I also installed the MQTT and BLE container onto another ESP-DevKit using Jaguarg (not artemis) and it was running forever
theHuanter
theHuanter 11/07/2023 08:23 AM
in this case it was a WROOM
theHuanter
theHuanter 11/07/2023 08:24 AM
I just don't have another WROVER board where I could run the artemis version on just to check if this also happens on other hardware to rule out the olimex board
bitphlipphar
bitphlipphar 11/07/2023 08:24 AM
Would it be super annoying to run the Jaguar variant on your WROVER board?
theHuanter
theHuanter 11/07/2023 08:25 AM
no that might work as well
theHuanter
theHuanter 11/07/2023 08:25 AM
I can install the containers just as they are in artemis then
bitphlipphar
bitphlipphar 11/07/2023 08:25 AM
I'm just hoping we can start ruling some things out.
theHuanter
theHuanter 11/07/2023 08:25 AM
okay I will do that then
theHuanter
theHuanter 11/07/2023 08:26 AM
I will run all containers: watchdog, ethernet, mqtt and ble via jag(edited)
bitphlipphar
bitphlipphar 11/07/2023 08:26 AM
Thanks! It would also be interesting to try to run Artemis with no containers (except ethernet) and max-offline 0s and see if that also stops sync'ing at some point.
theHuanter
theHuanter 11/07/2023 08:27 AM
tbh I can not recall for sure but it might be related to the IDF changes
bitphlipphar
bitphlipphar 11/07/2023 08:27 AM
In the mean time, we'll check our boards and see if we have a WROVER among the ethernet ones.
theHuanter
theHuanter 11/07/2023 08:27 AM
I think you ordered one as well
bitphlipphar
bitphlipphar 11/07/2023 08:27 AM
You mean the upgrade to ESP-IDF v5.x?
theHuanter
theHuanter 11/07/2023 08:27 AM
yes
theHuanter
theHuanter 11/07/2023 08:27 AM
but of course many things have changed til lthen
theHuanter
theHuanter 11/07/2023 08:37 AM
ok the device is up and running using jag, flashed with esp32-eth-clk-out0-spiram, and all containers are there. watchdog, eth, mqtt and ble
theHuanter
theHuanter 11/07/2023 08:38 AM
every 30s there is a log in the BLE container which prints [ble] DEBUG: Feed that dog.. omnomnomnom
bitphlipphar
bitphlipphar 11/07/2023 08:38 AM
Great. I've looked around a bit and I started getting worried about internal FreeRTOS stack sizes. https://esp32.com/viewtopic.php?t=30700
Espressif ESP32 Official Forum
😬1
bitphlipphar
bitphlipphar 11/07/2023 08:40 AM
I'll dig a bit deeper, but we could try to build a variant with larger stacks. We currently run with 2KB stacks on the ESP32 for the tasks that run Toit code (edit: turns out that is wrong, it really is 8KB).(edited)
theHuanter
theHuanter 11/07/2023 08:40 AM
they also say that without WiFi being initialised it does not happen, could we try that as well somehow?
theHuanter
theHuanter 11/07/2023 08:41 AM
I have another WROVER PoE board where I could run this
theHuanter
theHuanter 11/07/2023 08:41 AM
then we can see if ti is maybe that, sounds pretty much like the issue
bitphlipphar
bitphlipphar 11/07/2023 08:43 AM
Espressif IoT Development Framework. Official development framework for Espressif SoCs. - Issues · espressif/esp-idf
bitphlipphar
bitphlipphar 11/07/2023 08:45 AM
Looks like it is still an open issue (the tech lead of the ESP-IDF add a comment to it in September, 2023).
bitphlipphar
bitphlipphar 11/07/2023 08:49 AM
Just found this in our configs: CONFIG_FREERTOS_ISR_STACKSIZE=2096. That's a weird number, but it probably not problematic. It looks like someone tried to change it from 4KB to 2KB, but got it wrong :😉:
😬1
bitphlipphar
bitphlipphar 11/07/2023 08:53 AM
Actually we're using 8KB for the stacks that run Toit code.
theHuanter
theHuanter 11/07/2023 08:54 AM
so as far as I understand it the ESP gets stuck after some panics or overflows in a function of the espressif IDF running SOC_HAL_STALL_OTHER_CORES(). They also say, that the RTC is still able to reset the device, not sure if that helps in any way. Could the RTC reset the device in such moments if some watchdog is not triggered in time?
bitphlipphar
bitphlipphar 11/07/2023 08:54 AM
Not sure, but maybe.
bitphlipphar
bitphlipphar 11/07/2023 08:55 AM
I think our best bet is trying to figure out if we get panics/overflows and then avoid those.
theHuanter
theHuanter 11/07/2023 08:55 AM
true
theHuanter
theHuanter 11/07/2023 08:55 AM
that might be the core issue
bitphlipphar
bitphlipphar 11/07/2023 08:55 AM
That feels very actionable, but it starts with us being able to understand if that is actually happening.
theHuanter
theHuanter 11/07/2023 08:55 AM
but might be hard to firgure out who is causing them if the device does not tell us
theHuanter
theHuanter 11/07/2023 08:56 AM
its the right way :😉: finding the root-couse not avoid the symptoms
theHuanter
theHuanter 11/07/2023 08:57 AM
do they mean that the stack is overflowing? is that caused by allocations or what might cause this?
bitphlipphar
bitphlipphar 11/07/2023 08:59 AM
It is caused by the C/C++ code that needs stack space for local variables, etc. So depending on what the code does and when interrupts fire, you'll need more or less space for your stacks.
bitphlipphar
bitphlipphar 11/07/2023 09:00 AM
The stacks are allocated (essentially) at startup and they have a fixed size.
bitphlipphar
bitphlipphar 11/07/2023 09:00 AM
If some code changed and now uses a bit more recursion or more local variables, then we might occasionally need more stack space than we have.
bitphlipphar
bitphlipphar 11/07/2023 09:01 AM
It is a super unsatisfying setup. At the Toit level, we grow stacks on demand. It isn't your Toit code that contributes to the low-level stack space consumption.
theHuanter
theHuanter 11/07/2023 09:07 AM
okay - the jag version is now running on my raspberry pi writing the serial output into a file, lets see if it happens there as well
bitphlipphar
bitphlipphar 11/07/2023 09:07 AM
Thanks!
bitphlipphar
bitphlipphar 11/07/2023 09:08 AM
If it is a low-level stack overflow issue of sorts, then we should expect Jaguar/Artemis to behave the same -- modulo the fact that Artemis does a network request every 20s and Jaguar just waits for http clients to connect.
👍1
bitphlipphar
bitphlipphar 11/07/2023 09:32 AM
Just found that we're running with lower than default stack size for the lwIP task (2560 vs the default of 3072).
floitsch
floitsch 11/07/2023 09:57 AM
The watchdog is hardware based (I think). I don't see how the stack overflow could prevent the device from rebooting.
bitphlipphar
bitphlipphar 11/07/2023 09:58 AM
@floitsch Clearly you didn't read the bug report :🙂:(edited)
bitphlipphar
bitphlipphar 11/07/2023 09:59 AM
To answer your question, the WDT does not kick in ever, we have kept the esp32 ON in that state for more than 8 hours. According to our observation the panic_handler function stop at line SOC_HAL_STALL_OTHER_CORES();
floitsch
floitsch 11/07/2023 10:00 AM
Didn't go deep enough. You are right.
bitphlipphar
bitphlipphar 11/07/2023 10:01 AM
@floitsch What would it take for us to produce a variant with a small esp-idf patch that comments out that line? Does the envelopes repository support patching that?
floitsch
floitsch 11/07/2023 10:19 AM
I don't think the envelope repo is already prepared for it, but I think it should be feasible.
bitphlipphar
bitphlipphar 11/07/2023 10:43 AM
I suggest we drop the non-default stack sizes as a starting point. I'm running tests on that right now.
bitphlipphar
bitphlipphar 11/07/2023 10:44 AM
I'm a little bit concerned that we may need more space for the BLE stack than the default allows for, but that is completely unproven at this point.
bitphlipphar
bitphlipphar 11/07/2023 02:08 PM
SDK v2.0.0-alpha.119 comes with adjusted native stack sizes.
bitphlipphar
bitphlipphar 11/07/2023 02:09 PM
Trying to get a version of Artemis with support for that ready.
👍1
theHuanter
theHuanter 11/07/2023 03:09 PM
My device stopped again - this time it was running entirely using jaguar
theHuanter
theHuanter 11/07/2023 03:13 PM
theHuanter
theHuanter 11/07/2023 03:13 PM
pretty boring log
bitphlipphar
bitphlipphar 11/07/2023 03:15 PM
We have no real indications that this is going to fix it, but Artemis v0.13.2 is out with support SDK v2.0.0-alpha.119 that comes with slightly more stack space for the lwIP task.
bitphlipphar
bitphlipphar 11/07/2023 03:15 PM
@theHuanter Good to know about Jaguar!
bitphlippharbitphlipphar
We have no real indications that this is going to fix it, but Artemis v0.13.2 is out with support SDK v2.0.0-alpha.119 that comes with slightly more stack space for the lwIP task.
theHuanter
theHuanter 11/07/2023 03:17 PM
I will give it a spin later - does that also apply for jaguar?
bitphlipphar
bitphlipphar 11/07/2023 03:20 PM
I can get a Jaguar build out in a few hours.
bitphlipphar
bitphlipphar 11/07/2023 03:21 PM
(maybe a bit before that)
bitphlipphar
bitphlipphar 11/07/2023 03:22 PM
You'll need Jaguar v1.19.0 (unreleased for now).
theHuanter
theHuanter 11/07/2023 03:23 PM
ok
theHuanter
theHuanter 11/07/2023 03:23 PM
I am fine with artemis as well
theHuanter
theHuanter 11/07/2023 03:23 PM
ah right the winget situation, I remember :😄:
bitphlipphar
bitphlipphar 11/07/2023 03:25 PM
You should be able to download Jaguar from here: https://github.com/toitlang/jaguar/releases/tag/v1.19.0 (once the assets are built).
What's Changed

Update to SDK v2.0.0-alpha.119 by @kasperl in #434

Full Changelog: v1.18.0...v1.19.0
theHuanter
theHuanter 11/07/2023 04:36 PM
just as an idea: if this is related to some memory leakage (if thats even a possibility) an increase in the stack would simply just move the issue to a later point, right?
floitsch
floitsch 11/07/2023 04:41 PM
We don't move C stacks, and I'm guessing they are allocated at start.
bitphlipphar
bitphlipphar 11/07/2023 05:31 PM
Yes, fixed allocations.
bitphlipphar
bitphlipphar 11/07/2023 05:31 PM
Jaguar v1.19.0 is out.
floitsch
floitsch 11/07/2023 05:32 PM
Have you signed the Contributor License Agreement?
Have you checked that there aren't other open pull requests for the same manifest update/change?
This PR only modifies one (1) manifest
Hav...
floitschfloitsch
bitphlipphar
bitphlipphar 11/08/2023 05:18 AM
Appears complete, so winget should give you Jaguar v1.19.0.
theHuanter
theHuanter 11/08/2023 07:33 AM
it is running now on my pi using this jaguar version
bitphlipphar
bitphlipphar 11/08/2023 08:01 AM
@theHuanter Thanks. I remain a bit sceptical about this being a fix for the issue, but I'm curious to hear what you find.
theHuanter
theHuanter 11/08/2023 08:06 AM
yea me too
theHuanter
theHuanter 11/08/2023 08:07 AM
@floitsch my device stopped working over night again how is yours doing?
theHuanterOPtheHuanter
@floitsch my device stopped working over night again how is yours doing?
floitsch
floitsch 11/08/2023 09:42 AM
Still running strong.
🧐1
bitphlipphar
bitphlipphar 11/09/2023 06:03 AM
@theHuanter Any updates from running Jaguar v1.19.0 on your device?
theHuanter
theHuanter 11/09/2023 07:21 AM
I had to reboot yesterday evening for a test but sofar it is still running
bitphlipphar
bitphlipphar 11/09/2023 07:21 AM
Interesting.
bitphlipphar
bitphlipphar 11/09/2023 07:22 AM
@floitsch pushed an update to the MQTT package with a bug fix, so at some point you may want to upgrade to [email protected].
bitphlipphar
bitphlipphar 11/09/2023 07:24 AM
I guess the frequency of the hang is low enough that we will not be able to conclude anything positive before early next week.
theHuanter
theHuanter 11/09/2023 07:46 AM
usually it happens over night, but lets see. I hope the changes did not just "hide" the problems and make it appear on a later state
bitphlipphar
bitphlipphar 11/09/2023 07:49 AM
So if we've solved it, it is most likely due to the extra stack space allocated to the lwIP task. Given the frequency of the issue, it would make sense if the old setting was almost enough (~2.5K), but that the ethernet code would sometimes use a tiny bit too much stack space (compared to wifi, perhaps). The theory is that a detected stack overflow would lead to a panic that would hang both cores at a very low level due to a bug in the esp-idf.
bitphlipphar
bitphlipphar 11/09/2023 07:54 AM
Going to the default lwIP task stack size seems like a good idea (2.5K -> 3K) and the esp-idf probably increased the default size for a good reason. It was increased back in 2018, but we didn't pay enough attention to that: https://github.com/espressif/esp-idf/commit/2ff3f8b0c8b14dc3e9b581d3031689e13c5530a6. The commit message also wasn't super helpful :😋:(edited)
theHuanter
theHuanter 11/09/2023 09:49 AM
okay.. I mean I am happy if thats the fix! that would be amazing
theHuanter
theHuanter 11/09/2023 09:49 AM
I will keep it running and see if it stays
floitsch
floitsch 11/09/2023 09:49 AM
I'm still running two devices at work with your setup.
theHuanter
theHuanter 11/09/2023 09:50 AM
but with old jag?(edited)
floitsch
floitsch 11/09/2023 09:50 AM
yes
theHuanter
theHuanter 11/09/2023 09:50 AM
and you haven't seen the hang yeT?
floitsch
floitsch 11/09/2023 09:50 AM
I'm working from home today, so will only see if it "worked" tomorrow or (maybe even only) next week
theHuanter
theHuanter 11/09/2023 09:50 AM
ah ok
theHuanter
theHuanter 11/09/2023 09:50 AM
thats fine
floitsch
floitsch 11/09/2023 09:50 AM
but until yesterday evening it didn't reproduce.
🧐1
floitschfloitsch
but until yesterday evening it didn't reproduce.
bitphlipphar
bitphlipphar 11/09/2023 11:58 AM
This means that is hadn't reproduced when you checked yesterday evening, right? Not that it reproduced yesterday evening.(edited)
bitphlippharbitphlipphar
This means that is hadn't reproduced when you checked yesterday evening, right? Not that it reproduced yesterday evening.(edited)
floitsch
floitsch 11/09/2023 12:31 PM
Correct. I hadn't reproduced when I checked at that time.(edited)
bitphlipphar
bitphlipphar 11/10/2023 06:04 AM
@theHuanter Let us know the status of your tests on alpha.119. Is it possible for you to run the tests over the weekend too so we get more data?
theHuanter
theHuanter 11/10/2023 06:39 AM
the device is still running (second night). I will keep it running over the weekend. if it still runs on monday it might be fixed
bitphlipphar
bitphlipphar 11/10/2023 06:45 AM
It is pretty crazy to think about. You reported the hang and we found this old bug report that matched the description pretty well. We checked our stack limits and noticed they were slightly off from the defaults. I assume we believe this is the issue @Niels R. reported on October 16?
theHuanter
theHuanter 11/10/2023 07:17 AM
yes exactly - it feels a bit strange and "too easy"
🫣1
theHuanter
theHuanter 11/10/2023 07:18 AM
I will swtich to Artemis on monday again and keep it running with that
👍1
theHuanter
theHuanter 11/10/2023 07:19 AM
@floitsch are you running the code with jag or artemis right now?
bitphlipphar
bitphlipphar 11/10/2023 07:20 AM
I believe Florian is using Jaguar.
floitsch
floitsch 11/10/2023 07:40 AM
I'm using Jaguar.
theHuanter
theHuanter 11/10/2023 10:14 AM
you can nor check if it is still running right? because it s in the office? but on monday the nI guees?
floitsch
floitsch 11/10/2023 10:23 AM
I ended up working from home. I will check on Monday.
theHuanter
theHuanter 11/13/2023 09:12 AM
the jaguar device is still running
theHuanter
theHuanter 11/13/2023 09:12 AM
no hang
theHuanter
theHuanter 11/13/2023 09:13 AM
I will replace it now with the artemis version(edited)
bitphlipphar
bitphlipphar 11/13/2023 09:28 AM
You should be able to use Artemis v0.13.3 with SDK v2.0.0-alpha.120.
👍1
theHuanter
theHuanter 11/13/2023 09:49 AM
thats what I am running
theHuanter
theHuanter 11/13/2023 04:01 PM
the device from @Niels R. got an update as well on friday with artemis and is also still running today
👍1
theHuanter
theHuanter 11/15/2023 07:49 AM
devices are still running - I think we can almost count this bug as fixed
theHuanter
theHuanter 11/15/2023 07:49 AM
but @floitsch you have not been able to reproduce it, right?
floitsch
floitsch 11/15/2023 07:59 AM
Both boards are still doing fine (as of yesterday evening)
bitphlipphar
bitphlipphar 11/15/2023 07:59 AM
@theHuanter There's a chance that the timings involved in talking to the servers play a role in this.
bitphlipphar
bitphlipphar 11/15/2023 08:00 AM
Florian's setup is different in this respect, so maybe the lwIP stack is exercised in a different way there.(edited)
floitsch
floitsch 11/16/2023 06:19 PM
My two devices are still running without issues (after a week?). So I agree with Kasper. There is a chance that my setup is just slightly different and we never hit the overflow (assuming that was the reason).
theHuanter
theHuanter 11/16/2023 06:27 PM
my device here also still runs with artemis since monday
floitsch
floitsch 11/16/2023 06:58 PM
Mine is still running the old Jaguar.
263 messages in total