guild icon
Toit
#Quite often my `jag run` commands hang, and I'm trying to figure out why!
Thread channel in help
addshore
addshore 03/21/2025 11:24 AM
Sometimes my jag run commands hang for quite a while, sometimes they are super quick, and sometimes they are supre slow.
Recently I got this set of traces from one that was hanging and ultimately failed
Any thoughts on the best way to help possibly identify what is going on, and make it go away?
floitsch
floitsch 03/21/2025 11:24 AM
Do you know if the issue is on the cli or device side?
floitsch
floitsch 03/21/2025 11:25 AM
Iirc the log of the device shows when it gets a request (but I'm not 100% sure right now).
addshore
addshore 03/21/2025 11:25 AM
So the most recent one failed with this
https://gist.githubusercontent.com/addshore/dbf2d2e7909f60dddae0fecef5eeae9c/raw/be68079e71a9bbc77687d2e7403d33b1e6210e7b/gistfile1.txt
And running jag run again right after worked after a few seconds
floitsch
floitsch 03/21/2025 11:25 AM
Ok. This looks like the CLI didn't send the full program.
floitsch
floitsch 03/21/2025 11:26 AM
As if it was interrupted.
addshore
addshore 03/21/2025 11:26 AM
or it got lost in the network? (I'm not sure how these comms really look under the hood)
addshore
addshore 03/21/2025 11:26 AM
But I find I can reproduce this using jag run probably 8 in 10 times to some degree (either slowly, or failing)
floitsch
floitsch 03/21/2025 11:26 AM
The device sends out UDP broadcasts to signal its presence, but once Jaguar connects to the device it goes through http with tcp/ip.
floitsch
floitsch 03/21/2025 11:27 AM
Package loss should be handled by the tcp/ip layer.
floitsch
floitsch 03/21/2025 11:27 AM
Is there any output on the CLI side?
floitsch
floitsch 03/21/2025 11:27 AM
80% of the time is a lot.
floitsch
floitsch 03/21/2025 11:27 AM
Definitely not something we have been seeing.
floitsch
floitsch 03/21/2025 11:28 AM
When you get the "UNEXPECTED_END_OF_READER" is that happening after a long time?
👍1
floitsch
floitsch 03/21/2025 11:28 AM
Could the CLI have given up at that point?
floitsch
floitsch 03/21/2025 11:29 AM
Do you know if it could be spotty WiFi? If you have laptop, you can test this, by creating a hotspot with your phone and connecting your laptop and the device to it).
addshore
addshore 03/21/2025 11:29 AM
only output on the CLI side would be like this

~/dev/lb/io/toit (main) jag run ./pkg/lightbug/examples/httpmsg-raw.toit --device 192.168.68.65 -D lb-localdocs=1 Scanning for device with address: '192.168.68.65' Error: Get "http://192.168.68.65:9000/identify": context deadline exceeded
floitsch
floitsch 03/21/2025 11:30 AM
At the moment (with the information I have), I'm guessing the device has a hard time connecting to the WiFi hotspot and, despite the TCP/IP retransmissions, doesn't manage to get the data.
Eventually the CLI gives up, and the device gets the stacktrace.
👍1
addshore
addshore 03/21/2025 11:31 AM
managed to get this this time

Scanning for device with address: '192.168.68.65' Running './pkg/lightbug/examples/httpmsg-raw.toit' on 'working-heat' ... Error: got non-OK from device: 500 Internal Server error - DEADLINE_EXCEEDED
addshore
addshore 03/21/2025 11:31 AM
so, is that 500 error actually comes from the esp?
floitsch
floitsch 03/21/2025 11:32 AM
Good question. Could be from our sources. Let me grep.
addshore
addshore 03/21/2025 11:32 AM
Got this from the ESP side
[jaguar.http] INFO: Internal Server error - DEADLINE_EXCEEDED {peer: 192.168.68.61:28407, method: PUT, path: /run} [jaguar.http] INFO: Internal Server error - request not fully read {peer: 192.168.68.61:28407, method: PUT, path: /run}
addshore
addshore 03/21/2025 11:33 AM
While doing the run my continual stream of ICMP pings to the ESP seem to get through fine, though there is some lag
Pre run it was 5-20ms, During run it is 1000-10000ms
floitsch
floitsch 03/21/2025 11:35 AM
Are you doing something else (like BLE) on the device while updating?
floitsch
floitsch 03/21/2025 11:36 AM
I have experienced similar issues during firmware updates. If there is little memory left, and the BLE is using lots of CPU the firmware update was a bad experience.
addshore
addshore 03/21/2025 11:36 AM
nope, I'm only ever running this 1 script on the devic currently, so the first thing it does before trying to run the new one is stop the old one
floitsch
floitsch 03/21/2025 11:36 AM
I can imagine the same happening for run.
floitsch
floitsch 03/21/2025 11:36 AM
good to know.
addshore
addshore 03/21/2025 11:36 AM
ble certainly wasn't turned on by the old one
The only thing it does is I2C comms
floitsch
floitsch 03/21/2025 11:36 AM
that shouldn't cost anything.
floitsch
floitsch 03/21/2025 11:37 AM
I haven't found the DEADLINE_EXCEEDED yet.
I'm pretty sure it's from our code, but I don't know which one yet.
addshore
addshore 03/21/2025 11:37 AM
so, after failing to do the run a few misn ago, http://192.168.68.61:9000/ and /identify now currenlty don't load in a browser
floitsch
floitsch 03/21/2025 11:38 AM
Package: Implement your REST server or client in Toit and run it on your ESP32. - toitlang/pkg-http
addshore
addshore 03/21/2025 11:38 AM
oooooh

Scanning for device with address: '192.168.68.65' Running './pkg/lightbug/examples/httpmsg-raw.toit' on 'working-heat' ... Success: Sent 127KB code to 'working-heat'

on the esp

Heap report @ out of memory in primitive 3:4: ┌───────────┬──────────┬─────────────────────────────────────────────────────┐ │ Bytes │ Count │ Type │ ├───────────┼──────────┼─────────────────────────────────────────────────────┤ │ 5360 │ 477 │ heap overhead │ │ 171248 │ 433 │ untagged │ └───────────┴──────────┴─────────────────────────────────────────────────────┘ Total: 176608 bytes in 433 allocations (52%), largest free 128k, total free 157k [jaguar] INFO: program 1d0bb989-0dd5-a5ed-eafb-499af7a7d41f started [lb-comms] INFO: Comms starting

But it took a very long time to flash
addshore
addshore 03/21/2025 11:39 AM
not seen that OOM report either before, thats a first I have spotted that
bitphlipphar
bitphlipphar 03/21/2025 11:39 AM
Is it a new problem?
addshoreOPaddshore
oooooh `` Scanning for device with address: '192.168.68.65' Running './pkg/lightbug/examples/httpmsg-raw.toit' on 'working-heat' ... Success: Sent 127KB code to 'working-heat' ...
bitphlipphar
bitphlipphar 03/21/2025 11:40 AM
We did increase CONFIG_LWIP_TCP_MSS from 1410 to 1440 recently.
addshore
addshore 03/21/2025 11:40 AM
This has generally been happening for some weeks, I have just chosen to mostly ignore it until today :😉:
It's one of the reasons that I was interested in the proxy command for jag things, but realized in the past day or so that also might not resolve the issue
bitphlipphar
bitphlipphar 03/21/2025 11:40 AM
I think the bump to 1440 landed last week.
addshoreOPaddshore
This has generally been happening for some weeks, I have just chosen to mostly ignore it until today :😉: It's one of the reasons that I was interested in the proxy command for jag t...
floitsch
floitsch 03/21/2025 11:41 AM
It would probably solve the issue, but the uart is generally slower, and the experience with WiFi should be better.
bitphlipphar
bitphlipphar 03/21/2025 11:41 AM
So it is unlikely to be the cause.(edited)
bitphlippharbitphlipphar
I think the bump to 1440 landed last week.
addshore
addshore 03/21/2025 11:41 AM
I'm likelly not using that bumped value then
addshore
addshore 03/21/2025 11:41 AM
jag v1.47.0 from Dec
floitsch
floitsch 03/21/2025 11:43 AM
From what I can see the program isn't super big either.
floitsch
floitsch 03/21/2025 11:44 AM
If possible please test with a cellphone hotspot.
addshore
addshore 03/21/2025 11:44 AM
yeah, I can give that a go!
floitsch
floitsch 03/21/2025 11:44 AM
It would give us some information on whether it's depending on the WiFi connection or not.
bitphlipphar
bitphlipphar 03/21/2025 11:44 AM
Maybe reboot your AP? Would be interesting to see if that helps.
bitphlipphar
bitphlipphar 03/21/2025 11:45 AM
We've had cases where the AP became really unstable after a while, but mostly when we had many devices.
bitphlipphar
bitphlipphar 03/21/2025 11:45 AM
Rebooting it made those issues go away for a while.
floitsch
floitsch 03/21/2025 11:46 AM
Another easy test:
in the beginning of your main:
if Time.monotonic-us > 0: return
Then reboot the device, and jag run this program multiple times.
floitsch
floitsch 03/21/2025 11:46 AM
If we don't see any issues anymore, then Toit doesn't clean up correctly when it stops the container before installing the next one.
floitsch
floitsch 03/21/2025 11:46 AM
If we still see the issue, then we can exclude anything from your program.
floitsch
floitsch 03/21/2025 11:47 AM
And doing it this way (with the if) ensures that the program size stays the same.
addshore
addshore 03/21/2025 11:51 AM
I will try all of the above and report back :🙂:
Thanks for the insights
floitschfloitsch
I haven't found the DEADLINE_EXCEEDED yet. I'm pretty sure it's from our code, but I don't know which one yet.
floitsch
floitsch 03/21/2025 11:57 AM
I looked through the code, and I'm pretty sure the DEADLINE_EXCEEDED comes from a lower layer; maybe the Socket layer.
floitsch
floitsch 03/21/2025 11:58 AM
Could still be Toit, but also C++.
addshore
addshore 03/21/2025 11:59 AM
Interesting, if Time.monotonic-us > 0: return at the top of the same program, and I ran it 6 times in a row no issue
bitphlipphar
bitphlipphar 03/21/2025 11:59 AM
addshore
addshore 03/21/2025 12:00 PM
Also interesting,
main: yield
also worked 5 times in a row
Now I am suspicious
floitsch
floitsch 03/21/2025 12:00 PM
hmm. That could also be the case.
I somehow excluded it because the number looked so big. But it's just 2 minutes. So maybe yes.
addshore
addshore 03/21/2025 12:01 PM
I did also just start using main of jag
bitphlipphar
bitphlipphar 03/21/2025 12:01 PM
I think that is the simplest explanation. Somehow it takes more than 2 minutes to get the data across and we time out.
bitphlipphar
bitphlipphar 03/21/2025 12:01 PM
Consequently we throw DEADLINE_EXCEEDED from somewhere deep in the http server code or something, something.
floitsch
floitsch 03/21/2025 12:02 PM
we should probably reset the timeout every time we get data.
addshore
addshore 03/21/2025 12:02 PM
I'll report back later throughout the day, but as far as I can tell, using main or jag, and thus also newer toit seems to have made it all go away?(edited)
floitsch
floitsch 03/21/2025 12:02 PM
that would be nice :🙂:
bitphlipphar
bitphlipphar 03/21/2025 12:02 PM
Latest Toit for the win.
💟1
addshore
addshore 03/21/2025 12:03 PM
it always seems to run in sub 5s now
addshore
addshore 03/21/2025 12:03 PM
I read your changelog, and I wonder if perhaps it might have been sometihng to do with I2c then? as you did tweak some things? but no idea!
floitsch
floitsch 03/21/2025 12:03 PM
I don't think so, but you could still do the if Time.monotonic-us test from above on an older jag.
addshore
addshore 03/21/2025 12:04 PM
I will do that now
bitphlipphar
bitphlipphar 03/21/2025 12:04 PM
It could also be weirder. Like reflashing the device helped because you don't have a crap ton of stuff already installed in the programs section that needs to be deleted as we install new things.
bitphlipphar
bitphlipphar 03/21/2025 12:04 PM
2 minutes is a long time though, so I can't quite make that explanation fit.
bitphlipphar
bitphlipphar 03/21/2025 12:05 PM
If that is somehow part of the explanation, going back to an older version should not make this more reproducible.
addshore
addshore 03/21/2025 12:05 PM
So, a fresh flash on the older toit and jag, same hardware, and its currently hung, and i expect it to fail
Scanning for device with address: '192.168.68.65' Running './pkg/lightbug/examples/httpmsg-raw.toit' on 'glad-ear' ...
bitphlipphar
bitphlipphar 03/21/2025 12:06 PM
Nice.
bitphlipphar
bitphlipphar 03/21/2025 12:06 PM
Would love to know what we fixed.
🤣1
addshore
addshore 03/21/2025 12:06 PM
ooh, it did run after about 20 seconds, but on the stuff i was getting under 5 always
addshore
addshore 03/21/2025 12:08 PM
If you want, I can probably bisect it at some point for you :🙂:
bitphlipphar
bitphlipphar 03/21/2025 12:09 PM
That'd be cool. For now, I guess stress testing the latest bits and making sure it's better for you also seems like a nice thing - and productive for you.
82 messages in total