guild icon
Toit
#jag run crashes toit vm on ESP32 and ESP32S3
Thread channel in help
MikkelD
MikkelD 11/24/2022 01:43 PM
When I try to run the attached program with jag run, the vm sometimes crashes when I run jag run again and again. It is not consistent. It happens less often on the esp32 than on thes esp32s3.

This is on latests master branch of toit and jaguar.
(edited)
MikkelD
MikkelD 11/24/2022 01:43 PM
MikkelD
MikkelD 11/24/2022 01:44 PM
bitphlipphar
bitphlipphar 11/24/2022 01:52 PM
I wonder if this is reproducible with 1fc17422b7448fdf0f605648d6ae5552e58ae71a?(edited)
MikkelD
MikkelD 11/24/2022 02:10 PM
So far, I havent been able to reproduce on that commit.
MikkelD
MikkelD 11/24/2022 02:18 PM
And now it happened on that commit.
MikkelD
MikkelD 11/24/2022 02:19 PM
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled. Core 1 register dump: PC : 0x403754ea PS : 0x00060030 A0 : 0x82024bda A1 : 0x3fce5950 0x403754ea: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262 A2 : 0x00000000 A3 : 0x3fce1018 A4 : 0x3c0e1ec8 A5 : 0x00000002 A6 : 0x00000000 A7 : 0x3fcee958 A8 : 0x803754e5 A9 : 0x3fce5930 A10 : 0x3009ff14 A11 : 0x3fca0474 A12 : 0x3fca0478 A13 : 0x0000abab A14 : 0x00060423 A15 : 0x00060420 SAR : 0x00000003 EXCCAUSE: 0x0000001c EXCVADDR: 0x00000000 LBEG : 0x420ce0ec LEND : 0x420ce0f2 LCOUNT : 0x00000002 0x420ce0ec: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:938 0x420ce0f2: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:941 Backtrace:0x403754e7:0x3fce59500x42024bd7:0x3fce59d0 0x42024e66:0x3fce5a20 0x42024e91:0x3fce5a50 0x4201760e:0x3fce5a70 0x42017635:0x3fce5a90 0x403754e7: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262 0x42024bd7: toit::Scheduler::run_process(toit::Locker&, toit::Process*, toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:627 0x42024e66: toit::Scheduler::run(toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:404 0x42024e91: toit::SchedulerThread::entry() at ./toolchains/esp32s3/build/./src/scheduler.cc:38 0x4201760e: toit::Thread::_boot() at ./toolchains/esp32s3/build/./src/os_esp32.cc:302 0x42017635: toit::thread_start(void*) at ./toolchains/esp32s3/build/./src/os_esp32.cc:289 (inlined by) esp_thread_start at ./toolchains/esp32s3/build/./src/os_esp32.cc:294
MikkelD
MikkelD 11/24/2022 02:23 PM
FYI: I am using this script to run an extended test.
#!/bin/bash while : do /Users/mikkel/proj/jaguar/build/jag run service_perf_test.toit -d 192.168.1.129 sleep 2 done
MikkelD
MikkelD 11/24/2022 02:26 PM
Also reproduced in 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9
After quite a while
MikkelD
MikkelD 11/24/2022 02:39 PM
The log from a 10 minute or so run:
MikkelD
MikkelD 11/24/2022 02:39 PM
On 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9
MikkelD
MikkelD 11/24/2022 02:50 PM
I dont think it is a recent commit that is the culprit.
MikkelD
MikkelD 11/24/2022 02:51 PM
I went back to a commit from a month ago, and could reproduce.
bitphlipphar
bitphlipphar 11/24/2022 02:54 PM
Good to know!
MikkelD
MikkelD 11/24/2022 02:56 PM
At least we have a test case that can reproduce it within some minutes. But it is fairly infrequent, but it happens. Anything else I can do to help trace this one down?
MikkelD
MikkelD 11/24/2022 04:09 PM
So, more testing suggests that the frequency of crashes has increased with the latest build.(edited)
bitphlipphar
bitphlipphar 11/24/2022 04:27 PM
It's faster :🥴:
MikkelD
MikkelD 11/25/2022 08:04 AM
My thoughts. It happens when there is high load. It happens when TCP/IP/process unloading is initiated. It seems to happen both on Wifi and Ethernet boards. It looks to be toit heap corruption (class tag, stack pointer).
MikkelD
MikkelD 11/25/2022 08:21 AM
It looks to be class tag 12, whenever I print it out....
MikkelD
MikkelD 11/25/2022 08:26 AM
And by the way, this also seems to happen on our product from time to time. Not only in this contrived testcase. (So it might not have much to do with process unloading, but more to do with high load?)
MikkelD
MikkelD 11/25/2022 09:22 AM
FREERTOS_UNICORE=y, makes it less frequent, but does not eliminate the crash.
bitphlipphar
bitphlipphar 11/25/2022 10:12 AM
I think I can reproduce this now. Looking into it.
bitphlipphar
bitphlipphar 11/25/2022 04:55 PM
Found one small issue. Fixed on master.
bitphlipphar
bitphlipphar 11/25/2022 04:56 PM
Still needs more work.
MikkelD
MikkelD 11/26/2022 03:31 AM
Happy hunting
bitphlipphar
bitphlipphar 11/28/2022 10:16 AM
I can confirm that this is reproducible without process unloading and without flash writing. So far, it looks like network load is the only thing required to make this happen.
👍1
bitphlipphar
bitphlipphar 11/29/2022 06:24 PM
@MikkelD I may have found the cause of the crashes (finally). If we preempt a process right before the current task in the process needs to grow the Toit execution stack, we forget to grow the stack when we resume the task. It then overflows the too small stack causing memory corruption, destroying the stack object's header and maybe a bit of the previous object.
bitphlipphar
bitphlipphar 11/29/2022 06:24 PM
I expect to land a fix tomorrow.
MikkelD
MikkelD 11/29/2022 06:25 PM
Yay!
bitphlipphar
bitphlipphar 11/29/2022 06:26 PM
It gets much more likely under high load with frequent preemptions for cross-process GCs and time sharing.
MikkelD
MikkelD 11/29/2022 06:27 PM
The find definitely fits the observations
bitphlipphar
bitphlipphar 11/30/2022 08:25 AM
Fix out for review.
bitphlipphar
bitphlipphar 11/30/2022 02:31 PM
Good progress on eliminating the crashes (first significant fix landed today). Still looking into another case where it looks like we may be slightly underestimating the stack space used by a method.
33 messages in total