jag run crashes toit vm on ESP32 and ESP32S3

Toit

#jag run crashes toit vm on ESP32 and ESP32S3

Thread channel in help

MikkelD 11/24/2022 01:43 PM

When I try to run the attached program with jag run, the vm sometimes crashes when I run jag run again and again. It is not consistent. It happens less often on the esp32 than on thes esp32s3.

This is on latests master branch of toit and jaguar.(edited)

MikkelD 11/24/2022 01:43 PM

MikkelD 11/24/2022 01:44 PM

bitphlipphar 11/24/2022 01:52 PM

I wonder if this is reproducible with 1fc17422b7448fdf0f605648d6ae5552e58ae71a?(edited)

MikkelD 11/24/2022 02:10 PM

So far, I havent been able to reproduce on that commit.

MikkelD 11/24/2022 02:18 PM

And now it happened on that commit.

MikkelD 11/24/2022 02:19 PM

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x403754ea  PS      : 0x00060030  A0      : 0x82024bda  A1      : 0x3fce5950  
0x403754ea: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262

A2      : 0x00000000  A3      : 0x3fce1018  A4      : 0x3c0e1ec8  A5      : 0x00000002  
A6      : 0x00000000  A7      : 0x3fcee958  A8      : 0x803754e5  A9      : 0x3fce5930  
A10     : 0x3009ff14  A11     : 0x3fca0474  A12     : 0x3fca0478  A13     : 0x0000abab  
A14     : 0x00060423  A15     : 0x00060420  SAR     : 0x00000003  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x420ce0ec  LEND    : 0x420ce0f2  LCOUNT  : 0x00000002  
0x420ce0ec: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:938

0x420ce0f2: toit::Scheduler::has_ready_processes(toit::Locker&) at ./toolchains/esp32s3/build/./src/scheduler.cc:941



Backtrace:0x403754e7:0x3fce59500x42024bd7:0x3fce59d0 0x42024e66:0x3fce5a20 0x42024e91:0x3fce5a50 0x4201760e:0x3fce5a70 0x42017635:0x3fce5a90 
0x403754e7: toit::Interpreter::run() at ./toolchains/esp32s3/build/./src/interpreter_run.cc:262

0x42024bd7: toit::Scheduler::run_process(toit::Locker&, toit::Process*, toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:627

0x42024e66: toit::Scheduler::run(toit::SchedulerThread*) at ./toolchains/esp32s3/build/./src/scheduler.cc:404

0x42024e91: toit::SchedulerThread::entry() at ./toolchains/esp32s3/build/./src/scheduler.cc:38

0x4201760e: toit::Thread::_boot() at ./toolchains/esp32s3/build/./src/os_esp32.cc:302

0x42017635: toit::thread_start(void*) at ./toolchains/esp32s3/build/./src/os_esp32.cc:289
 (inlined by) esp_thread_start at ./toolchains/esp32s3/build/./src/os_esp32.cc:294

MikkelD 11/24/2022 02:23 PM

FYI: I am using this script to run an extended test.

#!/bin/bash
while :
do
  /Users/mikkel/proj/jaguar/build/jag run service_perf_test.toit -d 192.168.1.129
  sleep 2
done

MikkelD 11/24/2022 02:26 PM

Also reproduced in 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9
After quite a while

MikkelD 11/24/2022 02:39 PM

The log from a 10 minute or so run:

MikkelD 11/24/2022 02:39 PM

On 2a299286d3c1cda55afeb5a9a9fe19e1279a39f9

MikkelD 11/24/2022 02:50 PM

I dont think it is a recent commit that is the culprit.

MikkelD 11/24/2022 02:51 PM

I went back to a commit from a month ago, and could reproduce.

bitphlipphar 11/24/2022 02:54 PM

Good to know!

MikkelD 11/24/2022 02:56 PM

At least we have a test case that can reproduce it within some minutes. But it is fairly infrequent, but it happens. Anything else I can do to help trace this one down?

MikkelD 11/24/2022 04:09 PM

So, more testing suggests that the frequency of crashes has increased with the latest build.(edited)

bitphlipphar 11/24/2022 04:27 PM

It's faster

MikkelD 11/25/2022 08:04 AM

My thoughts. It happens when there is high load. It happens when TCP/IP/process unloading is initiated. It seems to happen both on Wifi and Ethernet boards. It looks to be toit heap corruption (class tag, stack pointer).

MikkelD 11/25/2022 08:21 AM

It looks to be class tag 12, whenever I print it out....

MikkelD 11/25/2022 08:26 AM

And by the way, this also seems to happen on our product from time to time. Not only in this contrived testcase. (So it might not have much to do with process unloading, but more to do with high load?)

MikkelD 11/25/2022 09:22 AM

FREERTOS_UNICORE=y, makes it less frequent, but does not eliminate the crash.

bitphlipphar 11/25/2022 10:12 AM

I think I can reproduce this now. Looking into it.

bitphlipphar 11/25/2022 04:55 PM

Found one small issue. Fixed on master.

bitphlipphar 11/25/2022 04:56 PM

Still needs more work.

MikkelD 11/26/2022 03:31 AM

Happy hunting

bitphlipphar 11/28/2022 10:16 AM

I can confirm that this is reproducible without process unloading and without flash writing. So far, it looks like network load is the only thing required to make this happen.

bitphlipphar 11/29/2022 06:24 PM

@MikkelD

I may have found the cause of the crashes (finally). If we preempt a process right before the current task in the process needs to grow the Toit execution stack, we forget to grow the stack when we resume the task. It then overflows the too small stack causing memory corruption, destroying the stack object's header and maybe a bit of the previous object.

bitphlipphar 11/29/2022 06:24 PM

I expect to land a fix tomorrow.

MikkelD 11/29/2022 06:25 PM

Yay!

bitphlipphar 11/29/2022 06:26 PM

It gets much more likely under high load with frequent preemptions for cross-process GCs and time sharing.

MikkelD 11/29/2022 06:27 PM

The find definitely fits the observations

bitphlipphar 11/30/2022 08:25 AM

Fix out for review.

bitphlipphar 11/30/2022 02:31 PM

Good progress on eliminating the crashes (first significant fix landed today). Still looking into another case where it looks like we may be slightly underestimating the stack space used by a method.

33 messages in total