MQTT not Authorised is disconnecting wifi connection

Toit

#MQTT not Authorised is disconnecting wifi connection

Thread channel in help

theHuanter 11/16/2023 02:39 PM

When the MQTT Client can not connect due to missing or wrong credentials it seem to close the network connection which also prevents the artemis synchronise from running.
In this setup there are 4 WiFi credentials in the Pod.

In the log you can see the it connects to wifi-3, then failes to connect to the broker (our broker in our mqtt service) and closes the wifi-3 connection in the same time the synchronise job failes. This should not happen, right? Once connected to wifi it should stay there and not close the connection because the mqtt client can not connect to its broker.

ping @floitsch

floitsch 11/16/2023 02:47 PM

It seems like any timeout on the network code is considered to be a broken network. However there are some that are caused by bad input (or a server that doesn't behave nicely).
It's a bit of a question: if one container finds that the network is broken, should it inform the other containers or should we let them figure out themselves?
It's even worse, because you might want to "reset" the network, but that only happens once all containers gave up on it.
That said: the current behavior is not optimal and inconsistent.

theHuanter 11/16/2023 06:25 PM

in our case the mqtt container starts over again 10s later and the artemis never shows this device as know or being able to update it

theHuanter 11/16/2023 06:26 PM

because it gets stuck with ethernet for some reason - I guess it is due to the quarantine thing which then also does not try this network again for a while. But also the error message is quite clear and should not lead to a closed network. because its just mqtt which does noit work, the internet itself works

floitsch 11/16/2023 06:58 PM

I hope that @bitphlipphar can provide more insight. I do think that we will need to tweak quarantining.

bitphlipphar 11/16/2023 08:37 PM

I don't think the network operations from the MQTT service should be able to quarantine or close the network used by Artemis. Something seems odd. Will take a look tomorrow morning

bitphlipphar 11/16/2023 08:40 PM

Oh, wait. You're asking Artemis to go offline, right? That probably makes it quarantine the network if it's connected even though that's silly.

bitphlipphar 11/16/2023 08:42 PM

Pretty sure that's it. It has nothing to do with MQTT not connecting and everything to do with 'artemis.run --offline'.

theHuanter 11/16/2023 08:45 PM

hm... okay what would be the solution to that?

bitphlipphar 11/17/2023 04:23 AM

Fix in Artemis coming up.

bitphlipphar 11/17/2023 06:20 AM

Sent out for review. Should be shippable today.

bitphlipphar 11/17/2023 06:51 AM

ez4ence

bitphlipphar 11/17/2023 08:13 AM

The fix is to avoid quarantining the network if the synchronization was simply interrupted (not failing).

bitphlipphar 11/17/2023 11:00 AM

Fix is rolling out to our test fleet.

theHuanter 11/17/2023 11:51 AM

@bitphlipphar

looks good, thanks

Niels R. 11/17/2023 01:45 PM

hi, everyone, the current solution looks like this: https://sedisto-my.sharepoint.com/:v:/p/niels/Ebfl0XUthhlDkIASu4z_BhUBY68MPFENuLoNmC-qkMzuFw?e=JG0uLO

From my perspective a crashing container shouldn't lead to a full network reset. I would expect, that the device is not able to get potential fixes via firmware updates in this state. So the device renders itself unusable until the problem is solved - and the solution couldn't be connected to firmware problem, because the devices can't pull them down.

If you let it run for some time and watch the logs, I saw several different errors popping up:

******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.121>
******************************************************************************
EXCEPTION error. 
ALREADY_IN_USE
  0: watchdog-init             <sdk>/esp32/esp32.toit:165:3
  1: WatchdogServiceProvider.start-system-watchdog-if-necessary_.<block> <pkg:toit-watchdog>/provider.toit:70:7
  2: Mutex.do.<monitor-block>  <sdk>/monitor.toit:28:27
  3: __Monitor__.locked_.<block> <sdk>/core/monitor_impl_.toit:123:12
  4: __Monitor__.locked_       <sdk>/core/monitor_impl_.toit:95:3
  5: Mutex.do                  <sdk>/monitor.toit:28:3
  6: WatchdogServiceProvider.start-system-watchdog-if-necessary_ <pkg:toit-watchdog>/provider.toit:68:12
  7: WatchdogServiceProvider.handle <pkg:toit-watchdog>/provider.toit:46:7
  8: ServiceManager_.<lambda>  <sdk>/system/services.toit:601:15
  9: RpcRequest_.process.<block> <sdk>/rpc/broker.toit:98:26
 10: RpcRequest_.process       <sdk>/rpc/broker.toit:95:3
 11: RpcRequestQueue_.ensure-processing-task_.<lambda>.<block>.<block> <sdk>/rpc/broker.toit:214:20
 12: RpcRequestQueue_.ensure-processing-task_.<lambda>.<block> <sdk>/rpc/broker.toit:209:9
 13: RpcRequestQueue_.ensure-processing-task_.<lambda> <sdk>/rpc/broker.toit:204:85
******************************************************************************

Niels R. 11/17/2023 01:47 PM

Or something like the attached one - sorry stacktrace too long...

Niels R. 11/17/2023 01:47 PM

[wifi] INFO: network address dynamically assigned through dhcp {ip: 192.168.8.117}
[wifi] INFO: dns server address dynamically assigned through dhcp {ip: [192.168.8.1]}
[network] INFO: opened {connection: wifi-3}
[artemis.synchronize] INFO: connected to network
[artemis.containers.mqtt] DEBUG: starting {reason: 320002, arguments: null}
[artemis.scheduler] INFO: job started {job: container:mqtt}
[mqtt] DEBUG: mqtt.username definition is not a string or null
[mqtt] DEBUG: mqtt.username is automatically set
[mqtt] DEBUG: mqtt.password definition is not a string or null
[mqtt] DEBUG: mqtt.password is automatically set
[watchdog] INFO: created watchdog {id: mqtt-dog}
[watchdog] INFO: started watchdog {id: mqtt-dog}
[artemis.synchronize] INFO: request to run offline - start
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x400f88a5  PS      : 0x00060430  A0      : 0x800e2d17  A1      : 0x3fff82f      : 0x3ffefce8  A3      : 0x3fff6d28  A4      : 0x3fff6d28  A5      : 0xffff9700  
A6      : 0x00000001  A7      : 0xffffffff  A8      : 0x3fff6d2c  A9      : 0x00000000  
A10     : 0x3fff6d2c  A11     : 0x3fff6d28  A12     : 0x3ffd5178  A13     : 0x00000001  
A14     : 0x00000000  A15     : 0x3ffc3960  SAR     : 0x0000000e  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x00000004  LBEG    : 0x40095f8e  LEND    : 0x40095f99  LCOUNT  : 0x00000000  



******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.121>
******************************************************************************
Backtrace: 0x400f88a2:0x3fff82f0 0x400e2d14:0x3fff8310 0x400e2d2e:0x3fff8340 0x400ebf4a:0x3fff8360 0x400ebf71:0x3fff8380
jag: Failed to decode line.
******************************************************************************

ELF file SHA256: 8799dbdb6f4eb226

Also somes along

bitphlipphar 11/17/2023 01:49 PM

@Niels R.

It looks a bit like your container crashes and because it waits for a bit to restart, it fails to feed the watchdog.

bitphlipphar 11/17/2023 01:50 PM

@floitsch

might know why we're seeing ALREADY_IN_USE from the watchdog implementation. That looks weird and potentially wrong.(edited)

Niels R. 11/17/2023 01:50 PM

Nevertheless, I talked with @theHuanter and he'll catch connection errors to the mqtt server within the container itself, log them and retry after some grace periode. Having this exception handling within the container makes sense, but in general if others would like to spin up a container to feed some data to a mqtt broker, jumping though those pains as well, feels heavy.(edited)

Niels R. 11/17/2023 01:53 PM

In addition to those crashes, I saw that after some time the wifi-3 connection was also not working anymore, I suspect that the wifi access points decided to give this unfriendly wifi-client a pause. And also on your infrastructure with "artemis.synchronize" this causes frequent connects and disconnects.

bitphlipphar 11/17/2023 01:55 PM

@Niels R.

The last error (distrusting wifi-3) is what we have been trying to solve with Artemis v0.14.1. Which version are you currently testing?(edited)

bitphlipphar 11/17/2023 01:55 PM

Looks like you're on v0.14.1 in the video.

Niels R. 11/17/2023 01:56 PM

"artemis-version": "v0.14.1"

Niels R. 11/17/2023 01:57 PM

In those cases the error reads:

[network] WARN: opening failed {connection: wifi-3, error: IP_ASSIGN_FAILED}

bitphlipphar 11/17/2023 01:58 PM

Yeah, that would cause Artemis to distrust the network for a little while.

Niels R. 11/17/2023 01:58 PM

After that, it loops to trying eth and breaks with:

******************************************************************************
Decoding by `jag`, device has version <2.0.0-alpha.121>
******************************************************************************
EXCEPTION error. 
CONNECT_FAILED: no available networks
No such file: /Users/niels/.cache/jaguar/snapshots/4bfaea31-053d-53f8-8478-3b8456f32203.snapshot
----
Received a Toit system message. Executing the command below will
make it human readable:
----
jag decode WyNVBVVYU1UQdjIuMC4wLWFscGhhLjEyMVNVAFskVSNVEEv66jEFPVP4hHg7hFbzIgNbI1UEVUVTVQlFWENFUFRJT05TVSVDT05ORUNUX0ZBSUxFRDogbm8gYXZhaWxhYmxlIG5ldHdvcmtzWyNVAlVTWyNVF1sjbAAAAANVRlUAbAABNWxbI2wAAAADVUZVAWwAATOSWyNsAAAAA1VGVQJsAAFmNVsjbAAAAANVRlUDbAABZntbI2wAAAADVUZVBGwAAV5AWyNsAAAAA1VGVQVsAAFellsjbAAAAANVRlUGbAABZsNbI2wAAAADVUZVB2wAATNWWyNsAAAAA1VGVQhsAAExbFsjbAAAAANVRlUJSS8lWyNsAAAAA1VGVQpsAAD1OlsjbAAAAANVRlULbAAA9WRbI2wAAAADVUZVDGwAAWGFWyNsAAAAA1VGVQ1sAAFhslsjbAAAAANVRlUObAABYe5bI2wAAAADVUZVD0kW+FsjbAAAAANVRlUQSRgzWyNsAAAAA1VGVRFJGEpbI2wAAAADVUZVEmwAAUpuWyNsAAAAA1VGVRNsAAFKhlsjbAAAAANVRlUUbAABSapbI2wAAAADVUZVFWwAAUoyWyNsAAAAA1VGVRZJFww=
jag: Failed to decode line.
******************************************************************************

which is fine, since it starts over afterwards.

Niels R. 11/17/2023 02:01 PM

If a device is in this situation I suspect, that it will not be able to get firmware updates. Since it loops in around 10 seconds and since the application (mqtt) container can disable artemis.synchronize, this leads to a not updateable device.

Is it really neccessary to bring artemis in this offline state, before we connect to a local mqtt broker?

bitphlipphar 11/17/2023 02:04 PM

When the mqtt container isn't running, Artemis is free to run. Also Artemis comes with support for not running certain containers if it has hard time connecting. You could try to avoid artemis.run --offline -- I think it was introduced to lower the peak memory pressure.

floitsch 11/17/2023 02:10 PM

There are a lot of things happening here. Let's try to disentangle them one by one.
The first one, which just jumps out because it shouldn't happen:

ALREADY_IN_USE
  0: watchdog-init             <sdk>/esp32/esp32.toit:165:3

Could you check whether there is some error log of the watchdog container failing/crashing?
We purposefully don't clean up the watchdog when the container crashes, since we wouldn't be able to remember if clients had registered watchdogs. As such the best way forward is to let the system watchdog trigger. You then see this ALREADY_IN_USE, but at worst it happens only for 2 seconds before the watchdog triggers.

floitsch 11/17/2023 02:11 PM

As Kasper mentioned: even in a weird state where user-containers and Artemis fight each other for network, Artemis will eventually prevail. It simply shuts down all non-critical containers to be able to go online.
This is, why it's important not to have everything marked as 'critical'. Otherwise, Artemis can't go into a safe-mode where it can recover by getting an update through the cloud.

floitsch 11/17/2023 02:12 PM

We haven't done that yet, but Artemis will better integrate with the watchdog too:
1. use it to make sure that it gets an update itself.
2. disable it for non-critical containers so that it isn't interrupted when it does a firmware update.
The latter is a bit tricky with design options which is one of the reasons we haven't done it yet.

floitsch 11/17/2023 02:21 PM

Looking through the logs you seem to suffer from the fact that the network is continuously shut down and brought up again.
For example:
- when the mqtt starts it disallows artemis to run.
- artemis turns itself off, thus shutting the network down (in the middle of bringing it up).
- moments later mqtt opens the network again.
- it then fails, and shuts the network down again.
- Artemis is allowed to run again and opens the network ,...

Since you are not powered by battery (afaik), it would be nicer if the network stayed alive.
I guess we could add an option to the connections to ask the device to stay "online". As a work-around, a simply container that just net.opens might do the same thing for now ( @bitphlipphar how could that deal with network errors?).

bitphlipphar 11/17/2023 02:22 PM

net.open before artemis.run --offline?

bitphlipphar 11/17/2023 02:22 PM

Same container.

floitsch 11/17/2023 02:22 PM

That would help, but the mqtt container dies when it can't fully connect to the mqtt client.

floitsch 11/17/2023 02:23 PM

so it would then shut down the network again moments later.

bitphlipphar 11/17/2023 02:23 PM

Maybe fix that instead?

bitphlipphar 11/17/2023 02:23 PM

It's a foreseeable issue.

floitsch 11/17/2023 02:24 PM

Would still be nice to be able to not needing to worry that the container crashes.

bitphlipphar 11/17/2023 02:24 PM

Sure.

bitphlipphar 11/17/2023 02:25 PM

I guess it runs against the request to go offline.

bitphlipphar 11/17/2023 02:26 PM

(the option to stay sort of connected)

floitsch 11/17/2023 02:26 PM

Yeah.. The idea is more to take Artemis offline.

floitsch 11/17/2023 02:26 PM

Not the whole network.

bitphlipphar 11/17/2023 02:28 PM

It's a bit of a special case. Let me think about it.

Niels R. 11/17/2023 03:31 PM

Thanks for getting back to me! Really appreciate your effort and energy. Please don't understand my feedback as rude or offensive, I just want to give you feedback from an "enduser" perspective to make Toit more robust.

From my perspective a "normal" use-case for Toit on the ESP32 would be to read some value and push it to a MQTT broker. I'M fully with @bitphlipphar, that the error handling on the connection could be handled within the container. But in general I think a crash in a "user container" shouldn't render the device into such a reboot and reconnect loop.

I'm with @floitsch as well, that the --offline flag should only tell artemis to not do something on the network stack for a moment. And this should not lead to a full network reset if the user-container crashes with an error on the network stack. In our use-case it's MQTT, but it could be a lot of other network issues and services.

Moving forward, we'll try-catch the MQTT related connection errors (host not found, service not available, unauthenticated, connection closed, ...) and retry within the container. In addition we should remove the --offline request to artemis and see if it runs stable.

Thanks again for your effort!(edited)

Niels R.

Thanks for getting back to me! Really appreciate your effort and energy. Please don't understand my feedback as rude or offensive, I just want to give you feedback from an "enduser...(edited)

floitsch 11/17/2023 03:34 PM

No worries. That's exactly how we read it.

From my perspective it also indicates that we should add an option to allow devices to always be connected. If they are not on battery that just seems simpler.

Niels R. 11/17/2023 03:40 PM

That would be awesome for us!

theHuanter 11/17/2023 03:59 PM

some cent from my side as well: I also agree with @Niels R. that the MQTT level crash should not lead to a closing network. In other words thats not a network issue just because the MQTT Client can not connect. Of course this could be multiple reasons but especially wrong credentials shouldn't be one. Maybe this should be handled differently.

A second thought is: running an ESP32 WiFi device on battery does not really makes sense to me tbh. There might be a few use-cases when it sleeps all day, but in general it consumes a lot of battery. I worked on an open-source solar powered ESP32 device and we made it life for like 6 month waking up once a day and go to sleep. But thats just a side-note

which basically aggrees with @bitphlipphar that connecting to wifi consumes a lot of energy as well instead of simply stay online.

OPtheHuanter

some cent from my side as well: I also agree with @Niels R. that the MQTT level crash should not lead to a closing network. In other words thats not a network issue ju...

floitsch 11/17/2023 04:14 PM

I think the closing network makes complete sense. If we assume that the network is only opened when it's needed, then the crash in the mqtt removes the last user of the network and the closing is thus required. It's hard to guard against other programs (in this case Artemis itself) opening the network immediately afterwards. The network manager doesn't know about these kind of synchronizations (and probably shouldn't know about it).

We actually have more customers running on battery than not. Some of them only wake up from time to time (as you mention). Others wake up based on triggers. You need to be really careful how often you connect, though, as that drains the battery really fast. And that's one of the reasons we are so eager to turn the network off when it's not in use anymore.

52 messages in total