guild icon
Toit
#MQTT Client connection issue if broker is unavailabe
Thread channel in help
theHuanter
theHuanter 11/15/2023 03:35 PM
If the MQTT Client connects to an IP address which has no device behind, the client crashes with "connection refused" and tries again after a few seconds.
If the MQTT Client connects to an IP Address where only the broker itself is unavailable but the device is available (pingable) the code crashes with the same exception "connection refused" but seem to not pause in between retries. This might lead to unwanted behavior of the containers. In our case a separate BLE container kept running but ble stopped advertising for some reason (I assume some memory overflow or something similar)

Any idea of why that is?
bitphlipphar
bitphlipphar 11/15/2023 03:40 PM
Summoning @floitsch ...
floitsch
floitsch 11/15/2023 03:41 PM
Hmm. It should always pause.
floitsch
floitsch 11/15/2023 03:41 PM
Could you have a look with logging enabled?
floitsch
floitsch 11/15/2023 03:42 PM
Would also be good to know whether it really was an OOM.
theHuanter
theHuanter 11/15/2023 03:45 PM
it never showed an OOM
theHuanter
theHuanter 11/15/2023 03:45 PM
I try to reproduce it right now
theHuanter
theHuanter 11/15/2023 03:46 PM
so even tho it does not pause right now, it also does not crash anything, it just retries in a loop
theHuanter
theHuanter 11/15/2023 03:47 PM
but it crashes here:
client = mqtt.Client --host=host --routes=routes
when I am creating the client. So there must be a difference if the server device is available or not somehow in the code
theHuanter
theHuanter 11/15/2023 03:48 PM
lets call it the "fast looping" only happens if the server is reachable but the broker is deactivated ( I killed the mosquitto broker service)
theHuanter
theHuanter 11/15/2023 03:49 PM
if the server is unrachable so even a ping would not reach it, the loop seems reasonable
theHuanter
theHuanter 11/15/2023 03:52 PM
okay interesting - now the ble stopped advertising! no crash or anything. It crashed repeatedly with
****************************************************************************** Decoding by `jag`, device has version <2.0.0-alpha.120> ****************************************************************************** EXCEPTION error. Connection refused 0: TcpSocket.connect <sdk>\net\modules\tcp.toit:151:40 1: TcpSocket.connect <sdk>\net\modules\tcp.toit:141:12 2: Client.tcp-connect <sdk>\net\net.toit:110:12 3: Client.tcp-connect <sdk>\net\net.toit:101:12 4: ReconnectingTransport_.new-connection_ <pkg:mqtt>\tcp.toit:132:21 5: ReconnectingTransport_.reconnect.<block> <pkg:mqtt>\tcp.toit:120:22 6: Mutex.do.<monitor-block> <sdk>\monitor.toit:28:27 7: __Monitor__.locked_.<block> <sdk>\core\monitor_impl_.toit:123:12 8: __Monitor__.locked_ <sdk>\core\monitor_impl_.toit:95:3 9: Mutex.do <sdk>\monitor.toit:28:3 10: ReconnectingTransport_.reconnect <pkg:mqtt>\tcp.toit:112:25 11: ReconnectingTransport_ <pkg:mqtt>\tcp.toit:94:5 12: TcpTransport <pkg:mqtt>\tcp.toit:33:12 13: Client <pkg:mqtt>\client.toit:54:18 14: main C:\Users\Mirko\AppData\Local\Temp\artemis-464b3efa-86cf-43f1-a66e-8b58389ac9e3\clone\src\services\mqtt\mqtt.toit:71:12******************************************************************************
for many many times and suddenly the crashes stopped being printed. But I see that the ble watchdog is still running, which tells me that the ble container is still running so the task which is keeping the advertisement died or got stuck or something.
floitsch
floitsch 11/15/2023 04:00 PM
So from what I can see: when mqtt.Client --... can't connect it immediately throws an exception like the one you show.
floitsch
floitsch 11/15/2023 04:00 PM
However, it is not, by itself, trying to reconnect.
floitsch
floitsch 11/15/2023 04:01 PM
If you see a loop, then that's because you (or Artemis) is trying to connect again.
floitsch
floitsch 11/15/2023 04:01 PM
I'm guessing you don't have a catch around that part of the code. -> The program crashes.
However, you marked the container as critical (or something similar), and the program is started immediately again and it tries to start the MQTT again.
floitsch
floitsch 11/15/2023 04:03 PM
There are now three ways to avoid this:
- change the container's description to not be critical, or interval 0s. If it is interval 1s, it would wait 1s before starting the program again.
- catch the exception of mqtt.Client and retry again after an appropriate timeout.
- we change the mqtt library to go through the "normal" reconnection strategy even for the first connection.
I think I changed it away from that, because users didn't get a nice "wrong credentials, ..." when they tried to start the client. Instead the program seemed to hang.
theHuanter
theHuanter 11/15/2023 04:03 PM
true - the loop comes from artemis restarting the container
theHuanter
theHuanter 11/15/2023 04:04 PM
ok I will try to remove the critical part and make interval to be 1s
theHuanter
theHuanter 11/15/2023 04:08 PM
I made the interval 1s and removed critical=true, uploaded, flashed, but after the crash it is not restarting the container. it stopped, but it does not start again
theHuanter
theHuanter 11/15/2023 04:10 PM
"mqtt": { ... github bla bla "background": true, "interval": "1s" }
bitphlipphar
bitphlipphar 11/15/2023 04:12 PM
Is that the right syntax?
theHuanter
theHuanter 11/15/2023 04:13 PM
the interval part was there already but with 0s
bitphlipphar
bitphlipphar 11/15/2023 04:13 PM
"containers": { "measure": { "entrypoint": "measure.toit", "triggers": [ { "interval": "20s" } ] } }
🤯1
bitphlipphar
bitphlipphar 11/15/2023 04:13 PM
I think that's the usual example.
theHuanter
theHuanter 11/15/2023 04:14 PM
okay, did you changed that?
bitphlipphar
bitphlipphar 11/15/2023 04:14 PM
Maybe I am wrong and there is a shortcut.
theHuanter
theHuanter 11/15/2023 04:14 PM
not sure where I got my version from
floitsch
floitsch 11/15/2023 04:16 PM
Could be that we changed it.
theHuanter
theHuanter 11/15/2023 04:17 PM
the configs are quite old tho or at least the content. it works now
theHuanter
theHuanter 11/15/2023 04:17 PM
I think thats an acceptable solution
theHuanter
theHuanter 11/15/2023 04:18 PM
made it 10s now
👍1
floitsch
floitsch 11/15/2023 04:39 PM
I'm planning on creating JSON schemas for the specification files. That should make it easier to manipulate them.
bitphlipphar
bitphlipphar 11/16/2023 08:06 AM
@floitsch Should we complain about unrecognized entries?
floitsch
floitsch 11/16/2023 10:25 AM
I guess we should at least warn.
35 messages in total