How Failures are Handled in ERS - Potato/Elixir GUSC IoT System
Description
Hardware: Sensor Failure
If the sensor node fails to connect or retrieve data from any sensor, it is deemed to have failed. At this point, the sensor node sends a failure message to the server, which updates the status record to show that it's broken, which in turn updates the website. This was validated by disconnecting each sensor individually and checking that the website updates the status correctly each time.
Sensor node failures could be made more informative by individually tracking and reporting each sensor status, but this requires more verbose code and a more elaborate website.
Software: Sensor Node Software Failure
ERS on the sensor node is launched by a module that monitors the software using Elixir's built-in monitoring functionality - the monitor module does not launch the server node, despite using the same codebase. If the sensor node software crashes, it is automatically restarted by the monitor module. On the Server, if the sensor node doesn't reconnect within 10 seconds, currently twice the normal communication interval of five seconds, the Statuses table and the website are updated to show that the ERS process on the sensor node has failed, or in other words, the software has crashed. This was validated by adding a line in the sensor node software that would cause a crash and checking that the monitor script relaunches ERS.
Network: Lost Sensor Node Data
There's a possibility of losing sensor node data due to dropped packets during the UDP-based communication between the sensor node and the server. To catch this, if the Measurements table fails to receive data for more than 8 seconds (3 seconds over the data reading interval), the server will update the Statuses table to show this failure, after which the website displays that the sensor reading was lost. This status is returned to the working state after the next set of sensor data is received. This was validated by modifying the sensor node code to not send any data to the server and checking to see if the status on the website changes to reflect the failure correctly. This can be seen in the attached image.
To improve this, instead of displaying the status of the sensor data, the server could automatically send a request to the sensor node for a single reading of the sensors.
Files
website_lost_data.png
Files
(17.2 kB)
Name | Size | Download all |
---|---|---|
md5:aa3939dfa1c5b4eff2b83f673772c650
|
17.2 kB | Preview Download |