Improving the reliability of the Internet of Things

Over-the-air (OTA) updates enable dramatic changes in the way systems in the Internet of Things (IoT) operate. This article explains the benefits.

The obvious advantage is, of course, easier updates, often downloaded and installed transparently. When this is coupled with software tracing, it becomes a powerful mechanism to improve the quality and reliability of a wide range of embedded IoT systems.

Despite the best efforts of developers, these systems are still deployed with bugs remaining in their code. A development team introduces on average about 120 bugs per 1,000 lines of code during development and about five percent, or six bugs per 1,000 lines of code, typically remain in the shipped software. When there are thousands of IoT devices deployed in the field, relying on users to report the problems caused by these bugs is neither reliable nor scalable. User reports also tend to be vague and unhelpful for solving the problem. When there are millions of devices, this matters even more.

These missed bugs probably won’t show up right away, but only cause problems under certain circumstances. Otherwise they would have been found before the product shipped. While an OTA update can solve the problem in the field, developers need a feedback system to identify issues in deployed devices, and they need to know quickly. This approach has long been standard in the development of mobile and cloud applications (DevOps), and it has now become viable for embedded development as well.

The key to finding out about — and solving — problems in the field is the combination of software tracing, cloud management and OTA updates, but this is a complex challenge. The tracing code needs to be as efficient as possible in a system already constrained in resources. The link back to the cloud needs to be secure, transparent and transfer the right data to help developers identify any problems quickly and easily.

The cloud service has to identify what issues are new and important, and then notify the developers that there is a problem they need to fix. Once it’s fixed, the updated software must be distributed to all devices. All of this needs to scale across millions of devices.

The information flow starts in the error handling code of the IoT device, already existing sanity checks and fault exception handlers. Using a software agent, firmware issues are uploaded as alerts to a customer’s cloud account. An alert may include an error message and any other information relevant to the specific issue, such as software state variables and hardware registers. Depending on the severity of the issue, the alert is either uploaded directly or after a device restart, once the cloud connection has been restored.

The alerts may also include a trace of the most recent software events in the device, which is recorded automatically by the agent. The trace provides both the details of the error and the context, making it easier for developers to identify the bug.

The encoding efficiency is key. The developers need enough context to identify the real problem, but tracing should be done only using a minimal amount of memory. This is important for two reasons. In the collection of traces of sufficient length even from memory-constrained IoT systems:

This encoding efficiency makes it possible to use the trace technology out in the field, even in small IoT devices, bringing dramatic advantages.

Alerts from the firmware agent are uploaded to the customer’s cloud service, which is configured to store the alerts and to notify an engine that handles classification, statistics and notifications to the developers. It also offers configuration options, identifying, for example, the conditions under which notifications should be sent and to whom.

When developers receive notification about a new issue, they can access alerts and traces to see what the problem is.

Privacy is also key here. The software trace never needs to leave the customer’s cloud account. Only an anonymised signature of the alert is required for the cloud processing, which can be provided in an external cloud service.

This information can be made completely transparent, configurable, and meaningless on its own. The communication and storage is provided by the existing capabilities in the developer’s IoT platform, using best practices for authentication and encryption.

Need to Trace Your IoT Device?

Testing in the lab just isn’t enough. Today’s embedded IoT systems are complex. To eradicate all software issues you need real time tracing and alerts to identify bugs in the field as they happen, with automatic notifications to the developers to speed up resolution.

Such a system has to be scalable, secure and transparent to the developers. Once in place, it provides immediate awareness on the very first occurrence of an issue — before many users have been affected, and lets developers take full advantage of OTA updates to rapidly improve their product.

– Johan Kraft, CEO and Founder of Percepio