Editorial: Embedded Computing and Society
This paper presents a modeling and optimization framework that enables developers to model an application's data sources, tasks, and exchanged data tokens; specify application requirements through high-level design metrics and fuzzy logic based optimization rules; and define an estimation framework to automatically optimize the application at runtime. We demonstrate the modeling and optimization process via an example application for video-based vehicle tracking and collision avoidance. We analyze the benefits of runtime optimization by comparing the performance of static point solutions to dynamic solutions over five distinct execution scenarios, showing improvements of up to 74% for dynamic over static configurations.
The rapidly evolving technologies in the mobile devices inevitably increase their power demands for the battery. However, the development of the battery can hardly keep pace with the fast growing demands, leading to short battery life which becomes the top complaints from the customers. In this paper, we investigate a novel energy supply technology, fuel cell (FC), and leverage its advantages of providing the long-term energy storage to build a hybrid FC-battery power system. Therefore, the mobile devices operation time is dramatically extended so that users are not bothered by the battery recharging anymore. We examine the real-world smart phone usage data and find that a naive hybrid power system cannot meet many users' highly diversified power demands. We thus propose the $\alpha$\% peak throttling technique that reduces the device power consumption by a% for each power peak to solve the mismatch between the power supply and demands. This technique trades the quality-of-service (QoS) for a larger FC ratio in the system, thus much longer device operation time. We further observe that the user's personality largely determines his/her satisfaction with the QoS degradation and the operation time extension. Applying a fixed a% peak throttling fails to satisfy every user. We thus propose the personality-aware peak throttling that identifies the user personality online and then adopts the best a% value during the peak throttling to achieve the optimal satisfaction score for each user. The experimental results show that our personality-aware hybrid FC-battery solution can achieve 3.4X longer operation time and 25\% higher satisfaction score comparing to the baseline (the common battery powered device) under the same size and weight limitation.
Modern and recent architectures of vision based Convolutional Neural Networks (CNN) have improved detection and prediction accuracy significantly. However, these algorithms are extremely computational intensive. To break the power and performance wall of CNN computation, we reformulate the CNN computation into an iterative process, where each iteration processes a sub-sample of input features with smaller network and ingests additional features to improve the prediction accuracy. Each smaller network could either classify based on its input set or feed computed and extracted features to the next network to enhance the accuracy. The proposed approach allows early-termination upon reaching acceptable confidence. Moreover, each iteration provides a contextual awareness that allows an intelligent resource allocation and optimization for the proceeding iterations. In this paper we propose various policies to reduce the computational complexity of CNN through the proposed iterative approach. We illustrate how the proposed policies construct a dynamic architecture suitable for a wide range of applications with varied accuracy requirements, resources, and time-budget, without further need for network re-training. Furthermore, we carry out a visualization of the detected features in each iteration through deconvolution network to gain more insight into the successive traversal of the ICNN.
Shared last level caches (LLC) of multicore systems-on-chip are subject to a significant amount of contention over a limited bandwidth, resulting in major performance bottlenecks that make the issue a first-order concern in modern multiprocessor systems-on-chip. Even though shared cache space partitioning has been extensively studied in the past, the problem of cache bandwidth partitioning has not received sufficient attention. We demonstrate the occurrence of such contention and the resulting impact on the overall system performance. To address the issue, we perform detailed simulations to study the impact of different parameters, and propose a novel cache bandwidth partitioning technique, called REAL, that arbitrates among cache access requests originating from different processor cores. It monitors the LLC access patterns to dynamically assign a priority value to each core. Experimental results on different mixes of benchmarks show up to 2.13x overall system speedup over baseline policies, with minimal impact on energy.
Executing a set of control loops over a shared multi-hop (wireless) control network (MCN) requires careful co-scheduling of the control tasks as well as the routing of sensory/actuation messages over the MCN. In this work, we establish pattern guided aperiodic execution of control loops as a resource-aware alternative to traditional fully periodic executions of a set of embedded control loops sharing a computation as well as the communication infrastructure. We provide a satisfiability modulo theories based co-design framework that synthesizes loop execution patterns having optimized control cost as the underlying scheduling scheme together with the associated routing solution over the MCN. The routing solution implements the timed movement of the sensory/actuation messages of the control loops, generated according to those loop execution patterns. From the given settling time requirement of the control loops, we compute a control theoretically sound model using matrix inequalities, that gives an upper bound to the number of loop drops within the finite length loop execution pattern. Next, we show that how the proposed framework can be useful for evaluating the fault tolerance of a resource-constrained shared MCN subject to communication link failure.
The use of a managed, type-safe language such as Standard ML, Ada Ravenscar or Java in hard real-time and embedded systems offers productivity, safety and dependability benefits at a reasonable cost. Static software systems, that is systems in which all relevant resource entities such as threads and their priorities, for instance, and the entire source code are known ahead of time, are particularly interesting for the deployment in safety-critical embedded systems: Code verification is rather maintainable in contrast to dynamic systems. Additionally, static analyses can incorporate information from all software and system layers to assist compilers in emitting code that is well-suited to an application on a particular hardware device. It was shown it the past, that a program composed in type-safe Java in combination with a static system setup can be as efficient as one that is written in C, which is still the most widely-used language in the embedded domain. Escape analysis (EA) is one of several static-analysis techniques. It supports, for instance, runtime efficiency by enabling automated stack allocation of objects. In addition, researchers have argued that EA enables further applications in safety-critical embedded systems such as the computation of memory classes stated in the Real-Time Specification for Java (RTSJ). EA can be applied to any programming language but the quality of its results greatly benefits from the properties of a type-safe language. Notably, embedded multicore devices can positively be affected by the use of EA. Thus, we explore an ahead-of-time (AOT) escape analysis in the context of the KESO JVM featuring a Java AOT compiler targeting (deeply) embedded (hard) real-time systems.
The current trend in modeling and analyzing real-time systems is toward tighter yet safe timing constraints. Many practical real-time systems can de facto sustain a bounded number of deadline misses, i.e., they have weakly-hard real-time constraints rather than hard real-time constraints. We therefore strive to provide tight deadline miss models in complement to tight response time bounds for such systems. In this work, we bound the distribution of deadline misses for task sets running on uniprocessors using the Earliest Deadline First (EDF) scheduling policy. We assume tasks miss their deadlines due to transient overload resulting from sporadic activations, e.g. interrupt service routines and we use Typical Worst-Case Analysis (TWCA) to tackle the problem in this context. TWCA relies on existing worst-case response time analyses as a foundation, so we revisit and revise in this paper the state-of-the-art worst-case response time analysis for EDF scheduling. This work is motivated by and validated on a realistic case study inspired by industrial practice (satellite on-board software) and on a set of synthetic test cases. The results show the usefulness of this approach for temporary overloaded systems when EDF scheduling is considered. The scalability has also been addressed in our experiments.
Code-reuse attack is a concrete threat to computing systems as it can evade conventional security defenses. Control flow integrity (CFI) is proposed to repel this threat. However, former implementations of CFI suffer from two major drawbacks: 1) complex offline processing on programs; 2) high overheads at runtime. Therefore, it is impractical for performance-constrained devices to adopt the technology, leaving them vulnerable to exploitation. In this paper, we develop a cross-layer approach named BBB-CFI to minimize the overheads of both offline analysis and runtime checking. Our approach employs basic block information inside the binary code and read-only data to enforce control-flow integrity. We identify a key binary level property called basic block boundary and based on it we propose the code-inspired method where short code sequences can endorse a control flow transition. Our solution enables quick application launching because it does not require control flow graph construction at the offline stage. We only demand a lightweight analysis on read-only data and a small amount of code of the application. According to the experiments, our approach incurs a negligible 0.11% runtime performance overhead with a minor processor extension, whereas it achieves an order of magnitude speedup in pre-preprocessing compared to a baseline approach. Without control flow analysis or recompilation, BBB-CFI still effectively reduces 90% of the attack surface in terms of gadget numbers. Besides this, we show that the Turing-completeness in the libc is unsustainable. Our approach also demonstrates high applicability to many programs and it is capable of protecting striped binaries.
As the adoption of Neural Networks continues to proliferate different classes of applications and systems, edge devices have been left behind. Their strict energy and storage limitations make them unable to cope with the sizes of common network models. While many compression methods such as precision reduction and sparsity have been proposed to alleviate this, they don't go quite far enough. To push size reduction to its absolute limits, we combine binarization with sparsity in Pruned-Permuted-Packed XNOR Networks (3PXNet), which can be efficiently implemented on even the smallest of embedded microcontrollers. 3PXNets can reduce model sizes by up to 38X and reduce runtime by up to 3X compared with already compact conventional binarized implementations with less than 3% accuracy reduction. We have created the first software implementation of sparse-binarized Neural Networks, released as an open-source library targeting edge devices. Our library is complete with training methodology and model generating scripts, making it easy and fast to deploy.
Co-simulation based validation of hardware controllers adjoined with plant models, with continuous dynamics, is an important step in model based design of controllers for Cyber-physical Systems (CPS). Co-simulation suffers from many problems such as timing delays, skew, race conditions, etc, making it unsuitable for checking timing properties of CPS. In our approach to verification of controllers synthesised from their models, the synthesised controller is adjoined with a synthesised hardware plant unit. The synthesised plant and controller are then executed synchronously and Metric Interval Temporal Logic properties are validated on the closed-loop system. The clock period is chosen, using the robustness estimates, such that all timing properties that hold on the controller guiding the discretized plant model also hold on the original case of the continuous time plant model guided by the controller.
Resistive crossbars have shown strong potential as the building blocks of future neural fabrics, due to their ability to natively execute vector-matrix multiplication (the dominant computational kernel in DNNs). However, a key challenge that arises in resistive crossbars is that non-idealities in the synaptic devices, interconnects, and peripheral circuits of resistive crossbars lead to errors in the computations performed. When large-scale DNNs are executed on resistive crossbar systems, these errors compound and result in unacceptable degradation in application-level accuracy. We propose CXDNN, a hardware-software methodology that enables the realization of large-scale DNNs on crossbar systems with minimal degradation in accuracy by compensating for errors due to non-idealities. CXDNN comprises of (i) an optimized mapping technique to convert floating-point weights and activations to crossbar conductances and input voltages, (ii) a fast re-training method to recover accuracy loss due to this conversion, and (iii) low-overhead compensation hardware to mitigate dynamic and hardware-instance-specific errors. Unlike previous efforts that are limited to small networks and require the training and deployment of hardware-instance-specific models, CXDNN presents a scalable compensation methodology that can address large DNNs (e.g., ResNet-50 on ImageNet), and enables a common model to be trained and deployed on many devices. We evaluated CXDNN on 6 top DNNs from the ILSVRC challenge with 0.5-13.8 million neurons and 0.5-15.5 billion connections. CXDNN achieves 16.9%-49% improvement in the top-1 classification accuracy, effectively mitigating a key challenge to the use of resistive crossbar based neural fabrics.
Indoor localization is an emerging application domain for the navigation and tracking of people and assets. Ubiquitously available Wi-Fi signals have enabled low-cost fingerprinting-based localization solutions. Further, the rapid growth in mobile hardware capability now allows high-accuracy deep learning-based frameworks to be executed locally on mobile devices in an energy-efficient manner. However, existing deep learning-based indoor localization solutions are vulnerable to access point (AP) attacks. This paper presents an analysis into the vulnerability of a convolutional neural network (CNN) based indoor localization solution to AP security compromises. Based on this analysis, we propose a novel methodology to maintain indoor localization accuracy, even in the presence of AP attacks. The proposed secured framework (called S-CNNLOC) is validated across a benchmark suite of indoor paths and is found to observe up to 10x average localization improvement on a given path with large number of malicious AP attacks, compared to its unsecured counterpart.
Many real-world edge applications including object detection, robotics, and smart health are enabled by deploying deep neural networks (DNNs) on energy-constrained mobile platforms. In this paper, we propose a novel approach to trade-off energy and accuracy of inference at runtime using a design space called Learning Energy Accuracy Tradeoff Networks (LEANets). The key idea behind LEANets is to design classifiers of increasing complexity using pre-trained DNNs to perform input-specific adaptive inference. The accuracy and energy-consumption of the adaptive inference scheme depends on a set of thresholds, one for each classifier. To determine the set of threshold vectors to achieve different energy and accuracy trade-offs, we propose a novel multi-objective optimization approach. We can select the appropriate threshold vector at runtime based on the desired trade-off. We perform experiments on multiple pre-trained DNNs including ConvNet, VGG-16, and MobileNet using diverse image classification datasets. Our results show that we get up to 50% gain in energy for negligible loss in accuracy, and optimized LEANets achieve significantly better energy and accuracy trade-off when compared to a state-of-the-art method referred as Slimmable neural networks.
With the rapid growth of connectivity and autonomy for today?s automobiles, their security vulnerabilities are becoming one of the most urgent concerns in the automotive industry. The lack of message authentication in Controller Area Network (CAN), which is the most popular in-vehicle communication protocol, makes it susceptible to cyber attack. It has been demonstrated that the remote attackers can take over the maneuver of vehicles after getting access to CAN, which poses serious safety threats to the public. To mitigate this issue, we propose a novel intrusion detection system (IDS), called BTMonitor (Bit-time based CAN Bus Monitor). It utilizes the small but measurable discrepancy of bit time in CAN frames to fingerprint their sender Electronic Control Units (ECUs). To reduce the requirement for high sampling rate, we calculate the bit time of recessive bits and dominant bits respectively and extract their statistical features as fingerprint. The generated fingerprint is then used to detect intrusion and pinpoint the attacker. BTMonitor can detect new types of masquerade attack that the state-of-the-art clock-skew based IDS is unable to identify. We implement a prototype system for BTMonitor using Xilinx Spartan 6 FPGA for data collection. We evaluate our method on both a CAN bus prototype and a real vehicle. The results show that BTMonitor can correctly identify the sender with an average probability of 99.76% on the real vehicle.
Stream programs are graph structured parallel programs, where the nodes are computational kernels that communicate by sending tokens over the edges. In this paper we present a framework for compiling stream programs that we call Tÿcho. It handles kernels of different styles and with a high degree of expressiveness using a common intermediate representation. It also provides efficient implementation, especially for but not limited to the restricted forms of stream programs, such as synchronous dataflow.
Deep neural networks (DNNs) are becoming a key enabling technique for many application domains. However, on-device inference on battery-powered, resource-constrained embedding systems is often infeasible due to prohibitively long inferencing time and resource requirements of many DNNs. Offloading computation into the cloud is often unacceptable due to privacy concerns, high latency, or the lack of connectivity. While compression algorithms often succeed in reducing inferencing times, they come at the cost of reduced accuracy. This paper presents a new, alternative approach to enable efficient execution of DNNs on embedded devices. Our approach dynamically determines which DNN to use for a given input, by considering the desired accuracy and inference time. It employs machine learning to develop a low-cost predictive model to quickly select a pre-trained DNN to use for a given input and the optimization constraint. We achieve this by first training off-line a predictive model, and then use the learnt model to select a DNN model to use for new, unseen inputs. We apply our approach to two representative DNN domains: image classification and machine translation. We evaluate our approach on a Jetson TX2 embedded deep learning platform, and consider a range of influential DNN models including convolutional and recurrent neural networks. For image classification, we achieve a 1.8x reduction in inference time with a 7.52% improvement in accuracy, over the most-capable single DNN model. For machine translation, we achieve a 1.34x reduction in inference time over the most-capable single model, with little impact on the quality of translation.
Continuous technology scaling in manycore systems leads to severe overheating issues. To guarantee system reliability, it is critical to accurately yet efficiently monitor run-time temperature distribution for effective chip thermal management. As an emerging communication architecture for new-generation manycore systems, optical network-on-chip (ONoC) satisfies the communication bandwidth and latency requirements with low power dissipation. What?s more, observation shows that it can be leveraged for run-time thermal sensing. In this paper, we propose a brand-new on-chip thermal sensing approach for ONoC-based manycore systems by utilizing the intrinsic thermal sensitivity of optical devices and the interprocessor communications in ONoCs. It requires no extra hardware but utilizes existing optical devices in ONoCs, and combines them with lightweight software computation in a hardware-software collaborative manner. The effectiveness of our approach is validated both at the device level and the system level through professional photonic simulations. Evaluation results based on synthetic communication traces and realistic benchmarks show that our approach achieves an average temperature inaccuracy of only 0.6648 K compared to ground truth values, and is scalable to be applied for large-size ONoCs.
Illegal use of memory pointers is a serious security vulnerability. A large number of malwares exploit the spatial and temporal nature of these vulnerabilities to subvert execution or glean sensitive data from an application. Recent countermeasures attach metadata to memory pointers, which define the pointer's capabilities. The metadata is used by the hardware to validate pointer-based memory accesses. However, recent works have considerable overheads. Further, the pointer validation is decoupled from the actual memory access. We show that this could open up vulnerabilities in multi-threaded applications and introduce new vulnerabilities due to speculation in out-of-order processors. In this paper, we demonstrate that the overheads can be reduced considerably by efficient metadata management. We show that the hardware can be designed in a manner that would remain safe in multi-threaded applications and immune to speculative vulnerabilities. We achieve these by ensuring that the pointer validations and the corresponding memory access is always done atomically and in-order. To evaluate our scheme, which we call ALEXIA, we enhance an OpenRISC processor to perform the memory validation at run time and also add compiler support. ALEXIA is the first hardware countermeasure scheme for memory protection that provides such an end-to-end solution. We evaluate the processor on an Altera FPGA and show that the run time overheads, on average, is 14%, with negligible impact on the processor's size and clock frequency. There is also a negligible impact on the program's code and data sizes.
The next significant step in the evolution and proliferation of artificial intelligence technology will be the integration of neural network (NN) models within embedded and mobile systems. This calls for the design of compact, energy efficient NN models in silicon. In this paper, we present a scalable ASIC design of an LSTM accelerator named ELSA, that is suitable for energy-constrained devices. It includes several architectural innovations to achieve small area and high energy efficiency. To reduce the area and power consumption of the overall design, the compute-intensive units of ELSA employ approximate multiplications and still achieve high performance and accuracy. The performance is further improved through efficient synchronization of the elastic pipeline stages to maximize the utilization. The paper also includes a performance model of ELSA, as a function of the hidden nodes and time steps, permitting its use for the evaluation of any LSTM application. ELSA was implemented in RTL and was synthesized and placed and routed in 65nm technology. Its functionality is demonstrated for language modeling ? a common application of LSTM. ELSA is compared against a baseline implementation of an LSTM accelerator with standard functional units and without any of the architectural innovations of ELSA. The paper demonstrates that ELSA can achieve significant improvements in power, area and energy-efficiency when compared to the baseline design and several ASIC implementations reported in the literature, making it suitable for use in embedded systems and real-time applications.
Heterogeneous multicore processor has recently become de facto computing platform for state-of-the-art embedded applications. Nonetheless, very little research focuses on the scheduling for real-time periodic tasks upon heterogeneous multicores under the requirements of task synchronization, which is stemmed from resource access conflicts and can greatly affect the schedulability of tasks. In view of the partitioned-EDF algorithm and Multiprocessor Stack Resource Policy (MSRP), we first discuss the blocking-aware utilization bound for uniform heterogeneous multicores and then illustrate its non-monotonicity, where the bound may reduce with more cores being exploited. Following the insights obtained from the analysis of the bound, taking the heterogeneity of computing systems into consideration, we propose an algorithm SA-TPA-HM (synchronization-aware task partitioning algorithm for heterogeneous multicores). Several blocking-guided and heterogeneity-aware mapping heuristics are incorporated to reduce the negative impacts of blocking conflicts in task system for better schedulability performance of tasks and balanced workload distributed across cores. The extensive simulation results demonstrate that the SA-TPA-HM algorithm can obtain the schedulability result approximate to an Integer Non-Linear Programming (INLP) based method and can have much better schedulability result (such as 60% more) in comparison with the current mapping heuristics targeted at homogeneous multicores. The measurements in Linux kernel further reveal the practical viability of SA-TPA-HM that can experience lower online overhead (e.g., 15% less) in contrast to other partitioning schemes.