Editorial: Embedded Computing and Society
The rapidly evolving technologies in the mobile devices inevitably increase their power demands for the battery. However, the development of the battery can hardly keep pace with the fast growing demands, leading to short battery life which becomes the top complaints from the customers. In this paper, we investigate a novel energy supply technology, fuel cell (FC), and leverage its advantages of providing the long-term energy storage to build a hybrid FC-battery power system. Therefore, the mobile devices operation time is dramatically extended so that users are not bothered by the battery recharging anymore. We examine the real-world smart phone usage data and find that a naive hybrid power system cannot meet many users' highly diversified power demands. We thus propose the $\alpha$\% peak throttling technique that reduces the device power consumption by a% for each power peak to solve the mismatch between the power supply and demands. This technique trades the quality-of-service (QoS) for a larger FC ratio in the system, thus much longer device operation time. We further observe that the user's personality largely determines his/her satisfaction with the QoS degradation and the operation time extension. Applying a fixed a% peak throttling fails to satisfy every user. We thus propose the personality-aware peak throttling that identifies the user personality online and then adopts the best a% value during the peak throttling to achieve the optimal satisfaction score for each user. The experimental results show that our personality-aware hybrid FC-battery solution can achieve 3.4X longer operation time and 25\% higher satisfaction score comparing to the baseline (the common battery powered device) under the same size and weight limitation.
Modern and recent architectures of vision based Convolutional Neural Networks (CNN) have improved detection and prediction accuracy significantly. However, these algorithms are extremely computational intensive. To break the power and performance wall of CNN computation, we reformulate the CNN computation into an iterative process, where each iteration processes a sub-sample of input features with smaller network and ingests additional features to improve the prediction accuracy. Each smaller network could either classify based on its input set or feed computed and extracted features to the next network to enhance the accuracy. The proposed approach allows early-termination upon reaching acceptable confidence. Moreover, each iteration provides a contextual awareness that allows an intelligent resource allocation and optimization for the proceeding iterations. In this paper we propose various policies to reduce the computational complexity of CNN through the proposed iterative approach. We illustrate how the proposed policies construct a dynamic architecture suitable for a wide range of applications with varied accuracy requirements, resources, and time-budget, without further need for network re-training. Furthermore, we carry out a visualization of the detected features in each iteration through deconvolution network to gain more insight into the successive traversal of the ICNN.
Shared last level caches (LLC) of multicore systems-on-chip are subject to a significant amount of contention over a limited bandwidth, resulting in major performance bottlenecks that make the issue a first-order concern in modern multiprocessor systems-on-chip. Even though shared cache space partitioning has been extensively studied in the past, the problem of cache bandwidth partitioning has not received sufficient attention. We demonstrate the occurrence of such contention and the resulting impact on the overall system performance. To address the issue, we perform detailed simulations to study the impact of different parameters, and propose a novel cache bandwidth partitioning technique, called REAL, that arbitrates among cache access requests originating from different processor cores. It monitors the LLC access patterns to dynamically assign a priority value to each core. Experimental results on different mixes of benchmarks show up to 2.13x overall system speedup over baseline policies, with minimal impact on energy.
The current trend in modeling and analyzing real-time systems is toward tighter yet safe timing constraints. Many practical real-time systems can de facto sustain a bounded number of deadline misses, i.e., they have weakly-hard real-time constraints rather than hard real-time constraints. We therefore strive to provide tight deadline miss models in complement to tight response time bounds for such systems. In this work, we bound the distribution of deadline misses for task sets running on uniprocessors using the Earliest Deadline First (EDF) scheduling policy. We assume tasks miss their deadlines due to transient overload resulting from sporadic activations, e.g. interrupt service routines and we use Typical Worst-Case Analysis (TWCA) to tackle the problem in this context. TWCA relies on existing worst-case response time analyses as a foundation, so we revisit and revise in this paper the state-of-the-art worst-case response time analysis for EDF scheduling. This work is motivated by and validated on a realistic case study inspired by industrial practice (satellite on-board software) and on a set of synthetic test cases. The results show the usefulness of this approach for temporary overloaded systems when EDF scheduling is considered. The scalability has also been addressed in our experiments.
Co-simulation based validation of hardware controllers adjoined with plant models, with continuous dynamics, is an important step in model based design of controllers for Cyber-physical Systems (CPS). Co-simulation suffers from many problems such as timing delays, skew, race conditions, etc, making it unsuitable for checking timing properties of CPS. In our approach to verification of controllers synthesised from their models, the synthesised controller is adjoined with a synthesised hardware plant unit. The synthesised plant and controller are then executed synchronously and Metric Interval Temporal Logic properties are validated on the closed-loop system. The clock period is chosen, using the robustness estimates, such that all timing properties that hold on the controller guiding the discretized plant model also hold on the original case of the continuous time plant model guided by the controller.
Resistive crossbars have shown strong potential as the building blocks of future neural fabrics, due to their ability to natively execute vector-matrix multiplication (the dominant computational kernel in DNNs). However, a key challenge that arises in resistive crossbars is that non-idealities in the synaptic devices, interconnects, and peripheral circuits of resistive crossbars lead to errors in the computations performed. When large-scale DNNs are executed on resistive crossbar systems, these errors compound and result in unacceptable degradation in application-level accuracy. We propose CXDNN, a hardware-software methodology that enables the realization of large-scale DNNs on crossbar systems with minimal degradation in accuracy by compensating for errors due to non-idealities. CXDNN comprises of (i) an optimized mapping technique to convert floating-point weights and activations to crossbar conductances and input voltages, (ii) a fast re-training method to recover accuracy loss due to this conversion, and (iii) low-overhead compensation hardware to mitigate dynamic and hardware-instance-specific errors. Unlike previous efforts that are limited to small networks and require the training and deployment of hardware-instance-specific models, CXDNN presents a scalable compensation methodology that can address large DNNs (e.g., ResNet-50 on ImageNet), and enables a common model to be trained and deployed on many devices. We evaluated CXDNN on 6 top DNNs from the ILSVRC challenge with 0.5-13.8 million neurons and 0.5-15.5 billion connections. CXDNN achieves 16.9%-49% improvement in the top-1 classification accuracy, effectively mitigating a key challenge to the use of resistive crossbar based neural fabrics.
Indoor localization is an emerging application domain for the navigation and tracking of people and assets. Ubiquitously available Wi-Fi signals have enabled low-cost fingerprinting-based localization solutions. Further, the rapid growth in mobile hardware capability now allows high-accuracy deep learning-based frameworks to be executed locally on mobile devices in an energy-efficient manner. However, existing deep learning-based indoor localization solutions are vulnerable to access point (AP) attacks. This paper presents an analysis into the vulnerability of a convolutional neural network (CNN) based indoor localization solution to AP security compromises. Based on this analysis, we propose a novel methodology to maintain indoor localization accuracy, even in the presence of AP attacks. The proposed secured framework (called S-CNNLOC) is validated across a benchmark suite of indoor paths and is found to observe up to 10x average localization improvement on a given path with large number of malicious AP attacks, compared to its unsecured counterpart.
Many real-world edge applications including object detection, robotics, and smart health are enabled by deploying deep neural networks (DNNs) on energy-constrained mobile platforms. In this paper, we propose a novel approach to trade-off energy and accuracy of inference at runtime using a design space called Learning Energy Accuracy Tradeoff Networks (LEANets). The key idea behind LEANets is to design classifiers of increasing complexity using pre-trained DNNs to perform input-specific adaptive inference. The accuracy and energy-consumption of the adaptive inference scheme depends on a set of thresholds, one for each classifier. To determine the set of threshold vectors to achieve different energy and accuracy trade-offs, we propose a novel multi-objective optimization approach. We can select the appropriate threshold vector at runtime based on the desired trade-off. We perform experiments on multiple pre-trained DNNs including ConvNet, VGG-16, and MobileNet using diverse image classification datasets. Our results show that we get up to 50% gain in energy for negligible loss in accuracy, and optimized LEANets achieve significantly better energy and accuracy trade-off when compared to a state-of-the-art method referred as Slimmable neural networks.
With the rapid growth of connectivity and autonomy for today?s automobiles, their security vulnerabilities are becoming one of the most urgent concerns in the automotive industry. The lack of message authentication in Controller Area Network (CAN), which is the most popular in-vehicle communication protocol, makes it susceptible to cyber attack. It has been demonstrated that the remote attackers can take over the maneuver of vehicles after getting access to CAN, which poses serious safety threats to the public. To mitigate this issue, we propose a novel intrusion detection system (IDS), called BTMonitor (Bit-time based CAN Bus Monitor). It utilizes the small but measurable discrepancy of bit time in CAN frames to fingerprint their sender Electronic Control Units (ECUs). To reduce the requirement for high sampling rate, we calculate the bit time of recessive bits and dominant bits respectively and extract their statistical features as fingerprint. The generated fingerprint is then used to detect intrusion and pinpoint the attacker. BTMonitor can detect new types of masquerade attack that the state-of-the-art clock-skew based IDS is unable to identify. We implement a prototype system for BTMonitor using Xilinx Spartan 6 FPGA for data collection. We evaluate our method on both a CAN bus prototype and a real vehicle. The results show that BTMonitor can correctly identify the sender with an average probability of 99.76% on the real vehicle.
Stream programs are graph structured parallel programs, where the nodes are computational kernels that communicate by sending tokens over the edges. In this paper we present a framework for compiling stream programs that we call Tÿcho. It handles kernels of different styles and with a high degree of expressiveness using a common intermediate representation. It also provides efficient implementation, especially for but not limited to the restricted forms of stream programs, such as synchronous dataflow.
Continuous technology scaling in manycore systems leads to severe overheating issues. To guarantee system reliability, it is critical to accurately yet efficiently monitor run-time temperature distribution for effective chip thermal management. As an emerging communication architecture for new-generation manycore systems, optical network-on-chip (ONoC) satisfies the communication bandwidth and latency requirements with low power dissipation. What?s more, observation shows that it can be leveraged for run-time thermal sensing. In this paper, we propose a brand-new on-chip thermal sensing approach for ONoC-based manycore systems by utilizing the intrinsic thermal sensitivity of optical devices and the interprocessor communications in ONoCs. It requires no extra hardware but utilizes existing optical devices in ONoCs, and combines them with lightweight software computation in a hardware-software collaborative manner. The effectiveness of our approach is validated both at the device level and the system level through professional photonic simulations. Evaluation results based on synthetic communication traces and realistic benchmarks show that our approach achieves an average temperature inaccuracy of only 0.6648 K compared to ground truth values, and is scalable to be applied for large-size ONoCs.
Illegal use of memory pointers is a serious security vulnerability. A large number of malwares exploit the spatial and temporal nature of these vulnerabilities to subvert execution or glean sensitive data from an application. Recent countermeasures attach metadata to memory pointers, which define the pointer's capabilities. The metadata is used by the hardware to validate pointer-based memory accesses. However, recent works have considerable overheads. Further, the pointer validation is decoupled from the actual memory access. We show that this could open up vulnerabilities in multi-threaded applications and introduce new vulnerabilities due to speculation in out-of-order processors. In this paper, we demonstrate that the overheads can be reduced considerably by efficient metadata management. We show that the hardware can be designed in a manner that would remain safe in multi-threaded applications and immune to speculative vulnerabilities. We achieve these by ensuring that the pointer validations and the corresponding memory access is always done atomically and in-order. To evaluate our scheme, which we call ALEXIA, we enhance an OpenRISC processor to perform the memory validation at run time and also add compiler support. ALEXIA is the first hardware countermeasure scheme for memory protection that provides such an end-to-end solution. We evaluate the processor on an Altera FPGA and show that the run time overheads, on average, is 14%, with negligible impact on the processor's size and clock frequency. There is also a negligible impact on the program's code and data sizes.
The next significant step in the evolution and proliferation of artificial intelligence technology will be the integration of neural network (NN) models within embedded and mobile systems. This calls for the design of compact, energy efficient NN models in silicon. In this paper, we present a scalable ASIC design of an LSTM accelerator named ELSA, that is suitable for energy-constrained devices. It includes several architectural innovations to achieve small area and high energy efficiency. To reduce the area and power consumption of the overall design, the compute-intensive units of ELSA employ approximate multiplications and still achieve high performance and accuracy. The performance is further improved through efficient synchronization of the elastic pipeline stages to maximize the utilization. The paper also includes a performance model of ELSA, as a function of the hidden nodes and time steps, permitting its use for the evaluation of any LSTM application. ELSA was implemented in RTL and was synthesized and placed and routed in 65nm technology. Its functionality is demonstrated for language modeling ? a common application of LSTM. ELSA is compared against a baseline implementation of an LSTM accelerator with standard functional units and without any of the architectural innovations of ELSA. The paper demonstrates that ELSA can achieve significant improvements in power, area and energy-efficiency when compared to the baseline design and several ASIC implementations reported in the literature, making it suitable for use in embedded systems and real-time applications.
Heterogeneous multicore processor has recently become de facto computing platform for state-of-the-art embedded applications. Nonetheless, very little research focuses on the scheduling for real-time periodic tasks upon heterogeneous multicores under the requirements of task synchronization, which is stemmed from resource access conflicts and can greatly affect the schedulability of tasks. In view of the partitioned-EDF algorithm and Multiprocessor Stack Resource Policy (MSRP), we first discuss the blocking-aware utilization bound for uniform heterogeneous multicores and then illustrate its non-monotonicity, where the bound may reduce with more cores being exploited. Following the insights obtained from the analysis of the bound, taking the heterogeneity of computing systems into consideration, we propose an algorithm SA-TPA-HM (synchronization-aware task partitioning algorithm for heterogeneous multicores). Several blocking-guided and heterogeneity-aware mapping heuristics are incorporated to reduce the negative impacts of blocking conflicts in task system for better schedulability performance of tasks and balanced workload distributed across cores. The extensive simulation results demonstrate that the SA-TPA-HM algorithm can obtain the schedulability result approximate to an Integer Non-Linear Programming (INLP) based method and can have much better schedulability result (such as 60% more) in comparison with the current mapping heuristics targeted at homogeneous multicores. The measurements in Linux kernel further reveal the practical viability of SA-TPA-HM that can experience lower online overhead (e.g., 15% less) in contrast to other partitioning schemes.