ACM Transactions on

Embedded Computing Systems (TECS)

Latest Articles

TF-Net: Deploying Sub-Byte Deep Neural Networks on Microcontrollers

Deep Neural Networks (DNNs) have become an essential component of various applications. While today’s DNNs are mainly restricted to cloud services, network connectivity, energy, and data privacy problems make it important to support efficient DNN computation on low-cost, low-power processors like microcontrollers. However, due to the... (more)

Aggressive Energy Reduction for Video Inference with Software-only Strategies

In the past years, several works have proposed custom hardware and software-based techniques for the acceleration of Convolutional Neural Networks... (more)

CompAct: On-chip <underline>Com</underline>pression of <underline>Act</underline>ivations for Low Power Systolic Array Based CNN Acceleration

This paper addresses the design of systolic array (SA) based convolutional neural network (CNN) accelerators for mobile and embedded domains. On- and off-chip memory accesses to the large activation inputs (sometimes called feature maps) of CNN layers contribute significantly to total energy consumption for such accelerators; while prior has... (more)

ECAx: Balancing Error Correction Costs in Approximate Accelerators

Approximate computing has emerged as a design paradigm amenable to error-tolerant applications. It enables trading the quality of results for efficiency improvement in terms of delay, power, and energy consumption under user-provided tolerable quality degradation. Approximate accelerators have been proposed to expedite frequently executing code... (more)

An Ultra-Low Energy Human Activity Recognition Accelerator for Wearable Health Applications

Human activity recognition (HAR) has recently received significant attention due to its wide range of applications in health and activity... (more)

CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to... (more)

Is Your Bus Arbiter Really Fair? Restoring Fairness in AXI Interconnects for FPGA SoCs

AMBA AXI is a popular bus protocol that is widely adopted as the medium to exchange data in field-programmable gate array system-on-chips (FPGA SoCs).... (more)

Analytical Performance Models for NoCs with Multiple Priority Traffic Classes

Networks-on-chip (NoCs) have become the standard for interconnect solutions in industrial designs ranging from client CPUs to many-core... (more)

Enabling and Exploiting Partition-Level Parallelism (PALP) in Phase Change Memories

Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they... (more)

Synterface: Efficient Chip-to-World Interfacing for Flow-Based Microfluidic Biochips Using Pin-Count Minimization

Flow-based microfluidic biochips can be used to perform bioassays by manipulating a large number of on-chip valves. These biochips are increasingly used today for biomolecular recognition, single-cell screening, and point-of-care disease diagnostics, and design-automation solutions for flow-based microfluidics enable the mapping and optimization of... (more)

Output-based Intermediate Representation for Translation of Test-pattern Program

An Intermediate Representation (IR) used by compilers is normally generated statically, as a result of parsing or analyzing the source program. This... (more)


Editor-in-Chief Call for Nominations

The term of the current Editor-in-Chief (EiC) of the ACM Transactions on Embedded Computing Systems (TECS) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC. The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM TECS aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.



Call for Nominations for ACM Transactions on Embedded Computing Systems Best Paper Award 2019

ACM TECS is seeking nominations to recognize the best paper published in ACM TECS. The best paper award will be based on the overall quality, the originality, the level of contribution, the subject matter, and the timeliness and potential impact of the research. 


Forthcoming Articles

Editorial: Embedded Computing and Society

Quality Estimation and Optimization of Adaptive Stereo Matching Algorithms for Smart Vehicles

Stereo matching is a promising approach for smart vehicles to find the depth of nearby objects. Transforming a traditional stereo matching algorithm to its adaptive version has potential advantages to achieve the maximum quality (depth accuracy) in a best-effort manner. However, it is very challenging to support this adaptive feature, since (1) the internal mechanism of adaptive stereo matching (ASM) has to be accurately modeled, and (2) scheduling ASM tasks on multiprocessors to generate the maximum quality is difficult, under strict real-time constraints of smart vehicles. In this paper, we propose a framework for constructing an ASM application and optimizing its output quality on smart vehicles. First, we empirically convert stereo matching into ASM by exploiting its inherent characteristics of disparity-cycle correspondence and introduce an exponential quality model that accurately represents the quality-cycle relationship. Second, with the explicit quality model, we propose an efficient quadratic programming-based dynamic voltage/frequency scaling (DVFS) algorithm to decide the optimal operating strategy, which maximizes the output quality under timing, energy, and temperature constraints. Third, we propose two novel methods to efficiently estimate the parameters of the quality model, namely location similarity-based feature point thresholding (L-FPT) and street scenario- confined CNN (S-CNN) prediction. Results show that our DVFS algorithm achieves at least 1.61 times quality improvement compared to the state-of-the-art techniques, and average parameter estimation for the quality model achieves 96.35% accuracy on the straight road.

Automated Model-based Optimization of Data-Adaptable Embedded Systems

This paper presents a modeling and optimization framework that enables developers to model an application's data sources, tasks, and exchanged data tokens; specify application requirements through high-level design metrics and fuzzy logic based optimization rules; and define an estimation framework to automatically optimize the application at runtime. We demonstrate the modeling and optimization process via an example application for video-based vehicle tracking and collision avoidance. We analyze the benefits of runtime optimization by comparing the performance of static point solutions to dynamic solutions over five distinct execution scenarios, showing improvements of up to 74% for dynamic over static configurations.

ICNN: The Iterative Convolutional Neural Network

Modern and recent architectures of vision based Convolutional Neural Networks (CNN) have improved detection and prediction accuracy significantly. However, these algorithms are extremely computational intensive. To break the power and performance wall of CNN computation, we reformulate the CNN computation into an iterative process, where each iteration processes a sub-sample of input features with smaller network and ingests additional features to improve the prediction accuracy. Each smaller network could either classify based on its input set or feed computed and extracted features to the next network to enhance the accuracy. The proposed approach allows early-termination upon reaching acceptable confidence. Moreover, each iteration provides a contextual awareness that allows an intelligent resource allocation and optimization for the proceeding iterations. In this paper we propose various policies to reduce the computational complexity of CNN through the proposed iterative approach. We illustrate how the proposed policies construct a dynamic architecture suitable for a wide range of applications with varied accuracy requirements, resources, and time-budget, without further need for network re-training. Furthermore, we carry out a visualization of the detected features in each iteration through deconvolution network to gain more insight into the successive traversal of the ICNN.

Pattern Guided Integrated Scheduling and Routing in Multi-hop Control Networks

Executing a set of control loops over a shared multi-hop (wireless) control network (MCN) requires careful co-scheduling of the control tasks as well as the routing of sensory/actuation messages over the MCN. In this work, we establish pattern guided aperiodic execution of control loops as a resource-aware alternative to traditional fully periodic executions of a set of embedded control loops sharing a computation as well as the communication infrastructure. We provide a satisfiability modulo theories based co-design framework that synthesizes loop execution patterns having optimized control cost as the underlying scheduling scheme together with the associated routing solution over the MCN. The routing solution implements the timed movement of the sensory/actuation messages of the control loops, generated according to those loop execution patterns. From the given settling time requirement of the control loops, we compute a control theoretically sound model using matrix inequalities, that gives an upper bound to the number of loop drops within the finite length loop execution pattern. Next, we show that how the proposed framework can be useful for evaluating the fault tolerance of a resource-constrained shared MCN subject to communication link failure.

Design and Implementation of an Escape Analysis in the Context of Safety-Critical Embedded Systems

The use of a managed, type-safe language such as Standard ML, Ada Ravenscar or Java in hard real-time and embedded systems offers productivity, safety and dependability benefits at a reasonable cost. Static software systems, that is systems in which all relevant resource entities such as threads and their priorities, for instance, and the entire source code are known ahead of time, are particularly interesting for the deployment in safety-critical embedded systems: Code verification is rather maintainable in contrast to dynamic systems. Additionally, static analyses can incorporate information from all software and system layers to assist compilers in emitting code that is well-suited to an application on a particular hardware device. It was shown it the past, that a program composed in type-safe Java in combination with a static system setup can be as efficient as one that is written in C, which is still the most widely-used language in the embedded domain. Escape analysis (EA) is one of several static-analysis techniques. It supports, for instance, runtime efficiency by enabling automated stack allocation of objects. In addition, researchers have argued that EA enables further applications in safety-critical embedded systems such as the computation of memory classes stated in the Real-Time Specification for Java (RTSJ). EA can be applied to any programming language but the quality of its results greatly benefits from the properties of a type-safe language. Notably, embedded multicore devices can positively be affected by the use of EA. Thus, we explore an ahead-of-time (AOT) escape analysis in the context of the KESO JVM featuring a Java AOT compiler targeting (deeply) embedded (hard) real-time systems.

Weakly-Hard Real-Time Guarantees for Earliest Deadline First Scheduling of Independent Tasks

The current trend in modeling and analyzing real-time systems is toward tighter yet safe timing constraints. Many practical real-time systems can de facto sustain a bounded number of deadline misses, i.e., they have weakly-hard real-time constraints rather than hard real-time constraints. We therefore strive to provide tight deadline miss models in complement to tight response time bounds for such systems. In this work, we bound the distribution of deadline misses for task sets running on uniprocessors using the Earliest Deadline First (EDF) scheduling policy. We assume tasks miss their deadlines due to transient overload resulting from sporadic activations, e.g. interrupt service routines and we use Typical Worst-Case Analysis (TWCA) to tackle the problem in this context. TWCA relies on existing worst-case response time analyses as a foundation, so we revisit and revise in this paper the state-of-the-art worst-case response time analysis for EDF scheduling. This work is motivated by and validated on a realistic case study inspired by industrial practice (satellite on-board software) and on a set of synthetic test cases. The results show the usefulness of this approach for temporary overloaded systems when EDF scheduling is considered. The scalability has also been addressed in our experiments.

BBB-CFI: Lightweight CFI Approach Against Code-Reuse Attacks Using Basic Block Information

Code-reuse attack is a concrete threat to computing systems as it can evade conventional security defenses. Control flow integrity (CFI) is proposed to repel this threat. However, former implementations of CFI suffer from two major drawbacks: 1) complex offline processing on programs; 2) high overheads at runtime. Therefore, it is impractical for performance-constrained devices to adopt the technology, leaving them vulnerable to exploitation. In this paper, we develop a cross-layer approach named BBB-CFI to minimize the overheads of both offline analysis and runtime checking. Our approach employs basic block information inside the binary code and read-only data to enforce control-flow integrity. We identify a key binary level property called basic block boundary and based on it we propose the code-inspired method where short code sequences can endorse a control flow transition. Our solution enables quick application launching because it does not require control flow graph construction at the offline stage. We only demand a lightweight analysis on read-only data and a small amount of code of the application. According to the experiments, our approach incurs a negligible 0.11% runtime performance overhead with a minor processor extension, whereas it achieves an order of magnitude speedup in pre-preprocessing compared to a baseline approach. Without control flow analysis or recompilation, BBB-CFI still effectively reduces 90% of the attack surface in terms of gadget numbers. Besides this, we show that the Turing-completeness in the libc is unsustainable. Our approach also demonstrates high applicability to many programs and it is capable of protecting striped binaries.

3PXNet: Pruned-Permuted-Packed XNOR Networks for Edge Machine Learning

As the adoption of Neural Networks continues to proliferate different classes of applications and systems, edge devices have been left behind. Their strict energy and storage limitations make them unable to cope with the sizes of common network models. While many compression methods such as precision reduction and sparsity have been proposed to alleviate this, they don't go quite far enough. To push size reduction to its absolute limits, we combine binarization with sparsity in Pruned-Permuted-Packed XNOR Networks (3PXNet), which can be efficiently implemented on even the smallest of embedded microcontrollers. 3PXNets can reduce model sizes by up to 38X and reduce runtime by up to 3X compared with already compact conventional binarized implementations with less than 3% accuracy reduction. We have created the first software implementation of sparse-binarized Neural Networks, released as an open-source library targeting edge devices. Our library is complete with training methodology and model generating scripts, making it easy and fast to deploy.

Design & Optimization of Energy-Accuracy Trade-off for Mobile platforms via Pretrained Deep Models

Many real-world edge applications including object detection, robotics, and smart health are enabled by deploying deep neural networks (DNNs) on energy-constrained mobile platforms. In this paper, we propose a novel approach to trade-off energy and accuracy of inference at runtime using a design space called Learning Energy Accuracy Tradeoff Networks (LEANets). The key idea behind LEANets is to design classifiers of increasing complexity using pre-trained DNNs to perform input-specific adaptive inference. The accuracy and energy-consumption of the adaptive inference scheme depends on a set of thresholds, one for each classifier. To determine the set of threshold vectors to achieve different energy and accuracy trade-offs, we propose a novel multi-objective optimization approach. We can select the appropriate threshold vector at runtime based on the desired trade-off. We perform experiments on multiple pre-trained DNNs including ConvNet, VGG-16, and MobileNet using diverse image classification datasets. Our results show that we get up to 50% gain in energy for negligible loss in accuracy, and optimized LEANets achieve significantly better energy and accuracy trade-off when compared to a state-of-the-art method referred as Slimmable neural networks.

Tÿcho: A Framework for Compiling Stream Programs

Stream programs are graph structured parallel programs, where the nodes are computational kernels that communicate by sending tokens over the edges. In this paper we present a framework for compiling stream programs that we call Tÿcho. It handles kernels of different styles and with a high degree of expressiveness using a common intermediate representation. It also provides efficient implementation, especially for but not limited to the restricted forms of stream programs, such as synchronous dataflow.

Optimizing Deep Learning Inference on Embedded Systems Through Adaptive Model Selection

Deep neural networks (DNNs) are becoming a key enabling technique for many application domains. However, on-device inference on battery-powered, resource-constrained embedding systems is often infeasible due to prohibitively long inferencing time and resource requirements of many DNNs. Offloading computation into the cloud is often unacceptable due to privacy concerns, high latency, or the lack of connectivity. While compression algorithms often succeed in reducing inferencing times, they come at the cost of reduced accuracy. This paper presents a new, alternative approach to enable efficient execution of DNNs on embedded devices. Our approach dynamically determines which DNN to use for a given input, by considering the desired accuracy and inference time. It employs machine learning to develop a low-cost predictive model to quickly select a pre-trained DNN to use for a given input and the optimization constraint. We achieve this by first training off-line a predictive model, and then use the learnt model to select a DNN model to use for new, unseen inputs. We apply our approach to two representative DNN domains: image classification and machine translation. We evaluate our approach on a Jetson TX2 embedded deep learning platform, and consider a range of influential DNN models including convolutional and recurrent neural networks. For image classification, we achieve a 1.8x reduction in inference time with a 7.52% improvement in accuracy, over the most-capable single DNN model. For machine translation, we achieve a 1.34x reduction in inference time over the most-capable single model, with little impact on the quality of translation.

ELSA: A Throughput-Optimized Design of an LSTM Accelerator for Energy-Constrained Devices

The next significant step in the evolution and proliferation of artificial intelligence technology will be the integration of neural network (NN) models within embedded and mobile systems. This calls for the design of compact, energy efficient NN models in silicon. In this paper, we present a scalable ASIC design of an LSTM accelerator named ELSA, that is suitable for energy-constrained devices. It includes several architectural innovations to achieve small area and high energy efficiency. To reduce the area and power consumption of the overall design, the compute-intensive units of ELSA employ approximate multiplications and still achieve high performance and accuracy. The performance is further improved through efficient synchronization of the elastic pipeline stages to maximize the utilization. The paper also includes a performance model of ELSA, as a function of the hidden nodes and time steps, permitting its use for the evaluation of any LSTM application. ELSA was implemented in RTL and was synthesized and placed and routed in 65nm technology. Its functionality is demonstrated for language modeling ? a common application of LSTM. ELSA is compared against a baseline implementation of an LSTM accelerator with standard functional units and without any of the architectural innovations of ELSA. The paper demonstrates that ELSA can achieve significant improvements in power, area and energy-efficiency when compared to the baseline design and several ASIC implementations reported in the literature, making it suitable for use in embedded systems and real-time applications.

Blocking-Aware Partitioned Real-Time Scheduling for Uniform Heterogeneous Multicore Platforms

Heterogeneous multicore processor has recently become de facto computing platform for state-of-the-art embedded applications. Nonetheless, very little research focuses on the scheduling for real-time periodic tasks upon heterogeneous multicores under the requirements of task synchronization, which is stemmed from resource access conflicts and can greatly affect the schedulability of tasks. In view of the partitioned-EDF algorithm and Multiprocessor Stack Resource Policy (MSRP), we first discuss the blocking-aware utilization bound for uniform heterogeneous multicores and then illustrate its non-monotonicity, where the bound may reduce with more cores being exploited. Following the insights obtained from the analysis of the bound, taking the heterogeneity of computing systems into consideration, we propose an algorithm SA-TPA-HM (synchronization-aware task partitioning algorithm for heterogeneous multicores). Several blocking-guided and heterogeneity-aware mapping heuristics are incorporated to reduce the negative impacts of blocking conflicts in task system for better schedulability performance of tasks and balanced workload distributed across cores. The extensive simulation results demonstrate that the SA-TPA-HM algorithm can obtain the schedulability result approximate to an Integer Non-Linear Programming (INLP) based method and can have much better schedulability result (such as 60% more) in comparison with the current mapping heuristics targeted at homogeneous multicores. The measurements in Linux kernel further reveal the practical viability of SA-TPA-HM that can experience lower online overhead (e.g., 15% less) in contrast to other partitioning schemes.

DyVEDeep: Dynamic Variable Effort Deep Neural Networks

Deep Neural Networks (DNNs) have advanced the state-of-the-art in a variety of machine learning tasks and are deployed in increas- ing numbers of products and services. However, the computational requirements of training and evaluating large-scale DNNs are grow- ing at a much faster pace than the capabilities of the underlying hardware platforms that they are executed upon. To address this challenge, one promising approach is to exploit the error resilient nature of DNNs by skipping or approximating computations that have negligible impact on classification accuracy. Almost all prior efforts in this direction propose static DNN approximations by ei- ther pruning network connections, implementing computations at lower precision, or compressing weights. In this work, we propose Dynamic Variable Effort Deep Neu- ral Networks (DyVEDeep) to reduce the computational require- ments of DNNs during inference. Complementary to the afore- mentioned static approaches, DyVEDeep is a dynamic approach that exploits heterogeneity in the DNN inputs to improve their compute efficiency with comparable classification accuracy, and without requiring any re-training. DyVEDeep equips DNNs with dynamic effort mechanisms that identify computations critical to classifying a given input and focus computational effort only on the critical computations, while skipping or approximating the rest. We propose three dynamic effort mechanisms that operate at different levels of granularity viz. neuron, feature and layer lev- els. We build DyVEDeep versions of 6 popular image recognition benchmarks (CIFAR-10, AlexNet, OverFeat, VGG-16, SqueezeNet and Deep-Compressed-AlexNet) within the Caffe deep learning framework. We evaluate DyVEDeep on two platforms ? a high performance server with a 2.7GHz Intel Xeon E5-2680 processor and 128GB memory, and a low-power Raspberry Pi board with an ARM Cortex A53 processor and 1GB memory. Across all bench- marks, DyVEDeep achieves 2.47×-5.15× reduction in the number of scalar operations, which translates to 1.94×-2.23× and 1.46×-3.46× performance improvement over well-optimized baselines on the Xeon server and the Raspberry Pi respectively, with comparable classification accuracy.

All ACM Journals | See Full Journal Index

Search TECS
enter search term and/or author name