Fast multipole method (FMM) is a promising mathematical technique that accelerates the calculation of long-ranged forces in the large-sized N-body problem. Existing implementations of the FMM on general purpose processors are energy and resource inefficient. To mitigate these issues, we propose a hardware pipeline that accelerates three key FMM steps. The pipeline improves energy efficiency by exploiting fine-granularity parallelism of the FMM. We reuse the pipeline for different FMM steps to improve resource utilization. Experiments show that our implementation outperforms the state-of-the-art implementations on CPUs and GPUs with 15\% less required energy and 66\% less required resources.
Fixed-Priority Scheduling for Two-Phase Mixed-Criticality Systems
Sample preparation plays a crucial role in biochemical applications since a predominant portion of analysis time is associated with sample collection, transportation, and preparation. However, in many biochemical applications, target mixtures with exact component-proportions may not be needed. The choice of a particular valid ratio, however, strongly impacts solution-preparation cost and time. In order to address this problem, we propose an optimal-cost, concentration-resilient ratio-selection method that is suitable for digital microfluidic biochips. Experimental results reveal that the proposed method can be used conveniently in tandem with several existing sample preparation algorithms for improving their performance.
The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task and data-level parallelism to meet throughput requirements of digital signal processing (DSP) applications. Moreover, in the presence of system-level memory constraints, hand optimization of code to satisfy these requirements is inefficient and error-prone, and can therefore, greatly slow down development time or result in highly underutilized processing resources. In this paper, we present vectorization and scheduling methods to effectively exploit multiple forms of parallelism for throughput optimization on hybrid CPU-GPU platforms, while conforming to system-level ...
Radio frequency (RF) energy harvesting techniques are becoming a potential method to power battery-free wireless networks. In RF energy harvesting communications, energy cooperation enables shaping and optimization of the energy arrivals at the energy-receiving node to improve the overall system performance. In this paper, we propose an energy cooperation scheme that enables energy cooperation in battery-free wireless networks with RF harvesting. We first study the battery-free wireless network with RF energy harvesting then state the problem that optimizing the system performance through new energy cooperation protocol. We find our protocol performs better than the original battery-free wireless network solution.
Through pseudorandom permutation, tweakable enciphering schemes (TES) constitute block cipher modes of operation which perform length-preserving computations. In this paper, we propose efficient approaches for protecting such schemes against natural and malicious faults. Specifically, noting that intelligent attackers do not merely get confined to injecting multiple faults, one major benchmark for the proposed schemes is evaluation towards biased and burst fault models. In addition, we benchmark the overhead and performance degradation on ASIC platform. The results of our error injection simulations and ASIC implementations show the suitability of the proposed approaches for a wide-range of applications including deeply-embedded systems.
In this paper we propose a novel methodology to compute the parametric WCET of a program. Unlike other algorithms in the literature, our method is not based on Integer Linear Programming (ILP). Instead, we follow an approach based on the notion of symbolic computation of WCET formulae. After explaining our methodology and proving its correctness, we present a set of experiments to compare our method against the state of the art. We show that our approach dominates other parametric analyses, and produces results that are very close to those produced by non-parametric ILP-based approaches, while keeping very good computing time.
Convolutional neural networks (CNNs) are widely employed in many image recognition applications. The ever-increasing computational capability of mobile processors provides opportunities to run such applications in an energy efficient and low latency manner, and is therefore critical that they are properly optimized on these platforms. Matrix multiplication is an important operation used in CNNs. In this paper, we propose customized versions of the matrix multiplication algorithms that can help to speed up CNNs. Specifically, we propose a BCSC (Block Compressed Sparse Column) algorithm and a bitrepresentation based algorithm (BitsGEMM) that exploit sparsity to accelerate CNNs on the NVIDIA Jetson TK1.
Guest Editorial for the Special Issue of ESWEEK 2016
Network-connected embedded systems are vulnerable to increasing malware. While anomaly-based malware detection is effective, existing approaches incur significant overheads and are susceptible to mimicry attacks. We present a formal security model that defines normal system behavior including execution sequence and timing. On-chip hardware non-intrusively monitors system execution and detect anomalies at runtime. The timing distribution of control flow events is further analyzed to select subset of monitoring targets to meet hardware constraints. The approach is evaluated using a network-connected pacemaker prototype and mimicry malware.
A key step in simulation-driven algorithms is to compute the reach set over-approximations from a set of initial states through numerical simulations and sensitivity analysis. This paper addresses this problem by providing algorithms for computing discrepancy functions as the upper bound on the sensitivity. The proposed algorithms rely on computing local bounds on matrix measures under different norms such that the over-approximations can be computed fast or its conservativeness is locally minimized. The proposed algorithms enable automatic reach set computations of general nonlinear systems and have been successfully used on several challenging benchmark model.
In safety-critical systems, there are typically different applications providing functionalities with varying degrees of criticality. Consequently, high levels of assurance is required for a highly critical functionality, whereas relatively low levels of assurance is required for a less critical functionality. A theory of real-time scheduling for such multi-criticality systems has been under development in the recent past. In particular, an algorithm called Earliest Deadline First with Virtual Deadlines (EDF-VD) has shown a lot of promise for systems in terms of practical performance. In this paper we design a new schedulability test based on demand bound functions for EDF-VD.
This paper designs a new fusion algorithm from all kinds of IMU sources, namely, gyroscope, accelerometer and magnetometer. Comparing to state-of-art approaches, time-varying magnetic perturbation problem is firstly characterized in an geometric view. Using this detailed model as motivation, we propose an extend Kalman filter based (EKF-based) algorithm to eliminate the position-dependent affection of compass sensor. Experiment data demonstrates that our proposed attitude fusion algorithm has the maximum angle error of $2.74\degree$ comparing to $6.68\degree$ in gradient declining based (GD-based) algorithm even under different indoor magnetic distortion environment.
In this paper, we propose four Approximate Full Adders (AFAs) with design objective that Cout remains independent of Cin with minimal error probability. We exploit the proposed AFAs to construct an N-bit approximate adder which hereinafter is referred to as ApproxADD. For improving Error Distance (ED) and Error Rate (ER) of ApproxADD, we exploit the concept of carry-lifetime and Error Detection and Correction (EDC) logic, respectively. We evaluate efficiency of the proposed approach by comparing it with existing approximate adders. We inspect effectiveness of the proposed approach in real-life applications by demonstrating image compression and decompression using the ApproxADD.
Recently, the research community has introduced several predictable DRAM controller designs that provide improved worst-case timing guarantees for real-time embedded systems. The proposed controllers significantly differ in terms of arbitration, configuration and simulation environment, making it difficult to assess the contribution of each approach. To bridge this gap, this paper provides the first comprehensive evaluation of state-of-the-art predictable DRAM controllers. We propose a categorization of available controllers, and introduce an analytical performance model based on worst-case latency. We then conduct an extensive evaluation for all state-of-the-art controllers based on a common simulation platform, and discuss findings and recommendations.
Wireless sensor networks for rarely occurring critical events must maintain sensing coverage and low latency network connectivity to ensure event detection and subsequent rapid propagation of notification messages. Few algorithms have been proposed that address both coverage and forwarding and those that do are either unconcerned with rapid propagation or are not optimised to handle the constant changes in topology observed in duty cycled networks. This paper proposes an algorithm for Coverage Preservation with Rapid Forwarding (CPRF). The algorithm is shown to deliver perfect coverage maintenance and low latency message propagation whilst allowing stored-charge conservation via collaborative duty cycling in...
In this paper, we propose D-PUF, an intrinsically reconfigurable DRAM PUF based on refresh pausing. A key feature of the proposed DRAM PUF is reconfigurability, i.e., by varying the refresh-pause interval, the challenge-response behavior of the PUF can be altered, making it robust against various attacks. The paper is broadly divided into two parts. In the first part, we demonstrate the use of D-PUF in performing device authentication through a secure, low-overhead methodology. In the second part, we show the generation of true random numbers using D-PUF. The design is implemented and validated using several off-the-shelf DDR3 DRAM modules.
Evaluation of industrial embedded control system designs is a time-consuming and imperfect process. While an ideal process would apply a formal verification technique such as model checking, for industrial scale control systems, these techniques are often difficult to use to verify performance aspects such as convergence. For industrial designs, engineers rely on testing processes to identify unexpected behaviors. We propose a novel framework called Underminer to improve the testing process; this is an automated technique to identify non-converging behaviors in embedded control system designs. It supports various convergence-like notions, such as those based on Lyapunov analysis and temporal logic formulae.
Interrupt-driven software is difficult to test and debug, especially when interrupts can be nested and subject to priorities. We present a new formal approach to verifying interrupt-driven software based on symbolic execution. The approach leverages recent advances in the encoding of concurrent programs. We assess the performance of our method on benchmarks drawn from embedded systems code, and experimentally compare it to conventional approaches that use source-to-source transformations. Our results show that our method performs the best. To our best knowledge, our work is the first to demonstrate effective verification of low-level embedded software with nested interrupts.
In this paper we present a new algorithm that combines contextual unfoldings and dynamic symbolic execution to systematically test multithreaded programs. The approach uses symbolic execution to limit the number of input values and unfoldings to limit the number of thread interleavings that are needed to cover reachable local states of threads in the program under test. We show that the use of contextual unfoldings allows interleavings of threads to be succinctly represented. This can in some cases lead to a substantial reduction to the number of needed test executions when compared to previous approaches.
Device-free passive detection is an emerging technology to detect presense of moving entities without attaching any device to them. Despite of the prevalent RSS, most robust and reliable solutions resort to finer-grained channel descriptor at physical layer. Few existing techniques have explored full potentials of CSI. Moreover, space diversity supported by multi-antenna systems are not investigated as extensive as frequency diversity. In this paper, we propose a novel scheme for PAssive Detection of moving humans with dynamic Speed (PADS), which exploits full information of CSI and space diversity. Experiment results demonstrate PADS's great performance in spite of dynamic human movements.
Mobile platforms are increasingly using HMPSoCs with integrated GPUs. Traditionally, separate CPU and GPU governors are deployed to achieve energy efficiency through DVFS, but miss opportunities for further energy savings. We present a cooperative CPU-GPU DVFS strategy that orchestrates energy-efficient DVFS through synergistic CPU and GPU frequency capping to avoid frequency over-provisioning. Our results across wide range of multiple applications (over 200 micro-benchmarks and 40 mobile games) show that our proposal improves energy per frame by 8 % (up to 27.6%) and achieves minimal FPS loss by 0.85% in deployment set, compared to the default governors on the ODROID-XU3 platform.
Guest Editorial: Special Issue on Formal Methods and Models for System Design
Traditional approaches for managing software-programmable memories (SPMs) do not support sharing of distributed on-chip memory resources, and consequently miss the opportunity to better utilize those memory resources. Managing on-chip memory resources in many-core embedded systems (MES) with distributed SPMs requires runtime support to share memory resources between various threads with different memory demands running concurrently. This paper proposes ShaVe-ICE: an operating-system-level solution, along with hardware support, to virtualize and ultimately share SPM resources across a many-core embedded system in order to reduce the average memory latency. We present a number of simple allocation policies to improve performance and energy.
Coarse-grained reconfigurable architectures (CGRAs) have many processing elements, which is suitable for implementing spatial redundancy, as used in the design of fault-tolerant systems. This paper introduces a recovery time model for transient faults in CGRAs. The proposed fault-tolerance is based on triple modular redundancy and coding techniques for error detection and correction. To evaluate the model, several kernels from space computing are mapped onto the suggested architecture. We demonstrate the tradeoff between recovery time, performance and area. In addition, the average execution time of an application including recovery time is evaluated using area-based error rate estimates in harsh radiation environments.
In this paper we present Distributed Computing for Constrained Devices (DC4CD), a novel software architecture that supports symbolic distributed computing on Wireless Sensor Networks. DC4CD integrates the functionalities of a high-level symbolic interpreter, a compiler, and an operating system, and includes networking abstractions to exchange high-level symbolic code among peer devices. Contrarily to other architectures proposed in literature, DC4CD allows for changes at runtime, even on deployed nodes of both application and system code. Experimental results show that DC4CD is more efficient in terms of memory usage than existing architectures, with which also compares well in terms of execution efficiency.
Temporal properties define the order of occurrence and timing constraints on event occurrence. Such specifications are important for safety-critical real-time systems. We propose a framework for automatically mining (eager matching) properties that are in the form of timed regular expressions (TREs) from system traces. Using an abstract structure of the property, the framework constructs a finite state machine to serve as an acceptor. We analytically derive speedup for the fragment and confirm the speedup using empirical validation with synthetic traces. The framework is evaluated on industrial strength safety-critical real-time applications using traces with more than 1 Million entries.
This paper presents a new reliability-aware task mapping approach in a many-core platform at design time for applications with DAG-based task graphs. The main goal is to devise a task mapping which meets a predefined reliability threshold considering a minimized performance degradation. The proposed approach uses a majority-voting replication technique to fulfill error-masking capability. A quantitative reliability model is also proposed for the platform. Our iterative approach is applicable to an unlimited number of system fault types. All parts of the platform including cores, links and routers are assumed to be prone to failur
The drift phenomenon can degrade the quality of result (QoR) of applications in presence of approximate MLC-PCM. The architecture-level techniques to alleviate the effects of the drift incur considerable power overhead in embedded systems (about 100%). In this paper, we utilize the DVFS technique to speed up the execution of the application when the profiling shows the drift-based errors are more probable to decrease the probability of drift-based soft errors. But, we speed up the application to save energy when the drift is not threating. Power overhead is improved by 84% (in average) in our approach, while QoR is acceptable.
A framework for the elicitation and debugging of formal specifications for Cyber-Physical Systems is presented. The elicitation of specifications is handled through a graphical interface. Two debugging algorithms are presented. The first checks for erroneous or incomplete temporal logic specifications without considering the system. The second can be utilized for the analysis of reactive requirements with respect to system test traces. The specification debugging framework is applied on a number of formal specifications collected through a user study. The user study establishes that requirement errors are common and that the debugging framework can resolve many insidious specification errors.
We consider the model repair problem: given a finite Kripke structure M and a CTL formula ·, determine if M contains a substructure M' that satisfies ·. Thus, M can be ``repaired'' to satisfy · by deleting some transitions and states. We reduce model repair to boolean satisfiability, so that any SAT solver can be used to repair Kripke structures. We extend the basic repair method to use abstraction mappings, to repair concurrent Kripke structures and concurrent programs, and to repair hierarchical Kripke structures.