Fast multipole method (FMM) is a promising mathematical technique that accelerates the calculation of long-ranged forces in the large-sized N-body problem. Existing implementations of the FMM on general purpose processors are energy and resource inefficient. To mitigate these issues, we propose a hardware pipeline that accelerates three key FMM steps. The pipeline improves energy efficiency by exploiting fine-granularity parallelism of the FMM. We reuse the pipeline for different FMM steps to improve resource utilization. Experiments show that our implementation outperforms the state-of-the-art implementations on CPUs and GPUs with 15\% less required energy and 66\% less required resources.
Sample preparation plays a crucial role in biochemical applications since a predominant portion of analysis time is associated with sample collection, transportation, and preparation. However, in many biochemical applications, target mixtures with exact component-proportions may not be needed. The choice of a particular valid ratio, however, strongly impacts solution-preparation cost and time. In order to address this problem, we propose an optimal-cost, concentration-resilient ratio-selection method that is suitable for digital microfluidic biochips. Experimental results reveal that the proposed method can be used conveniently in tandem with several existing sample preparation algorithms for improving their performance.
The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task and data-level parallelism to meet throughput requirements of digital signal processing (DSP) applications. Moreover, in the presence of system-level memory constraints, hand optimization of code to satisfy these requirements is inefficient and error-prone, and can therefore, greatly slow down development time or result in highly underutilized processing resources. In this paper, we present vectorization and scheduling methods to effectively exploit multiple forms of parallelism for throughput optimization on hybrid CPU-GPU platforms, while conforming to system-level ...
Editorial: Who Could You Trust?
Radio frequency (RF) energy harvesting techniques are becoming a potential method to power battery-free wireless networks. In RF energy harvesting communications, energy cooperation enables shaping and optimization of the energy arrivals at the energy-receiving node to improve the overall system performance. In this paper, we propose an energy cooperation scheme that enables energy cooperation in battery-free wireless networks with RF harvesting. We first study the battery-free wireless network with RF energy harvesting then state the problem that optimizing the system performance through new energy cooperation protocol. We find our protocol performs better than the original battery-free wireless network solution.
Through pseudorandom permutation, tweakable enciphering schemes (TES) constitute block cipher modes of operation which perform length-preserving computations. In this paper, we propose efficient approaches for protecting such schemes against natural and malicious faults. Specifically, noting that intelligent attackers do not merely get confined to injecting multiple faults, one major benchmark for the proposed schemes is evaluation towards biased and burst fault models. In addition, we benchmark the overhead and performance degradation on ASIC platform. The results of our error injection simulations and ASIC implementations show the suitability of the proposed approaches for a wide-range of applications including deeply-embedded systems.
To enable long-lived indoor sensing, we report in this paper a self-sustaining sensing system that draws energy from HVAC systems. As the harvested power is tiny, an extremely low but synchronous duty-cycle has to be applied; hence we design two complementary synchronization schemes that cost virtually no energy. Finally, we exploit the feature of our harvester to sense the airflow speed in an energy-free manner and the sensed data can be used to enhance the awareness of the indoor microclimate. To our knowledge, this is the first indoor wireless sensing system encapsulating energy harvesting, network operating, and sensing all together.
Current cache side-channel attacks (SCA) countermeasures have not been designed for many-core architectures and need to be revisited in order to be practical for these new technologies. Spatial isolation of resources for sensitive applications has been proposed taking advantage of the large number of resources offered by these architectures. This solution avoids cache sharing with sensitive processes. Consequently, their cache activity cannot be monitored and cache SCA cannot be performed. This work focuses on the implementation of this technique in order to minimize the induced performance overhead. Different strategies for the management of isolated secure zones are implemented and compared.
In this paper, we present implementations for Addition, Rotation, and eXclusive-or (ARX)--based block ciphers including LEA and HIGHT on IoT devices, including 8-bit AVR, 16-bit MSP, 32-bit ARM, and 32-bit ARM--NEON processors. We optimized 32/8-bit wise ARX operations for LEA and HIGHT block ciphers by considering variations in word-size, the number of general purpose registers, and instruction set of the target IoT devices. Finally, we achieved the most compact implementations of LEA and HIGHT block ciphers. The implementations were fairly evaluated through the Fair Evaluation of Lightweight Cryptographic Systems framework and implementations won the competitions in first and second rounds.
This paper designs a new fusion algorithm from all kinds of IMU sources, namely, gyroscope, accelerometer and magnetometer. Comparing to state-of-art approaches, time-varying magnetic perturbation problem is firstly characterized in an geometric view. Using this detailed model as motivation, we propose an extend Kalman filter based (EKF-based) algorithm to eliminate the position-dependent affection of compass sensor. Experiment data demonstrates that our proposed attitude fusion algorithm has the maximum angle error of $2.74\degree$ comparing to $6.68\degree$ in gradient declining based (GD-based) algorithm even under different indoor magnetic distortion environment.
Recently, the research community has introduced several predictable DRAM controller designs that provide improved worst-case timing guarantees for real-time embedded systems. The proposed controllers significantly differ in terms of arbitration, configuration and simulation environment, making it difficult to assess the contribution of each approach. To bridge this gap, this paper provides the first comprehensive evaluation of state-of-the-art predictable DRAM controllers. We propose a categorization of available controllers, and introduce an analytical performance model based on worst-case latency. We then conduct an extensive evaluation for all state-of-the-art controllers based on a common simulation platform, and discuss findings and recommendations.
This paper examines sources of dynamic energy during the execution of software on Internet of Things (IoT) domain microprocessors. Typically, an energy model is used to find the most costly path through a program. Few models, however, adequately consider dynamic energy caused by operand data. We find that operand data's contribution to overall energy can be significant, prove that finding the worst-case input data is NP-hard, and further, that it cannot be estimated to any useful factor. Our work shows that accurate worst-case analysis of data dependent energy is infeasible, and that other energy estimation techniques should be considered.
Device-free passive detection is an emerging technology to detect presense of moving entities without attaching any device to them. Despite of the prevalent RSS, most robust and reliable solutions resort to finer-grained channel descriptor at physical layer. Few existing techniques have explored full potentials of CSI. Moreover, space diversity supported by multi-antenna systems are not investigated as extensive as frequency diversity. In this paper, we propose a novel scheme for PAssive Detection of moving humans with dynamic Speed (PADS), which exploits full information of CSI and space diversity. Experiment results demonstrate PADS's great performance in spite of dynamic human movements.
Mobile platforms are increasingly using HMPSoCs with integrated GPUs. Traditionally, separate CPU and GPU governors are deployed to achieve energy efficiency through DVFS, but miss opportunities for further energy savings. We present a cooperative CPU-GPU DVFS strategy that orchestrates energy-efficient DVFS through synergistic CPU and GPU frequency capping to avoid frequency over-provisioning. Our results across wide range of multiple applications (over 200 micro-benchmarks and 40 mobile games) show that our proposal improves energy per frame by 8 % (up to 27.6%) and achieves minimal FPS loss by 0.85% in deployment set, compared to the default governors on the ODROID-XU3 platform.
Guest Editorial: Special Issue on Formal Methods and Models for System Design
Traditional approaches for managing software-programmable memories (SPMs) do not support sharing of distributed on-chip memory resources, and consequently miss the opportunity to better utilize those memory resources. Managing on-chip memory resources in many-core embedded systems (MES) with distributed SPMs requires runtime support to share memory resources between various threads with different memory demands running concurrently. This paper proposes ShaVe-ICE: an operating-system-level solution, along with hardware support, to virtualize and ultimately share SPM resources across a many-core embedded system in order to reduce the average memory latency. We present a number of simple allocation policies to improve performance and energy.
Recent trends in SIMD architecture have tended toward longer vector lengths. However, legacy applications compiled with short-SIMD ISA cannot benefit from long-SIMD architectures which support improved parallelism, resulting in only a small fraction of potential performance. This paper presents a dynamic binary translation technique that enables short-SIMD binaries to exploit benefits of new SIMD architectures by rewriting short-SIMD codes. We propose a general approach which translates short-SIMD loops to machine-independent IR, conducts SIMD transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Benchmark results show that average speedups of 1.59X/2.82X are achieved for NEON to AVX2/AVX-512 loop transformation.
This paper proposes an inter-procedural Loop-oriented Pointer Analysis, called LPA, for analyzing arrays and structs to support aggressive vectorization optimizations. Unlike field-insensitive solutions that pre-allocate objects for each memory allocation site, our approach uses a lazy memory model to generate access-based location sets based on how structs and arrays are accessed. LPA can precisely analyze arrays and nested aggregate structures to enable SIMD optimizations for large programs. By separating the location set generation as an independent concern from the rest of the pointer analysis, LPA is designed to reuse easily existing points-to resolution algorithms.
Temporal properties define the order of occurrence and timing constraints on event occurrence. Such specifications are important for safety-critical real-time systems. We propose a framework for automatically mining (eager matching) properties that are in the form of timed regular expressions (TREs) from system traces. Using an abstract structure of the property, the framework constructs a finite state machine to serve as an acceptor. We analytically derive speedup for the fragment and confirm the speedup using empirical validation with synthetic traces. The framework is evaluated on industrial strength safety-critical real-time applications using traces with more than 1 Million entries.