[1] H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. Energy-efficient adaptive pipelined mpsocs for multimedia applications. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 33(5):663-676, 2014. [ bib | DOI ]
Pipelined MPSoCs provide a high throughput implementation platform for multimedia applications. They are typically balanced at design-time considering worst-case scenarios so that a given throughput can be fulfilled at all times. Such worst-case pipelined MPSoCs lack runtime adaptability and result in inefficient resource utilization and high power/energy consumption under a dynamic workload. In this paper, we propose a novel adaptive architecture and a distributed runtime processor manager to enable runtime adaptation in pipelined MPSoCs. The proposed architecture consists of main processors and auxiliary processors, where a main processor uses differing number of auxiliary processors considering runtime workload variations. The runtime processor manager uses a combination of application's execution and knowledge, and offline profiling and statistical information to proactively predict the auxiliary processors that should be used by a main processor. The idle auxiliary processors are then deactivated using clock- or power-gating. Each main processor with a pool of auxiliary processors has its own runtime manager, which is independent of the other main processors, enabling a distributed runtime manager. Our experiments with an H.264 video encoder for HD720p resolution at 30 frames/s show that the adaptive pipelined MPSoC consumed up to 29% less energy (computed using processors and caches) than a worst-case pipelined MPSoC, while delivering a minimum of 28.75 frames/s. Our results show that adaptive pipelined MPSoCs can emerge as an energy-efficient implementation platform for advanced multimedia applications.

Keywords: multimedia communication;multiprocessing systems;system-on-chip;video coding;H.264 video encoder;auxiliary processor;distributed runtime processor management;energy-efficient adaptive pipelined MPSoC;multimedia application;offline profiling;power-energy consumption;resource utilization;runtime adaptability;statistical information;Clocks;Energy consumption;Motion estimation;Multimedia communication;Runtime;Streaming media;Throughput;Multimedia applications;multiprocessor system-on-chip (MPSoC);runtime adaptability
[2] H. Bokhari, H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. darknoc: Designing energy-efficient network-on-chip with multi-vt cells for dark silicon. In Design Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE, pages 1-6, 2014. [ bib | DOI ]
In this paper, we propose a novel NoC architecture, called dark-NoC, where multiple layers of architecturally identical, but physically different routers are integrated, leveraging the extra transistors available due to dark silicon . Each layer is separately optimized for a particular voltage-frequency range by the adroit use of multi-Vt circuit optimization. At a given time, only one of the network layers is illuminated while all the other network layers are dark. We provide architectural support for seamless integration of multiple network layers, and a fast inter-layer switching mechanism without dropping in-network packets. Our experiments on a 4 4 mesh with multi-programmed real application workloads show that darkNoC improves energy-delay product by up to 56% compared to a traditional single layer NoC with state-of-the-art DVFS. This illustrates darkNoC can be used as an energy-efficient communication fabric in future dark silicon chips.

Keywords: circuit optimisation;elemental semiconductors;integrated circuit interconnections;low-power electronics;network-on-chip;silicon;DVFS;DarkNoC;dark silicon chips;energy-efficient communication fabric;in-network packets;interlayer switching mechanism;multiVt cells;multiVt circuit optimization;multiple network layers;network-on-chip;Computer architecture;Libraries;Microprocessors;Optimization;Silicon;Switches;Threshold voltage
[3] J. Schneider, J. Peddersen, and S. Parameswaran. A scorchingly fast fpga-based precise l1 lru cache simulator. In Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific, pages 412-417, 2014. [ bib | DOI ]
Judicious selection of cache configuration is critical in embedded systems as the cache design can impact power consumption and processor throughput. A large cache increases cache hits but requires more hardware and more power, and will be slower for each access. A smaller cache is more economical and faster per access, but may result in significantly more cache misses resulting in a slower system. For a given application or a class of applications on a given hardware system, the designer can aim to optimise cache configuration through cache simulation. We present here the first multiple cache simulator based on hardware. The FPGA implementation is characterised by a trace consumption rate of 100MHz making our cache simulation core up to 53x faster, for a set of benchmarks, than the fastest software based cache simulator. Our cache simulator can determine the hit rates of 308 cache configurations, of which it can determine the hit rates of 44 simultaneously.

Keywords: cache storage;field programmable gate arrays;FPGA implementation;FPGA-based precise L1 LRU cache simulator;cache configuration optimization;cache design;cache simulation;embedded systems;fastest software-based cache simulator;hardware system;power consumption;processor throughput;trace consumption rate;Field programmable gate arrays;Hardware;Indexes;Shift registers;Software;Table lookup
[4] I. Nawinne, J. Schneider, H. Javaid, and S. Parameswaran. Hardware-based fast exploration of cache hierarchies in application specific mpsocs. In Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, pages 1-6, 2014. [ bib | DOI ]
Multi-level caches are widely used to improve the memory access speed of multiprocessor systems. Deciding on a suitable set of cache memories for an application specific embedded system's memory hierarchy is a tedious problem, particularly in the case of MPSoCs. To accurately determine the number of hits and misses for all the configurations in the design space of an MPSoC, researchers extract the trace first using Instruction set simulators and then simulate using a software simulator. Such simulations take several hours to months. We propose a novel method based on specialized hardware which can quickly simulate the design space of cache configurations for a shared memory multiprocessor system on an FPGA, by analyzing the memory traces and calculating the cache hits and misses simultaneously. We demonstrate that our simulator can explore the cache design space of a quad-core system with private L1 caches and a shared L2 cache, over a range of standard benchmarks, taking as less as 0.106 seconds per million memory accesses, which is up to 456 times faster than the fastest known software based simulator. Since we emulate the program and analyze memory traces simultaneously, we eliminate the need to extract multiple memory access traces prior to simulation, which saves a significant amount of time during the design stage.

Keywords: cache storage;shared memory systems;system-on-chip;FPGA;MPSoC;application specific embedded system;cache configurations;cache design space;cache memories;instruction set simulators;memory access speed;memory hierarchy;memory traces;multilevel caches;private L1 caches;quad-core system;shared L2 cache;shared memory multiprocessor system;software simulator;Analytical models;Energy consumption;Field programmable gate arrays;Hardware;Real-time systems;Software;Space exploration
[5] H. Javaid, Y. Yachide, S.M.M. Shwe, H. Bokhari, and S. Parameswaran. Falcon: A framework for hierarchical computation of metrics for component-based parameterized socs. In Design Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE, pages 1-6, 2014. [ bib | DOI ]
In this paper, we focus on systematic and efficient computation (accurate value or an estimate) of metrics such as performance, power, energy, etc. of a component-based parameterized system-on-chip (SoC). Traditionally, given models of SoC components (such as cycle-accurate simulator of a processor, trace-based simulator of a cache/memory), a designer manually determines an execution schedule of these models (such as execute processor simulator, followed by cache/memory simulator) to combine/propagate their individual results for computation of a SoC metric. To reduce designer's effort, we propose FALCON, a framework where the execution schedule of component models is generated automatically, and a minimal number of model executions is used to compute values of a SoC metric for the given component models and design space (resulting from component parameter values). FALCON is semi-automated, is applicable to a wide range of SoC platforms with ease, and works with existing design space exploration algorithms. In three case studies (uniprocessor system, multiprocessor pipeline system and multiprocessor mesh network-on-chip system), FALCON reduced designer's effort (measured in minutes) by at least two orders of magnitude.

Keywords: multiprocessing systems;network-on-chip;processor scheduling;FALCON-based parameterized SoC;SoC metric;automatic execution schedule generation;component model execution;component parameter values;component-based parameterized system-on-chip;cycle-accurate processor simulator;design space exploration algorithms;framework-for-hierarchical computation-of-metrics-for-component;multiprocessor mesh network-on-chip system;multiprocessor pipeline system;semiautomated FALCON;trace-based cache simulator;trace-based memory simulator;uniprocessor system;Analytical models;Clocks;Computational modeling;Estimation;Integrated circuit modeling;System-on-chip
[6] Darshana Jayasinghe, Roshan Ragel, Jude Angelo Ambrose, Aleksandar Ignjatovic, and Sri Parameswaran. Advanced modes in aes: Are they safe from power analysis based side channel attacks? In Computer Design (ICCD), 2014 32nd IEEE International Conference on, pages 173-180, 2014. [ bib | DOI ]
Advanced Encryption Standard (AES) is arguably the most popular symmetric block cipher algorithm. The commonly used mode of operation in AES is the Electronic Codebook (ECB) mode. In the past, side channel attacks (including power analysis based attacks) have been shown to be effective in breaking the secret keys used with AES, while AES is operating in the ECB mode. AES defines a number of advanced modes (namely Cipher Block Chaining - CBC, Cipher Feedback - CFB, Output Feedback - OFB, and Counter - CTR) of operations that are built on top of the EBC mode to enhance security via disassociating the encryption function from the plaintext or the secret key used. In this paper, we investigate the vulnerabilities against power analysis based side channel attacks of all such modes of operations, implemented on hardware circuits for low power and high speed embedded systems. Through such an investigation, we show that AES is vulnerable in all modes of operations against Correlation Power Analysis (CPA) attack, one of the strongest power analysis based side channel attacks. We also quantify the level of difficulty in breaking AES in different modes by calculating the number of power traces needed to arrive at the complete secret key. We conclude that the Counter mode of operation provides a balance in between area and power while maintaining adequate resistance for power analysis attacks than when used with other modes of operations. We show that the previous recommendations for the rate of change in the keys and vectors is grossly inadequate, and suggest that it must be changed at least every 210 encryptions in CBC mode and 212 encryptions in CFB, OFB and CTR modes in order to resist power analysis attacks.

Keywords: Correlation;Encryption;Equations;Mathematical model;Power demand;Registers
[7] P. Aluthwala, N. Weste, A. Adams, T. Lehmann, and S. Parameswaran. A simple digital architecture for a harmonic-cancelling sine-wave synthesizer. In Circuits and Systems (ISCAS), 2014 IEEE International Symposium on, pages 2113-2116, 2014. [ bib | DOI ]
Sine-wave synthesizers are a core requirement in many electronic applications, such as communication systems, and test and verification of analog/mixed-signal electronic systems. In sine-wave synthesizers, there exists a compromise between output spectral purity and hardware complexity. Harmonic-cancelling sine-wave synthesizers (HCSSs) allow spectrally pure signal synthesis at low hardware cost, compared to conventional sine-wave synthesis approaches. In this paper, we propose a digital HCSS hardware architecture, which is simpler, more hardware efficient and more programmable compared to state of the art HCSSs. The proposed architecture has been verified through a prototype built from an FPGA and discrete components. Prototype results demonstrate 51.9 dBc spurious free dynamic range (SFDR) and an output frequency range from 100 Hz to 100 kHz.

Keywords: direct digital synthesis;harmonics suppression;signal synthesis;FPGA;SFDR;analog-mixed-signal electronic systems;communication systems;digital HCSS hardware architecture;digital architecture;discrete components;frequency 100 Hz to 100 kHz;hardware complexity;harmonic-cancelling sine-wave synthesizer;low hardware cost;output spectral purity;signal synthesis;spurious free dynamic range;Clocks;Digital circuits;Hardware;Harmonic analysis;Power harmonic filters;Prototypes;Synthesizers;Sinusoids;digital-oscillator;low harmonic distortion
[8] S. Parameswaran. Mapping programs for execution on pipelined mpsocs. In Embedded Systems for Real-time Multimedia (ESTIMedia), 2014 IEEE 12th Symposium on, pages 11-11, 2014. [ bib | DOI ]
In this paper, we will examine Hardware/Software Pipelines, and showcase parallelization/ pipelining approaches, to synthesize an MPSoC for a control loop with improved throughput (critical for streaming applications). Finally, we will show state of the art methods to map programs to processors.

Keywords: system-on-chip;control loop;hardware-software pipelines;multiprocessor systems-on-chips;parallelization approach;pipelined MPSoC;pipelining approach;program mapping;streaming application;Embedded systems;Multimedia communication;Multiprocessing systems;Pipeline processing;Streaming media;Throughput
[9] Hong Chinh Doan, H. Javaid, and S. Parameswaran. Flexible and scalable implementation of h.264/avc encoder for multiple resolutions using asips. In Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, pages 459-466, 2014. [ bib | DOI ]
Side channel attacks are a significant threat to the deployment of secure embedded systems. Differential power analysis is one of the powerful power analysis attacks, which can be exploited in secure devices such as smart cards, PDAs and mobile phones. Several researchers in the past have presented experiments and countermeasures for differential power analysis in AES cryptography, though none of them have described the attack in a step by step manner, covering all the aspects of the attack. Some of the important missing segments are the consideration of pipelines, analysis of the power profile to locate the points of attack, the correspondence of the source code, its assembly representation, and the point of attack. In this paper we describe in detail a step-wise explanation of the differential power analysis of an AES implementation, with all of the aspects identified above.

Keywords: authorisation;cryptography;embedded systems;AES cryptography;differential power analysis;secure embedded system;side channel attack;Algorithm design and analysis;Anatomy;Cryptography;Embedded system;Mobile handsets;Performance analysis;Personal digital assistants;Power measurement;Semiconductor device measurement;Smart cards;Differential Power Analysis;Side Channel Attacks
[10] J. Schneider, J. Peddersen, and S. Parameswaran. Mash #x007b;fifo #x007d;: A hardware-based multiple cache simulator for rapid fifo cache analysis. In Design Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE, pages 1-6, 2014. [ bib | DOI ]
Cache memories have become an essential component in modern processors. To find the cache configuration that best fits the targeted power, timing and cost criteria of the system, designers conventionally run a lengthy cache simulation in software. In this paper we present MASHfifo, the first Multiple cAche Simulator in Hardware (MASH) supporting the FIFO replacement policy. We measured a speedup of up to 11x when compared to the fastest software alternative, CIPARSim. We also investigate an in-system implementation where multiple cache simulation is performed in real time from within an embedded system.

Keywords: cache storage;integrated circuit modelling;FIFO replacement policy;MASHfifo;cache memory;embedded system;hardware based multiple cache simulator;multiple cache simulator in hardware;rapid FIFO cache analysis;Abstracts;Containers;Convolution;Data models;Digital audio players;Multi-stage noise shaping;Table lookup
[11] H. Javaid, A. Ignjatovic, and S. Parameswaran. Performance estimation of pipelined multiprocessor system-on-chips (mpsocs). Parallel and Distributed Systems, IEEE Transactions on, 25(8):2159-2168, 2014. [ bib | DOI ]
The paradigm of pipelined MPSoC (processors connected in a pipeline) is well suited to data flow nature of multimedia applications. Often design space exploration is performed to optimize execution time, latency or throughput of a pipelined MPSoC where the variants in the system are processor configurations due to customizable options in each of the processors. Since there can be billions of combinations of processor configurations (design points), the challenge is to quickly provide estimates of performance metrics of those design points. Hence, in this article, we propose analytical models to estimate execution time, latency and throughput of a pipelined MPSoC's design points, avoiding slow full-system cycle accurate simulations of all the design points. For effective use of these analytical models, latencies of individual processor configurations should be available. We propose two estimation methods (PS and PSP) to quickly gather latencies of processor configurations with reduced number of simulations. The PS method simulates all the processor configurations once, while the PSP method simulates only a subset of processor configurations and then uses a processor analytical model to estimate the latencies of the remaining processor configurations. We experimented with several pipelined MPSoCs executing typical multimedia applications (JPEG encoder/decoder, MP3 encoder and H.264 encoder). Our results show that the analytical models with PS and PSP methods had maximum absolute error of 12.95 percent and 18.67 percent respectively, and minimum fidelity of 0.93 and 0.88 respectively. The design spaces of the pipelined MPSoCs ranged from 1012 to 1018 design points, and hence simulation of all design points will take years and is infeasible. Compared to PS method, the PSP method reduced simulation time from days to several hours.

Keywords: integrated circuit design;integrated circuit modelling;multiprocessing systems;pipeline processing;system-on-chip;PS methods;PSP method;design space exploration;execution time optimization;latency estimation;multimedia applications;performance estimation;pipelined MPSoC design points;pipelined multiprocessor system-on-chips;processor analytical model;processor configurations;slow full-system cycle accurate simulations;throughput estimation;Analytical models;Clocks;Estimation;Multimedia communication;Program processors;Steady-state;Throughput;Dataflow architectures;evaluation;heterogeneous (hybrid) systems;measurement;modeling;simulation of multiple-processor systems
[12] J.A. Ambrose, J. Peddersen, S. Parameswaran, A. Labios, and Y. Yachide. Sdg2kpn: System dependency graph to function-level kpn generation of legacy code for mpsocs. In Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific, pages 267-273, 2014. [ bib | DOI ]
The Multiprocessor System-on-Chip (MPSoC) paradigm as a viable implementation platform for parallel processing has expanded to encompass embedded devices. The ability to execute code in parallel gives MPSoCs the potential to achieve high performance with low power consumption. In order for sequential legacy code to take advantage of the MPSoC design paradigm, it must first be partitioned into data flow graphs (such as Kahn Process Networks - KPNs) to ensure the data elements can be correctly passed between the separate processing elements that operate on them. Existing techniques are inadequate for use in complex legacy code. This paper proposes SDG2KPN, a System Dependency Graph to KPN conversion methodology targeting the conversion of legacy code. By creating KPNs at the granularity of the function-/procedure-level, SDG2KPN is the first of its kind to support shared and global variables as well as many more program patterns/application types. We also provide a design flow which allows the creation of MPSoC systems utilizing the produced KPNs. We demonstrate the applicability of our approach by retargeting several sequential applications to the Tensilica MPSoC framework. Our system parallelized AES, an application of 950 lines, in 4.8 seconds, while H.264, of 57896 lines, took 164.9 seconds to parallelize.

Keywords: data flow graphs;low-power electronics;multiprocessing systems;parallel processing;system-on-chip;H.264;Kahn process networks;MPSoC design paradigm;SDG2KPN;Tensilica MPSoC framework;data flow graphs;embedded devices;function-level KPN generation;function-procedure-level;low power consumption;multiprocessor system-on-chip paradigm;parallel processing;sequential legacy code;system dependency graph to KPN conversion methodology;time 164.9 s;time 4.8 s;Australia;Generators;Input variables;Manuals;Parallel processing;Program processors;Syntactics
[13] Josef Schneider and Sri Parameswaran. An extremely compact jpeg encoder for adaptive embedded systems. In Design, Automation Test in Europe Conference Exhibition (DATE), 2013, pages 1063-1064, 2013. [ bib | DOI ]
JPEG Encoding is a commonly performed application that is also very process and memory intensive, and not suited for low-power embedded systems with narrow data buses and small amounts of memory. An embedded system may also need to adapt its application in order to meet varying system constraints such as power, energy, time or bandwidth. We present here an extremely compact JPEG encoder that uses very few system resources, and which is capable of dynamically changing its Quality of Service (QoS) on the fly. The application was tested on a NIOS II core, AVR, and PIC24 microcontrollers with excellent results.

Keywords: Discrete cosine transforms;Educational institutions;Image color analysis;Image resolution;Random access memory;Read only memory;Transform coding
[14] Tuo Li, M. Shafique, S. Rehman, J.A. Ambrose, J. Henkel, and S. Parameswaran. Dhaser: Dynamic heterogeneous adaptation for soft-error resiliency in asip-based multi-core systems. In Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on, pages 646-653, 2013. [ bib | DOI ]
Soft error has become a major adverse effect in CMOS based electronic systems. Mitigating soft error requires enhancing the underlying system with error recovery functionality, which typically leads to considerable design cost overhead, in terms of performance, power and area. For embedded systems, where stringent design constraints apply, such cost must be properly bounded. In this paper, we propose a HW/SW methodology DHASER, which enables efficient error recovery functionality for embedded ASIP-based multi-core systems. DHASER consists of three main parts: task level correctness (TLC) analysis, TLC-based processor/core customization, and runtime reliability-aware task management mechanism. It enables each individual ASIP-based processing core to dynamically adapt its specific error recovery functionality according to the corresponding task's characteristics (i.e., soft error vulnerability and execution time deadline). The goal is to optimize the overall system reliability while considering performance/throughput. The experimental results have shown that DHASER can significantly improve the reliability of the system, with little cost overhead, in comparison to the state-of-art counterparts.

Keywords: embedded systems;instruction sets;multiprocessing systems;radiation hardening (electronics);CMOS-based electronic systems;DHASER;HW-SW methodology;TLC-based core customization;TLC-based processor customization;area factor;design constraints;design cost overhead;dynamic heterogeneous adaptation-for-soft-error resiliency;embedded ASIP-based multicore systems;error recovery functionality;execution time deadline;overall system reliability optimization;performance factor;power factor;runtime reliability-aware task management mechanism;soft error mitigation;soft error vulnerability;system reliability improvement;task characteristics;task level correctness analysis;throughput;Adaptation models;Multicore processing;Optimization;Redundancy;Runtime
[15] Liang Tang, J.A. Ambrose, and S. Parameswaran. Mapro: A tiny processor for reconfigurable baseband modulation mapping. In VLSI Design and 2013 12th International Conference on Embedded Systems (VLSID), 2013 26th International Conference on, pages 1-6, 2013. [ bib | DOI ]
The need to integrate multiple wireless communication protocols into a single low-cost flexible hardware platform is prompted by the increasing number of emerging communication protocols and applications in modern embedded systems. The modulation mapping scheme, one of the key components in the communication baseband, varies differing communication protocols. This paper presents an efficient tiny processor, named MAPro, which is programmable at runtime for differing modulation schemes. MAPro costs little in area, consumes less power compared to other programmable solutions, and is sufficiently fast to satisfy even the most demanding of Software Defined Radio (SDR) applications. The proposed method is flexible (when compared to ASICs) and suitable for mobile applications (when compared to FPGAs and ASIP processors). The area of MAPro is only 25% of the combined ASIC implementation of multiple individual modulation mapping circuits, while the throughput meets specification. Power consumption is 110% more than the ASIC implementation on average. MAPro outperforms both FPGA and ASIP processor significantly in area and power consumption. In terms of throughput, MAPro is similar to FPGA, and outperforms the ASIP processor.

Keywords: field programmable gate arrays;microprocessor chips;modulation;ASIP processors;FPGA;MAPro;embedded system;emerging communication protocols;hardware platform;mobile application;modulation mapping circuit;modulation mapping scheme;multiple wireless communication protocols;power consumption;reconfigurable baseband modulation mapping;software defined radio application;tiny processor;Application specific integrated circuits;Field programmable gate arrays;Phase shift keying;Protocols;Table lookup;Throughput;modulation;multi-mode;processor;reconfigurable
[16] J.A. Ambrose, I. Nawinne, and S. Parameswaran. Latency-constrained binding of data flow graphs to energy conscious gals-based mpsocs. In Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, pages 1212-1215, 2013. [ bib | DOI ]
Mapping tasks to cores in an Multiprocessor System-on-Chip (MPSoC) to meet constraints is widely investigated. Thus far the data flow graphs used for binding have been limited to acyclic graphs or have been single rate. In this paper we generalize the approach by allowing DFGs to be cyclic and multi rate. We further improve energy consumption by setting frequency per core in a Globally Asynchronous and Locally Synchronous (GALS) architecture (by the distribution of slack). A design flow is proposed with these two approaches to form a latency constrained and energy efficient binding. A generalized solution is proposed, compared to state-of-the-art, using improvements in formulation, data structures and heuristics. Eight benchmarks are experimented upon for mesh and pipeline architectures. Our heuristics achieve significant simulation speedup compared to the state-of-the-art and provide a solution which is 2.5% lower (26% worst case) than the optimal, but the solution is obtained 40x quicker (average case). Such a speedup allows us to rapidly explore a large design space.

Keywords: data flow graphs;integrated circuit design;multiprocessing systems;system-on-chip;MPSoC;acyclic graphs;data flow graphs;data structures;design flow;design space;energy conscious GALS-based MPSoC;energy consumption;energy-efficient binding;globally asynchronous-locally synchronous architecture;latency-constrained binding;mapping task;mesh architecture;multiprocessor system-on-chip;pipeline architecture;simulation speedup;slack distribution;Convex functions;Data structures;Flow graphs;Mathematical model;Ports (Computers);Program processors
[17] Su Myat Min Shwe, H. Javaid, and S. Parameswaran. Rexcache: Rapid exploration of unified last-level cache. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, pages 582-587, 2013. [ bib | DOI ]
In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%.

Keywords: cache storage;RExCache;cache configuration;energy consumption;energy efficiency;energy estimator;execution time estimator;full-system cycle-accurate simulation;trace-driven cache simulator;unified last-level cache;Accuracy;Computational modeling;Energy consumption;Estimation;Memory management;Real-time systems;System-on-chip
[18] H. Bokhari, H. Javaid, and S. Parameswaran. System-level optimization of on-chip communication using express links for throughput constrained mpsocs. In Embedded Systems for Real-time Multimedia (ESTIMedia), 2013 IEEE 11th Symposium on, pages 68-77, 2013. [ bib | DOI ]
Application specific MPSoCs are often used for high end streaming applications, which impose stringent throughput constraints and raise the demand for application specific communication architectures. In this paper, we introduce a framework that selectively adds high bandwidth express links between communicating processors so that some traffic is directed via the express links rather than the baseline on-chip interconnect to improve MPSoC's throughput. We present a novel heuristic, xLink, which exploits both processor latencies and on-chip traffic volume to efficiently prune the exponential design space and quickly reach the solution of minimal number of express links for a given throughput constraint. Our framework is oblivious to the baseline interconnect and therefore can be applied to different interconnects. We applied our framework to two different MPSoC interconnects: crossbar NoC and mesh NoC, using 9 benchmark applications. For crossbar NoC based MPSoC, xLink found the optimal solution in 24 out of 26 cases considered (with max error of 20%), while a traditional heuristic found the optimal solution in only 17 cases (with max error of 44%). For mesh NoC based MPSoC, xLink is better than traditional heuristic in 3 out of 9 cases considered with up to 11% saving in communication architecture area footprint. The xLink heuristic always took less than one hour, compared to several hours for the traditional heuristic and several days for an exhaustive algorithm. On average, xLink resulted in a runtime speedup of 7.5 for crossbar NoC topology, and 16.5 for mesh NoC topology, with respect to the traditional heuristic.

Keywords: integrated circuit interconnections;multiprocessing systems;network topology;network-on-chip;MPSoC interconnects;MPSoC throughput;NoC topology;application specific MPSoC;application specific communication architectures;bandwidth express links;baseline interconnect;baseline on-chip interconnect;communicating processors;communication architecture area footprint;crossbar NoC based MPSoC;exponential design space;mesh NoC;on-chip communication;on-chip traffic volume;processor latency;stringent throughput constraints;system-level optimization;xLink;Algorithm design and analysis;Bandwidth;Bismuth;Computer architecture;Program processors;System-on-chip;Throughput;Network-on-Chip;On-chip Interconnect;Streaming Applications;System level Design
[19] Liang Tang, J.A. Ambrose, and S. Parameswaran. Reconfigurable pipelined coprocessor for multi-mode communication transmission. In Design Automation Conference (DAC), 2013 50th ACM / EDAC / IEEE, pages 1-8, 2013. [ bib ]
The need to integrate multiple wireless communication protocols into a single low-cost, low-power hardware platform is prompted by the increasing number of emerging communication protocols and applications. This paper presents a novel application specific platform for integrating multiple wireless communication transmission baseband protocols in a pipelined coprocessor, which can be programmed to support various baseband protocols. This coprocessor can dynamically select the suitable pipeline stages for each baseband protocol. Moreover, each carefully designed stage is able to perform a certain signal processing function in a reconfigurable fashion. The proposed platform is flexible (compared to ASICs) and is suitable for mobile applications (compared to FPGAs and processors). The area footprint of the coprocessor is smaller than an ASIC or FPGA implementation of multiple individual protocols, while the overhead of throughput is 34% worse than ASICs and 32% better than FPGAs. The power consumption is 2.7X worse than ASICs but 40X better than FPGAs on average. The proposed platform outperforms processor implementation in all area, throughput and power consumption. Moreover, fast protocol switching is supported. Wireless LAN (WLAN) 802.11 a, WLAN 802.11 b and Ultra Wide Band (UWB) transmission circuits are developed and mapped to the pipelined coprocessor to prove the efficacy of our proposal.

Keywords: coprocessors;protocols;radio equipment;wireless LAN;IEEE 802.11 a;WLAN 802.11 b;Wireless LAN;baseband protocol;fast protocol switching;low-cost hardware platform;low-power hardware platform;multimode communication transmission;multiple individual protocol;multiple wireless communication protocol;processor implementation;reconfigurable pipelined coprocessor;signal processing function;ultrawide band transmission circuits;Abstracts;Application specific integrated circuits;Logic gates;Multiaccess communication;Random access memory;Servers;Spread spectrum communication
[20] S.M. Min, H. Javaid, and S. Parameswaran. Xdra: Exploration and optimization of last-level cache for energy reduction in ddr drams. In Design Automation Conference (DAC), 2013 50th ACM / EDAC / IEEE, pages 1-10, 2013. [ bib ]
Embedded systems with high energy consumption often exploit the idleness of DDR-DRAM to reduce their energy consumption by putting the DRAM into deepest low-power mode (self-refresh power down mode) during idle periods. DDR-DRAM idle periods heavily depend on the last-level cache. Exhaustive search using processor-memory simulators can take several months. This paper for first time proposes a fast framework called XDRA, which allows the exploration of last-level cache configurations to improve DDR-DRAM energy efficiency. XDRA combines a processor-memory simulator, a cache simulator and novel analysis techniques to produce a Kriging based estimator which predicts the energy savings for differing cache configurations for a given main memory size and application. Errors for the estimator were less than 4.4% on average for 11 applications from mediabench and SPEC2000 suite and two DRAM sizes (Micron DDR3-DRAM 256MB and 4GB). Cache configurations selected by XDRA were on average 3.6 and 4 more energy efficient (cache and DRAM energy) than a common cache configuration. Optimal cache configurations were selected by XDRA 20 times out of 22. The two suboptimal configurations were at most 3.9% from their optimal counterparts. XDRA took a few days for the exploration of 330 cache configurations compared to several hundred days of cycle-accurate simulations, saving at least 85% of exploration time.

Keywords: DRAM chips;cache storage;statistical analysis;DDR DRAM;Kriging based estimator;SPEC2000 suite;XDRA;cache simulator;embedded system;energy consumption;energy reduction;last-level cache;low-power mode;mediabench;processor-memory simulator;self-refresh power down mode;Correlation;Embedded systems;Energy consumption;Mathematical model;Program processors;Random access memory;Training
[21] Tuo Li, M. Shafique, J.A. Ambrose, S. Rehman, J. Henkel, and S. Parameswaran. Raster: Runtime adaptive spatial/temporal error resiliency for embedded processors. In Design Automation Conference (DAC), 2013 50th ACM / EDAC / IEEE, pages 1-7, 2013. [ bib ]
Applying error recovery monotonously can either compromise the real-time constraint, or worsen the power/energy envelope. Neither of these violations can be realistically accepted in embedded system design, which expects ultra efficient realization of a given application. In this paper, we propose a HW/SW methodology that exploits both application specific characteristics and Spatial/Temporal redundancy. Our methodology combines design-time and runtime optimizations, to enable the resultant embedded processor to perform runtime adaptive error recovery operations, precisely targeting the reliability-wise critical instruction executions. The proposed error recovery functionality can dynamically 1) evaluate the reliability cost economy (in terms of execution-time and dynamic power), 2) determine the most profitable scheme, and 3) adapt to the corresponding error recovery scheme, which is composed of spatial and temporal redundancy based error recovery operations. The experimental results have shown that our methodology at best can achieve fifty times greater reliability while maintaining the execution time and power deadlines, when compared to the state of the art.

Keywords: embedded systems;hardware-software codesign;integrated circuit design;integrated circuit reliability;microprocessor chips;RASTER;corresponding error recovery scheme;design time optimization;dynamic power;embedded processor;execution time;hardware-software methodology;reliability cost economy;runtime adaptive spatial-temporal error resiliency;runtime optimization;Abstracts;Adaptation models;Redundancy;ASIP;Checkpoint Recovery;Redundancy;Runtime Adaptation;Soft Error
[22] Liang Tang, J.A. Ambrose, and S. Parameswaran. Variable increment step based reconfigurable interleaver for multimode communication application. In Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, pages 73-76, 2013. [ bib | DOI ]
The need to integrate multiple wireless communication protocols into a single low-cost flexible hardware platform is prompted by the increasing number of emerging communication protocols and applications in modern embedded systems. The interleaving, one of the key components in the communication baseband, varies in differing communication protocols. A novel reconfigurable variable increment step (VIS) based interleaver is proposed for efficient multimode communication application. The proposed reconfigurable interleaver supports both block and convolutional interleaving, costs little in area, consumes less power compared to some other programmable solutions, and is sufficiently fast to satisfy even the most demanding of Software Defined Radio (SDR) applications. The proposed method is flexible (when compared to ASICs and some reconfigurable solutions) and suitable for mobile applications in terms of area, power consumption and throughput (when compared to FPGAs, processors and some other ASIC based reconfigurable proposals).

Keywords: embedded systems;mobile radio;protocols;software radio;SDR applications;block interleaving;convolutional interleaving;embedded systems;low-cost flexible hardware platform;mobile applications;multimode communication application;power consumption;reconfigurable VIS based interleaver;software defined radio applications;variable increment step based reconfigurable interleaver;wireless communication protocols;Application specific integrated circuits;Delays;Field programmable gate arrays;Power demand;Protocols;Random access memory;Throughput
[23] J.A. Ambrose, V. Cassisi, D. Murphy, Tuo Li, D. Jayasinghe, and S. Parameswaran. Scalable performance monitoring of application specific multiprocessor systems-on-chip. In Industrial and Information Systems (ICIIS), 2013 8th IEEE International Conference on, pages 315-320, 2013. [ bib | DOI ]
It is widely known that Multiprocessor Systems-on-Chip (MPSoC) is the driving force behind many embedded devices. State-of-the-art mobile phones and gaming consoles contain more than four processors in their MPSoC. Performance counters have become the recent trend in these devices to perform runtime adaptations to match power and performance budgets. In this paper, we propose a scalable performance monitoring unit (referred to as Event Monitoring Unit - EMU) for application specific MPSoC. Our monitors can be easily attached to an application specific processor as a hardware IP, to monitor any given event. The monitors are programmable, allowing customizable event handling and performance modeling for runtime adaptation and verification. Our programmable EMU costs very little hardware.

Keywords: application specific integrated circuits;multiprocessing systems;system-on-chip;MPSoC;application specific multiprocessor systems-on-chip;embedded devices;event monitoring unit;gaming consoles;hardware IP;mobile phones;performance budgets;performance counters;programmable EMU costs;scalable performance monitoring;Clocks;Hardware;Monitoring;Performance evaluation;Pipelines;Program processors;Radiation detectors
[24] H. Javaid, D. Witono, and S. Parameswaran. Multi-mode pipelined mpsocs for streaming applications. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, pages 231-236, 2013. [ bib | DOI ]
In this paper, we propose a design flow for the pipelined paradigm of Multi-Processor System on Chips (MPSoCs) targeting multiple streaming applications. A multi-mode pipelined MPSoC, used as a streaming accelerator, executes multiple, mutually exclusive applications through modes where each mode refers to the execution of one application. We model each application as a directed graph. The challenge is to merge application graphs into a single graph so that the multi-mode pipelined MPSoC derived from the merged graph contains minimal resources. We solve this problem by finding maximal overlap between application graphs. Three heuristics are proposed where two of them greedily merge application graphs while the third one finds an optimal merging at the cost of higher running time. The results indicate significant area saving (up to 62% processor area, 57% FIFO area and 44 processor/FIFO ports) with minuscule degradation of system throughput (up to 2%) and latency (up to 2%) and increase in energy values (up to 3%) when compared to widely used approach of designing distinct pipelined MPSoCs for individual applications. Our work is the first step in the direction of multi-mode pipelined MPSoCs, and the results demonstrate the usefulness of resource sharing among pipelined MPSoCs based streaming accelerators in a multimedia platform.

Keywords: directed graphs;media streaming;multiprocessing systems;pipeline processing;system-on-chip;FIFO area;FIFO port;directed graph;energy value;maximal overlap;multimedia platform;multimode pipelined MPSoC;multiprocessor system on chips;resource sharing;streaming accelerator;streaming application;Benchmark testing;Electromyography;Hardware;Merging;Multimedia communication;Streaming media;Transform coding
[25] Su Myat Min, H. Javaid, A. Ignjatovic, and S. Parameswaran. A case study on exploration of last-level cache for energy reduction in ddr3 dram. In Embedded Computing (MECO), 2013 2nd Mediterranean Conference on, pages 42-45, 2013. [ bib | DOI ]
This paper studies the effects of last-level cache on DRAM energy consumption. In particular, we explore how different last-level cache configurations affect the idle periods of DRAM, and whether those idle periods can be exploited through the use of self refresh power down mode to enable maximum energy reduction in both the energy consumption of the last-level cache and DRAM. A suitable last-level cache configuration reduces active power consumption of DRAM by reducing read/write accesses to it and use of the self refresh power down mode reduces background power of DRAM, creating a possibility of significant energy reduction. We propose a power mode controller to adaptively transition DRAM to self refresh power down mode when a memory request hits the last-level cache, and activate the DRAM when a memory request misses the last-level cache. We experimented with eight applications from mediabench, and found that an optimal last-level cache configuration with self refresh power down mode can save up to 89% energy compared to a standard memory controller. Additionally, the use of self refresh power down mode degraded the performance by a maximum of 2% only. Thus, we conclude that exploration and optimization of last-level cache can result in significant energy savings for memory subsystem with little performance degradation.

Keywords: DRAM chips;cache storage;power aware computing;DDR3 DRAM energy consumption;DRAM idle periods;active power consumption reduction;background power reduction;energy savings;maximum energy reduction;mediabench;memory request;memory subsystem;optimal last-level cache configuration;performance degradation;power mode controller;read-write access reduction;self-refresh power down mode;Abstracts;Local area networks;RNA;Random access memory;Switches
[26] A. Arora, J.A. Ambrose, J. Peddersen, and S. Parameswaran. A double-width algorithmic balancing to prevent power analysis side channel attacks in aes. In VLSI (ISVLSI), 2013 IEEE Computer Society Annual Symposium on, pages 76-83, 2013. [ bib | DOI ]
Advanced Encryption Standard (AES) is one of the most widely used cryptographic algorithms in embedded systems, and is deployed in smart cards, mobile phones and wireless applications. Researchers have found various techniques to attack the encrypted data or the secret key using Side Channel information (execution time, power variations, electro migration, sound, etc.). Power analysis attack is most prevalent out of all Side Channel Attacks (SCAs), the popular being the Differential Power Analysis (DPA). Balancing of signal transitions is one of the methods by which a countermeasure is implemented. Existing balancing solutions to counter power analysis attacks are either costly in terms of power and area or involve much complexity, hence lacks practicality. This paper for the first time proposes a double-width single core (earlier methods used two separate cores)processor algorithmic balancing to obfuscate power variations resulting in a DPA resistant system. The countermeasure only includes code/algorithmic modifications, hence can be easily deployed in any embedded system with a 16 bits bitwidth (or wider) processor. A DPA attack is demonstrated on the Double Width Single Core (DWSC) solution. The attack proved unsuccessful in finding the correct secret key. The instruction memory size overhead is only 16.6% while data memory increases by 15.8%.

Keywords: cryptography;embedded systems;AES;DPA attack;DWSC solution;SCA;advanced encryption standard;cryptographic algorithm;data memory;differential power analysis;double-width algorithmic balancing;double-width single core;embedded system;instruction memory size overhead;power analysis side channel attack;side channel information;signal transition balancing;Algorithm design and analysis;Embedded systems;Encryption;Hardware;Standards
[27] I. Nawinne and S. Parameswaran. A survey on exact cache design space exploration methodologies for application specific soc memory hierarchies. In Industrial and Information Systems (ICIIS), 2013 8th IEEE International Conference on, pages 332-337, 2013. [ bib | DOI ]
Caching is the most widely used solution to improve the memory access speed of a processor. Behaviour of a cache memory is characterized by several parameters such as the set size, associativity, block size and replacement policy which compose the configuration of the cache. Cache hits and misses encountered by an application are decided by the configuration. While a cache improves the memory performance, it also imposes additional costs in power consumption and chip area which vary according to the configuration. Deciding the suitable set of cache memories for an application specific embedded system's memory hierarchy is a tedious problem which involves design space exploration of how different configurations behave for a given application to accurately or exactly determine the number of hits and misses for each configuration. The literature consists of several different approaches to perform such explorations efficiently while reducing the design time taken. This paper presents a critical analysis of a representative set of such methods.

Keywords: cache storage;embedded systems;system-on-chip;application specific SoC memory hierarchies;associativity;block size;cache design space exploration methodologies;cache hits;cache memory;cache misses;embedded system;memory access speed improvement;replacement policy;set size;Computational modeling;Correlation;Field programmable gate arrays;Hardware;Memory management;Space exploration;Tuning
[28] Tuo Li, Muhammad Shafique, Semeen Rehman, Swarnalatha Radhakrishnan, Roshan Ragel, Jude Angelo Ambrose, Jorg Henkel, and Sri Parameswaran. Cser: Hw/sw configurable soft-error resiliency for application specific instruction-set processors. In Design, Automation Test in Europe Conference Exhibition (DATE), 2013, pages 707-712, 2013. [ bib | DOI ]
Soft error has been identified as one of the major challenges to CMOS technology based computing systems. To mitigate this problem, error recovery is a key component, which usually accounts for a substantial cost, since they must introduce redundancies in either time or space. Consequently, using state-of-art recovery techniques could heavily worsen the design constraint, which is fairly stringent for embedded system design. In this paper, we propose a HW/SW methodology that generates the processor, which performs finely configured error recovery functionality targeting the given design constraints (e.g., performance, area and power). Our methodology employs three application-specific optimization heuristics, which generate the optimized composition and configuration based on the two primitive error recovery techniques. The resultant processor is composed of selected primitive techniques at corresponding instruction execution, and configured to perform error recovery at run-time accordingly to the scheme determined at design time. The experiment results have shown that our methodology can at best achieve nine times reliability while maintaining the given constraints, in comparison to the state of the art.

Keywords: Integrated circuits;Programming;Redundancy;Runtime;Time factors
[29] Mei Hong, Hui Guo, and S. Parameswaran. Dynamic encryption key design and management for memory data encryption in embedded systems. In VLSI (ISVLSI), 2013 IEEE Computer Society Annual Symposium on, pages 70-75, 2013. [ bib | DOI ]
To effectively encrypt data memory contents of an embedded processor, multiple keys which are dynamically changed are necessary. However, the resources required to store and manage these keys on-chip (so that they are secure) can be extensive. This paper presents a design where each dynamic key is determined by a random number, a counter value, and a memory address, and is unique to the data in a memory location. The counter value, dedicated to a given memory location, controls the duration of the random number for the key associated with the location. The counter table and random number table are used for key storage. We reduce on-chip resources by customizing the counter table and allowing a pool of random numbers to be shared amongst the keys. The random numbers are dynamically updated during the application execution. We propose a key generation and management scheme such that the random number pool is extremely small (hence low memory consumption) yet sufficient for the uniqueness and randomness of each dynamic key. Experiments on a set of applications show that on average, large overhead (90% on chip area and 92% on power consumption) can be saved for a same security level, when compared to the state-of-the-art approach.

Keywords: cryptography;embedded systems;integrated memory circuits;logic design;microprocessor chips;power consumption;counter table;counter value;data memory;dynamic encryption key design;embedded processor;embedded systems;keys on-chip;memory address;memory data encryption management;on-chip resources;power consumption;random number;Cryptography;Educational institutions;Radiation detectors;Very large scale integration
[30] Tuo Li, R. Ragel, and S. Parameswaran. Reli: Hardware/software checkpoint and recovery scheme for embedded processors. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 875-880, 2012. [ bib | DOI ]
Checkpoint and Recovery (CR) allows computer systems to operate correctly even when compromised by transient faults. While many software systems and hardware systems for CR do exist, they are usually either too large, require major modifications to the software, too slow, or require extensive modifications to the caching schemes. In this paper, we propose a novel error-recovery management scheme, which is based upon re-engineering the instruction set. We take the native instruction set of the processor and enhance the microinstructions with additional micro-operations which enable checkpointing. The recovery mechanism is implemented by three custom instructions, which recover the registers which were changed, the data memory values which were changed and the special registers (PC, status registers etc.) which were changed. Our checkpointing storage is changed according to the benchmark executed. Results show that our method degrades performance by just 1.45% under fault free conditions, and incurs area overhead of 45% on average and 79% in the worst case. The recovery takes just 62 clock cycles (worst case) in the examples which we examined.

Keywords: cache storage, caching schemes, checkpointing, checkpointing storage, Clocks, computer systems, custom instructions, data memory values, embedded processors, embedded systems, error-recovery management scheme, fault diagnosis, fault free conditions, Hardware, hardware-software checkpoint and recovery scheme, instruction set, instruction sets, microinstructions, microoperations, Optimization, Program processors, Registers, Reli, transient faults
[31] J.A. Ambrose, A. Ignjatovic, and S. Parameswaran. CoRaS: A multiprocessor key corruption and random round swapping for power analysis side channel attacks: A DES case study. In 2012 IEEE International Symposium on Circuits and Systems (ISCAS), pages 253-256, 2012. [ bib | DOI ]
Multiprocessor System-on-Chip (MPSoC) is an integral element in state-of-the-art embedded devices, ranging from low-end, mobile phones, PDAs, handheld medical devices up to high-end cars, avionics and robotics. Proper and safe functionality of such embedded systems is mandatory to avoid severe consequences, whereas security is absolutely necessary with ..Cashless Wallets. forecasted to be the only means of financial transactions in the near future. Such a scenario places immense onus on the security experts where secure transactions using credit cards or mobile phones or any other embedded devices should not be revealing any footprint to the adversary. Side Channel Attacks (SCA) are considered as one of the most effective attacks on these embedded systems because of their effectiveness in realizing the secret information without physically disassembling the device. We propose an MPSoC architecture to prevent power analysis SCA where a dual-core algorithmic balancing is enforced by corrupting the balanced key and swapping the encryption rounds of a block-cipher at random places, random number of times. A case study using DES cryptography is performed. Our approach, CoRaS, alleviates performance by 0.1% and area by 3.6% compared to the state-of-the-art MPSoC solution, however enhances security and practicality by eliminating its weaknesses.

Keywords: balanced key, block cipher, cashless wallets, CoRaS, credit cards, cryptography, DES cryptography, dual core algorithmic balancing, embedded devices, embedded systems, encryption rounds, mobile phones, MPSoC, multiprocessing systems, multiprocessor key corruption, multiprocessor system-on-chip, power analysis, random round swapping, secure transactions, side channel attacks, Switches, system-on-chip
[32] Tuo Li, J.A. Ambrose, and S. Parameswaran. Fine-grained hardware/software methodology for process migration in MPSoCs. In 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 508-515, 2012. [ bib ]
Process migration (PM) is a method used in Multi-Processor System on Chips (MPSoCs) to improve reliability, reduce thermal hotspots and balance loads. However, existing PM approaches are limited by coarse granularity (i.e. can only switch at application or operating systems boundaries), and thus respond slowly. Such slow response does not allow for fine control over temperature, nor does it allow frequent migration which is necessary in certain systems. In this paper, we propose Thor, an approach which is a fine-grained reliable PM scheme, for Embedded MPSoCs, to overcome the limitations of existing PM approaches. Our approach leverages custom instructions to integrate a base processor architecture, with PM functionality. We have proposed three schemes, Thor-BM (migration at basic block boundaries), Thor-BM/CR (migration at basic block boundaries with checkpoint and recovery), and ThorIM/CR (migration at instruction level with checkpoint and recovery). Our experiments show that the execution time overhead is less than 2%, while the additional area cost and power consumption costs are approximately 50% (excluding main memories, which if taken into account would substantially decrease this overhead). The average migration time cost is 289 cycles.

Keywords: application specific instruction set processors, ASIP, base processor architecture, basic block boundary, checkpoint, checkpointing, coarse granularity, Computer architecture, Educational institutions, embedded MPSoC, embedded systems, execution time overhead, fine-grained hardware-software methodology, Ground penetrating radar, hardware-software codesign, Indexes, instruction level, instruction sets, Integrated circuits, load balancing, multiprocessing systems, multiprocessor system on chip, operating system boundary, operating systems (computers), power consumption cost, process migration, recovery, Registers, reliability, resource allocation, system-on-chip, thermal hotspot reduction, Thor-BM/CR, ThorIM/CR
[33] Liang Tang, J. Peddersen, and S. Parameswaran. A rapid methodology for multi-mode communication circuit generation. In VLSI Design (VLSID), 2012 25th International Conference on, pages 203-208, 2012. [ bib | DOI ]
The need to integrate multiple wireless communication protocols into a single low-cost, low power hardware platform is prompted by the increasing number of emerging communication protocols and applications. This paper presents an efficient methodology for integrating multiple wireless protocols in an ASIC which minimizes resource occupation. A hierarchical data path merging algorithm is developed to find common shareable components in two different communication circuits. The data path merging approach will build a combined generic circuit with inserted multiplexers (MUXes) which can provide the same functionality of each individual circuit. The proposed method is orders of magnitude faster (well over 1000 times faster for realistic circuits) than the existing data path merging algorithm (with an overhead of 3% additional area) and can switch communication protocols on the fly (i.e. it can switch between protocols in a single clock cycle), which is a desirable feature for seemingly simultaneous multi-mode wireless communication. Wireless LAN (WLAN) 802.11a, WLAN802.11b and Ultra Wide Band (UWB) transmission circuits are merged to prove the efficacy of our proposal.

Keywords: application specific integrated circuits;multiplexing equipment;protocols;ASIC;MUX;UWB transmission circuits;WLAN 802.11a transmission circuits;hierarchical data path merging algorithm;multimode communication circuit generation;multiple wireless communication protocols;multiplexers;ultrawide band transmission circuits;wireless LAN 802.11a transmission circuits;Algorithm design and analysis;Baseband;Merging;Protocols;Switching circuits;Wireless LAN;Wires;Datapath merging;multi-mode;wireless baseband
[34] Tuo Li, R. Ragel, and S. Parameswaran. Reli: Hardware/software checkpoint and recovery scheme for embedded processors. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 875-880, 2012. [ bib | DOI ]
Checkpoint and Recovery (CR) allows computer systems to operate correctly even when compromised by transient faults. While many software systems and hardware systems for CR do exist, they are usually either too large, require major modifications to the software, too slow, or require extensive modifications to the caching schemes. In this paper, we propose a novel error-recovery management scheme, which is based upon re-engineering the instruction set. We take the native instruction set of the processor and enhance the microinstructions with additional micro-operations which enable checkpointing. The recovery mechanism is implemented by three custom instructions, which recover the registers which were changed, the data memory values which were changed and the special registers (PC, status registers etc.) which were changed. Our checkpointing storage is changed according to the benchmark executed. Results show that our method degrades performance by just 1.45% under fault free conditions, and incurs area overhead of 45% on average and 79% in the worst case. The recovery takes just 62 clock cycles (worst case) in the examples which we examined.

Keywords: cache storage;checkpointing;embedded systems;fault diagnosis;instruction sets;Reli;caching schemes;checkpointing storage;computer systems;custom instructions;data memory values;embedded processors;error-recovery management scheme;fault free conditions;hardware-software checkpoint and recovery scheme;instruction set;microinstructions;microoperations;transient faults;Checkpointing;Clocks;Hardware;Optimization;Program processors;Registers
[35] J.A. Ambrose, A. Ignjatovic, and S. Parameswaran. Coras: A multiprocessor key corruption and random round swapping for power analysis side channel attacks: A des case study. In Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pages 253-256, 2012. [ bib | DOI ]
Multiprocessor System-on-Chip (MPSoC) is an integral element in state-of-the-art embedded devices, ranging from low-end, mobile phones, PDAs, handheld medical devices up to high-end cars, avionics and robotics. Proper and safe functionality of such embedded systems is mandatory to avoid severe consequences, whereas security is absolutely necessary with .Cashless Wallets. forecasted to be the only means of financial transactions in the near future. Such a scenario places immense onus on the security experts where secure transactions using credit cards or mobile phones or any other embedded devices should not be revealing any footprint to the adversary. Side Channel Attacks (SCA) are considered as one of the most effective attacks on these embedded systems because of their effectiveness in realizing the secret information without physically disassembling the device. We propose an MPSoC architecture to prevent power analysis SCA where a dual-core algorithmic balancing is enforced by corrupting the balanced key and swapping the encryption rounds of a block-cipher at random places, random number of times. A case study using DES cryptography is performed. Our approach, CoRaS, alleviates performance by 0.1% and area by 3.6% compared to the state-of-the-art MPSoC solution, however enhances security and practicality by eliminating its weaknesses.

Keywords: cryptography;embedded systems;multiprocessing systems;system-on-chip;CoRaS;DES cryptography;MPSoC;balanced key;block cipher;cashless wallets;credit cards;dual core algorithmic balancing;embedded devices;encryption rounds;mobile phones;multiprocessor key corruption;multiprocessor system-on-chip;power analysis;random round swapping;secure transactions;side channel attacks;Cryptography;Switches
[36] Tuo Li, J.A. Ambrose, and S. Parameswaran. Fine-grained hardware/software methodology for process migration in mpsocs. In Computer-Aided Design (ICCAD), 2012 IEEE/ACM International Conference on, pages 508-515, 2012. [ bib ]
Process migration (PM) is a method used in Multi-Processor System on Chips (MPSoCs) to improve reliability, reduce thermal hotspots and balance loads. However, existing PM approaches are limited by coarse granularity (i.e. can only switch at application or operating systems boundaries), and thus respond slowly. Such slow response does not allow for fine control over temperature, nor does it allow frequent migration which is necessary in certain systems. In this paper, we propose Thor, an approach which is a fine-grained reliable PM scheme, for Embedded MPSoCs, to overcome the limitations of existing PM approaches. Our approach leverages custom instructions to integrate a base processor architecture, with PM functionality. We have proposed three schemes, Thor-BM (migration at basic block boundaries), Thor-BM/CR (migration at basic block boundaries with checkpoint and recovery), and ThorIM/CR (migration at instruction level with checkpoint and recovery). Our experiments show that the execution time overhead is less than 2%, while the additional area cost and power consumption costs are approximately 50% (excluding main memories, which if taken into account would substantially decrease this overhead). The average migration time cost is 289 cycles.

Keywords: checkpointing;embedded systems;hardware-software codesign;instruction sets;multiprocessing systems;operating systems (computers);resource allocation;system-on-chip;ASIP;Thor-BM/CR;ThorIM/CR;application specific instruction set processors;base processor architecture;basic block boundary;checkpoint;coarse granularity;embedded MPSoC;execution time overhead;fine-grained hardware-software methodology;instruction level;load balancing;multiprocessor system on chip;operating system boundary;power consumption cost;process migration;recovery;reliability;thermal hotspot reduction;Computer architecture;Educational institutions;Ground penetrating radar;Indexes;Integrated circuits;Registers;Reliability
[37] M.S. Haque, J. Peddersen, and S. Parameswaran. CIPARSim: Cache intersection property assisted rapid single-pass FIFO cache simulation technique. In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 126-133, 2011. [ bib | DOI ]
An application's cache miss rate is used in timing analysis, system performance prediction and in deciding the best cache memory for an embedded system to meet tighter constraints. Single-pass simulation allows a designer to find the number of cache misses quickly and accurately on various cache memories. Such single-pass simulation systems have previously relied heavily on cache inclusion properties, which allowed rapid simulation of cache configurations for different applications. Thus far the only inclusion properties discovered were applicable to the Least Recently Used (LRU) replacement policy based caches. However, LRU based caches are rarely implemented in real life due to their circuit complexity at larger cache associativities. Embedded processors typically use a FIFO replacement policy in their caches instead, for which there are no full inclusion properties to exploit. In this paper, for the first time, we introduce a cache property called the ..Intersection Property. that helps to reduce single-pass simulation time in a manner similar to inclusion property. An intersection property defines conditions that if met, prove a particular element exists in larger caches, thus avoiding further search time. We have discussed three such intersection properties for caches using the FIFO replacement policy in this paper. A rapid single-pass FIFO cache simulator ..CIPARSim. has also been proposed. CIPARSim is the first single-pass simulator dependent on the FIFO cache properties to reduce simulation time significantly. CIPARSim's simulation time was up to 5 times faster (on average 3 times faster) compared to the state of the art single-pass FIFO cache simulator for the cache configurations tested. CIPARSim produces the cache hit and miss rates of an application accurately on various cache configurations. During simulation, CIPARSim's intersection properties alone predict up to 90% (on average 65%) of the total hits, reducing simulat- on time immensely.

Keywords: application cache miss rate, cache inclusion properties, cache intersection property, cache memories, Cache memory, cache storage, CIPARSim, circuit complexity, circuit simulation, Data models, embedded processors, FIFO replacement policy, first-in-first-out cache, least recently used replacement policy based caches, Magnetic resonance imaging, Predictive models, Proposals, single-pass FIFO cache simulation technique, system performance prediction, Table lookup, timing analysis
[38] J.A. Ambrose, R.G. Ragel, S. Parameswaran, and A. Ignjatovic. Multiprocessor information concealment architecture to prevent power analysis-based side channel attacks. IET Computers Digital Techniques, 5(1):1-15, 2011. [ bib | DOI ]
Side channel attackers observe external manifestations of internal computations in an embedded system to predict the encryption key employed. The ability to examine such external manifestations (power dissipation or electromagnetic emissions) is a major threat to secure embedded systems. This study proposes a secure multiprocessor architecture to prevent side channel attacks, based on a dual-core algorithmic balancing technique, where two identical cores are used. Both cores use a single clock and encrypt simultaneously, with one core executing the original encryption, whereas the second executes the complementary encryption. This effectively balances the crucial information from the power profile (note that it is the information and not the power profile itself), hiding the actual key from the adversary attempting an attack based on differential power analysis (DPA). The two cores normally execute different tasks, but will encrypt together to foil a side channel attack. The authors show that, when our technique is applied, DPA fails on the most common block ciphers, data encryption standard (DES) and advanced encryption standard (AES), leaving the attacker with little useful information with which to perpetrate an attack.

Keywords: advanced encryption standard, data encryption standard, differential power analysis, dual-core algorithmic balancing technique, electromagnetic emissions, embedded system, embedded systems, encryption key, multiprocessing systems, multiprocessor information concealment architecture, power analysis-based side channel attacks, power dissipation, public key cryptography
[39] K. Patel, S. Parameswaran, and R.G. Ragel. Architectural frameworks for security and reliability of MPSoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(9):1641-1654, 2011. [ bib | DOI ]
Multiprocessor system-on-chip (MPSoC) architectures are increasingly used in modern embedded systems. MPSoCs are used for confidential and critical applications and hence need strong security and reliability features. Software attacks exploit vulnerabilities in the software on MPSoCs. In this paper we propose two MPSoC architectural frameworks, tCUFFS and iCUFFS, for an Application Specific Instruction set Processor (ASIP) design. Both tCUFFS and iCUFFS employ a dedicated security processor for detecting software attacks. iCUFFS relies on the exact number of instructions in the basic block to determine an attack and tCUFFS relies on time-frame based measures. In addition to software attacks, reliability concerns of bit flip errors in the control flow instructions (CFIs) are also addressed. Additional method is proposed to the iCUFFS framework to ensure reliable inter-processor communication. The results for the implementation on Xtensa processor from Tensilica showed, worst case runtime penalty of 38% for tCUFFS and 44% for iCUFFS, and worst case area overhead of 33% for tCUFFS and 40% for iCUFFS. The existing iCUFFS framework was able to detect approximately 70% of bit flip errors in the CFIs. The modified iCUFFS framework proposed for reliable inter-processor communication was at most 4% slower than the existing iCUFFS framework.

Keywords: Application software, application specific instruction set processor design, Application specific processors, Architecture, bit flip errors, code injection, Communication system control, Computer architecture, control flow instructions, embedded system, embedded systems, Error correction, iCUFFS, instruction count, instruction sets, inter-processor communication, MPSoC reliability, MPSoC security, multiprocessing systems, multiprocessor system-on-chip architectures, multiprocessor system-on-chip (MPSoC), Process design, reliability, Security, security of data, software attack detection, Software measurement, software reliability, system-on-chip, tCUFFS, Tensilica, Xtensa processor
[40] S.M. Min, J. Peddersen, and S. Parameswaran. Realizing cycle accurate processor memory simulation via interface abstraction. In 2011 24th International Conference on VLSI Design (VLSI Design), pages 141-146, 2011. [ bib | DOI ]
SoC designers typically use a processor simulator to generate a memory trace and apply the generated trace to a memory simulator in order to collect the performance statistics of a complete system. This is an inaccurate process for most applications, making it difficult to optimize the processor and memory configurations. In this paper, we study the problems encountered in the typical simulation approach and propose a methodology which utilizes an interface layer component to link the processor simulator and memory simulator seamlessly. The interface layer component presented in this paper can be used as the connector between the processor module and memory module in building an execution-driven approach which can be applied to process run-time memory requests rather than the traditional trace driven simulation approaches. By applying the proposed interface layer component to link the processor simulator and memory simulator, the estimated performance statistics of the system and the average power consumption of the memory system can be collected with high accuracy. We prove the necessity of our approach by evaluating six benchmarks. Over these benchmarks, there is an 80% variation in the choice of memory latency to achieve the most accurate power consumption and a 16% variation in the choice of memory latency to achieve the most accurate execution time. The increase in accuracy comes at an average increase in simulation time of 13.5%.

Keywords: Benchmark testing, Clocks, cycle accurate processor memory simulation, DRAM chips, embedded systems, Indexes, instruction sets, interface abstraction, interface layer component, low-power electronics, memory latency, Memory management, power consumption, Power demand, processor simulator, Random access memory, simulation, system-on-chip, Timing
[41] S. Parameswaran and S. Selvin. Fish model for underwater robots. In 2011 Annual IEEE India Conference (INDICON), pages 1-4, 2011. [ bib | DOI ]
Underwater robotics is one of the most challenging field of robotics, primarily due to gravity and the problem of resistance offered by water to motion. Fishes are the best swimmers in nature. Inspired from the fish we create a design for underwater robots based on the mechanism of motion of the fish. The advantages and applications of this model are explained. The proposed model was experimentally made and tested underwater. The results are described in this paper.

Keywords: DC motors, fish model, Marine animals, mobile robots, Propulsion, robotic fish, Robot sensing systems, Servomotors, underwater experiment, underwater robots, underwater vehicles
[42] H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. System-level application-aware dynamic power management in adaptive pipelined MPSoCs for multimedia. In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 616-623, 2011. [ bib | DOI ]
System-level dynamic power management (DPM) schemes in Multiprocessor System on Chips (MPSoCs) exploit the idleness of processors to reduce the energy consumption by putting idle processors to low-power states. In the presence of multiple low-power states, the challenge is to predict the duration of the idle period with high accuracy so that the most beneficial power state can be selected for the idle processor. In this work, we propose a novel dynamic power management scheme for adaptive pipelined MPSoCs, suitable for multimedia applications. We leverage application knowledge in the form of future workload prediction to forecast the duration of idle periods. The predicted duration is then used to select an appropriate power state for the idle processor. We proposed five heuristics as part of the DPM and compared their effectiveness using an MPSoC implementation of the H.264 video encoder supporting HD720p at 30 fps. The results show that one of the application prediction based heuristic (MAMAPBH) predicted the most beneficial power states for idle processors with less than 3% error when compared to an optimal solution. In terms of energy savings, MAMAPBH was always within 1% of the energy savings of the optimal solution. When compared with a naive approach (where only one of the possible power states is used for all the idle processors), MAMAPBH achieved up to 40% more energy savings with only 0.5% degradation in throughput. These results signify the importance of leveraging application knowledge at system-level for dynamic power management schemes.

Keywords: adaptive pipelined MPSoC, Clocks, energy conservation, Energy consumption, energy consumption reduction, energy saving, H.264 video encoder, HD720p, History, low-power electronics, MAMAPBH, Motion estimation, multimedia applications, Multimedia communication, multimedia systems, multiprocessing systems, multiprocessor system-on-chip, pipeline processing, power aware computing, Prediction algorithms, Streaming media, system-level application-aware dynamic power management, system-on-chip, video coding
[43] H. Javaid, M. Shafique, S. Parameswaran, and J. Henkel. Low-power adaptive pipelined MPSoCs for multimedia: An h.264 video encoder case study. In 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1032-1037, 2011. [ bib ]
Pipelined MPSoCs provide a high throughput implementation platform for multimedia applications, with reduced design time and improved flexibility. Typically a pipelined MPSoC is balanced at design-time using worst-case parameters. Where there is a widely varying workload, such designs consume exorbitant amount of power. In this paper, we propose a novel adaptive pipelined MPSoC architecture that adapts itself to varying workloads. Our architecture consists of Main Processors and Auxiliary Processors with a distributed run-time balancing approach, where each Main Processor, independent of other Main Processors, decides for itself the number of required Auxiliary Processors at run-time depending on its varying workload. The proposed run-time balancing approach is based on off-line statistical information along with workload prediction and run-time monitoring of current and previous workloads' execution times. We exploited the adaptability of our architecture through a case study on an H.264 video encoder supporting HD720p at 30 fps, where clock- and power-gating were used to deactivate idle Auxiliary Processors during low workload periods. The results show that an adaptive pipelined MPSoC provides energy savings of up to 34% and 40% for clock- and power-gating based deactivation of Auxiliary Processors respectively with a minimum throughput of 29 fps when compared to a design-time balanced pipelined MPSoC.

Keywords: Adaptive MPSoCs, auxiliary processors, Clocks, distributed run-time balancing approach, Energy consumption, H.264 video encoder, low-power adaptive pipelined MPSoC architecture, Low-Power Design, main processors, Motion estimation, multimedia applications, Multimedia communication, multimedia systems, multiprocessing systems, off-line statistical information, Program processors, run-time monitoring, Streaming media, system-on-chip, Throughput, video coding, workload prediction
[44] A. Arora, S. Parameswaran, R. Ragel, and D. Jayasinghe. A hardware/software countermeasure and a testing framework for cache based side channel attacks. In 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 1005-1014, 2011. [ bib | DOI ]
Cache attacks have been described in the literature for over a decade now. Cache attacks are performed remotely by the use of time differences observed due to cache misses and hits, or by the use of power traces either by measuring power or by monitoring the bus between the processor and the memory to monitor the cache activity. In this paper, for the first time we have implemented a fast trace driven cache attack, and incorporated this attack into a flexible framework containing extensible processor(s). This simulator is modifiable and incorporates both Tensilica's [9] processor simulator environment along with DRAMsim, a DRAM simulator. Thus we are able to make changes to processor's instruction set, its cache architecture, and add additional hardware units. On this framework we have implemented a hardware / software countermeasure and shown that it is difficult to differentiate the cache misses for differing encryptions. The processor with the countermeasure is 30% more energy ef ficient, 17% more power efficient and 15% faster and when compared to processor without the countermeasure. The area of the processor with the countermeasure increases by 7.6%.

Keywords: cache architecture, Cache Attacks, cache based side channel attacks, cache storage, cryptography, DRAM chips, DRAM simulator, Electromagnetic radiation, extensible processor, Hardware, hardware/software countermeasurement, power aware computing, power measurement, processor simulator environment, program testing, security of data, side channel attacks, Software, Table lookup, testing framework, Timing
[45] Hong Chinh Doan, H. Javaid, and S. Parameswaran. Multi-ASIP based parallel and scalable implementation of motion estimation kernel for high definition videos. In 2011 9th IEEE Symposium on Embedded Systems for Real-Time Multimedia (ESTIMedia), pages 56-65, 2011. [ bib | DOI ]
Parallel implementations of motion estimation for high definition videos typically exploit various forms of parallelism (GOP-, frame-, slice- and macroblock-level) to deliver real-time throughput. Although parallel implementations deliver real-time throughput, they often suffer from limited flexibility and scalability due to the form of parallelism and architecture used. In this work, we use Group Of MacroBlocks (GOMB) and Intra-MB (1MB) parallelism with a multi-ASIP (Application Specific Instruction set Processor) architecture to provide a flexible and scalable platform for motion estimation of high definition videos. Multiple GOMBs are processed by the ASIPs in parallel (GOMB-level) where each ASIP is equipped with custom instructions to process the pixels of an MB in parallel (1MB-level). The system is flexible and scalable as the number of ASIPs (number of GOMBs) and custom instructions are not fixed, and are determined through design space exploration. We evaluated the multi-ASIP architecture in Tensilica's commercial design environment with varying number of ASIPs (up to nine), and compared hand-coded and automatically generated custom instructions. The results illustrate that systems with three and seven ASIPs delivered real-time throughput of 30 and 60 fps respectively for "pedestrian", "rush hour" and "tractor" HD1080p video sequences. In addition, the results indicate that the multi-ASIP platform can be extended for even higher resolutions such as Ultra High Definition (UHD) due to its flexibility and scalability.

Keywords: application specific instruction set processor, ASIP architecture, Computer architecture, GOMB, group of macroblocks, high definition video, high definition videos, instruction sets, Kernel, Motion estimation, motion estimation kernel, multi-ASIP based parallel implementation, multi-ASIP based scalable implementation, Parallel processing, Throughput, video coding, Videos
[46] Yee Jern Chong and S. Parameswaran. Configurable multimode embedded floating-point units for FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(11):2033-2044, 2011. [ bib | DOI ]
Performance of field-programmable gate arrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floating-point units (FPUs) on FPGAs consume a large amount of resources. This makes FPGAs less attractive for use in floating-point intensive applications. Therefore, there is a need for embedded FPUs in FPGAs. However, if unutilized, embedded FPUs waste space on the FPGA die. To overcome this issue, we propose a flexible multimode embedded FPU for FPGAs that can be configured to perform a wide range of operations. The floating-point adder and multiplier in our embedded FPU can each be configured to perform one double-precision operation or two single-precision operations in parallel. To increase flexibility further, access to the large integer multiplier, adder and shifters in the FPU is provided. Benchmark circuits were implemented on both a standard Xilinx Virtex-II FPGA and on our FPGA with embedded FPU blocks. The results using our embedded FPUs showed a mean area improvement of 5.5 times and a mean delay improvement of 5.8 times for the double-precision benchmarks, and a mean area improvement of 3.8 times and a mean delay improvement of 4.2 times for the single-precision benchmarks. The embedded FPUs were also shown to provide significant area and delay benefits for fixed-point and integer circuits.

Keywords: adders, Benchmark testing, Computer architecture, configurable multimode embedded floating-point unit, Delay, double-precision benchmark, double-precision operation, Dual-precision, embedded block, field-programmable gate array, field-programmable gate array (FPGA), field programmable gate arrays, fixed-point circuit, flexible multimode embedded FPU, floating-point, floating-point adder, floating point arithmetic, floating-point arithmetic, floating-point intensive application, floating-point multiplier, floating-point unit (FPU), FPGA architecture, Hardware, integer circuit, Routing, single-precision operation, Xilinx Virtex-II FPGA
[47] S.M. Min, J. Peddersen, and S. Parameswaran. Realizing cycle accurate processor memory simulation via interface abstraction. In VLSI Design (VLSI Design), 2011 24th International Conference on, pages 141-146, 2011. [ bib | DOI ]
SoC designers typically use a processor simulator to generate a memory trace and apply the generated trace to a memory simulator in order to collect the performance statistics of a complete system. This is an inaccurate process for most applications, making it difficult to optimize the processor and memory configurations. In this paper, we study the problems encountered in the typical simulation approach and propose a methodology which utilizes an interface layer component to link the processor simulator and memory simulator seamlessly. The interface layer component presented in this paper can be used as the connector between the processor module and memory module in building an execution-driven approach which can be applied to process run-time memory requests rather than the traditional trace driven simulation approaches. By applying the proposed interface layer component to link the processor simulator and memory simulator, the estimated performance statistics of the system and the average power consumption of the memory system can be collected with high accuracy. We prove the necessity of our approach by evaluating six benchmarks. Over these benchmarks, there is an 80% variation in the choice of memory latency to achieve the most accurate power consumption and a 16% variation in the choice of memory latency to achieve the most accurate execution time. The increase in accuracy comes at an average increase in simulation time of 13.5%.

Keywords: DRAM chips;embedded systems;instruction sets;low-power electronics;system-on-chip;cycle accurate processor memory simulation;interface abstraction;interface layer component;processor simulator;Benchmark testing;Clocks;Indexes;Memory management;Power demand;Random access memory;Timing;interface abstraction;memory latency;power consumption;simulation
[48] H. Javaid, M. Shafique, S. Parameswaran, and J. Henkel. Low-power adaptive pipelined mpsocs for multimedia: An h.264 video encoder case study. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 1032-1037, 2011. [ bib ]
Pipelined MPSoCs provide a high throughput implementation platform for multimedia applications, with reduced design time and improved flexibility. Typically a pipelined MPSoC is balanced at design-time using worst-case parameters. Where there is a widely varying workload, such designs consume exorbitant amount of power. In this paper, we propose a novel adaptive pipelined MPSoC architecture that adapts itself to varying workloads. Our architecture consists of Main Processors and Auxiliary Processors with a distributed run-time balancing approach, where each Main Processor, independent of other Main Processors, decides for itself the number of required Auxiliary Processors at run-time depending on its varying workload. The proposed run-time balancing approach is based on off-line statistical information along with workload prediction and run-time monitoring of current and previous workloads' execution times. We exploited the adaptability of our architecture through a case study on an H.264 video encoder supporting HD720p at 30 fps, where clock- and power-gating were used to deactivate idle Auxiliary Processors during low workload periods. The results show that an adaptive pipelined MPSoC provides energy savings of up to 34% and 40% for clock- and power-gating based deactivation of Auxiliary Processors respectively with a minimum throughput of 29 fps when compared to a design-time balanced pipelined MPSoC.

Keywords: multimedia systems;multiprocessing systems;system-on-chip;video coding;H.264 video encoder;auxiliary processors;distributed run-time balancing approach;low-power adaptive pipelined MPSoC architecture;main processors;multimedia applications;off-line statistical information;run-time monitoring;workload prediction;Clocks;Energy consumption;Motion estimation;Multimedia communication;Program processors;Streaming media;Throughput;Adaptive MPSoCs;Low-Power Design
[49] J.A. Ambrose, R.G. Ragel, S. Parameswaran, and A. Ignjatovic. Multiprocessor information concealment architecture to prevent power analysis-based side channel attacks. Computers Digital Techniques, IET, 5(1):1-15, 2011. [ bib | DOI ]
Side channel attackers observe external manifestations of internal computations in an embedded system to predict the encryption key employed. The ability to examine such external manifestations (power dissipation or electromagnetic emissions) is a major threat to secure embedded systems. This study proposes a secure multiprocessor architecture to prevent side channel attacks, based on a dual-core algorithmic balancing technique, where two identical cores are used. Both cores use a single clock and encrypt simultaneously, with one core executing the original encryption, whereas the second executes the complementary encryption. This effectively balances the crucial information from the power profile (note that it is the information and not the power profile itself), hiding the actual key from the adversary attempting an attack based on differential power analysis (DPA). The two cores normally execute different tasks, but will encrypt together to foil a side channel attack. The authors show that, when our technique is applied, DPA fails on the most common block ciphers, data encryption standard (DES) and advanced encryption standard (AES), leaving the attacker with little useful information with which to perpetrate an attack.

Keywords: embedded systems;multiprocessing systems;public key cryptography;advanced encryption standard;data encryption standard;differential power analysis;dual-core algorithmic balancing technique;electromagnetic emissions;embedded system;encryption key;multiprocessor information concealment architecture;power analysis-based side channel attacks;power dissipation
[50] M.S. Haque, J. Peddersen, and S. Parameswaran. Ciparsim: Cache intersection property assisted rapid single-pass fifo cache simulation technique. In Computer-Aided Design (ICCAD), 2011 IEEE/ACM International Conference on, pages 126-133, 2011. [ bib | DOI ]
An application's cache miss rate is used in timing analysis, system performance prediction and in deciding the best cache memory for an embedded system to meet tighter constraints. Single-pass simulation allows a designer to find the number of cache misses quickly and accurately on various cache memories. Such single-pass simulation systems have previously relied heavily on cache inclusion properties, which allowed rapid simulation of cache configurations for different applications. Thus far the only inclusion properties discovered were applicable to the Least Recently Used (LRU) replacement policy based caches. However, LRU based caches are rarely implemented in real life due to their circuit complexity at larger cache associativities. Embedded processors typically use a FIFO replacement policy in their caches instead, for which there are no full inclusion properties to exploit. In this paper, for the first time, we introduce a cache property called the .Intersection Property. that helps to reduce single-pass simulation time in a manner similar to inclusion property. An intersection property defines conditions that if met, prove a particular element exists in larger caches, thus avoiding further search time. We have discussed three such intersection properties for caches using the FIFO replacement policy in this paper. A rapid single-pass FIFO cache simulator .CIPARSim. has also been proposed. CIPARSim is the first single-pass simulator dependent on the FIFO cache properties to reduce simulation time significantly. CIPARSim's simulation time was up to 5 times faster (on average 3 times faster) compared to the state of the art single-pass FIFO cache simulator for the cache configurations tested. CIPARSim produces the cache hit and miss rates of an application accurately on various cache configurations. During simulation, CIPARSim's intersection properties alone predict up to 90% (on average 65%) of the total hits, reducing simulat- on time immensely.

Keywords: cache storage;circuit complexity;circuit simulation;CIPARSim;FIFO replacement policy;application cache miss rate;cache inclusion properties;cache intersection property;cache memories;circuit complexity;embedded processors;first-in-first-out cache;least recently used replacement policy based caches;single-pass FIFO cache simulation technique;system performance prediction;timing analysis;Cache memory;Data models;Magnetic resonance imaging;Predictive models;Proposals;Table lookup
[51] K. Patel, S. Parameswaran, and R.G. Ragel. Architectural frameworks for security and reliability of mpsocs. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 19(9):1641-1654, 2011. [ bib | DOI ]
Multiprocessor system-on-chip (MPSoC) architectures are increasingly used in modern embedded systems. MPSoCs are used for confidential and critical applications and hence need strong security and reliability features. Software attacks exploit vulnerabilities in the software on MPSoCs. In this paper we propose two MPSoC architectural frameworks, tCUFFS and iCUFFS, for an Application Specific Instruction set Processor (ASIP) design. Both tCUFFS and iCUFFS employ a dedicated security processor for detecting software attacks. iCUFFS relies on the exact number of instructions in the basic block to determine an attack and tCUFFS relies on time-frame based measures. In addition to software attacks, reliability concerns of bit flip errors in the control flow instructions (CFIs) are also addressed. Additional method is proposed to the iCUFFS framework to ensure reliable inter-processor communication. The results for the implementation on Xtensa processor from Tensilica showed, worst case runtime penalty of 38% for tCUFFS and 44% for iCUFFS, and worst case area overhead of 33% for tCUFFS and 40% for iCUFFS. The existing iCUFFS framework was able to detect approximately 70% of bit flip errors in the CFIs. The modified iCUFFS framework proposed for reliable inter-processor communication was at most 4% slower than the existing iCUFFS framework.

Keywords: embedded systems;instruction sets;multiprocessing systems;security of data;software reliability;system-on-chip;MPSoC reliability;MPSoC security;Tensilica;Xtensa processor;application specific instruction set processor design;bit flip errors;control flow instructions;embedded systems;iCUFFS;inter-processor communication;multiprocessor system-on-chip architectures;software attack detection;tCUFFS;Application software;Application specific processors;Communication system control;Computer architecture;Embedded system;Error correction;Multiprocessing systems;Process design;Security;Software measurement;Architecture;code injection;instruction count;multiprocessor system-on-chip (MPSoC);reliability;tensilica
[52] A. Arora, S. Parameswaran, R. Ragel, and D. Jayasinghe. A hardware/software countermeasure and a testing framework for cache based side channel attacks. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2011 IEEE 10th International Conference on, pages 1005-1014, 2011. [ bib | DOI ]
Cache attacks have been described in the literature for over a decade now. Cache attacks are performed remotely by the use of time differences observed due to cache misses and hits, or by the use of power traces either by measuring power or by monitoring the bus between the processor and the memory to monitor the cache activity. In this paper, for the first time we have implemented a fast trace driven cache attack, and incorporated this attack into a flexible framework containing extensible processor(s). This simulator is modifiable and incorporates both Tensilica's [9] processor simulator environment along with DRAMsim, a DRAM simulator. Thus we are able to make changes to processor's instruction set, its cache architecture, and add additional hardware units. On this framework we have implemented a hardware / software countermeasure and shown that it is difficult to differentiate the cache misses for differing encryptions. The processor with the countermeasure is 30% more energy ef ficient, 17% more power efficient and 15% faster and when compared to processor without the countermeasure. The area of the processor with the countermeasure increases by 7.6%.

Keywords: DRAM chips;cache storage;power aware computing;program testing;security of data;DRAM simulator;cache architecture;cache based side channel attacks;extensible processor;hardware/software countermeasurement;power measurement;processor simulator environment;testing framework;Cryptography;Electromagnetic radiation;Hardware;Power measurement;Software;Table lookup;Timing;Cache Attacks;Side Channel Attacks
[53] Hong Chinh Doan, H. Javaid, and S. Parameswaran. Multi-asip based parallel and scalable implementation of motion estimation kernel for high definition videos. In Embedded Systems for Real-Time Multimedia (ESTIMedia), 2011 9th IEEE Symposium on, pages 56-65, 2011. [ bib | DOI ]
Parallel implementations of motion estimation for high definition videos typically exploit various forms of parallelism (GOP-, frame-, slice- and macroblock-level) to deliver real-time throughput. Although parallel implementations deliver real-time throughput, they often suffer from limited flexibility and scalability due to the form of parallelism and architecture used. In this work, we use Group Of MacroBlocks (GOMB) and Intra-MB (1MB) parallelism with a multi-ASIP (Application Specific Instruction set Processor) architecture to provide a flexible and scalable platform for motion estimation of high definition videos. Multiple GOMBs are processed by the ASIPs in parallel (GOMB-level) where each ASIP is equipped with custom instructions to process the pixels of an MB in parallel (1MB-level). The system is flexible and scalable as the number of ASIPs (number of GOMBs) and custom instructions are not fixed, and are determined through design space exploration. We evaluated the multi-ASIP architecture in Tensilica's commercial design environment with varying number of ASIPs (up to nine), and compared hand-coded and automatically generated custom instructions. The results illustrate that systems with three and seven ASIPs delivered real-time throughput of 30 and 60 fps respectively for "pedestrian", "rush hour" and "tractor" HD1080p video sequences. In addition, the results indicate that the multi-ASIP platform can be extended for even higher resolutions such as Ultra High Definition (UHD) due to its flexibility and scalability.

Keywords: high definition video;instruction sets;motion estimation;video coding;ASIP architecture;GOMB;application specific instruction set processor;group of macroblocks;high definition videos;motion estimation kernel;multi-ASIP based parallel implementation;multi-ASIP based scalable implementation;Computer architecture;High definition video;Kernel;Motion estimation;Parallel processing;Throughput;Videos
[54] Yee Jern Chong and S. Parameswaran. Configurable multimode embedded floating-point units for fpgas. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 19(11):2033-2044, 2011. [ bib | DOI ]
Performance of field-programmable gate arrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floating-point units (FPUs) on FPGAs consume a large amount of resources. This makes FPGAs less attractive for use in floating-point intensive applications. Therefore, there is a need for embedded FPUs in FPGAs. However, if unutilized, embedded FPUs waste space on the FPGA die. To overcome this issue, we propose a flexible multimode embedded FPU for FPGAs that can be configured to perform a wide range of operations. The floating-point adder and multiplier in our embedded FPU can each be configured to perform one double-precision operation or two single-precision operations in parallel. To increase flexibility further, access to the large integer multiplier, adder and shifters in the FPU is provided. Benchmark circuits were implemented on both a standard Xilinx Virtex-II FPGA and on our FPGA with embedded FPU blocks. The results using our embedded FPUs showed a mean area improvement of 5.5 times and a mean delay improvement of 5.8 times for the double-precision benchmarks, and a mean area improvement of 3.8 times and a mean delay improvement of 4.2 times for the single-precision benchmarks. The embedded FPUs were also shown to provide significant area and delay benefits for fixed-point and integer circuits.

Keywords: adders;field programmable gate arrays;floating point arithmetic;Xilinx Virtex-II FPGA;configurable multimode embedded floating-point unit;double-precision benchmark;double-precision operation;field-programmable gate array;fixed-point circuit;flexible multimode embedded FPU;floating-point adder;floating-point arithmetic;floating-point intensive application;floating-point multiplier;integer circuit;single-precision operation;Adders;Benchmark testing;Computer architecture;Delay;Field programmable gate arrays;Hardware;Routing;Dual-precision;FPGA architecture;embedded block;field-programmable gate array (FPGA);floating-point;floating-point unit (FPU)
[55] H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. System-level application-aware dynamic power management in adaptive pipelined mpsocs for multimedia. In Computer-Aided Design (ICCAD), 2011 IEEE/ACM International Conference on, pages 616-623, 2011. [ bib | DOI ]
System-level dynamic power management (DPM) schemes in Multiprocessor System on Chips (MPSoCs) exploit the idleness of processors to reduce the energy consumption by putting idle processors to low-power states. In the presence of multiple low-power states, the challenge is to predict the duration of the idle period with high accuracy so that the most beneficial power state can be selected for the idle processor. In this work, we propose a novel dynamic power management scheme for adaptive pipelined MPSoCs, suitable for multimedia applications. We leverage application knowledge in the form of future workload prediction to forecast the duration of idle periods. The predicted duration is then used to select an appropriate power state for the idle processor. We proposed five heuristics as part of the DPM and compared their effectiveness using an MPSoC implementation of the H.264 video encoder supporting HD720p at 30 fps. The results show that one of the application prediction based heuristic (MAMAPBH) predicted the most beneficial power states for idle processors with less than 3% error when compared to an optimal solution. In terms of energy savings, MAMAPBH was always within 1% of the energy savings of the optimal solution. When compared with a naive approach (where only one of the possible power states is used for all the idle processors), MAMAPBH achieved up to 40% more energy savings with only 0.5% degradation in throughput. These results signify the importance of leveraging application knowledge at system-level for dynamic power management schemes.

Keywords: energy conservation;low-power electronics;multimedia systems;multiprocessing systems;pipeline processing;power aware computing;system-on-chip;video coding;H.264 video encoder;HD720p;MAMAPBH;adaptive pipelined MPSoC;energy consumption reduction;energy saving;multimedia applications;multiprocessor system-on-chip;system-level application-aware dynamic power management;Clocks;Energy consumption;History;Motion estimation;Multimedia communication;Prediction algorithms;Streaming media
[56] M.S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran. SCUD: A fast single-pass l1 cache simulation approach for embedded processors with round-robin replacement policy. In 2010 47th ACM/IEEE Design Automation Conference (DAC), pages 356-361, 2010. [ bib ]
Embedded systems designers are free to choose the most suitable configuration of L1 cache in modern processor based SoCs. Choosing the appropriate L1 cache configuration necessitates the simulation of long memory access traces to accurately obtain hit/miss rates. The long execution time taken to simulate these traces, particularly separate simulation for each configuration is a major drawback. Researchers have proposed techniques to speed up the simulation of caches with LRU replacement policy. These techniques are of little use in the majority of embedded processors as these processors utilize Round-robin policy based caches. In this paper we propose a fast L1 cache simulation approach, called SCUD(Sorted Collection of Unique Data), for caches with the Round-robin policy. SCUD is a single-pass cache simulator that can simulate multiple L1 cache configurations (with varying set sizes and associativities) by reading the application trace once. Utilizing fast binary searches in a novel data structure, SCUD simulates an application trace significantly faster than a widely used single configuration cache simulator (Dinero IV). We show SCUD can simulate a set of cache configurations up to 57 times faster than Dinero IV. SCUD shows an average speed up of 19.34 times over Dinero IV for Mediabench applications, and an average speed up of over 10 times for SPEC CPU2000 applications.

Keywords: Analytical models, Australia, Cache memory, Cache simulation, Data structures, embedded system, Energy consumption, L1 cache, Miss rate, Performance analysis, Permission, Process design, Round robin, simulation
[57] A. Janapsatya, A. Ignjatovic, J. Peddersen, and S. Parameswaran. Dueling CLOCK: Adaptive cache replacement policy based on the CLOCK algorithm. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 920-925, 2010. [ bib | DOI ]
We consider the problem of on-chip L2 cache management and replacement policies. We propose a new adaptive cache replacement policy, called Dueling CLOCK (DC), that has several advantages over the Least Recently Used (LRU) cache replacement policy. LRU's strength is that it keeps track of the 'recency' information of memory accesses. However, a) LRU has a high overhead cost of moving cache blocks into the most recently used position each time a cache block is accessed; b) LRU does not exploit 'frequency' information of memory accesses; and, c) LRU is prone to cache pollution when a sequence of single-use memory accesses that are larger than the cache size is fetched from memory (i.e., it is non scan resistant). The DC policy was developed to have low overhead cost, to capture 'recency' information in memory accesses, to exploit the 'frequency' pattern of memory accesses and to be scan resistant. In this paper, we propose a hardware implementation of the CLOCK algorithm for use within an on-chip cache controller to ensure low overhead cost. We then present the DC policy, which is an adaptive replacement policy that alternates between the CLOCK algorithm and the scan resistant version of the CLOCK algorithm. We present experimental results showing the MPKI (Misses per thousand instructions) comparison of DC against existing replacement policies, such as LRU. The results for an 8-way 1MB L2 cache show that DC can lower the MPKI of SPEC CPU2000 benchmark by an average of 10.6% when compared to the tree based Pseudo-LRU cache replacement policy.

Keywords: adaptive cache replacement policy, Australia, cache storage, CLOCK algorithm, Clocks, Computer science, Costs, DC generators, Delay, dueling CLOCK, Engineering management, Frequency, Hardware, least recently used cache replacement policy, misses per thousand instructions, on-chip L2 cache management, Pollution, single use memory accesses, SPEC CPU2000 benchmark, storage management chips
[58] H. Javaid, Xin He, A. Ignjatovic, and S. Parameswaran. Optimal synthesis of latency and throughput constrained pipelined MPSoCs targeting streaming applications. In 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 75-84, 2010. [ bib ]
A streaming application, characterized by a kernel that can be broken down into independent tasks which can be executed in a pipelined fashion, inherently allows its implementation on a pipeline of Application Specific Instruction set Processors (ASIPs), called a pipelined MPSoC. The latency and throughput requirements of streaming applications put constraints on the design of such a pipelined MPSoC, where each ASIP has a number of available configurations differing by additional instructions, and instruction and data cache sizes. Thus, the design space of a pipelined MPSoC is all the possible combinations of ASIP configurations (design points). In this paper, a methodology is proposed to optimize the area of a pipelined MPSoC under a latency or a throughput constraint. The final design point is a set of ASIP configurations with one configuration for each ASIP. We proposed an Integer Linear Programming (ILP) based solution to the area optimization problem under a latency constraint, and an algorithm for optimization of pipelined MPSoC area under a throughput constraint. The proposed solutions were evaluated using four streaming applications: JPEG encoder; JPEG decoder; MP3 encoder; and H.264 decoder. The time to find the Pareto front of each pipelined MPSoC was less than 4 minutes where design spaces had up to 1016 design points, illustrating the applicability of our approach.

Keywords: application specific instruction set processors, application specific integrated circuits, Clocks, Design Space Exploration, H.264 decoder, integer linear programming, integer programming, JPEG decoder, JPEG encoder, linear programming, MP3 encoder, multiprocessing systems, optimal latency synthesis, Optimization, Pareto front, Pareto optimisation, pipeline processing, Pipelines, Program processors, streaming applications, Streaming media, system-on-chip, Throughput, throughput constrained pipelined MPSoC, Transform coding
[59] H. Javaid, A. Ignjatovic, and S. Parameswaran. Fidelity metrics for estimation models. In 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1-8, 2010. [ bib | DOI ]
Estimation models play a vital role in many aspects of day to day life. Extremely complex estimation models are employed in the design space exploration of SoCs, and the efficacy of these estimation models is usually measured by the absolute error of the models compared to known actual results. Such absolute error based metrics can often result in over-designed estimation models, with a number of researchers suggesting that fidelity of an estimation model (correlation between the ordering of the estimated points and the ordering of the actual points) should be examined instead of, or in addition to, the absolute error. In this paper, for the first time, we propose four metrics to measure the fidelity of an estimation model, in particular for use in design space exploration. The first two are based on two well known rank correlation coefficients. The other two are weighted versions of the first two metrics, to give importance to points nearer the Pareto front. The proposed fidelity metrics range from -1 to 1, where a value of 1 reflects a perfect positive correlation while a value of -1 reflects a perfect negative correlation. The proposed fidelity metrics were calculated for a single processor estimation model and a multiprocessor estimation model to observe their behavior, and were compared against the models' absolute error. For the multiprocessor estimation model, even though the worst average and maximum absolute error of 6.40% and 16.61% respectively can be considered reasonable in design automation, the worst fidelity of 0.753 suggests that the multiprocessor estimation model may not be as good a model (compared to an estimation model with same or higher absolute errors but a fidelity of 0.95) as depicted by its absolute accuracy, leading to an over-designed estimation model.

Keywords: absolute error based metrics, Algorithm design and analysis, Correlation, Design Space Exploration, Estimation, estimation models, fidelity metrics, Frequency modulation, integrated circuit design, integrated circuit modelling, Mathematical model, Measurement, multiprocessor estimation model, Pareto front, Pareto optimisation, rank correlation coefficients, Space exploration, system-on-chip
[60] Xin He, J. Peddersen, and S. Parameswaran. Improved architectures for range encoding in packet classification system. In 2010 9th IEEE International Symposium on Network Computing and Applications (NCA), pages 10-19, 2010. [ bib | DOI ]
Packet classification is an important aspect of modern network systems. Packet classification systems have traditionally been built utilizing Ternary Content Addressable Memory (TCAM) due to the high throughput needed. However, TCAMs are expensive in terms of area and power consumption. An alternative to TCAM based systems using SRAM and a novel rule encoding method has been proposed to match multiple packets simultaneously (using multiple store_compare_units). This alternative achieves similar or better throughput than the traditional TCAM approach while occupying smaller space and consuming less energy. This paper revamps the existing SRAM-based architecture to reduce area and power consumption without threatening throughput. Two methods are shown. The first method allows for range encoding SRAMs to be shared between store_compare_units (SCUs) to lower area and power consumption with minor effect on throughput. The second method discusses a hybrid system allowing rules with prefixes (single rules) and ranges (rules which match a range of addresses, usually translated to many prefixes) to exist in parallel for the same domain. This allows for lower power consumption than utilizing fixed range encoding due to optimization of the ruleset. Results show that this hybrid architecture saves more than 20% of power/field if half of the TCP ports contain ranges (and the other half contains prefixes). The new extensions have been tested in over ten benchmarks (including the SNORT ruleset) to verify the claimed improvements.

Keywords: Australia, Clocks, computer networks, content-addressable storage, Encoding, LOP, Memory management, Packet Classification, packet classification system, packet switching, Power demand, Random access memory, range encoding, rule encoding, SNORT ruleset, SRAM, SRAM chips, store compare units, TCP ports, ternary content addressable memory, Throughput
[61] H. Javaid, A. Janapsatya, M.S. Haque, and S. Parameswaran. Rapid runtime estimation methods for pipelined MPSoCs. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 363-368, 2010. [ bib | DOI ]
The pipelined Multiprocessor System on Chip (MPSoC) paradigm is well suited to the data flow nature of streaming applications. A pipelined MPSoC is a system where processing elements (PEs) are connected in a pipeline. Each PE is implemented using one of a number of processor configurations (configurations differ by instruction sets and cache sizes) available for that PE. The goal is to select a pipelined MPSoC with a mapping of a processor configuration to every PE. To estimate the run-time of a pipelined MPSoC, designers typically perform cycle-accurate simulation of the whole pipelined system. Since the number of possible pipelined implementations can be in the order of billions, estimation methods are necessary. In this paper, we propose two methods to estimate the runtime of a pipelined MPSoC, minimizing the use of slow cycle-accurate simulations. The first method estimates the runtime of the pipelined MPSoC, by performing cycle accurate simulations of individual processor configurations (rather than the whole pipelined system), and then utilizing an analytical model to estimate the runtime of the pipelined system. In the second method, runtimes of individual processor configurations are estimated using an analytical processor model (which uses cycle-accurate simulations of selected configurations, and an equation based on ISA and cache statistics). These estimated runtimes of individual processor configurations are then used to estimate the total runtime of the pipelined system. By evaluating our approach on three benchmarks, we show that the maximum estimation error is 5.91% and 16.45%, with an average estimation error of 2.28% and 6.30% for the first and second method respectively. The time to simulate all the possible pipelined implementations (design points) using cycle-accurate simulator is in the order of years, as design spaces with at least 1010 design points are considered in this paper. However, the time to simulate all processor configura- ions individually (first method) takes tens of hours, while the time to simulate a subset of processor configurations and estimate their runtimes (second method) is only a few hours. Once these simulations are done, the runtime of each pipelined implementation can be estimated within milliseconds.

Keywords: Analytical models, Application specific processors, average estimation error, cycle-accurate simulation, cycle-accurate simulator, data flow nature, Equations, Estimation error, instruction sets, maximum estimation error, multiprocessing systems, pipelined MPSoC, pipelined multiprocessor system on chip, pipeline processing, Pipelines, processing elements, processor configurations, rapid runtime estimation methods, Runtime, Space exploration, Statistical analysis, streaming applications, system-on-chip
[62] M.S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran. DEW: A fast level 1 cache simulation approach for embedded processors with FIFO replacement policy. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 496-501, 2010. [ bib | DOI ]
Increasing the speed of cache simulation to obtain hit/miss rates enables performance estimation, cache exploration for embedded systems and energy estimation. Previously, such simulations, particularly exact approaches, have been exclusively for caches which utilize the least recently used (LRU) replacement policy. In this paper, we propose a new, fast and exact cache simulation method for the First In First Out(FIFO) replacement policy. This method, called DEW, is able to simulate multiple level 1 cache configurations (different set sizes, associativities, and block sizes) with FIFO replacement policy. DEW utilizes a binomial tree based representation of cache configurations and a novel searching method to speed up simulation over single cache simulators like Dinero IV. Depending on different cache block sizes and benchmark applications, DEW operates around 8 to 40 times faster than Dinero IV. Dinero IV compares 2.17 to 19.42 times more cache ways than DEW to determine accurate miss rates.

Keywords: Analytical models, Application specific processors, Australia, binomial tree based representation, Cache memory, cache storage, circuit simulation, Computer aided instruction, Costs, DEW, Dinero IV, embedded processor system, embedded system, embedded systems, Energy consumption, energy estimation, fast level 1 cache simulation approach, FIFO replacement policy, first in first out replacement policy, LRU replacement policy, microprocessor chips, Process design, searching method, System performance, tree searching
[63] H. Javaid, A. Ignjatovic, and S. Parameswaran. Rapid design space exploration of application specific heterogeneous pipelined multiprocessor systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 29(11):1777-1789, 2010. [ bib | DOI ]
This paper describes a rapid design methodology to create a pipeline of processors to execute streaming applications. The methodology seeks a system with the smallest area while its runtime is within a specified runtime constraint. Initially, a heuristic is used to rapidly explore a large number of processor configurations to find the near Pareto front of the design space, and then an exact integer linear programming (ILP) formulation (EIF) is used to find an optimal solution. A reduced ILP formulation (RIF) or the heuristic is used if the EIF does not find an optimal solution in a given time window. This design methodology was integrated into a commercial design flow and was evaluated on four benchmarks with design spaces containing up to 1016 design points. For each benchmark, the near Pareto front was found in less than 3 h using the heuristic, while EIF took up to 16 h. The results show that the average area error of the heuristic and RIF was within 2.25% and 1.25% of the optimal design points for all the benchmarks, respectively. The heuristic is faster than RIF, while both the heuristic and RIF are significantly faster than EIF.

Keywords: Algorithms, application specific architectures, application specific heterogeneous pipelined multiprocessor system, Design methodology, Design Space Exploration, integer linear programming, integer linear programming formulation, integer programming, linear programming, multiprocessing systems, multiprocessor system on chip (MPSoC), near Pareto front, Pareto optimisation, pipeline processing, Pipelines, Program processors, rapid design space exploration, reduced ILP formulation, Runtime, runtime constraint, Space exploration, system-on-chip, Transform coding
[64] H. Javaid, A. Janapsatya, M.S. Haque, and S. Parameswaran. Rapid runtime estimation methods for pipelined mpsocs. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 363-368, 2010. [ bib | DOI ]
The pipelined Multiprocessor System on Chip (MPSoC) paradigm is well suited to the data flow nature of streaming applications. A pipelined MPSoC is a system where processing elements (PEs) are connected in a pipeline. Each PE is implemented using one of a number of processor configurations (configurations differ by instruction sets and cache sizes) available for that PE. The goal is to select a pipelined MPSoC with a mapping of a processor configuration to every PE. To estimate the run-time of a pipelined MPSoC, designers typically perform cycle-accurate simulation of the whole pipelined system. Since the number of possible pipelined implementations can be in the order of billions, estimation methods are necessary. In this paper, we propose two methods to estimate the runtime of a pipelined MPSoC, minimizing the use of slow cycle-accurate simulations. The first method estimates the runtime of the pipelined MPSoC, by performing cycle accurate simulations of individual processor configurations (rather than the whole pipelined system), and then utilizing an analytical model to estimate the runtime of the pipelined system. In the second method, runtimes of individual processor configurations are estimated using an analytical processor model (which uses cycle-accurate simulations of selected configurations, and an equation based on ISA and cache statistics). These estimated runtimes of individual processor configurations are then used to estimate the total runtime of the pipelined system. By evaluating our approach on three benchmarks, we show that the maximum estimation error is 5.91% and 16.45%, with an average estimation error of 2.28% and 6.30% for the first and second method respectively. The time to simulate all the possible pipelined implementations (design points) using cycle-accurate simulator is in the order of years, as design spaces with at least 1010 design points are considered in this paper. However, the time to simulate all processor configura- ions individually (first method) takes tens of hours, while the time to simulate a subset of processor configurations and estimate their runtimes (second method) is only a few hours. Once these simulations are done, the runtime of each pipelined implementation can be estimated within milliseconds.

Keywords: multiprocessing systems;pipeline processing;system-on-chip;average estimation error;cycle-accurate simulation;cycle-accurate simulator;data flow nature;maximum estimation error;pipelined MPSoC;pipelined multiprocessor system on chip;processing elements;processor configurations;rapid runtime estimation methods;streaming applications;Analytical models;Application specific processors;Equations;Estimation error;Instruction sets;Multiprocessing systems;Pipelines;Runtime;Space exploration;Statistical analysis
[65] H. Javaid, Xin He, A. Ignjatovic, and S. Parameswaran. Optimal synthesis of latency and throughput constrained pipelined mpsocs targeting streaming applications. In Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2010 IEEE/ACM/IFIP International Conference on, pages 75-84, 2010. [ bib ]
A streaming application, characterized by a kernel that can be broken down into independent tasks which can be executed in a pipelined fashion, inherently allows its implementation on a pipeline of Application Specific Instruction set Processors (ASIPs), called a pipelined MPSoC. The latency and throughput requirements of streaming applications put constraints on the design of such a pipelined MPSoC, where each ASIP has a number of available configurations differing by additional instructions, and instruction and data cache sizes. Thus, the design space of a pipelined MPSoC is all the possible combinations of ASIP configurations (design points). In this paper, a methodology is proposed to optimize the area of a pipelined MPSoC under a latency or a throughput constraint. The final design point is a set of ASIP configurations with one configuration for each ASIP. We proposed an Integer Linear Programming (ILP) based solution to the area optimization problem under a latency constraint, and an algorithm for optimization of pipelined MPSoC area under a throughput constraint. The proposed solutions were evaluated using four streaming applications: JPEG encoder; JPEG decoder; MP3 encoder; and H.264 decoder. The time to find the Pareto front of each pipelined MPSoC was less than 4 minutes where design spaces had up to 1016 design points, illustrating the applicability of our approach.

Keywords: Pareto optimisation;application specific integrated circuits;integer programming;linear programming;multiprocessing systems;pipeline processing;system-on-chip;H.264 decoder;JPEG decoder;JPEG encoder;MP3 encoder;Pareto front;application specific instruction set processors;integer linear programming;optimal latency synthesis;streaming applications;throughput constrained pipelined MPSoC;Clocks;Optimization;Pipelines;Program processors;Streaming media;Throughput;Transform coding;Design Space Exploration;Integer Linear Programming
[66] M.S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran. Scud: A fast single-pass l1 cache simulation approach for embedded processors with round-robin replacement policy. In Design Automation Conference (DAC), 2010 47th ACM/IEEE, pages 356-361, 2010. [ bib ]
Embedded systems designers are free to choose the most suitable configuration of L1 cache in modern processor based SoCs. Choosing the appropriate L1 cache configuration necessitates the simulation of long memory access traces to accurately obtain hit/miss rates. The long execution time taken to simulate these traces, particularly separate simulation for each configuration is a major drawback. Researchers have proposed techniques to speed up the simulation of caches with LRU replacement policy. These techniques are of little use in the majority of embedded processors as these processors utilize Round-robin policy based caches. In this paper we propose a fast L1 cache simulation approach, called SCUD(Sorted Collection of Unique Data), for caches with the Round-robin policy. SCUD is a single-pass cache simulator that can simulate multiple L1 cache configurations (with varying set sizes and associativities) by reading the application trace once. Utilizing fast binary searches in a novel data structure, SCUD simulates an application trace significantly faster than a widely used single configuration cache simulator (Dinero IV). We show SCUD can simulate a set of cache configurations up to 57 times faster than Dinero IV. SCUD shows an average speed up of 19.34 times over Dinero IV for Mediabench applications, and an average speed up of over 10 times for SPEC CPU2000 applications.

Keywords: Analytical models;Australia;Cache memory;Data structures;Embedded system;Energy consumption;Performance analysis;Permission;Process design;Round robin;Cache simulation;L1 cache;Miss rate;Round robin;Simulation
[67] M.S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran. Dew: A fast level 1 cache simulation approach for embedded processors with fifo replacement policy. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 496-501, 2010. [ bib | DOI ]
Increasing the speed of cache simulation to obtain hit/miss rates enables performance estimation, cache exploration for embedded systems and energy estimation. Previously, such simulations, particularly exact approaches, have been exclusively for caches which utilize the least recently used (LRU) replacement policy. In this paper, we propose a new, fast and exact cache simulation method for the First In First Out(FIFO) replacement policy. This method, called DEW, is able to simulate multiple level 1 cache configurations (different set sizes, associativities, and block sizes) with FIFO replacement policy. DEW utilizes a binomial tree based representation of cache configurations and a novel searching method to speed up simulation over single cache simulators like Dinero IV. Depending on different cache block sizes and benchmark applications, DEW operates around 8 to 40 times faster than Dinero IV. Dinero IV compares 2.17 to 19.42 times more cache ways than DEW to determine accurate miss rates.

Keywords: cache storage;circuit simulation;embedded systems;microprocessor chips;tree searching;DEW;Dinero IV;FIFO replacement policy;LRU replacement policy;binomial tree based representation;embedded processor system;energy estimation;fast level 1 cache simulation approach;first in first out replacement policy;searching method;Analytical models;Application specific processors;Australia;Cache memory;Computer aided instruction;Costs;Embedded system;Energy consumption;Process design;System performance
[68] Xin He, J. Peddersen, and S. Parameswaran. Improved architectures for range encoding in packet classification system. In Network Computing and Applications (NCA), 2010 9th IEEE International Symposium on, pages 10-19, 2010. [ bib | DOI ]
Packet classification is an important aspect of modern network systems. Packet classification systems have traditionally been built utilizing Ternary Content Addressable Memory (TCAM) due to the high throughput needed. However, TCAMs are expensive in terms of area and power consumption. An alternative to TCAM based systems using SRAM and a novel rule encoding method has been proposed to match multiple packets simultaneously (using multiple store_compare_units). This alternative achieves similar or better throughput than the traditional TCAM approach while occupying smaller space and consuming less energy. This paper revamps the existing SRAM-based architecture to reduce area and power consumption without threatening throughput. Two methods are shown. The first method allows for range encoding SRAMs to be shared between store_compare_units (SCUs) to lower area and power consumption with minor effect on throughput. The second method discusses a hybrid system allowing rules with prefixes (single rules) and ranges (rules which match a range of addresses, usually translated to many prefixes) to exist in parallel for the same domain. This allows for lower power consumption than utilizing fixed range encoding due to optimization of the ruleset. Results show that this hybrid architecture saves more than 20% of power/field if half of the TCP ports contain ranges (and the other half contains prefixes). The new extensions have been tested in over ten benchmarks (including the SNORT ruleset) to verify the claimed improvements.

Keywords: SRAM chips;computer networks;content-addressable storage;packet switching;SNORT ruleset;SRAM;TCP ports;packet classification system;range encoding;rule encoding;store compare units;ternary content addressable memory;Australia;Clocks;Encoding;Memory management;Power demand;Random access memory;Throughput;LOP;Packet Classification;Range Encoding
[69] A. Janapsatya, A. Ignjatovic, J. Peddersen, and S. Parameswaran. Dueling clock: Adaptive cache replacement policy based on the clock algorithm. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 920-925, 2010. [ bib | DOI ]
We consider the problem of on-chip L2 cache management and replacement policies. We propose a new adaptive cache replacement policy, called Dueling CLOCK (DC), that has several advantages over the Least Recently Used (LRU) cache replacement policy. LRU's strength is that it keeps track of the 'recency' information of memory accesses. However, a) LRU has a high overhead cost of moving cache blocks into the most recently used position each time a cache block is accessed; b) LRU does not exploit 'frequency' information of memory accesses; and, c) LRU is prone to cache pollution when a sequence of single-use memory accesses that are larger than the cache size is fetched from memory (i.e., it is non scan resistant). The DC policy was developed to have low overhead cost, to capture 'recency' information in memory accesses, to exploit the 'frequency' pattern of memory accesses and to be scan resistant. In this paper, we propose a hardware implementation of the CLOCK algorithm for use within an on-chip cache controller to ensure low overhead cost. We then present the DC policy, which is an adaptive replacement policy that alternates between the CLOCK algorithm and the scan resistant version of the CLOCK algorithm. We present experimental results showing the MPKI (Misses per thousand instructions) comparison of DC against existing replacement policies, such as LRU. The results for an 8-way 1MB L2 cache show that DC can lower the MPKI of SPEC CPU2000 benchmark by an average of 10.6% when compared to the tree based Pseudo-LRU cache replacement policy.

Keywords: cache storage;storage management chips;CLOCK algorithm;SPEC CPU2000 benchmark;adaptive cache replacement policy;dueling CLOCK;least recently used cache replacement policy;misses per thousand instructions;on-chip L2 cache management;single use memory accesses;Australia;Clocks;Computer science;Costs;DC generators;Delay;Engineering management;Frequency;Hardware;Pollution
[70] H. Javaid, A. Ignjatovic, and S. Parameswaran. Rapid design space exploration of application specific heterogeneous pipelined multiprocessor systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 29(11):1777-1789, 2010. [ bib | DOI ]
This paper describes a rapid design methodology to create a pipeline of processors to execute streaming applications. The methodology seeks a system with the smallest area while its runtime is within a specified runtime constraint. Initially, a heuristic is used to rapidly explore a large number of processor configurations to find the near Pareto front of the design space, and then an exact integer linear programming (ILP) formulation (EIF) is used to find an optimal solution. A reduced ILP formulation (RIF) or the heuristic is used if the EIF does not find an optimal solution in a given time window. This design methodology was integrated into a commercial design flow and was evaluated on four benchmarks with design spaces containing up to 1016 design points. For each benchmark, the near Pareto front was found in less than 3 h using the heuristic, while EIF took up to 16 h. The results show that the average area error of the heuristic and RIF was within 2.25% and 1.25% of the optimal design points for all the benchmarks, respectively. The heuristic is faster than RIF, while both the heuristic and RIF are significantly faster than EIF.

Keywords: Pareto optimisation;integer programming;linear programming;multiprocessing systems;pipeline processing;system-on-chip;application specific heterogeneous pipelined multiprocessor system;integer linear programming formulation;near Pareto front;rapid design space exploration;reduced ILP formulation;runtime constraint;system-on-chip;Design methodology;Multiprocessing systems;Pipelines;Program processors;Runtime;Space exploration;Transform coding;Algorithms;application specific architectures;design space exploration;integer linear programming;multiprocessor system on chip (MPSoC)
[71] H. Javaid, A. Ignjatovic, and S. Parameswaran. Fidelity metrics for estimation models. In Computer-Aided Design (ICCAD), 2010 IEEE/ACM International Conference on, pages 1-8, 2010. [ bib | DOI ]
Estimation models play a vital role in many aspects of day to day life. Extremely complex estimation models are employed in the design space exploration of SoCs, and the efficacy of these estimation models is usually measured by the absolute error of the models compared to known actual results. Such absolute error based metrics can often result in over-designed estimation models, with a number of researchers suggesting that fidelity of an estimation model (correlation between the ordering of the estimated points and the ordering of the actual points) should be examined instead of, or in addition to, the absolute error. In this paper, for the first time, we propose four metrics to measure the fidelity of an estimation model, in particular for use in design space exploration. The first two are based on two well known rank correlation coefficients. The other two are weighted versions of the first two metrics, to give importance to points nearer the Pareto front. The proposed fidelity metrics range from -1 to 1, where a value of 1 reflects a perfect positive correlation while a value of -1 reflects a perfect negative correlation. The proposed fidelity metrics were calculated for a single processor estimation model and a multiprocessor estimation model to observe their behavior, and were compared against the models' absolute error. For the multiprocessor estimation model, even though the worst average and maximum absolute error of 6.40% and 16.61% respectively can be considered reasonable in design automation, the worst fidelity of 0.753 suggests that the multiprocessor estimation model may not be as good a model (compared to an estimation model with same or higher absolute errors but a fidelity of 0.95) as depicted by its absolute accuracy, leading to an over-designed estimation model.

Keywords: Pareto optimisation;integrated circuit design;integrated circuit modelling;system-on-chip;Pareto front;absolute error based metrics;design space exploration;estimation models;fidelity metrics;multiprocessor estimation model;rank correlation coefficients;system-on-chip;Algorithm design and analysis;Correlation;Estimation;Frequency modulation;Mathematical model;Measurement;Space exploration
[72] A. Janapsatya, S. Parameswaran, and A. Ignjatovic. HitME: Low power hit MEmory buffer for embedded systems. In Design Automation Conference, 2009. ASP-DAC 2009. Asia and South Pacific, pages 335-340, 2009. [ bib | DOI ]
In this paper, we present a novel HitME (Hit-MEmory) buffer to reduce the energy consumption of memory hierarchy in embedded processors. The HitME buffer is a small direct-mapped cache memory that is added as additional memory into existing cache memory hierarchies. The HitME buffer is loaded only when there is a hit on L1 cache. Otherwise, L1 cache is updated from the memory and the processor's memory request is served directly from the L1 cache. The strategy works due to the fact that 90% of memory accesses are only accessed once, and these often pollute the cache. Energy reduction is achieved by reducing the number of accesses to the L1 cache memory. Experimental results show that the use of HitME buffer will reduce the L1 cache accesses resulting in a reduction in the energy consumption of the memory hierarchy. This decrease in L1 cache accesses reduces the cache system energy consumption by an average of 60.9% when compared to traditional L1 cache memory architecture and an energy reduction of 6.4% when compared to filter cache architecture for 70 nm cache technology.

Keywords: Australia, Cache memory, cache storage, Computer science, direct-mapped cache memory, embedded system, embedded systems, Energy consumption, Filters, HitME, integrated memory circuits, low power hit memory buffer, memory hierarchy, Memory management, Microprocessors, Multiplexing, Pollution
[73] Xin He, J. Peddersen, and S. Parameswaran. LOP_re: Range encoding for low power packet classification. In IEEE 34th Conference on Local Computer Networks, 2009. LCN 2009, pages 137-144, 2009. [ bib | DOI ]
State-of-the-art hardware based techniques achieve high performance and maximize efficiency of packet classification applications. The predominant example of these, ternary content addressable memory (TCAM) based packet classification systems can achieve much higher throughput than software-based techniques. However, they suffer from high power consumption due to the highly parallel architecture and lack high-throughput range encoding techniques. In this paper, we propose a novel SRAM-based packet classification architecture with packet-side search key range encoding units, significantly reducing energy consumption without reducing the throughput from that of TCAM and additionally allowing range matching at wire speed. LOP_RE is a flexible packet classification system which can be customized to the requirement of the application. Ten different benchmarks were tested, with results showing that LOP_RE architectures provide high lookup rates and throughput, and consume low power and energy. Compared with a TCAM-based packet classification system (without range encoding) implemented in 65nm CMOS technology, LOP_RE can save 65% energy consumption for the same rule set over these benchmarks.

Keywords: 65nm CMOS technology, Application software, Associative memory, Benchmark testing, CMOS technology, content-addressable storage, Encoding, Energy consumption, energy consumption reduction, Hardware, high-throughput range encoding techniques, LOP_RE architectures, low power packet classification, Packet Classification, Packet-Side Key Encoding, packet-side search key range encoding units, parallel architecture, parallel architectures, pattern classification, range encoding, SRAM-based packet classification architecture, SRAM chips, telecommunication networks, ternary content addressable memory, Throughput, Wire
[74] H. Javaid and S. Parameswaran. A design flow for application specific heterogeneous pipelined multiprocessor systems. In 46th ACM/IEEE Design Automation Conference, 2009. DAC '09, pages 250-253, 2009. [ bib ]
This paper describes a rapid design methodology to create a pipeline of processers to execute streaming applications. The methodology is in two separate phases: the first phase, uses a heuristic to rapidly search through a large number of processor configurations (configurations differ by the base processor, the additional instructions and cache sizes) to find the near Pareto front; the second phase, utilizes either the above heuristic or an ILP (integer linear programming) formulation to search a smaller design space to find an appropriate final implementation. By the utilization of the fast heuristic with differing runtime constraints in the first phase, we rapidly find the near Pareto front. The second phase provides either an optimal or a near optimal solution. Both the ILP formulation and the heuristic find a system with the smallest area, within a designer specified runtime constraint. The system has efficiently explored design spaces with over 1012 design points. We integrated this design methodology into a commercial design flow and evaluated our approach with different benchmarks (JPEG encoder, JPEG decoder and MP3 encoder). For each benchmark, the near Pareto front was found in a few hours using the heuristic (took several days for the ILP). The results show that the average area error of the heuristic is within 2.5% of the optimal design points (obtained using ILP) for all benchmarks.

Keywords: Application software, application specific heterogeneous pipelined multiprocessor systems, Application specific processors, Design methodology, Design Space Exploration, integer linear programming, integer programming, Java, JPEG decoder, JPEG encoder, linear programming, MP3 encoder, MPSoCs, multiprocessing systems, Pareto front, Pareto optimisation, pipeline processing, processor configurations, Runtime, Scheduling, Space exploration
[75] K. Patel, S. Parameswaran, and R.G. Ragel. CUFFS: An instruction count based architectural framework for security of MPSoCs. In Design, Automation Test in Europe Conference Exhibition, 2009. DATE '09., pages 779-784, 2009. [ bib | DOI ]
Multiprocessor system on chip (MPSoC) architecture is rapidly gaining momentum for modern embedded devices. The vulnerabilities in software on MPSoCs are often exploited to cause software attacks, which are the most common type of attacks on embedded systems. Therefore, we propose an MPSoC architectural framework, CUFFS, for an application specific instruction set processor (ASIP) design that has a dedicated security processor called iGuard for detecting software attacks.

Keywords: Application software, application specific instruction set processor design, Australia, Buffer overflow, circuit analysis computing, Computer science, Computer security, Data security, embedded devices, Embedded software, embedded system, embedded systems, Hardware, instruction count based architectural framework, instruction sets, logic design, MPSoC security, multiprocessor system on chip architecture, Runtime, security of data, software attack detection, system-on-chip
[76] J. Henkel, V. Narayanan, S. Parameswaran, and R. Ragel. Security and dependability of embedded systems: A computer architects' perspective. In 2009 22nd International Conference on VLSI Design, pages 30-32, 2009. [ bib | DOI ]
Designers of embedded systems have traditionally optimized circuits for speed, size, power and time to market. Recently however, the dependability of the system is emerging as a great concern to the modern designer with the decrease in feature size and the increase in the demand for functionality. Yet another crucial concern is the security of systems used for storage of personal details and for financial transactions. A significant number of techniques that are used to overcome security and dependability are the same or have similar origins. Thus this tutorial will examine the overlapping concerns of security and dependability and the design methods used to overcome the problems and threats. This tutorial is divided into four parts: the first will examine dependability issues due to technology effects; the second will look at reliability aware designs; the third, will describe the security threats; and, the fourth part will illustrate the countermeasures to security and reliability issues.

Keywords: Circuits, Computer architecture, Computer security, dependability, Embedded computing, embedded system, embedded systems, financial transactions, Hardware, Information security, personal details storage, Power system security, reliability, reliability aware designs, Robustness, Runtime, security of data, security threats, systems security
[77] S. Radhakrishnan, H. Guo, S. Parameswaran, and A. Ignjatovic. HMP-ASIPs: heterogeneous multi-pipeline application-specific instruction-set processors. IET Computers Digital Techniques, 3(1):94-108, 2009. [ bib | DOI ]
A heterogeneous multi-pipeline architecture to enable high-performance application-specific instruction-set processor (ASIP) design is proposed. Each pipeline in this architecture is extensively customised. The program instruction-level parallelism is statically explored during compilation. Techniques such as forwarding network reduction, instruction encoding customisation and pipeline structure/instruction-set tailoring are all used to achieve a high performance/area ratio, low power consumption and small code size. The simulations and experiments on a group of benchmarks show that when the multi-pipeline ASIP is employed, an average of 83% performance improvement can be achieved when compared with a single pipeline ASIP, with overheads of 31%, 33% and 86% for area, leakage power and code size, respectively.

Keywords: forwarding network reduction, heterogeneous multipipeline architecture, HMP-ASIP, instruction encoding customisation, instruction sets, pipeline processing, pipeline structure-instruction-set tailoring, specific instruction-set processors
[78] Yee Jern Chong and S. Parameswaran. Custom floating-point unit generation for embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(5):638-650, 2009. [ bib | DOI ]
While application-specific instruction-set processors (ASIPs) have allowed designers to create processors with custom instructions to target specific applications, floating-point (FP) units (FPUs) are still instantiated as noncustomizable general-purpose units, which, if underutilized, wastes area and performance. Therefore, there is a need for custom FPUs for embedded systems. To create a custom FPU, the subset of FP instructions that should be implemented in hardware has to be determined. Implementing more instructions in hardware reduces the cycle count of the application but may lead to increased latency if the critical delay of the FPU increases. Therefore, a balance between the hardware-implemented and the software-emulated instructions, which produces the best performance, must be found. In order to find this balance, a rapid design space exploration was performed to explore the tradeoffs between the area and the performance. In order to reduce the area of the custom FPU, it is desirable to merge the datapaths for each of the FP operations so that redundant hardware is minimized. However, FP datapaths are complex and contain components with varying bit widths; hence, sharing components of different bit widths is necessary. This introduces the problem of bit alignment, which involves determining how smaller resources should be aligned within larger resources when merged. A novel algorithm for solving the bit-alignment problem during datapath merging was developed. Our results show that adding more FP hardware does not necessarily equate to lower runtime if the delays associated with the additional hardware overcomes the cycle count reductions. We found that, with the Mediabench applications, datapath merging with bit alignment reduced area by an average of 22.5%, compared with an average of 14.1% without bit alignment. With the Standard Performance Evaluation Corporation (SPEC) CPU2000 FP (CFP2000) applications, datapath merging with bit alignment reduced area- - by an average of 7.6%, compared with an average of 3.9% without bit alignment. The less pronounced improvement with the SPEC CFP2000 benchmarks occurs because the SPEC CFP2000 applications predominantly use double-precision operations only. Therefore, there are fewer resources with different bit widths, which benefit less from bit alignment.

Keywords: application-specific instruction-set processors, application specific integrated circuits, Bit alignment, bit-alignment problem, CPU2000 FP, custom floating-point unit generation, custom instructions, datapath merging, embedded systems, floating point arithmetic, floating-point (FP) arithmetic, hardware-implemented instructions, merging, microprocessor chips, resource sharing, software-emulated instructions, Standard Performance Evaluation Corporation
[79] H. Javaid and S. Parameswaran. A design flow for application specific heterogeneous pipelined multiprocessor systems. In Design Automation Conference, 2009. DAC '09. 46th ACM/IEEE, pages 250-253, 2009. [ bib ]
This paper describes a rapid design methodology to create a pipeline of processers to execute streaming applications. The methodology is in two separate phases: the first phase, uses a heuristic to rapidly search through a large number of processor configurations (configurations differ by the base processor, the additional instructions and cache sizes) to find the near Pareto front; the second phase, utilizes either the above heuristic or an ILP (integer linear programming) formulation to search a smaller design space to find an appropriate final implementation. By the utilization of the fast heuristic with differing runtime constraints in the first phase, we rapidly find the near Pareto front. The second phase provides either an optimal or a near optimal solution. Both the ILP formulation and the heuristic find a system with the smallest area, within a designer specified runtime constraint. The system has efficiently explored design spaces with over 1012 design points. We integrated this design methodology into a commercial design flow and evaluated our approach with different benchmarks (JPEG encoder, JPEG decoder and MP3 encoder). For each benchmark, the near Pareto front was found in a few hours using the heuristic (took several days for the ILP). The results show that the average area error of the heuristic is within 2.5% of the optimal design points (obtained using ILP) for all benchmarks.

Keywords: Pareto optimisation;integer programming;linear programming;multiprocessing systems;pipeline processing;JPEG decoder;JPEG encoder;MP3 encoder;Pareto front;application specific heterogeneous pipelined multiprocessor systems;integer linear programming;processor configurations;Application software;Application specific processors;Design methodology;Integer linear programming;Java;Multiprocessing systems;Pipeline processing;Runtime;Scheduling;Space exploration;Design Space Exploration;Integer Linear Programming;MPSoCs
[80] Yee Jern Chong and S. Parameswaran. Custom floating-point unit generation for embedded systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 28(5):638-650, 2009. [ bib | DOI ]
While application-specific instruction-set processors (ASIPs) have allowed designers to create processors with custom instructions to target specific applications, floating-point (FP) units (FPUs) are still instantiated as noncustomizable general-purpose units, which, if underutilized, wastes area and performance. Therefore, there is a need for custom FPUs for embedded systems. To create a custom FPU, the subset of FP instructions that should be implemented in hardware has to be determined. Implementing more instructions in hardware reduces the cycle count of the application but may lead to increased latency if the critical delay of the FPU increases. Therefore, a balance between the hardware-implemented and the software-emulated instructions, which produces the best performance, must be found. In order to find this balance, a rapid design space exploration was performed to explore the tradeoffs between the area and the performance. In order to reduce the area of the custom FPU, it is desirable to merge the datapaths for each of the FP operations so that redundant hardware is minimized. However, FP datapaths are complex and contain components with varying bit widths; hence, sharing components of different bit widths is necessary. This introduces the problem of bit alignment, which involves determining how smaller resources should be aligned within larger resources when merged. A novel algorithm for solving the bit-alignment problem during datapath merging was developed. Our results show that adding more FP hardware does not necessarily equate to lower runtime if the delays associated with the additional hardware overcomes the cycle count reductions. We found that, with the Mediabench applications, datapath merging with bit alignment reduced area by an average of 22.5%, compared with an average of 14.1% without bit alignment. With the Standard Performance Evaluation Corporation (SPEC) CPU2000 FP (CFP2000) applications, datapath merging with bit alignment reduced area- - by an average of 7.6%, compared with an average of 3.9% without bit alignment. The less pronounced improvement with the SPEC CFP2000 benchmarks occurs because the SPEC CFP2000 applications predominantly use double-precision operations only. Therefore, there are fewer resources with different bit widths, which benefit less from bit alignment.

Keywords: application specific integrated circuits;embedded systems;floating point arithmetic;microprocessor chips;CPU2000 FP;Standard Performance Evaluation Corporation;application-specific instruction-set processors;bit-alignment problem;custom floating-point unit generation;custom instructions;datapath merging;embedded systems;hardware-implemented instructions;software-emulated instructions;Bit alignment;floating-point (FP) arithmetic;merging;resource sharing
[81] S. Radhakrishnan, H. Guo, S. Parameswaran, and A. Ignjatovic. Hmp-asips: heterogeneous multi-pipeline application-specific instruction-set processors. Computers Digital Techniques, IET, 3(1):94-108, 2009. [ bib | DOI ]
A heterogeneous multi-pipeline architecture to enable high-performance application-specific instruction-set processor (ASIP) design is proposed. Each pipeline in this architecture is extensively customised. The program instruction-level parallelism is statically explored during compilation. Techniques such as forwarding network reduction, instruction encoding customisation and pipeline structure/instruction-set tailoring are all used to achieve a high performance/area ratio, low power consumption and small code size. The simulations and experiments on a group of benchmarks show that when the multi-pipeline ASIP is employed, an average of 83% performance improvement can be achieved when compared with a single pipeline ASIP, with overheads of 31%, 33% and 86% for area, leakage power and code size, respectively.

Keywords: instruction sets;pipeline processing;HMP-ASIP;forwarding network reduction;heterogeneous multipipeline architecture;instruction encoding customisation;pipeline structure-instruction-set tailoring;specific instruction-set processors
[82] A. Janapsatya, S. Parameswaran, and A. Ignjatovic. Hitme: Low power hit memory buffer for embedded systems. In Design Automation Conference, 2009. ASP-DAC 2009. Asia and South Pacific, pages 335-340, 2009. [ bib | DOI ]
In this paper, we present a novel HitME (Hit-MEmory) buffer to reduce the energy consumption of memory hierarchy in embedded processors. The HitME buffer is a small direct-mapped cache memory that is added as additional memory into existing cache memory hierarchies. The HitME buffer is loaded only when there is a hit on L1 cache. Otherwise, L1 cache is updated from the memory and the processor's memory request is served directly from the L1 cache. The strategy works due to the fact that 90% of memory accesses are only accessed once, and these often pollute the cache. Energy reduction is achieved by reducing the number of accesses to the L1 cache memory. Experimental results show that the use of HitME buffer will reduce the L1 cache accesses resulting in a reduction in the energy consumption of the memory hierarchy. This decrease in L1 cache accesses reduces the cache system energy consumption by an average of 60.9% when compared to traditional L1 cache memory architecture and an energy reduction of 6.4% when compared to filter cache architecture for 70 nm cache technology.

Keywords: cache storage;embedded systems;integrated memory circuits;HitME;direct-mapped cache memory;embedded systems;energy consumption;low power hit memory buffer;memory hierarchy;Australia;Cache memory;Computer science;Embedded system;Energy consumption;Filters;Memory management;Microprocessors;Multiplexing;Pollution
[83] K. Patel, S. Parameswaran, and R.G. Ragel. Cuffs: An instruction count based architectural framework for security of mpsocs. In Design, Automation Test in Europe Conference Exhibition, 2009. DATE '09., pages 779-784, 2009. [ bib | DOI ]
Multiprocessor system on chip (MPSoC) architecture is rapidly gaining momentum for modern embedded devices. The vulnerabilities in software on MPSoCs are often exploited to cause software attacks, which are the most common type of attacks on embedded systems. Therefore, we propose an MPSoC architectural framework, CUFFS, for an application specific instruction set processor (ASIP) design that has a dedicated security processor called iGuard for detecting software attacks.

Keywords: circuit analysis computing;embedded systems;instruction sets;logic design;security of data;system-on-chip;MPSoC security;application specific instruction set processor design;embedded devices;embedded systems;instruction count based architectural framework;multiprocessor system on chip architecture;software attack detection;Application software;Australia;Buffer overflow;Computer science;Computer security;Data security;Embedded software;Embedded system;Hardware;Runtime
[84] J. Peddersen and S. Parameswaran. Low-impact processor for dynamic runtime power management. IEEE Design Test of Computers, 25(1):52-62, 2008. [ bib | DOI ]
This article presents a method of modifying a processor so that it can estimate its own power and energy consumption in parallel with application execution. The authors have applied the method to an existing 32-bit processor and demonstrated it on a range of benchmarks. The system adds only a small increase in average power consumption and chip area.

Keywords: 32-bit processor, Batteries, counters, dynamic runtime power management, energy aware, Energy consumption, Energy management, Frequency measurement, Heat engines, low-impact processor, low-power electronics, macromodeling, microprocessor chips, power consumption, power estimation, Power generation, power measurement, Power system management, Power system modeling, Runtime, runtime power management
[85] Xin He and S. Parameswaran. MCAD: Multiple connection based anomaly detection. In 11th IEEE Singapore International Conference on Communication Systems, 2008. ICCS 2008, pages 999-1004, 2008. [ bib | DOI ]
This paper describes a novel multi-connection based anomaly detection system. Previous techniques consume enormous amounts of time due to pre-processing features (unsupervised anomaly detection), or due to the lead time in creating specialized rules (supervised anomaly detection). The system described in this paper, MCAD, uses the observed premise that anomalous connections by one attacker are very similar to each other (e.g. an attacker will try to use similar connections to probe a network). MCAD tests for similarity amongst connections within clustered groups, and if the similarity for connections of the group is above a predetermined threshold, then these connections are deemed anomalous. MCAD was tested on two weeks of MIT/LL DARPA dataset. The total number connections tested was over a million. From this testing, MCAD was able to detect 15 types of multiple connection based attacks in which 14 types of attacks were fully detected while the 15th attack was detected 2/3 of the time. The false positive rate was 0.466%.

Keywords: Australia, Computer crime, Computer science, Computer vision, Data mining, Intrusion detection, MCAD, MIT-LL DARPA dataset, multiple connection, Probes, security of data, supervised anomaly detection, Telecommunication traffic, Testing, unsupervised anomaly detection, Unsupervised learning
[86] K. Avnit, V. D'silva, A. Sowmya, S. Ramesh, and S. Parameswaran. A formal approach to the protocol converter problem. In Design, Automation and Test in Europe, 2008. DATE '08, pages 294-299, 2008. [ bib | DOI ]
In the absence of a single module interface standard, integration of pre-designed modules in System-on-Chip design often requires the use of protocol converters. Existing approaches to automatic synthesis of protocol converters mostly lack formal foundations and either employ abstractions that ignore crucial low level behaviors, or grossly simplify the structure of the protocols considered. We present a state-machine based formal model for bus based communication protocols, and precisely define protocol compatibility, and correct protocol conversion. Our model is expressive enough to capture features of commercial protocols such as bursts, pipelined transfers, wait state insertion, and data persistence, in cycle accurate detail. We show that the most general, correct converter for a pair of protocols, can be described as the greatest fixed point of a function for updating buffer states. This characterization yields a natural algorithm for automatic synthesis of a provably correct converter by iterative computation of the fixed point. We report our experience with automatic converter synthesis between widely used commercial bus protocols, such as AMBA AHB, ASB, APB, and OCP, considering features which are beyond the scope of current techniques.

Keywords: Algorithm design and analysis, Australia, automatic converter synthesis, Bandwidth, bus protocols, Clocks, Context modeling, Hardware design languages, Iterative algorithms, protocol converter problem, Protocols, Signal synthesis, system buses, System-on-a-chip, system-on-chip, system-on-chip design
[87] K. Patel and S. Parameswaran. SHIELD: A software hardware design methodology for security and reliability of MPSoCs. In 45th ACM/IEEE Design Automation Conference, 2008. DAC 2008, pages 858-861, 2008. [ bib ]
Security of MPSoCs is an emerging area of concern in embedded systems. Security is jeopardized by code injection attacks, which are the most common types of software attacks. Previous attempts to detect code injection in MPSoCs have been burdened with significant performance overheads. In this work, we present a hardware/software methodology "SHIELD" to detect code injection attacks in MPSoCs. SHIELD instruments the software programs running on application processors in the MPSoC and also extracts control flow and basic block execution time information for runtime checking. We employ a dedicated security processor (monitor processor) to supervise the application processors on the MPSoC. Custom hardware is designed and used in the monitor and application processors. The monitor processor uses the custom hardware to rapidly analyze information communicated to it from the application processors at runtime. We have implemented SHIELD on a commercial extensible processor (Xtensa LX2) and tested it on a multiprocessor JPEG encoder program. In addition to code injection attacks, the system is also able to detect 83% of bit flips errors in the control flow instructions. The experiments show that SHIELD produces systems with runtime which is at least 9 times faster than the previous solution. SHIELD incurs a runtime (clock cycles) performance overhead of only 6.6% and an area overhead of 26.9%, when compared to a non-secure system.

Keywords: application processors, Application software, Architecture, Bit Flips, code injection, code injection attacks, Data mining, Design methodology, embedded system, embedded systems, Hardware, hardware-software codesign, hardware-software design, Information analysis, Information security, Instruments, integrated circuit reliability, Monitoring, monitor processor, MPSoC reliability, MPSoC security, multiprocessor JPEG encoder, Multiprocessors, multiprocessor system on chips, Runtime, SHIELD instruments, software programs, system-on-chip, Tensilica, Xtensa LX2
[88] J. Chan and S. Parameswaran. NoCOUT : NoC topology generation with mixed packet-switched and point-to-point networks. In Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific, pages 265-270, 2008. [ bib | DOI ]
Networks-on-chip (NoC) have been widely proposed as the future communication paradigm for use in next-generation system-on-chip. In this paper, we present NoCOUT, a methodology for generating an energy optimized application specific NoC topology which supports both point-to-point and packet-switched networks. The algorithm uses a prohibitive greedy iterative improvement strategy to explore the design space efficiently. A system-level floorplanner is used to evaluate the iterative design improvements and provide feedback on the effects of the topology on wire length. The algorithm is integrated within a NoC synthesis framework with characterized NoC power and area models to allow accurate exploration for a NoC router library. We apply the topology generation algorithm to several test cases including real-world and synthetic communication graphs with both regular and irregular traffic patterns, and varying core sizes. Since the method is iterative, it is possible to start with a known design to search for improvements. Experimental results show that many different applications benefit from a mix of ";on chip networks"; and ";point-to-point networks";. With such a hybrid network, we achieve approximately 25% lower energy consumption (with a maximum of 37%) than a state of the art min-cut partition based topology generator for a variety of benchmarks. In addition, the average hop count is reduced by 0.75 hops, which would significantly reduce the network latency.

Keywords: Algorithm design and analysis, Feedback, greedy algorithms, greedy iterative improvement strategy, Iterative algorithms, logic design, Network-on-a-chip, network-on-chip, networks-on-chip, network topology, Next generation networking, NoCOUT, NoC topology generation, Optimization methods, packet-switched networks, packet switching, point-to-point networks, Space exploration, system-level floorplanner, System-on-a-chip, Wire
[89] Yee Jern Chong and S. Parameswaran. Rapid application specific floating-point unit generation with bit-alignment. In 45th ACM/IEEE Design Automation Conference, 2008. DAC 2008, pages 62-67, 2008. [ bib ]
While ASIPs have allowed designers to create processors with custom instructions to target specific applications, floating point units are still instantiated as fixed general-purpose units, which wastes area if not fully utilized. Therefore, there is a need for custom FPUs for embedded systems. The creation of a custom FPU requires the selection of a subset of the full floating-point instruction set and the implementation of this subset in hardware, such that the runtime of the application is minimized. To minimize area, it is desirable to merge the datapaths for each of the floating-point operations, so that redundant hardware is minimized. Floating-point datapaths are complex and contain components with varying bit-widths, so sharing components of different bit-widths is necessary. However, this introduces the problem of bit-alignment, which involves determining how smaller resources should be aligned within larger resources when merged. This is a problem that has been largely neglected in previous work. Thus, this paper presents a novel algorithm for solving the bit-alignment problem, which neatly integrates into the datapath merging process. By solving this bit-alignment problem, automatic datapath merging can be made available for FPU generation. To explore the trade-offs between area and performance, a rapid design space exploration was performed to determine which FP operations should be implemented in hardware rather than emulated. Our results show that more floating-point hardware does not necessarily equate to lower run-time if the additional hardware increases delay. We found that bit-alignment reduced area by an average of 22.5% in our benchmarks, compared to an average of 14.1% without bit-alignment.

Keywords: Application software, application specific instruction set processor, Application specific processors, bit-alignment, Clocks, Computer science, datapath merging, Design engineering, Design Space Exploration, embedded system, embedded systems, floating-point, floating point arithmetic, floating-point datapath, floating-point instruction set, floating-point unit generation, Hardware, instruction sets, logic design, merging, microprocessor chips, Permission, redundant hardware, Runtime
[90] J.A. Ambrose, N. Aldon, A. Ignjatovic, and S. Parameswaran. Anatomy of differential power analysis for AES. In 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 2008. SYNASC '08, pages 459-466, 2008. [ bib | DOI ]
Side channel attacks are a significant threat to the deployment of secure embedded systems. Differential power analysis is one of the powerful power analysis attacks, which can be exploited in secure devices such as smart cards, PDAs and mobile phones. Several researchers in the past have presented experiments and countermeasures for differential power analysis in AES cryptography, though none of them have described the attack in a step by step manner, covering all the aspects of the attack. Some of the important missing segments are the consideration of pipelines, analysis of the power profile to locate the points of attack, the correspondence of the source code, its assembly representation, and the point of attack. In this paper we describe in detail a step-wise explanation of the differential power analysis of an AES implementation, with all of the aspects identified above.

Keywords: AES cryptography, Algorithm design and analysis, Anatomy, authorisation, cryptography, differential power analysis, embedded system, embedded systems, Mobile handsets, Performance analysis, Personal digital assistants, power measurement, secure embedded system, Semiconductor device measurement, side channel attack, side channel attacks, Smart cards
[91] J.A. Ambrose, S. Parameswaran, and A. Ignjatovic. MUTE-AES: A multiprocessor architecture to prevent power analysis based side channel attack of the AES algorithm. In IEEE/ACM International Conference on Computer-Aided Design, 2008. ICCAD 2008, pages 678-684, 2008. [ bib | DOI ]
Side channel attack based upon the analysis of power traces is an effective way of obtaining the encryption key from secure processors. Power traces can be used to detect bitflips which betray the secure key. Balancing the bitflips with opposite bitflips have been proposed, by the use of opposite logic. This is an expensive solution, where the balancing processor continues to balance even when encryption is not carried out in the processor.

Keywords: Algorithm design and analysis, bitllips, cryptographic program, cryptography, differential power analysis, dual processor architecture, embedded system, embedded systems, encryption key, Hamming weight, Information analysis, Information security, Mobile handsets, multiprocessing systems, multiprocessor algorithmic balancing technique, Performance analysis, Personal digital assistants, Power system security, side channel attack
[92] K. Avnit, V. D'silva, A. Sowmya, S. Ramesh, and S. Parameswaran. A formal approach to the protocol converter problem. In Design, Automation and Test in Europe, 2008. DATE '08, pages 294-299, 2008. [ bib | DOI ]
In the absence of a single module interface standard, integration of pre-designed modules in System-on-Chip design often requires the use of protocol converters. Existing approaches to automatic synthesis of protocol converters mostly lack formal foundations and either employ abstractions that ignore crucial low level behaviors, or grossly simplify the structure of the protocols considered. We present a state-machine based formal model for bus based communication protocols, and precisely define protocol compatibility, and correct protocol conversion. Our model is expressive enough to capture features of commercial protocols such as bursts, pipelined transfers, wait state insertion, and data persistence, in cycle accurate detail. We show that the most general, correct converter for a pair of protocols, can be described as the greatest fixed point of a function for updating buffer states. This characterization yields a natural algorithm for automatic synthesis of a provably correct converter by iterative computation of the fixed point. We report our experience with automatic converter synthesis between widely used commercial bus protocols, such as AMBA AHB, ASB, APB, and OCP, considering features which are beyond the scope of current techniques.

Keywords: system buses;system-on-chip;automatic converter synthesis;bus protocols;protocol converter problem;system-on-chip design;Algorithm design and analysis;Australia;Bandwidth;Clocks;Context modeling;Hardware design languages;Iterative algorithms;Protocols;Signal synthesis;System-on-a-chip
[93] J.A. Ambrose, S. Parameswaran, and A. Ignjatovic. Mute-aes: A multiprocessor architecture to prevent power analysis based side channel attack of the aes algorithm. In Computer-Aided Design, 2008. ICCAD 2008. IEEE/ACM International Conference on, pages 678-684, 2008. [ bib | DOI ]
Side channel attack based upon the analysis of power traces is an effective way of obtaining the encryption key from secure processors. Power traces can be used to detect bitflips which betray the secure key. Balancing the bitflips with opposite bitflips have been proposed, by the use of opposite logic. This is an expensive solution, where the balancing processor continues to balance even when encryption is not carried out in the processor.

Keywords: cryptography;embedded systems;multiprocessing systems;bitllips;cryptographic program;differential power analysis;dual processor architecture;embedded systems;encryption key;multiprocessor algorithmic balancing technique;side channel attack;Algorithm design and analysis;Cryptography;Embedded system;Hamming weight;Information analysis;Information security;Mobile handsets;Performance analysis;Personal digital assistants;Power system security
[94] Xin He and S. Parameswaran. Mcad: Multiple connection based anomaly detection. In Communication Systems, 2008. ICCS 2008. 11th IEEE Singapore International Conference on, pages 999-1004, 2008. [ bib | DOI ]
This paper describes a novel multi-connection based anomaly detection system. Previous techniques consume enormous amounts of time due to pre-processing features (unsupervised anomaly detection), or due to the lead time in creating specialized rules (supervised anomaly detection). The system described in this paper, MCAD, uses the observed premise that anomalous connections by one attacker are very similar to each other (e.g. an attacker will try to use similar connections to probe a network). MCAD tests for similarity amongst connections within clustered groups, and if the similarity for connections of the group is above a predetermined threshold, then these connections are deemed anomalous. MCAD was tested on two weeks of MIT/LL DARPA dataset. The total number connections tested was over a million. From this testing, MCAD was able to detect 15 types of multiple connection based attacks in which 14 types of attacks were fully detected while the 15th attack was detected 2/3 of the time. The false positive rate was 0.466%.

Keywords: security of data;MCAD;MIT-LL DARPA dataset;multiple connection;supervised anomaly detection;unsupervised anomaly detection;Australia;Computer crime;Computer science;Computer vision;Data mining;Intrusion detection;Probes;Telecommunication traffic;Testing;Unsupervised learning
[95] Yee Jern Chong and S. Parameswaran. Rapid application specific floating-point unit generation with bit-alignment. In Design Automation Conference, 2008. DAC 2008. 45th ACM/IEEE, pages 62-67, 2008. [ bib ]
While ASIPs have allowed designers to create processors with custom instructions to target specific applications, floating point units are still instantiated as fixed general-purpose units, which wastes area if not fully utilized. Therefore, there is a need for custom FPUs for embedded systems. The creation of a custom FPU requires the selection of a subset of the full floating-point instruction set and the implementation of this subset in hardware, such that the runtime of the application is minimized. To minimize area, it is desirable to merge the datapaths for each of the floating-point operations, so that redundant hardware is minimized. Floating-point datapaths are complex and contain components with varying bit-widths, so sharing components of different bit-widths is necessary. However, this introduces the problem of bit-alignment, which involves determining how smaller resources should be aligned within larger resources when merged. This is a problem that has been largely neglected in previous work. Thus, this paper presents a novel algorithm for solving the bit-alignment problem, which neatly integrates into the datapath merging process. By solving this bit-alignment problem, automatic datapath merging can be made available for FPU generation. To explore the trade-offs between area and performance, a rapid design space exploration was performed to determine which FP operations should be implemented in hardware rather than emulated. Our results show that more floating-point hardware does not necessarily equate to lower run-time if the additional hardware increases delay. We found that bit-alignment reduced area by an average of 22.5% in our benchmarks, compared to an average of 14.1% without bit-alignment.

Keywords: embedded systems;floating point arithmetic;instruction sets;logic design;microprocessor chips;application specific instruction set processor;bit-alignment;datapath merging;design space exploration;embedded system;floating-point datapath;floating-point instruction set;floating-point unit generation;redundant hardware;Application software;Application specific processors;Clocks;Computer science;Design engineering;Embedded system;Hardware;Merging;Permission;Runtime;Floating-point;bit-alignment;datapath merging
[96] K. Patel and S. Parameswaran. Shield: A software hardware design methodology for security and reliability of mpsocs. In Design Automation Conference, 2008. DAC 2008. 45th ACM/IEEE, pages 858-861, 2008. [ bib ]
Security of MPSoCs is an emerging area of concern in embedded systems. Security is jeopardized by code injection attacks, which are the most common types of software attacks. Previous attempts to detect code injection in MPSoCs have been burdened with significant performance overheads. In this work, we present a hardware/software methodology "SHIELD" to detect code injection attacks in MPSoCs. SHIELD instruments the software programs running on application processors in the MPSoC and also extracts control flow and basic block execution time information for runtime checking. We employ a dedicated security processor (monitor processor) to supervise the application processors on the MPSoC. Custom hardware is designed and used in the monitor and application processors. The monitor processor uses the custom hardware to rapidly analyze information communicated to it from the application processors at runtime. We have implemented SHIELD on a commercial extensible processor (Xtensa LX2) and tested it on a multiprocessor JPEG encoder program. In addition to code injection attacks, the system is also able to detect 83% of bit flips errors in the control flow instructions. The experiments show that SHIELD produces systems with runtime which is at least 9 times faster than the previous solution. SHIELD incurs a runtime (clock cycles) performance overhead of only 6.6% and an area overhead of 26.9%, when compared to a non-secure system.

Keywords: embedded systems;hardware-software codesign;integrated circuit reliability;system-on-chip;MPSoC reliability;MPSoC security;SHIELD instruments;Xtensa LX2;application processors;code injection attacks;embedded systems;hardware-software design;monitor processor;multiprocessor JPEG encoder;multiprocessor system on chips;software programs;Application software;Data mining;Design methodology;Embedded system;Hardware;Information analysis;Information security;Instruments;Monitoring;Runtime;Architecture;Bit Flips;Code Injection;Multiprocessors;Tensilica
[97] J. Peddersen and S. Parameswaran. Low-impact processor for dynamic runtime power management. Design Test of Computers, IEEE, 25(1):52-62, 2008. [ bib | DOI ]
This article presents a method of modifying a processor so that it can estimate its own power and energy consumption in parallel with application execution. The authors have applied the method to an existing 32-bit processor and demonstrated it on a range of benchmarks. The system adds only a small increase in average power consumption and chip area.

Keywords: low-power electronics;microprocessor chips;32-bit processor;dynamic runtime power management;energy consumption;low-impact processor;power consumption;Batteries;Energy consumption;Energy management;Frequency measurement;Heat engines;Power generation;Power measurement;Power system management;Power system modeling;Runtime;counters;energy aware;low-impact processor;macromodeling;power estimation;runtime power management
[98] P. Ray, N. Parameswaran, L. Lewis, and G. Jakobson. Distributed autonomic management: An approach and experiment towards managing service-centric networks. In Service Systems and Service Management, 2008 International Conference on, pages 1-6, 2008. [ bib | DOI ]
This paper describes a novel approach for managing service-centric communications networks called distributed autonomic management (DAM). Current approaches to network management employ the client/server model, cooperative stationary agents, and/or non-intelligent mobile agents. The DAM model consists of communities of mobile and stationary intelligent agents in collaboration. We discuss an experiment with DAM and proceed to discuss outstanding research issues. The DAM approach uses the properties and characteristics of autonomic systems in support of managing service-oriented communications networks and protecting e-commerce and business enterprises against cyber terrorism.

Keywords: distributed processing;mobile agents;dstributed autonomic management;mobile agent;service-centric communication network;stationary intelligent agent;Collaboration;Communication networks;Computer crime;Computer network management;Humans;Immune system;Intelligent agent;Intrusion detection;Network servers;Protection
[99] J. Chan and S. Parameswaran. Nocout : Noc topology generation with mixed packet-switched and point-to-point networks. In Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific, pages 265-270, 2008. [ bib | DOI ]
Networks-on-chip (NoC) have been widely proposed as the future communication paradigm for use in next-generation system-on-chip. In this paper, we present NoCOUT, a methodology for generating an energy optimized application specific NoC topology which supports both point-to-point and packet-switched networks. The algorithm uses a prohibitive greedy iterative improvement strategy to explore the design space efficiently. A system-level floorplanner is used to evaluate the iterative design improvements and provide feedback on the effects of the topology on wire length. The algorithm is integrated within a NoC synthesis framework with characterized NoC power and area models to allow accurate exploration for a NoC router library. We apply the topology generation algorithm to several test cases including real-world and synthetic communication graphs with both regular and irregular traffic patterns, and varying core sizes. Since the method is iterative, it is possible to start with a known design to search for improvements. Experimental results show that many different applications benefit from a mix of ";on chip networks"; and ";point-to-point networks";. With such a hybrid network, we achieve approximately 25% lower energy consumption (with a maximum of 37%) than a state of the art min-cut partition based topology generator for a variety of benchmarks. In addition, the average hop count is reduced by 0.75 hops, which would significantly reduce the network latency.

Keywords: greedy algorithms;logic design;network topology;network-on-chip;packet switching;NoC topology generation;NoCOUT;greedy iterative improvement strategy;networks-on-chip;packet-switched networks;point-to-point networks;system-level floorplanner;Algorithm design and analysis;Feedback;Iterative algorithms;Network topology;Network-on-a-chip;Next generation networking;Optimization methods;Space exploration;System-on-a-chip;Wire
[100] Seng Lin Shee and S. Parameswaran. Design methodology for pipelined heterogeneous multiprocessor system. In 44th ACM/IEEE Design Automation Conference, 2007. DAC '07, pages 811-816, 2007. [ bib ]
Multiprocessor SoC systems have led to the increasing use of parallel hardware along with the associated software. These approaches have included coprocessor, homogeneous processor (e.g. SMP) and application specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to its configurability and programmability. In this work, we introduce a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration. A streaming application is taken and manually broken into a series of algorithmic stages (each of which make up a stage in a pipeline). We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose a heuristic to efficiently search the design space for a pipeline-based multi ASIP system. We have implemented the proposed heterogeneous multiprocessor methodology using a commercial extensible processor (Xtensa LX from Tensilica Inc.). We have evaluated our system by creating two benchmarks (MP3 and JPEG encoders) which are mapped to our proposed design platform. Our multiprocessor design provided a performance improvement of at least 4.1 IX (JPEG) and 3.36X (MP3) compared to the single processor design. The minimum cost obtained through our heuristic was within 5.47% and 5.74% of the best possible values for JPEG and MP3 benchmarks respectively.

Keywords: Application software, application specific architectures, Application specific processors, Architecture, ASIP, Computer architecture, conventional processing entity, coprocessor, coprocessors, Design, Design methodology, Digital audio players, Digital signal processing, Experimentation, extensible processor, Hardware, hardware-software codesign, Hardware/software partitioning, heterogeneous multiprocessor methodology, homogeneous processor, image coding, JPEG, MP3, multiprocessing systems, multiprocessor design, multiprocessor SoC systems, parallel hardware, Performance, pipeline configuration, pipelined heterogeneous multiprocessor system, pipeline processing, Pipelines, single processor design, system-on-chip
[101] J.A. Ambrose, R.G. Ragel, and S. Parameswaran. A smart random code injection to mask power analysis based side channel attacks. In 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 51-56, 2007. [ bib ]
One of the security issues in embedded system is the ability of an adversary to perform side channel attacks. Power analysis attacks are often very successful, where the power sequence dissipated by the system is observed and analyzed to predict secret keys. In this paper we show a processor architecture, which automatically detects the execution of the most common encryption algorithms, starts to scramble the power waveform by adding randomly placed instructions with random register accesses, and stops injecting instructions when it is safe to do so. Our technique prevents both Simple Power Analysis (SPA) and Differential Power Analysis (DPA). This approach has less overheads compared to previous solutions and avoids software instrumentation, allowing programmers with no special knowledge to use the system. Our processor model costs an additional area of 1.2%, and an average of 25% in runtime and 28.5% in energy over heads for industry standard cryptographic algorithms.

Keywords: Benchmark testing, Cross Correlation, cryptographic algorithms, embedded system, embedded systems, Encryption, encryption algorithms, Hardware, Indexes, instructions, multiprocessing systems, power analysis, power analysis attacks, power sequence dissipation, power waveform, processor architecture, public key cryptography, random codes, Random Instruction Injection, random register accesses, Registers, Runtime, secret keys, security issues, side channel attack, side channel attacks, Signature Identification, smart random code injection, software instrumentation
[102] I.S. Lu, N. Weste, and S. Parameswaran. A power-efficient 5.6-GHz process-compensated CMOS frequency divider. IEEE Transactions on Circuits and Systems II: Express Briefs, 54(4):323-327, 2007. [ bib | DOI ]
This brief presents a robust, power efficient CMOS frequency divider for the 5-GHz UNII band. The divider operates as a voltage controlled ring oscillator with the output frequency modulated by the switching of the input transmission gate. The divider, designed in a 0.25-mum SOS-CMOS technology, occupies 35times25 mum2 and exhibit a operating frequency of 5.6 GHz while consuming 79 muW at a supply voltage of 0.8 V. Process and temperature tolerant operation can be achieved by utilizing a novel compensation circuitry to calibrate the speed of the ring oscillator-based divider. The simple compensation circuitry contains low-speed digital logic and dissipates minimal additional power since it is powered on only during the one-time factory calibration sequence

Keywords: 0.8 V, 0.25 micron, 5.6 GHz, 79 muW, Circuits, CMOS frequency divider, CMOS logic circuits, CMOS process, CMOS technology, compensation circuit, controllable delay element, Controllable delay element (CDE), digital logic, Frequency conversion, frequency divider, frequency dividers, Frequency modulation, MMIC frequency convertors, process compensation, process tolerance, ring oscillator divider calibration, Ring oscillators, Robustness, SOS-CMOS technology, Temperature, temperature tolerant operation, UNII band, Voltage control, voltage-controlled oscillators, voltage controlled ring oscillator
[103] Yee Jern Chong and S. Parameswaran. Automatic application specific floating-point unit generation. In Design, Automation Test in Europe Conference Exhibition, 2007. DATE '07, pages 1-6, 2007. [ bib | DOI ]
This paper describes the creation of custom floating point units (FPUs) for application specific instruction set processors (ASIPs). ASIPs allow the customization of processors for use in embedded systems by extending the instruction set, which enhances the performance of an application or a class of applications. These extended instructions are manifested as separate hardware blocks, making the creation of any necessary floating point instructions quite unwieldy. On the other hand, using a predefined FPU includes a large monolithic hardware block with considerable number of unused instructions. A customized FPU will overcome these drawbacks, yet the manual creation of one is a time consuming, error prone process. This paper presents a methodology for automatically generating floating-point units (FPUs) that are customized for specific applications at the instruction level. Generated FPUs comply with the IEEE754 standard, which is an advantage over FP format customization. Custom FPUs were generated for several Mediabench applications. Area savings over a fully-featured FPU without resource sharing of 26%-80% without resource sharing and 33%-87% with resource sharing, were obtained. Clock period increased in some cases by up to 9.5% due to resource sharing

Keywords: Application software, application specific instruction set processors, application specific integrated circuits, Application specific processors, Australia, automatic application specific floating-point unit generation, Clocks, Computer science, Costs, embedded system, embedded systems, floating point arithmetic, Hardware, instruction sets, Productivity, Resource management, resource sharing
[104] J.A. Ambrose, R.G. Ragel, and S. Parameswaran. RIJID: Random code injection to mask power analysis based side channel attacks. In 44th ACM/IEEE Design Automation Conference, 2007. DAC '07, pages 489-492, 2007. [ bib ]
Side channel attacks are becoming a major threat to the security of embedded systems. Countermeasures proposed to overcome Simple Power Analysis (SPA) and Differential Power Analysis (DPA), are data masking, table masking, current flattening, circuitry level solutions, dummy instruction insertions and balancing bit-flips. All these techniques are either susceptible to multi-order side channel attacks, not sufficiently generic to cover all encryption algorithms, or burden the system with high area cost, run-time or energy consumption. A HW/SW based randomized instruction injection technique is proposed in this paper to overcome the pitfalls of previous countermeasures. Our technique injects random instructions at random places during the execution of an application which protects the system from both SPA and DPA. Further, we devise a systematic method to measure the security level of a power sequence and use it to measure the number of random instructions needed, to suitably confuse the adversary. Our processor model costs 1.9% in additional area for a simplescalar processor, and costs on average 29.8% in runtime and 27.1% in additional energy consumption for six industry standard cryptographic algorithms.

Keywords: balancing bit-flips, circuitry level solutions, Circuits, Costs, Cross Correlation, cryptography, current flattening, Data analysis, data masking, Data security, Design, differential power analysis, dummy instruction insertions, embedded system, Energy consumption, mask power analysis, Measurement, Pattern Matching, power analysis, power measurement, Power system security, random code injection, Random Instruction Injection, randomized instruction injection technique, Runtime, Security, side channel attack, side channel attacks, simple power analysis, six industry standard cryptographic algorithms, table masking
[105] J. Peddersen and S. Parameswaran. Energy driven application SelfAdaptation. In , 20th International Conference on VLSI Design, 2007. Held jointly with 6th International Conference on Embedded Systems, pages 385-390, 2007. [ bib | DOI ]
Until recently, there has been a lack of methods to trade-off energy use for quality of service at run-time in stand-alone embedded systems. Such systems are motivated by the need to increase the apparent available battery energy of portable devices, with minimal compromise in quality. The available systems either drew too much power or added considerable overheads due to task swapping. In this paper we demonstrate a feasible method to perform these trade-offs. This work has been enabled by a low-impact power/energy estimating processor which utilizes counters to estimate power and energy consumption at run-time. Techniques are shown that modify multimedia applications to differ the fidelity of their output to optimize the energy/quality trade-off. Two adaptation algorithms are applied to multimedia applications demonstrating the efficacy of the method. The method increases code size by 1% and execution time by 0.02%, yet is able to produce an output which is acceptable and processes up to double the number of frames.

Keywords: Application software, Australia, Batteries, Counting circuits, Design methodology, embedded system, embedded systems, Energy consumption, energy driven application self-adaptation, Energy management, low-impact energy estimating processor, low-impact power estimating processor, low-power electronics, multimedia applications, power aware computing, power consumption, Quality of service, Runtime, supervisory programs
[106] K. Patel, S. Parameswaran, and Seng Lin Shee. Ensuring secure program execution in multiprocessor embedded systems: A case study. In 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 57-62, 2007. [ bib ]
Multiprocessor SoCs are increasingly deployed in embedded systems with little or no security features built in. Code Injection attacks are one of the most commonly encountered security threats. Most solutions to this problem in the single processor domain are purely software based and have high overheads. A few hardware solutions have been provided for the single processor case, which significantly reduce overheads. In this paper, for the first time, we propose a methodology addressing code injection attacks in a multiprocessor domain. A dedicated security (monitor) processor is used to oversee the application at runtime. Each processor communicates with the monitor processor through a FIFO queue, and is continuously checked. Static analysis of program map and timing profile are used to obtain program information at compile time, which is utilized by the monitor processor at runtime. This information is encrypted using a secure key and stored in the monitor processor. A copy of this secure key is built into the processor's hardware and is used for decryption by the monitor processor. Each basic block of the program is also instrumented with security information that uniquely identifies itself at runtime. The information from static analysis thus allows the monitor processor to supervise the proceedings on each processor at runtime. Our approach uses a combination of hardware and software techniques to keep overheads to a minimum. We implemented our methodology on a commercial extensible processor (Xtensa LX). Our approach successfully detects the execution of injected code when tested on a JPEG multiprocessor benchmark. The results show a small increase of 6.6% in application processors' runtime (clock cycle count) and 35.2% in code size for the JPEG encoder benchmark.

Keywords: Benchmark testing, code injection attacks, embedded system, Embedded System Processors, embedded systems, FIFO queue, Hardware, hardware-software techniques, information encryption, Information security, JPEG encoder, Monitoring, monitor processor, multiprocessing systems, multiprocessor, Multiprocessors, Runtime, secure program execution, Security, security of data, SoC, Software, static analysis, supervisory programs, system-on-chip, Tensilica, timing profile, Transform coding
[107] J. Peddersen and S. Parameswaran. Energy driven application selfadaptation. In VLSI Design, 2007. Held jointly with 6th International Conference on Embedded Systems., 20th International Conference on, pages 385-390, 2007. [ bib | DOI ]
Until recently, there has been a lack of methods to trade-off energy use for quality of service at run-time in stand-alone embedded systems. Such systems are motivated by the need to increase the apparent available battery energy of portable devices, with minimal compromise in quality. The available systems either drew too much power or added considerable overheads due to task swapping. In this paper we demonstrate a feasible method to perform these trade-offs. This work has been enabled by a low-impact power/energy estimating processor which utilizes counters to estimate power and energy consumption at run-time. Techniques are shown that modify multimedia applications to differ the fidelity of their output to optimize the energy/quality trade-off. Two adaptation algorithms are applied to multimedia applications demonstrating the efficacy of the method. The method increases code size by 1% and execution time by 0.02%, yet is able to produce an output which is acceptable and processes up to double the number of frames.

Keywords: embedded systems;low-power electronics;power aware computing;power consumption;supervisory programs;energy driven application self-adaptation;low-impact energy estimating processor;low-impact power estimating processor;multimedia applications;Application software;Australia;Batteries;Counting circuits;Design methodology;Embedded system;Energy consumption;Energy management;Quality of service;Runtime
[108] Yee Jern Chong and S. Parameswaran. Automatic application specific floating-point unit generation. In Design, Automation Test in Europe Conference Exhibition, 2007. DATE '07, pages 1-6, 2007. [ bib | DOI ]
This paper describes the creation of custom floating point units (FPUs) for application specific instruction set processors (ASIPs). ASIPs allow the customization of processors for use in embedded systems by extending the instruction set, which enhances the performance of an application or a class of applications. These extended instructions are manifested as separate hardware blocks, making the creation of any necessary floating point instructions quite unwieldy. On the other hand, using a predefined FPU includes a large monolithic hardware block with considerable number of unused instructions. A customized FPU will overcome these drawbacks, yet the manual creation of one is a time consuming, error prone process. This paper presents a methodology for automatically generating floating-point units (FPUs) that are customized for specific applications at the instruction level. Generated FPUs comply with the IEEE754 standard, which is an advantage over FP format customization. Custom FPUs were generated for several Mediabench applications. Area savings over a fully-featured FPU without resource sharing of 26%-80% without resource sharing and 33%-87% with resource sharing, were obtained. Clock period increased in some cases by up to 9.5% due to resource sharing

Keywords: application specific integrated circuits;embedded systems;floating point arithmetic;instruction sets;application specific instruction set processors;automatic application specific floating-point unit generation;embedded systems;resource sharing;Application software;Application specific processors;Australia;Clocks;Computer science;Costs;Embedded system;Hardware;Productivity;Resource management
[109] I.S. Lu, N. Weste, and S. Parameswaran. A power-efficient 5.6-ghz process-compensated cmos frequency divider. Circuits and Systems II: Express Briefs, IEEE Transactions on, 54(4):323-327, 2007. [ bib | DOI ]
This brief presents a robust, power efficient CMOS frequency divider for the 5-GHz UNII band. The divider operates as a voltage controlled ring oscillator with the output frequency modulated by the switching of the input transmission gate. The divider, designed in a 0.25-mum SOS-CMOS technology, occupies 35times25 mum2 and exhibit a operating frequency of 5.6 GHz while consuming 79 muW at a supply voltage of 0.8 V. Process and temperature tolerant operation can be achieved by utilizing a novel compensation circuitry to calibrate the speed of the ring oscillator-based divider. The simple compensation circuitry contains low-speed digital logic and dissipates minimal additional power since it is powered on only during the one-time factory calibration sequence

Keywords: CMOS logic circuits;MMIC frequency convertors;frequency dividers;voltage-controlled oscillators;0.25 micron;0.8 V;5.6 GHz;79 muW;CMOS frequency divider;SOS-CMOS technology;UNII band;compensation circuit;controllable delay element;digital logic;process compensation;process tolerance;ring oscillator divider calibration;temperature tolerant operation;voltage controlled ring oscillator;CMOS process;CMOS technology;Circuits;Frequency conversion;Frequency modulation;Ring oscillators;Robustness;Temperature;Voltage control;Voltage-controlled oscillators;Controllable delay element (CDE);frequency divider;process compensation
[110] J.A. Ambrose, R.G. Ragel, and S. Parameswaran. Rijid: Random code injection to mask power analysis based side channel attacks. In Design Automation Conference, 2007. DAC '07. 44th ACM/IEEE, pages 489-492, 2007. [ bib ]
Side channel attacks are becoming a major threat to the security of embedded systems. Countermeasures proposed to overcome Simple Power Analysis (SPA) and Differential Power Analysis (DPA), are data masking, table masking, current flattening, circuitry level solutions, dummy instruction insertions and balancing bit-flips. All these techniques are either susceptible to multi-order side channel attacks, not sufficiently generic to cover all encryption algorithms, or burden the system with high area cost, run-time or energy consumption. A HW/SW based randomized instruction injection technique is proposed in this paper to overcome the pitfalls of previous countermeasures. Our technique injects random instructions at random places during the execution of an application which protects the system from both SPA and DPA. Further, we devise a systematic method to measure the security level of a power sequence and use it to measure the number of random instructions needed, to suitably confuse the adversary. Our processor model costs 1.9% in additional area for a simplescalar processor, and costs on average 29.8% in runtime and 27.1% in additional energy consumption for six industry standard cryptographic algorithms.

Keywords: cryptography;balancing bit-flips;circuitry level solutions;current flattening;data masking;differential power analysis;dummy instruction insertions;mask power analysis;random code injection;randomized instruction injection technique;side channel attacks;simple power analysis;six industry standard cryptographic algorithms;table masking;Circuits;Costs;Cryptography;Data analysis;Data security;Embedded system;Energy consumption;Power measurement;Power system security;Runtime;Cross Correlation;Design;Measurement;Pattern Matching;Power Analysis;Random Instruction Injection;Security;Side Channel Attack
[111] Seng Lin Shee and S. Parameswaran. Design methodology for pipelined heterogeneous multiprocessor system. In Design Automation Conference, 2007. DAC '07. 44th ACM/IEEE, pages 811-816, 2007. [ bib ]
Multiprocessor SoC systems have led to the increasing use of parallel hardware along with the associated software. These approaches have included coprocessor, homogeneous processor (e.g. SMP) and application specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to its configurability and programmability. In this work, we introduce a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration. A streaming application is taken and manually broken into a series of algorithmic stages (each of which make up a stage in a pipeline). We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose a heuristic to efficiently search the design space for a pipeline-based multi ASIP system. We have implemented the proposed heterogeneous multiprocessor methodology using a commercial extensible processor (Xtensa LX from Tensilica Inc.). We have evaluated our system by creating two benchmarks (MP3 and JPEG encoders) which are mapped to our proposed design platform. Our multiprocessor design provided a performance improvement of at least 4.1 IX (JPEG) and 3.36X (MP3) compared to the single processor design. The minimum cost obtained through our heuristic was within 5.47% and 5.74% of the best possible values for JPEG and MP3 benchmarks respectively.

Keywords: coprocessors;hardware-software codesign;image coding;pipeline processing;system-on-chip;ASIP;JPEG;MP3;application specific architectures;conventional processing entity;coprocessor;design methodology;extensible processor;heterogeneous multiprocessor methodology;homogeneous processor;multiprocessor SoC systems;multiprocessor design;parallel hardware;pipeline configuration;pipelined heterogeneous multiprocessor system;single processor design;Application software;Application specific processors;Computer architecture;Coprocessors;Design methodology;Digital audio players;Digital signal processing;Hardware;Multiprocessing systems;Pipelines;ASIP;Design;Experimentation;Hardware/software partitioning;Performance;architecture
[112] K. Patel, S. Parameswaran, and Seng Lin Shee. Ensuring secure program execution in multiprocessor embedded systems: A case study. In Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2007 5th IEEE/ACM/IFIP International Conference on, pages 57-62, 2007. [ bib ]
Multiprocessor SoCs are increasingly deployed in embedded systems with little or no security features built in. Code Injection attacks are one of the most commonly encountered security threats. Most solutions to this problem in the single processor domain are purely software based and have high overheads. A few hardware solutions have been provided for the single processor case, which significantly reduce overheads. In this paper, for the first time, we propose a methodology addressing code injection attacks in a multiprocessor domain. A dedicated security (monitor) processor is used to oversee the application at runtime. Each processor communicates with the monitor processor through a FIFO queue, and is continuously checked. Static analysis of program map and timing profile are used to obtain program information at compile time, which is utilized by the monitor processor at runtime. This information is encrypted using a secure key and stored in the monitor processor. A copy of this secure key is built into the processor's hardware and is used for decryption by the monitor processor. Each basic block of the program is also instrumented with security information that uniquely identifies itself at runtime. The information from static analysis thus allows the monitor processor to supervise the proceedings on each processor at runtime. Our approach uses a combination of hardware and software techniques to keep overheads to a minimum. We implemented our methodology on a commercial extensible processor (Xtensa LX). Our approach successfully detects the execution of injected code when tested on a JPEG multiprocessor benchmark. The results show a small increase of 6.6% in application processors' runtime (clock cycle count) and 35.2% in code size for the JPEG encoder benchmark.

Keywords: embedded systems;multiprocessing systems;security of data;supervisory programs;system-on-chip;FIFO queue;JPEG encoder;SoC;code injection attacks;embedded system;hardware-software techniques;information encryption;information security;monitor processor;multiprocessor;secure program execution;static analysis;timing profile;Benchmark testing;Hardware;Monitoring;Runtime;Security;Software;Transform coding;Code Injection Attacks;Embedded System Processors;Multiprocessors;Security;Tensilica
[113] J.A. Ambrose, R.G. Ragel, and S. Parameswaran. A smart random code injection to mask power analysis based side channel attacks. In Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2007 5th IEEE/ACM/IFIP International Conference on, pages 51-56, 2007. [ bib ]
One of the security issues in embedded system is the ability of an adversary to perform side channel attacks. Power analysis attacks are often very successful, where the power sequence dissipated by the system is observed and analyzed to predict secret keys. In this paper we show a processor architecture, which automatically detects the execution of the most common encryption algorithms, starts to scramble the power waveform by adding randomly placed instructions with random register accesses, and stops injecting instructions when it is safe to do so. Our technique prevents both Simple Power Analysis (SPA) and Differential Power Analysis (DPA). This approach has less overheads compared to previous solutions and avoids software instrumentation, allowing programmers with no special knowledge to use the system. Our processor model costs an additional area of 1.2%, and an average of 25% in runtime and 28.5% in energy over heads for industry standard cryptographic algorithms.

Keywords: embedded systems;multiprocessing systems;public key cryptography;random codes;cryptographic algorithms;embedded system;encryption algorithms;instructions;power analysis attacks;power sequence dissipation;power waveform;processor architecture;random register accesses;secret keys;security issues;side channel attacks;smart random code injection;software instrumentation;Benchmark testing;Encryption;Hardware;Indexes;Registers;Runtime;Cross Correlation;Power Analysis;Random Instruction Injection;Side Channel Attack;Signature Identification
[114] Jude Angelo Ambrose, Roshan Ragel, and Sri Parameswaran. Rijid: Random code injection to mask power analysis based side channel attacks. In Design Automation Conference (DAC '07), page 6pp, San Diego, Ca, USA, 2007. [ bib | http ]
Side channel attacks are becoming a major threat to the security of embedded systems. Countermeasures proposed to overcome Simple Power Analysis and Differential Power Analysis, are data masking, table masking, current flattening, circuitry level solutions, dummy instruction insertions and balancing bit-flips. All these techniques are either susceptible to multi-order side channel attacks, not sufficiently generic to cover all encryption algorithms, or burden the system with high area cost, run-time or energy consumption. A HW/SW based randomized instruction injection technique is proposed in this paper to overcome the pitfalls of previous countermeasures. Our technique injects random instructions at random places during the execution of an application which protects the system from both SPA and DPA. Further, we devise a systematic method to measure the security level of a power sequence and use it to measure the number of random instructions needed, to suitably confuse the adversary. Our processor model costs 1.9% in additional area for a simplescalar processor, and costs on average 29.8% in runtime and 27.1% in additional energy consumption for six industry standard cryptographic algorithms.

[115] Yee Jern Chong and Sri Parameswaran. Automatic application specific floating-point unit generation. In Design, Automation and Test in Europe (DATE'07) Conference, Nice, France, 2007. IEEE. [ bib | http ]
This paper describes the creation of custom floating point units (FPUs) for Application Specific Instruction Set Processors(ASIPs). ASIPs allow the customization of processors for use in embedded systems, by extending the instruction set which enhances the performance of an application or a class of application. These extended instructions are manifested as separate hardware blocks, making the creation of any necessary floating point instructions quite unwieldy. On the other hand, using a predefined FPU includes a large monolithic hardware block with considerable number of unused instructions. A customized FPU will overcome these drawbacks, yet the manual creation of one is a time consuming, error prone process. This paper presents a methodology for automatically generating floating-point units (FPUs) that are customized for specific applications at the instruction level. Custom FPUs were generated for several Mediabench applications. Area savings over a fully-featured FPU without resource sharing of 26 resource sharing and 33 Clock period increased in some cases by up to 9.5 sharing.

[116] Andhi Janapsatya, Aleksandar Ignjatovic, Sri Parameswaran, and Joerg Henkel. Instruction trace compression for rapid instruction cache simulation. In Design, Automation and Test in Europe (DATE'07) Conference, page 6pp. IEEE, 2007. [ bib | http ]
Abstract: A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9 (on average) and improves performance by 40.0 compared to a traditional cache system which is identical in size (40 refs.)

[117] Ivan Siu-Chuang Lu, Neil Weste, and Sri Parameswaran. A power-efficient 5.6-ghz process-compensated cmos frequency divider. IEEE Transactions on Circuits and Systems II, 54(4):323-327, 2007. [ bib | .pdf ]
This brief presents a robust, power efficient CMOS frequency divider for the 5-GHz UNII band. The divider operates as a voltage controlled ring oscillator with the output frequency modulated by the switching of the input transmission gate. The divider, designed in a 0.25-μm SOS-CMOS technology, occupies 35×25 μm2 and exhibit a operating frequency of 5.6 GHz while consuming 79 μW at a supply voltage of 0.8 V. Process and temperature tolerant operation can be achieved by utilizing a novel compensation circuitry to calibrate the speed of the ring oscillator-based divider. The simple compensation circuitry contains low-speed digital logic and dissipates minimal additional power since it is powered on only during the one-time factory calibration sequence (10 refs.)

[118] Jorgen Peddersen and Sri Parameswaran. Low impact run-time power/energy estimating processor for dynamic power management. To appear in IEEE Design and Test, 2007. [ bib ]
[119] Jorgen Peddersen and Sri Parameswaran. Clipper: Counter-based low impact processor power estimation at run-time. In 12th Asia and South Pacific Design Automation Conference (ASP-DAC 2007), pages 890-895, Yokohama, Japan, 2007. IEEE. [ bib | .pdf ]
Numerous dynamic power management techniques have been proposed which utilize the knowledge of processor power/energy consumption at run-time. So far, no efficient method to provide run-time power/energy data has been presented. Current measurement systems draw too much power to be used in small embedded designs and existing performance counters can not provide sufficient information for run-time optimization. This paper presents a novel methodology to solve the problem of run-time power optimization by designing a processor that estimates its own power/energy consumption. Estimation is performed by the addition of small counters that tally events which consume power. This methodology has been applied to an existing processor resulting in an average power error of 2 adds little impact to the design, with only a 4.9 area and a 3 of an application that utilizes the processor showcases the benefits the methodology enables in dynamic power optimization.

[120] Jorgen Peddersen and Sri Parameswaran. Energy driven application adaptation at run-time. In 20th International Conference on VLSI DESIGN, pages 385-390, Bangalore, India, 2007. [ bib | .pdf ]
Until recently, there has been a lack of methods to trade-off energy use for quality of service at run-time in stand-alone embedded systems. Such systems are motivated by the need to increase the apparent available battery energy of portable devices, with minimal compromise in quality. The available systems either drew too much power or added considerable overheads due to task swapping. In this paper we demonstrate a feasible method to perform these trade-offs. This work has been enabled by a low-impact power/energy estimating processor which utilizes counters to estimate power and energy consumption at run-time. Techniques are shown that modify multimedia applications to differ the fidelity of their output to optimize the energy/quality trade-off. Two adaptation algorithms are applied to multimedia applications demonstrating the efficacy of the method. The method increases code size by 1 execution time by 0.02 acceptable and processes up to double the number of frames.

[121] Roshan Ragel and Sri Parameswaran. A hybrid hardware software technique to improve reliability in embedded processors. To appear in ACM Transactions on Embedded Computing Systems, 2007. [ bib ]
[122] Seng Lin Shee and Sri Parameswaran. Design methodology for pipelined heterogeneous multiprocessor system. In Design Automation Conference (DAC'07), page 6pp, San Diego, CA, USA, 2007. [ bib | .pdf ]
Multiprocessor SoC systems have led to the increasing use of parallel hardware along with the associated software. These approaches have included coprocessor, homogeneous processor (e.g. SMP) and application specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to its configurability and programmability. In this work, we introduce a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration. A streaming application is taken and manually broken into a series of algorithmic stages (each of which make up a stage in a pipeline). We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose a heuristic to efficiently search the design space for a pipeline-based multi ASIP system. We have implemented the proposed heterogeneous multiprocessor methodology using a commercial extensible processor (Xtensa LX from Tensilica Inc.). We have evaluated our system by creating two benchmarks (MP3 and JPEG encoders) which are mapped to our proposed design platform. Our multiprocessor design provided a performance improvement of at least 4.11X (JPEG) and 3.36X (MP3) compared to the single processor design. The minimum cost obtained through our heuristic was within 5.47 respectively.

[123] Seng Lin Shee and Sri Parameswaran. Architectural exploration of heterogeneous multiprocessor systems for jpeg. International Journal of Parallel Processing, 35, 2007. [ bib | .pdf ]
Multicore processors have been utilized in embedded systems and general computing applications for some time. However, these multicore chips execute multiple applications concurrently, with each core carrying out a particular task in the system. Such systems can be found in gaming, automotive real-time systems and video / image encoding devices. These system are commonly deployed to overcome deadline misses, which are primarily due to overloading of a single multitasking core. In this paper, we explore the use of multiple cores for a a single application, as opposed to multiple applications executing in a parallel fashion. A single application is parallelized using two different methods: one, a master-slave model; and two, a sequential pipeline model. The systems were implemented using Tensilica's Xtensa LX processors with queues as the means of communications between two cores. In a master-slave model, we utilized a course grained approach whereby a main core distributes the workload to the remaining cores and reads the processed data before writing the results back to file. In the pipeline model, a lower granularity is used. The application is partitioned into multiple sequential blocks; each block representing a stage in a sequential pipeline. For both models we applied a number of differing configurations ranging from a single core to a nine-core system. We found that without any optimization for the seven core system, the sequential pipeline approach has a more efficient area usage, with an area increase to speedup ratio of 1.83 compared to the master-slave approach of 4.34. With selective optimization in the pipeline approach, we obtained speed ups of up to 4.6times while with an area increase of only 3.1times (area increase to speedup ratio of just 0.68).

[124] Joerg Henkel and Sri Parameswaran, editors. Designing Embedded Processors: A Low Power Perspective. Springer, 2007. [ bib | .pdf ]
[125] Andhi Janapsatya, Aleks Ignjatovic, and Sri Parameswaran. Finding optimal l1 cache configuration for embedded systems. In Asia South Pacific Design Automation Conference (ASPDAC 2006), pages 796-801, Yokohama, Japan, 2006. [ bib | .pdf ]
Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy.

[126] Andhi Janapsatya, Aleks Ignjatovic, and Sri Parameswaran. A novel instruction scratchpad memory optimization method based on concomitance metric. In Asia South Pacific Design Automation Conference (ASPDAC 2006), page 6 pp., Yokohama, Japan, 2006. [ bib | .pdf ]
Scratchpad memory has been introduced as a replacement for cache memory as it improves the performance of certain embedded systems. Additionally, it has also been demonstrated that scratchpad memory can significantly reduce the energy consumption of the memory hierarchy of embedded systems. This is significant, as the memory hierarchy consumes a substantial proportion of the total energy of an embedded system. This paper deals with optimization of the instruction memory scratchpad based on a methodology that uses a metric which we call the concomitance. This metric is used to find basic blocks which are executed frequently and in close proximity in time. Once such blocks are found, they are copied into the scratchpad memory at appropriate times; this is achieved using a special instruction inserted into the code at appropriate places. For a set of benchmarks taken from Mediabench, our scratchpad system consumed just 59 cache system, and 73 scratchpad system, while improving the overall performance. Compared to the state of the art method, the number of instructions copied into the scratchpad memory from the main memory is reduced by 88%.

Keywords: cache storage embedded systems low-power electronics optimisation concomitance metric embedded systems instruction scratchpad memory optimization memory hierarchy state of the art scratchpad
[127] Andhi Janapsatya, Aleksander Ignjatovic, and Sri Parameswaran. Exploiting statistical information for implementation of instruction scratchpad memory in embedded system. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(8):816-29, 2006. [ bib | http ]
A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9 (on average) and improves performance by 40.0 compared to a traditional cache system which is identical in size.

[128] Ivan Lu, Sri Parameswaran, and Neil Weste. Design and Analysis of an Integrated Low Power Ultrawideband Receiver (final editing stage). Springer, 2006. [ bib ]
[129] Ivan Siu-Chuang Lu, Neil Weste, and Sri Parameswaran. Adc precision requirement for digital ultra-wideband receivers with sublinear front-ends: a power and performance perspective. In 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems and Design (VLSI Design '06), page 6 pp., Hyderabad, India, 2006. IEEE Computer Society. 8894131 digital ultra-wideband receivers sublinear front-ends direct sequence ultra-wideband receiver signal to noise and distortion ratio bit error rate analog-to-digital converter intermodulation distortion data reception 3 to 4 GHz. [ bib | .pdf ]
This paper presents the power and performance analysis of a digital, direct sequence ultra-wideband (DS-UWB) receiver operating in the 3 to 4 GHz band. The signal to noise and distortion ratio (SNDR) and bit error rate (BER) were evaluated with varying degrees of front-end linearity and analog to digital converter (ADC) accuracy. The analysis and simulation results indicate two or more ADC bits are required for reliable data reception in the presence of strong interference and intermodulation distortion. In addition to BER performance, power consumption of different hardware configurations is also evaluated to form the cost function for evaluating design choices. The combined power and performance analysis indicates that starting with one-bit ADC resolutions, a substantial gain in reliability can be attained by increasing ADC resolution to two-bits or more. When the ADC resolution improves beyond three bits, front-end linearization achieves similar BER improvements to increasing the ADC accuracy, at a fraction of the power cost. As a result, linear front-end designs become significant only when high precision ADCs are utilized

Keywords: analogue-digital conversion intermodulation distortion microwave receivers radio receivers ultra wideband technology
[130] Sri Parameswaran, Joerg Henkel, and Newton Cheung. Instruction matching and modelling. In Paolo Ienne and Rainer Leupers, editors, Customizable and Configurable Embedded Processors. Elseiver, 2006. [ bib ]
[131] Swarana Radhakrishnan, Hui Guo, and Sri Parameswaran. Customization of application speci c heterogeneous multipipeline processors. In Design, Automation and Test in Europe Conference and Exhibition (DATE '06), page 6 pages, Munich, Germany, 2006. IEEE Comput. Soc. [ bib | .pdf ]
In this paper we propose application specific instruction set processors with heterogeneous multiple pipelines to efficiently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specified in C language, the design system can generate a processor with a number of pipelines specifically suitable to the application, and the parallel code associated with the processor. Each pipeline in such a processor is customized, and implements its own special instruction set so that the instructions can be executed in parallel with low hardware overhead. Our simulations and experiments with a group of benchmarks, largely from Mibench suite, show that on average, 77 pipeline ASIP, with the overheads of 49 power, 17% on switching activity, and 69% on code size (20 refs.)

[132] Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran, and Aleksandar Ignjatovic. Application specific forwarding network and instruction encoding for multi-pipe asips. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006. [ bib ]
Small area and code size are two critical design issues in most of embedded system designs. In this paper, we tackle these issues by customizing forwarding networks and instruction encoding schemes for multi-pipe Application Specific Instruction-Set Processors (ASIPs). Forwarding is a popular technique to reduce data hazards in the pipeline to improve performance and is applied in almost all modern processor designs; but it is very area expensive. Instruction encoding schemes have a direct impact on code size; an efficient encoding method can lead to a small instruction width, and hence reducing the code size. We propose application specific techniques to reduce forwarding networks and instruction widths for ASIPs with multiple pipelines. By these design techniques, it is possible to reduce area, code size, and even power consumption (due to reduced area), without costing any performance. Our experiments, on a set of benchmarks using the proposed customization approaches show that, on average, there are 27 on area, 30 time, performance even improves by 4 period.

[133] Roshan Ragel and Sri Parameswaran. Hardware assisted pre-emptive control flow checking for embedded processors to improve reliability. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006. [ bib | .pdf ]
[134] Roshan G. Ragel and Sri Parameswaran. Impres: Integrated monitoring for processor reliability and security. In Design Automation Conference. (DAC '06), page 4 pages, San Francisco, CA, USA, 2006. ACM. [ bib | .pdf ]
Security and reliability in processor based systems are concerns requiring adroit solutions. Security is often compromised by code injection attacks, jeopardizing even `trusted software'. Reliability is of concern where unintended code is executed in modern processors with ever smaller feature sizes and low voltage swings causing bit ips. Countermeasures by software-only approaches increase code size by large amounts and therefore signicantly reduce performance. Hardware assisted approaches add extensive amounts of hardware monitors and thus incur unacceptably high hardware cost. This paper presents a novel hardware/software technique at the granularity of micro-instructions to reduce overheads considerably. Experiments show that our technique incurs an additional hardware overhead of 0.91 increase of 0.06 just 11.9 These overheads are far smaller than have been previously encountered.

[135] Seng Lin Shee, Andrea Erdos, and Sri Parameswaran. Heterogeneous multiprocessor implementations for jpeg : A case study. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006. [ bib ]
[136] Hui Wu and Sri Parameswaran. Minimising the energy consumption of real-time tasks with precedence constraints on a single processor. In The 2006 IFIP International Conference on Embedded And Ubiquitous Computing, page 12 pages, Seoul, Korea., 2006. Springer's Lecture Notes in Computer Science. [ bib ]
[137] Sri Parameswaran, Jorgen Peddersen, and Ashley Partis. A modular approach to tcp/ipv6 hardware implementation, 2 June 2005 2005. [ bib ]
[138] Jeremy Chan and Sri Parameswaran. Nocee: energy macro-model extraction methodology for network on chip routers. In International Conference on Computer Aided Design (ICCAD-2005), pages 254-9, San Jose, CA, USA, 2005. IEEE. 8834322 energy macro-model extraction method NoC routers network on chip routers packet-switched routers linear regression energy consumption router estimation models traffic patterns data inputs mean absolute energy estimation gate level power simulation logic synthesis flow power estimation tools. [ bib | .pdf ]
In this paper we present NoCEE, a fast and accurate method for extracting energy models for packet-switched network on chip (NoC) routers. Linear regression is used to model the relationship between events occurring in the NoC and energy consumption. The resulting models are cycle accurate and can be applied to different technology libraries. We verify the individual router estimation models with many different synthetically generated traffic patterns and data inputs. Characterization of a small library takes about two hours. The mean absolute energy estimation error of the resultant models is 5 a complete gate level simulation. We also apply this method to a number of complete NoCs with inputs extracted from synthetic application traces and compare our estimated results to the gate level power simulations (mean absolute error is 5 has been integrated with commercial logic synthesis flow and power estimation tools (synopsys design compiler and primepower), allowing application across different designs. The extracted models show the different trends across various parameterizations of network on chip routers and have been integrated into an architecture exploration framework

Keywords: circuit CAD integrated circuit design integrated circuit modelling network routing network-on-chip packet switching regression analysis
[139] Newton Cheung, Sri Parameswaran, and Joerg Henkel. Battery aware instruction generation for embedded processors. In Asia South Pacific Design Automation Conference (ASP-DAC '05), volume 1, pages 553-556, Shanghai, China, 2005. [ bib | .pdf ]
Automatic instruction generation is an efficient method to satisfy growing performance and meet design constraints for application specific instruction-set processors. A typical approach for instruction generation is to combine a large group of primitive instructions into a single extensible instruction for maximizing speedups. However, this approach often leads to large power dissipation and discharge current, posing a challenge to battery-powered products. In this paper, we propose a battery-aware automatic tool to design extensible instructions which minimizes power dissipation distribution by separating an instruction into multiple instructions. We verify our automatic tool using 50 different code segments, and five large real-world applications. Our tool reduces energy consumption by a further 5.8 (up to 17.7 approaches. For real-world applications, energy consumption is reduced by 6.6 for most cases. The automatic instruction generation tool is integrated into our application specific instruction-set processor tool suite (24 refs.)

Keywords: application specific integrated circuits embedded systems instruction sets microprocessor chips application specific instruction-set processor automatic instruction generation battery aware instruction generation discharge current embedded processor energy consumption extensible instruction power dissipation
[140] Hui Guo and Sri Parameswaran. Balancing system level pipelines with stage voltage scaling. In IEEE Computer Society Annual Symposium on VLSI (IVLSI '05), pages 287-289, Tampa, FL, USA, 2005. IEEE Comput. Soc. 8632187 system level pipelines balancing stage voltage scaling stage supply voltages throughput response time power consumption memory overhead. [ bib | http ]
This paper presents an approach to dynamically balance the pipeline by scaling the stage supply voltages. Simulation results show that by such an approach about 50 time, and 11 limited memory overhead

Keywords: application specific integrated circuits asynchronous circuits logic partitioning low-power electronics memory architecture
[141] Ivan S. C. Lu, Neil Weste, and Sri Parameswaran. The effect of receiver front-end non-linearity on ds-uwb systems operating in the 3 to 4 ghz band. In 2005 IEEE Wireless Communications and Networking Conference (WCNC '05), volume Vol. 2, pages 776-81, New Orleans, LA, USA, 2005. IEEE. 8511408 receiver front-end nonlinearity DS-UWB systems direct sequence ultra wideband systems pulse doublets signal-to-noise-and-distortion ratio bit error rate BER low noise amplifiers mixers baseband amplifiers SOS-CMOS process UWB communication 3 to 4 GHz 0.5 micron. [ bib | .pdf ]
The paper presents a performance analysis of direct sequence ultra wideband (DS-UWB) systems operating with non-linear receiver front-ends. Following this analysis, we propose the novel use of pulse doublets to mitigate non-linearity induced distortion. The signal-to-noise-and-distortion ratio (SNDR) and bit error rate (BER) are evaluated with varying degrees of non-linearity and interference power. Simulation results indicate significant performance improvements by using pulse doublets under high interference power and non-linear operating conditions. Using pulse doublets allows reduced front-end linearity requirements and enables improvements in more critical circuit parameters. Front-end modules, such as low noise amplifiers (LNAs), mixers and baseband amplifiers, are designed using Peregrine's 0.5 ?m SOS-CMOS process to demonstrate the benefits of circuits designed with relaxed linearity requirements. Simulation results obtained using the Cadence Spectre RF simulator indicate that the sub-linear front-end achieves 33 dB increase in voltage gain, 2 dB improvement in noise figure, 64 in power and 917 MHz extension in bandwidth over its more linear counterpart

Keywords: CMOS analogue integrated circuits error statistics interference (signal) microwave integrated circuits microwave receivers nonlinear distortion radio receivers random noise silicon-on-insulator ultra wideband communication
[142] Sri Parameswaran and Joerg Henkel. Instruction code mapping for performance increase and energy reduction in embedded computer systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 13(4):498-502, 2005. Copyright 2005, IEE 8368318 1063-8210 instruction code mapping energy reduction embedded computer systems preprocessing step cache misses energy consumption memory size cache memories. [ bib | .pdf ]
In this paper, we present a novel and fast constructive technique that relocates the instruction code in such a manner into the main memory that the cache is utilized more efficiently. The technique is applied as a preprocessing step, i.e., before the code is executed. Our technique is applicable in embedded systems where the number and characteristics of tasks running on the system is known a priori. The technique does not impose any computational overhead to the system. As a result of applying our technique to a variety of real-world applications we observed through simulation a significant drop of cache misses. Furthermore, the energy consumption of the whole system (CPU, caches, buses, main memory) is reduced by up to 65 benefits could be achieved by a slightly increased main memory size of about 13% on average

Keywords: cache storage codes embedded systems instruction sets low-power electronics
[143] Sri Parameswaran, Jorgen Peddersen, and Ashley Partis. Low power chip architecture, 2005. [ bib ]
[144] Jorgen Peddersen, Seng Lin Shee, Andhi Janapsatya, and Sri Parameswaran. Rapid embedded hardware/software system generation. In 18th International Conference on VLSI Design (VLSI Design '05), pages 111-16, Kolkata, India, 2005. IEEE Computer Soc. 8349108 embedded hardware system generation embedded software system generation RTL generation scheme SimpleScalar/PISA instruction set system calls C programs ASIPmeister processor generation tool PISA instruction set processor size energy consumption processor performance. [ bib | .pdf ]
This paper presents an RTL generation scheme for a SimpleScalar/PISA instruction set architecture with system calls to implement C programs. The scheme utilizes ASIPmeister, a processor generation tool. The RTL generated is available for download. The second part of the paper shows a method of reducing the PISA instruction set and generating a processor for a given application. This reduction and generation can be performed within an hour, making this one of the fastest methods of generating an application specific processor. For five benchmark applications, we show that on average, processor size can be reduced by 30 by 24%

Keywords: application specific integrated circuits C language embedded systems microprocessor chips
[145] Swarnalatha Radhakrishnan, Hui Guo, and Sri Parameswaran. n-pipe: Application specific heterogeneous multi-pipeline processor design. In Workshop on Application Specific Processors (WASP '05), page 8 pages, Jersey City, NJ, USA, 2005. [ bib ]
In this paper we propose Application Specic Instruction Set Processors with heterogeneous multiple pipelines to efciently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specied in C language, the design system can generate a processor with a number of pipelines specically suitable to the application and the parallel code associated with the processor. Each pipeline in such a processor is customized, and implements its own special instruction set so that the instructions can be executed in parallel with low hardware overhead. Our simulations and experiments with a group of benchmarks, largely from Mibench suite, show that on average, 77 performance improvement can be achieved compared to a single pipeline ASIP, with the overheads of 49 activity, and 69% on code size.

[146] Roshan Ragel, Sri Parameswaran, and Sayed Kia. Micro embedded monitoring for security in application specific instruction-set processors. In International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '05), pages 304-314, San Francisco, CA, USA, 2005. ACM. [ bib | .pdf ]
This paper presents a methodology for monitoring security in Application Specific Instruction-set Processors (ASIPs). This is a generalized methodology for inline monitoring insecure operations in machine instructions at microinstruction level. Microinstructions are embedded into the critical machine instructions forming self checking instructions. We name this method Micro Embedded Monitoring. Since ASIPs are designed exclusively for a particular application domain, the Instruction Set Architecture (ISA) of an ASIP is based on the application executed. Knowledge of the domain gives an insight into the kinds of the security threats which need to be considered. The fact that the ISA design is based on the application makes room to accommodate security monitoring support during the design phase by embedding microinstructions into the critical machine instructions. Since the microinstructions are the lowest possible software level architecture, we could expect to get better performance by implementing security detection using microinstruction routines. Four different embedded security monitoring routines are implemented for evaluation. The average performance penalty with these monitoring routines with ten different benchmarks is 1.93 3.07% respectively.

[147] Seng Lin Shee, Sri Parameswaran, and Newton Cheung. Architecture for loop acceleration: a case study. In International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS '05), pages 297-302, Jersey City, NJ, USA, 2005. IEEE. 8733245 loop acceleration architecture coprocessor approach ASIP latency hiding JPEG encoding algorithm high-level synthesis approach loop performance improvement hardware/software partitioning loop optimization loop pipelining. [ bib | .pdf ]
In this paper, we show a novel approach to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages of this approach, we investigate a JPEG encoding algorithm and accelerate one of its loop by implementing it in a coprocessor. We contrast the acceleration by implementing the critical segment as two different coprocessors and a set of customized instructions. The two different coprocessor approaches are: a high-level synthesis (HLS) approach; and a custom coprocessor approach. The HLS approach provides a faster method of generating coprocessors. We show that a loop performance improvement of 2.57 is achieved using the custom coprocessor approach, compared to 1.58 for the HLS approach and 1.33 for the customized instruction approach compared with just the main processor. Respective energy savings within the loop are 57%, 28% and 19%

Keywords: application specific integrated circuits coprocessors energy conservation hardware-software codesign instruction sets parallel architectures
[148] Jeremy Chan and Sri Parameswaran. Nocgen:a template based reuse methodology for networks on chip architecture. In 17th International Conference on VLSI Design (VLSI Design '04), pages 717-20, Mumbai, India, 2004. IEEE Comput. Soc. 7993739 networks on chip architecture router components routing algorithms interconnection networks wormhole output queued 2D mesh router data bus systemC VHDL simulation environment verilog hardware description languages. [ bib | .pdf ]
In this paper, we describe NoCGEN, a Network On Chip (NoC) generator, which is used to create a simulatable and synthesizable NoC description. NoCGEN uses a set of modularised router components that can be used to form different routers with a varying number of ports, routing algorithms, data widths and buffer depths. A graph description representing the interconnection between these routers is used to generate a top-level VHDL description. A wormhole output-queued 2-D mesh router was created to verify the capability of NoCGEN. Various parameterized designs were synthesized to provide estimated gate counts of 129 K to 695 K for a number of topologies varying from a 22 mesh to a 44 mesh, with constant data bus size width of 32. The NoC was simulated with random traffic using a mixed SystemC/VHDL environment to ensure correctness of operation and to obtain performance and average latency. The results show an accepted load of 53 depth from 8 to 32 flits for the 44 mesh router

Keywords: hardware description languages network routing system-on-chip
[149] Newton Cheung, Sri Parameswaran, and Joerg Henkel. A quantitative study and estimation models for extensible instructions in embedded processors. In International Conference on Computer Aided Design (ICCAD '04), pages 183-189, San Jose, CA, USA, 2004. IEEE. [ bib | .pdf ]
Designing extensible instructions is a computationally complex task, due to the large design space each instruction is exposed to. One method of speeding up the design cycle is to characterize instructions and estimate their peculiarities during a design exploration. In this paper, we study and derive three estimation models for extensible instructions: area overhead, latency, and power consumption under a wide range of customization parameters. System decomposition and regression analysis are used as the underlying methods to characterize and analyze extensible instructions. We verify our estimation models using automatically and manually generated extensible instructions, plus extensible instructions used in large real-world applications. The mean absolute error of our estimation models arc as small as: 3.4 and 4.2 through the time consuming synthesis and simulation steps using commercial tools. Our estimation models achieve an average speedup of three orders of magnitude over the commercial tools and thus enable us to conduct a fast and extensive design space exploration that would otherwise not be possible. The estimation models are integrated into our extensible processor tool suite (29 refs.)

[150] Newton Cheung, Sri Parameswaran, Joerg Henkel, and Jeremy Chan. Mince: matching instructions using combinational equivalence for extensible processor. In Design, Automation and Test in Europe Conference and Exhibition (DATE '04), volume Vol.2, pages 1020-5, Paris, France, 2004. IEEE Comput. Soc. 8016020 combinational equivalence checking extensible processor custom extensible instructions computationally complex task automatically matching candidate instructions high level language application specific instruction set processor design pattern matching filtering algorithm. [ bib | .pdf ]
Designing custom-extensible instructions for extensible processors is a computationally complex task because of the large design space. The task of automatically matching candidate instructions in an application (e.g. written in a high-level language) to a pre-designed library of extensible instructions is especially challenging. Previous approaches have focused on identifying extensible instructions (e.g. through profiling), synthesizing extensible instructions, estimating expected performance gains etc. In this paper we introduce our approach of automatically matching extensible instructions as this key step is missing in automating the entire design flow of an ASIP with extensible instruction capabilities. Since matching using simulation is practically infeasible (simulation time), and traditional pattern matching approaches would not yield reliable results (ambiguity related to a functionally equivalent code that can be represented in many different ways), we adopt combinational equivalence checking. Our MINCE tool as part of our ASIP design flow consists of a translator, a filtering algorithm and a combinational equivalence checking tool. We report matching times of extensible instructions that are 7.3x faster on average (using Mediabench applications) compared to the best known approaches to the problem (partial simulations). In all our experiments MINCE matched correctly and the outcome of the matching step yielded an average speedup of the application of 2.47x. As a summary, our work represents a key step towards automating the whole design flow of an ASIP with extensible instruction capabilities

Keywords: application specific integrated circuits combinational circuits instruction sets integrated circuit design
[151] Andhi Janapsatya, Sri Parameswaran, and Joerg Henkel. Remcode: relocating embedded code for improving system efficiency. IEE Proceedings-Computers and Digital Techniques, 151(6):457-65, 2004. Copyright 2005, IEE 8256745 1350-2387 modified Montgomery modular multiplication RSA modular exponentiation circuit architectures carry save adders large word length additions output words multiplier inputs output-input format conversion critical path delay Montgomery multiplier architectures five-to-two CSA four-to-two CSA plus multiplexer RSA processing units. [ bib | .pdf ]
The memory hierarchy subsystem has a significant impact on performance and energy consumption of an embedded system. Methods which increase the hit ratio of the cache hierarchy will typically enhance the performance and reduce the embedded system's total energy consumption. This is mainly due to reduced cache-to-memory bus transactions, fewer main memory accesses and fewer processor waiting cycles. A heuristic approach is presented to reduce the total number of cache misses by carefully relocating selected sections of the application's software code within the main memory, thus reducing conflict misses resulting from the cache hierarchy. The method requires no hardware modifications i.e. it is a software-only approach. For the first time such a method is applied to large program traces, and the miss rates and corresponding energy savings are observed while varying cache size, line size and associativity. Relocating the code consistently produces superior performance on direct-mapped cache. Since direct-mapped caches, being smaller in silicon area than caches with higher associativity (for the same size), cost less in terms of energy/access, and access faster, using direct-mapped instruction cache with code relocation for performance-oriented embedded systems is recommended. A maximum cache miss rate reduction from 71 of up to 63% with only a small increase in main memory size

Keywords: adders cryptography digital arithmetic multiplying circuits
[152] Andhi Janapsatya, Sri Parameswaran, and Aleksander Ignjatovic. Hardware/software managed scratchpad memory for embedded system. In International Conference on Computer Aided Design (ICCAD 2004), pages 370-7, San Jose, CA, USA, 2004. IEEE. 8231999 hardware-software managed scratchpad memory embedded system energy reduction performance improvement instruction scratchpad memory instruction cache highly utilized code segments custom hardware controller strategically placed custom instructions program execution cache systems 256 to 16E3 bytes. [ bib | .pdf ]
In this paper, we propose a methodology for energy reduction and performance improvement. The target system comprises of an instruction scratchpad memory instead of an instruction cache. Highly utilized code segments are copied into the scratchpad memory, and are executed from the scratchpad. The copying of code segments from main memory to the scratchpad is performed during runtime. A custom hardware controller is used to manage the copying process. The hardware controller is activated by strategically placed custom instructions within the executing program. These custom instructions inform the hardware controller when to copy during program execution. Novel heuristic algorithms are implemented to determine locations within the program to insert these custom instructions, as well as to choose the best sets of code segments to be copied to the scratchpad memory. For a set of realistic benchmarks, experimental results indicate the method uses 50.7 by 53.2 which is identical in size. Cache systems compared had sizes ranging from 256 to 16K bytes and associativities ranging from 1 to 32

Keywords: cache storage embedded systems memory architecture power consumption program processors
[153] Swarana Radhakrishnan, Hui Guo, and Sri Parameswaran. Dual-pipeline heterogeneous asip design. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '04), pages 12-17, Stockholm, Sweden, 2004. ACM. 8289852 dual-pipeline heterogeneous ASIP design application specific instruction set processor C program VHDL descriptions ASIP-Meister Tool Suite dual pipeline processor instruction set generation. [ bib | .pdf ]
In this paper we demonstrate the feasibility of a dual pipeline application specific instruction set processor. We take a C program and create a target instruction set by compiling to a basic instruction set from which some instructions are merged, while others discarded. Based on the target instruction set, parallelism of the application program is analyzed and two unique instruction sets are generated for a heterogeneous dual-pipeline processor. The dual pipe processor is created by making two unique ASIPs (VHDL descriptions) utilizing the ASIP-Meister Tool Suite, and fusing the two VHDL descriptions to construct a dual pipeline processor. Our results show that in comparison to the single pipeline application specific instruction set processor, the performance improves by 27.6 reduces by 6.1 at the cost of increased area which for benchmarks considered is 16.7% on average

Keywords: C language hardware description languages instruction sets pipeline processing
[154] Newton Cheung, Sri Parameswaran, and Joerg Henkel. Rapid configuration & instruction selection for an asip: A case study. In Ahmed A. Jerraya, S. Yoo, N. When, and D. Verkest, editors, Embedded Software for SoC. Kluwer Publishing, 2003. [ bib ]
[155] Newton Cheung, Sri Parameswaran, and Joerg Henkel. Inside: Instruction selection/identification & design exploration for extensible processors. In International Conference on Computer Aided Design (ICCAD '03), pages 291-7, San Jose, CA, USA, 2003. IEEE. 8000155 INSIDE system instruction selection/identification & design exploration extensible processors embedded application code segments extensible instructions heuristic algorithm. [ bib | http ]
This paper presents the INSIDE system that rapidly searches the design space for extensible processors, given area and performance constraints of an embedded application, while minimizing the design turn-around-time. Our system consists of a) a methodology to determine which code segments are most suited for implementation as a set of extensible instructions, b) a heuristic algorithm to select pre-configured extensible processors as well as extensible instructions (library), and c) an estimation tool which rapidly estimates the performance of an application on a generated extensible processor. By selecting the right combination of a processor core plus extensible instructions, we achieve a performance increase on average of 2.03x (up to 7x) compared to the base processor core at a minimum hardware overhead of 25% on average

Keywords: embedded systems instruction sets
[156] Newton Cheung, Sri Parameswaran, and Joerg Henkel. Rapid configuration and instruction selection for an asip: a case study. In Design, Automation and Test in Europe Conference and Exhibition (DATE '03), pages 802-7, Munich, Germany, 2003. IEEE Comput. Soc. 7792073 instruction selection rapid configuration ASIP area constraint pre-designed specific instructions Tensilica platform area constraints co-processors execution time error rate 3 to 4 h. [ bib | .pdf ]
We present a methodology that maximizes the performance of Tensilica based Application Specific Instruction-set Processor (ASIP) through instruction selection when an area constraint is given. Our approach rapidly selects from a set of pre-fabricated coprocessors/functional units from our library of pre-designed specific instructions (to evaluate our technology we use the Tensilica platform). As a result, we significantly increase application performance while area constraints are satisfied. Our methodology uses a combination of simulation, estimation and a pre-characterised library of instructions, to select the appropriate co-processors and instructions. We report that by selecting the appropriate coprocessors/functional units and specific TIE instructions, the total execution time of complex applications (we study a voice encoder/decoder), an application's performance can be reduced by up to 85 Our estimator used in the system takes typically less than a second to estimate, with an average error rate of 4 simulation, which takes 45 minutes). The total selection process using our methodology takes 3-4 hours, while a full design space exploration using simulation would take several days

Keywords: application specific integrated circuits coprocessors instruction sets integrated circuit design
[157] Ivan S. C. Lu, Neil Weste, and Sri Parameswaran. A digital ultra-wideband multiband transceiver architecture with fast frequency hopping capabilities. In 2003 IEEE Conference on Ultra Wideband Systems and Technologies (UWST '03), pages 448-52, Reston, VA, USA, 2003. IEEE. 8045336 digital ultra-wideband transceiver multiband transceiver architecture frequency hopping capabilities circuit parameter analysis subband generator multiple frequency bands LNA gain baseband amplifier gain matched filter accuracy multifrequency generator front end offset voltage CMOS technology power consumption 3 to 10 GHz 20 dB 45 dB 10 dB 8 mW 1.8 V. [ bib | .pdf ]
This paper presents for the first time the circuit parameter analysis of a digital multiband-UWB transceiver, encompassing a novel low-power sub-band generator. This sub-band generator is capable of producing multiple frequency bands, enabling sub-band generation from 3 to 10 GHz with nanosecond switching times. The circuit analysis of the complete transceiver is used to set parameters of components. The analysis indicate that a LNA gain of 20 dB, baseband amplifier gain of 45 dB, matched filter accuracy of five bits, ADC accuracy of two bits, a 60 dB dynamic range of the multi-frequency generator, and front end offset voltage of less than 30 mV is required to achieve a 10 dB SNR. Hspice simulation utilizing 0.35 ?m CMOS technology suggest that the power consumption of the sub-band generator is 8 mW from a 1.8 V power supply

Keywords: amplifiers analogue-digital conversion broadband networks CMOS integrated circuits digital radio frequency hop communication matched filters power consumption transceivers
[158] Sri Parameswaran, Joerg Henkel, and Haris Lekastas. Multi-parametric improvements for embedded systems using code-placement and address bus coding. In Asia and South Pacific Design Automation Conference 2003 (ASP-DAC '03), pages 15-21, Kitakyushu, Japan, 2003. IEEE. 7712285 multi-parametric improvements embedded systems code placement address bus coding SoC bus systems deep-submicron designs interconnect traffic optimization techniques interconnect encoding scheme interconnect-related energy consumption high-level optimization strategy lower-level optimization strategy cache miss reduction ratios combined optimization strategy. [ bib | .pdf ]
Code placement techniques for instruction code have shown to increase an SoC's performance mostly due to the increased cache hit ratios and as such those techniques can be a major optimization strategy for embedded systems. Little has been investigated on the interdependencies between code placement techniques and interconnect traffic (e.g. bus traffic) and optimization techniques combining both. In this paper we show as the first approach of its kind that a carefully designed known code placement strategy combined and adapted to a known interconnect encoding scheme does not only lead to a performance increase but it does also lead to a significant reduction of interconnect-related energy consumption. This becomes especially interesting since future SoC bus systems (or more general: "networks on a chip") are predicted to be a dominant energy consumer of an SoC. We show that a high-level optimization strategy like code placement and a lower-level optimization strategy like interconnect encoding are NOT orthogonal. Specifically, we report cache miss reduction ratios of 32 with bus related energy savings of 50.4 of up to 95.7 results have been verified by means of diverse real-world SoC applications

Keywords: cache storage circuit CAD circuit optimisation embedded systems encoding integrated circuit design integrated circuit interconnections low-power electronics storage allocation system buses system-on-chip VLSI
[159] Tony Han and Sri Parameswaran. Swasad: an asic design for high speed dna sequence matching. In ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design (ASP-DAC / VLSI Design '02), pages 541-6, Bangalore, India, 2002. IEEE Comput. Soc. 7253827 SWASAD ASIC design Smith and Waterman algorithm-specific design hardware solution S&W algorithm biological information signal processor matrix cells per second layout size data throughput high speed DNA sequence matching 0.5 micron. [ bib | http ]
Presents the Smith and Waterman algorithm-specific ASIC design (SWASAD) project. This is a hardware solution that implements the S and W algorithm.. The SWASAD is an improved implementation of the biological information signal processor (BISP) design. The SWASAD chip fabricated on a 0.5 ?m process achieves 3200 million matrix cells per second (MCPS) per chip, with a layout size of 7.1 mm by 7.1 mm. This is a large improvement over existing designs and improves data throughput by using a smaller datawidth

Keywords: application specific integrated circuits biological techniques digital signal processing chips DNA signal processing
[160] Sri Parameswaran, Joerg Henkel, Xiaobo Sharon Hu, and Rajesh Gupta. Proceedings of the tenth international symposium on hardware/software codesign, codes 2002. In CODES 2002, Estes Park, Colorado, 2002. ACM. [ bib ]
[161] Sri Parameswaran. Code placement in hardware software co synthesis to improve performance and reduce cost. In Design, Automation and Test in Europe. Conference and Exhibition (DATE '01), pages 626-32, Munich, Germany, 2001. IEEE Comput. Soc. 6964670 hardware software co-synthesis cost performance code placement target architecture multiprocessor system highest priority tasks 3 to 100% cache storage. [ bib | .pdf ]
This paper introduces an algorithm for code placement in cache, and maps it to memory using a second algorithm. The target architecture is a multiprocessor system with IS' level cache and a common main memory. These algorithms guarantee that as many instruction codewords as possible of the high priority tasks remain in cache all of the time so that other tasks do not overwrite them. This method improves the overall performance, and might result in cheaper systems if more powerful processors are not needed. Amount of memory increase necessary to facilitate this scheme is in the order of 13 of highest priority tasks always in memory can vary from 3 depending upon how many tasks (and their sizes) are allocated to each processor

Keywords: cache storage circuit CAD hardware-software codesign integrated circuit design integrated circuit economics VLSI
[162] Sri Parameswaran and Joerg Henkel. I-copes: fast instruction code placement for embedded systems to improve performance and energy efficiency. In IEEE/ACM International Conference on Computer Aided Design. (ICCAD 2001), pages 635-41, San Jose, CA, USA, 2001. IEEE. Also available on CD-ROM in PDF format 7219929 I-CoPES instruction code placement embedded systems energy efficiency cache hits cache misses energy consumption re-processing step computational overhead. [ bib | .pdf ]
The ratio of cache hits to cache misses in a computer system is, to a large extent, responsible for its characteristics such as energy consumption and performance. In recent years energy efficiency has become one of the dominating design constraints, due to the rapid growth in market share for mobile computing/communication/internet devices. We present a novel fast constructive technique that relocates the instruction code in such a manner into the main memory that the cache is utilized more efficiently. The technique is applied as a re-processing step, i.e. before the code is executed. it is applicable for embedded systems where the number and characteristics of tasks running on the system are known a priori. The technique does not impose any computational overhead to the system. As a result of applying our technique to a variety of real-world applications we measured (through simulation) that the number of cache misses drops significantly. Further, this reduces the energy consumption of a whole system (CPU, caches, buses, main memory) by up to 65 memory size of 13% on average

Keywords: cache storage embedded systems instruction sets low-power electronics storage allocation
[163] Allan Rae and Sri Parameswaran. Voltage reduction of application-specific heterogeneous multiprocessor systems for power minimisation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E84-A(9):2296-302, 2001. Copyright 2001, IEE 7080214 0916-8508 application-specific heterogeneous multiprocessor systems power minimisation voltage reduction design strategy interdependent subtasks randomised search genetic algorithm multiple trial solutions allocation scheduling execution time operating voltage. [ bib ]
We present a design strategy to reduce power demands in application-specific heterogeneous multiprocessor systems with interdependent subtasks. This power reduction scheme can be used with a randomised search such as a genetic algorithm where multiple trial solutions are tested. The scheme is applied to each trial solution after allocation and scheduling have been performed. Power savings are achieved by equally expanding each processor's execution time with a corresponding reduction in their respective operating voltage. Lowest cost solutions achieve average reductions of 24% while minimum power solutions average 58%

Keywords: application specific integrated circuits CMOS digital integrated circuits genetic algorithms integrated circuit design low-power electronics microprocessor chips multiprocessing systems processor scheduling
[164] Allan Rae and Sri Parameswaran. Synthesising application-specific heterogeneous multiprocessors using differential evolution. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E84-A(12):3125-31, 2001. Copyright 2002, IEE 7194854 0916-8508 application-specific heterogeneous multiprocessor synthesis system HeMPS system evolutionary computation differential evolution scheduling heuristic design space searching system-level synthesis codesign. [ bib ]
This paper presents an application-specific, heterogeneous multiprocessor synthesis system, named HeMPS, that combines a form of evolutionary computation known as Differential Evolution with a scheduling heuristic to search the design space efficiently. We demonstrate the effectiveness of our technique by comparing it to similar existing systems. The proposed strategy is shown to be faster than recent systems on large problems while providing equivalent or improved final solutions

Keywords: evolutionary computation hardware-software codesign multiprocessing systems processor scheduling
[165] Vince E Boros, Aleks D. Rakic, and Sri Parameswaran. High-level model of a wdma passive optical bus for a reconfigurable multiprocessor system. In Design Automation Conference. (DAC '00), pages 221-6, Los Angeles, CA, USA, 2000. ACM. 6715457 high-level model WDMA passive optical bus reconfigurable multiprocessor system optical bus bandwidth optical power budget electrical power budget optimal design on-the-fly system reconfiguration wavelength division multiple access multiprocessor interconnections. [ bib | .pdf ]
We describe the first iteration of a comprehensive model with which we can investigate the practical limits on optical bus bandwidth and number of bus processing modules for given signal power. The selection algorithm will ultimately allow programmable evaluation of system parameters bus bandwidth, optical power budget, electrical power budget, number of modules and space consumption for an optimal design that is suitable for on-the-fly system reconfiguration

Keywords: multiprocessing systems multiprocessor interconnection networks optical communication optical interconnections reconfigurable architectures wavelength division multiplexing
[166] Sri Parameswaran, Matthew F.Parkinson, and Peter Bartlett. Profiling in the asp codesign environment. Journal of Systems Architecture, 46(14):1263-74, 2000. Copyright 2000, IEE 6785317 1383-7621 ASP codesign environment hardware/software codesign high-level profiling tools system level synthesis profiling. [ bib | .pdf ]
Automation of the hardware/software codesign (HSC) methodology brings with it the need to develop sophisticated high-level profiling tools. This paper presents a profiling tool which uses execution profiling on standard C code to obtain accurate and consistent times at the level of individual compound code sections. This tool is used in the ASP hardware/software codesign project. The results from this tool show that profiling must be performed on dedicated hardware which is as close as possible to the final implementation, as opposed to a workstation. Further, in this paper a formula is derived for the number of times a program has to be profiled in order to get an accurate estimate of the number of times a loop with an indeterminate loop count is executed

Keywords: hardware-software codesign
[167] Allan Rae and Sri Parameswaran. Voltage reduction of application-specific heterogeneous multiprocessor systems for power minimisation. In Asia and South Pacific Design Automation Conference 2000 with EDA TechnoFair 2000 ( ASP-DAC 2000), pages 147-52, Yokohama, Japan, 2000. IEEE. 6596990 application-specific heterogeneous multiprocessor systems power minimisation power reduction genetic algorithm multiple trial solutions trial solution scheduling power savings lowest cost solution minimum power solutions voltage reduction. [ bib | .pdf ]
We present a design strategy to reduce power demands in application-specific heterogeneous multiprocessor systems with interdependent subtasks. This power reduction scheme can be used with a randomised search such as a genetic algorithm where multiple trial solutions are tested. The scheme is applied to each trial solution after allocation and scheduling have been performed. Power savings are achieved by equally expanding each processor's execution time with a corresponding reduction in their respective operating voltage. Lowest cost solutions achieve average reductions of 24% while minimum power solutions average 58%

Keywords: application specific integrated circuits circuit CAD CMOS digital integrated circuits genetic algorithms integrated circuit design microprocessor chips minimisation multiprocessing systems power consumption
[168] Seyed M. Kia and Sri Parameswaran. Self-checking synchronous controller design. IEE Proceedings-Computers and Digital Techniques, 146(1):9-12, 1999. Copyright 1999, IEE 6196836 1350-2387 totally self-checking code disjoint controller models TSC edge-triggered flip-flops synchronous controller. [ bib | .pdf ]
Efficient models are introduced for totally self-checking/code disjoint (TSC/CD) and strongly fault-secure/strongly code disjoint (SFS/SCD) synchronous controller models. These models are based on two low-cost, modular, TSC edge-triggered and error-propagating CD flip-flops. Properties of the proposed synchronous controller models are proven. The design procedure for these models and their proper applications are explained

Keywords: built-in self test flip-flops logic testing
[169] Hui Guo and Sri Parameswaran. Unrolling loops with indeterminate loop counts in system level pipelines. In Asia South Pacific Design Automation Conference (ASP-DAC '98), pages 195-200, Yokohama, Japan, 1998. [ bib | http ]
This paper describes the unrolling of loops with indeterminate loop counts in system level pipelines. Two methods are discussed in this paper. The first method is the varied latency method, where the input is blocked until the pipeline is clear. This variation in the input arrival time gives rise to the name. In this method, the output is in the same order as the input. The second method, called the fixed latency method, allows for the input arrival time to remain unchanged. The loops with loop count in excess of the number of unrolled loops must be stored until a suitable gap in the system becomes available. It is shown that the number of loops should be equal to the sum of the expected value of the loop count and standard deviation of the loop count in the varied latency method, and the expected value of the loop count for the fixed latency method

[170] Seyed M. Kia and Sri Parameswaran. Designs for self checking flip-flops. IEE Proceedings-Computers and Digital Techniques, 145(2):81-8, 1998. Copyright 1998, IEE 5895725 1350-2387 self checking flip-flops totally self checking TSC edge triggered error propagating flip-flops D flip-flop T flip-flop. [ bib | .pdf ]
The authors introduce two low-cost, modular, totally self checking (TSC), edge triggered and error propagating (code disjoint) flip-flops: one, a D flip-flop used in TSC and strongly fault secure (SFS) synchronous circuits with two-rail codes, the other a T flip-flop, used in a similar way as the D flip-flop but retaining the error as an indicator until the next presetting, to aid error propagation. Thus, the self checking T flip-flop can be used as an error indicator. The self checking D flip-flop is smaller than the duplicate D flip-flop circuitry by 30 than the pervious error indicator in the literature. These circuits, unlike previously reported circuits, can also detect stuck-at faults in the clock inputs. The authors have also presented TSC/error propagating applications for the above flip-flops: a counter and a shift register

Keywords: automatic testing flip-flops
[171] Sri Parameswaran. Hw-sw co-synthesis: the present and the future. In Asian and South Pacific Design Automation Conference 1998 (ASP-DAC '98), pages 19-22, Yokohama, Japan, 1998. IEEE. 5912414 HW-SW co-synthesis hardware software co-synthesis automated design microprocessors sensors RF circuits design methodology. [ bib | .pdf ]
As we move towards several million transistors per chip it is desirable to move to higher levels of abstraction for the purposes of automated design of systems. Increasing performance of microprocessors in the marketplace is moving the balance between software and hardware. In this environment, it is necessary to adapt our tools to create systems, which encompass these fast microprocessors rather than compete with them. It is important to adapt other peripheral components such as sensors and RF circuits into our design methodology

Keywords: high level synthesis
[172] Sri Parameswaran and Hui Guo. Power reduction in pipelines. In Asian and South Pacific Design Automation Conference 1998 (ASP-DAC '98), pages 545-50, Yokohama, Japan, 1998. IEEE. 5912449 system level pipeline power consumption multi-voltage supply scheme hardware power minimal power consumption. [ bib | .pdf ]
The reduction of power consumption for a system level pipeline is addressed in this paper. The pipeline is composed of several stages. Each stage has several behaviours. Different behaviours have differing execution times. The speed of the pipeline is only affected by the behaviours on the critical path of the slowest stages. Other behaviours can be slowed down to decrease the power consumed in the system. We propose a multi-voltage supply scheme, in which differing behaviours are supplied with differing voltages. The formulas for computing the supply voltage of each behaviour and minimal power consumption are derived in this paper. The results of computer experiment show that up to 80% hardware power can be saved with this scheme

Keywords: computer power supplies parallel architectures pipeline processing power consumption
[173] Allan Rae and Sri Parameswaran. Application-specific heterogeneous multiprocessor synthesis using differential-evolution - ii. In 11th International Symposium on System Synthesis (ISSS '98), pages 83-8, Hsinchu, Taiwan, 1998. IEEE Comput. Soc. 6128974 multiprocessor synthesis system HeMPS application-specific Evolutionary Computation Differential Evolution scheduling heuristic. [ bib | .pdf ]
This paper presents an application-specific, heterogeneous multiprocessor synthesis system, named HeMPS, that combines a form of Evolutionary Computation known as Differential Evolution with a scheduling heuristic to search the design space efficiently. We demonstrate the effectiveness of our technique by comparing it to similar existing systems. The proposed strategy is shown to be faster than recent systems on large problems while providing equivalent or improved final solutions

Keywords: computer architecture evolutionary computation multiprocessing systems
[174] Allan R. Rae and Sri Parameswaran. Application-specific heterogeneous multiprocessor synthesis using differential evolution. In Asia Pacific Conference on Hardware Description Languages (APCHDL '98), page 6 pages, Seoul, South Korea, 1998. [ bib ]
[175] Hui Guo and Sri Parameswaran. Unfolding loops with indeterminate count in system level pipelines. In 14th Australian Microelectronics Conference. Microelectronics: Technology Today for the Future (MICRO '97), pages 82-7, Melbourne, Vic., Australia, 1997. IREE Soc. 5972221 unfolding loops indeterminate loop count system level pipelines loop unrolling varied latency method input blocking input arrival time variation fixed latency method constant input arrival time loop count unrolled loops loop count standard deviation expected loop count value. [ bib ]
This paper describes the unrolling of loops with indeterminate loop counts in system level pipelines. Two methods are discussed in this paper. The first method is the varied latency method, where the input is blocked until the pipeline is clear. This variation in the input arrival time gives rise to the name. In this method, the output is in the same order as the input. The second method, called the fixed latency method, allows for the input arrival time to remain unchanged. The loops with loop count in excess of the number of unrolled loops must be stored until a suitable gap in the system becomes available. It is shown that the number of loops should be equal to the sum of the expected value of the loop count and standard deviation of the loop count in the varied latency method, and the expected value of the loop count for the fixed latency method

Keywords: pipeline processing
[176] Seyed M Kia and Sri Parameswaran. Design of tsc/cd and sfs/scd synchronous circuits with tsc/error propagating flip-flops'. In 11th Australian Microelectronics Conference (MICRO '97), pages 75-80, Sydney, NSW, Australia, 1997. Inst. Radio & Electron. Eng. [ bib ]
In this paper, we introduce a low-cost, modular, totally self checking (TSC), edge triggered and error propagating (code disjoint) D-type flip-flop. This D flip-flop can be used in TSC and strongly fault secure (SFS) synchronous circuits with two-rail codes

[177] Seyed M. Kia and Sri Parameswaran. An efficient self exercising two rail checker. Journal of Microelectronic Systems Integration, 5(3):159-65, 1997. Copyright 1997, IEE 5775447 1070-0056 self exercising two rail checker nonsystematic two rail codes two dimensional self exercising two rail checker TDSETR checker delay. [ bib ]
In this paper, a, new efficient self exercising checker for a class of non-systematic two rail codes is introduced. This proposed checker is called the Two Dimensional Self Exercising Two Rail checker (TDSETR checker). The TDSETR checker is used to check non-systematic two rail codes. In this method, the checking is done on line. This method significantly reduces the complexity and delay of checking compared to previous circuits. The high speed and low cost advantages of the TDSETR checker are compared with conventional circuits (e.g., the circuit is 30 reported circuits for 256 bit pair inputs)

Keywords: built-in self test error detection codes fault diagnosis logic testing
[178] Sri Parameswaran and Hui Guo. Power consumption in cmos combinational logic blocks at high frequencies. In Asia and South Pacific Design Automation Conference 1997 (ASP-DAC '97), pages 195-200, Chiba, Japan, 1997. IEEE. 5552627 CMOS combinational logic blocks power consumption high frequencies dynamic power dissipation saturation frequencies output voltage supply voltage output voltage waveform triangular waveform dynamic power consumption load capacitance switching speed. [ bib | .pdf ]
A new model for estimating dynamic power dissipation in CMOS combinational circuits at differing voltages is presented. The proposed model deals with power dissipation of circuits at saturation frequencies, where the output voltage does not reach 100 the output voltage waveform is almost a triangular waveform. We show that the dynamic power consumption at saturation frequencies is only dependent on the supply voltage, and is independent of load capacitance and switching speed. This model shows that when a circuit is working in the saturation frequency range, as the frequency is increased, the performance/power ratio is increased. However, this increase in performance/power ratio is at the expense of noise margin. The model is theoretically and empirically shown to be correct. This model can be used to design a system where the differing combinational logic blocks are supplied with differing voltages. Such a system would consume lower power than if the system was supplied by a single voltage rail

Keywords: CMOS logic circuits combinational circuits integrated circuit design
[179] Sri Parameswaran and Hui Guo. Partitioning of system level pipelines. In 14th Australian Microelectronics Conference. Microelectronics: Technology Today for the Future ( MICRO '97), pages 233-8, Melbourne, Vic., Australia, 1997. IREE Soc. 5972235 system level pipelines sub-systems sub-processes sub-process implementation implementation cost implementation execution time pipeline partitioning algorithm optimal pipeline near-optimal pipeline. [ bib ]
A system to be pipelined usually consists of several sub-systems or sub-processes. Each sub-process may have several implementations. Each implementation has an associated cost and an associated execution time. Different partitions result in a different pipeline. To obtain an efficient pipeline, partitioning is important. In this paper, an algorithm to partition a system in order to obtain an optimal or near-optimal pipeline, is given. Results obtained by this method are very close to the optimal solution, but can be found in a fraction of the time taken to search for an optimal solution

Keywords: logic partitioning optimisation pipeline processing
[180] Sri Parameswaran and Hui Guo. Power reduction in pipelines. In 14th Australian Microelectronics Conference. Microelectronics: Technology Today for the Future ( MICRO '97), pages 239-44, Melbourne, Vic., Australia, 1997. IREE Soc. 5972236 pipelines power consumption power reduction system level pipeline pipeline stages pipeline stage behaviours critical path behaviours multi-voltage supply scheme supply voltage computation minimal power consumption. [ bib | .pdf ]
The reduction of power consumption for a system level pipeline is addressed in this paper. The pipeline is composed of several stages. Each stage has several behaviours. Different behaviours have differing execution times. The speed of the pipeline is only affected by the behaviours on the critical path of the slowest stages. Other behaviours can be slowed down to decrease the power consumed in the system. We propose a multi-voltage supply scheme, in which differing behaviours are supplied with differing voltages. The formulas for computation of the supply voltage for each behaviour and minimal power consumption are derived in this paper. The results of computer experiments are also provided here

Keywords: minimisation pipeline processing power supply circuits
[181] Sri Parameswaran and Hui Guo. Extracting higher performance/power ratio in combinational cmos circuits. In Sixth International Workshop on Power, Timing, Modelling, Optimization and Simulation (PATMOS '96), pages 93-102, University of Bologna, 1996. [ bib ]
[182] Pradip Jha, Sri Parameswaran, and Nikil Dutt. Reclocking controllers for minimum execution time. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E78-A(12):1715-1721, 1995. [ bib ]
In this paper we describe a method for resynthesizing the controller of a design for a fixed datapath with the objective of increasing the design's throughput by minimizing its total execution time. This work has potential in two important areas: one, design reuse for retargetting datapaths to new libraries, new technologies and different bit-widths; and two, back-annotation of physical design information during High-Level Synthesis (HLS), and subsequent adjustment of the design's schedule to account for realistic physical design information with minimal changes to the datapath. We present our approach using various formulations, prove optimality of our algorithm and demonstrate the effectiveness of our technique on several HLS benchmarks. We have observed improvements of up to 34 application of our controller resynthesis technique to the outputs of HLS.

[183] Pradip Jha, Sri Parameswaran, and Nikil Dutt. Reclocking for high level synthesis. In Asia and South Pacific Design Automation Conference. IFIP International Conference on Computer Hardware Description Languages and their Applications. IFIP Interntional Conference on Very Large Scale Integration (ASP-DAC'95/CHDL'95/VLSI'95.), pages 49-54, Chiba, Japan, 1995. Nihon Gakkai Jimu Senta. 5217814 high level synthesis reclocking performance improvement wire delay consideration bit-width migration library migration feature size migration. [ bib | .pdf ]
Describes a powerful post-synthesis approach called reclocking, for performance improvement by minimizing the total execution time. By back annotating the wire delays of designs created by a high level synthesis system, and then finding an optimal clockwidth, we resynthesize the controller to improve performance without altering the datapath. Reclocking is versatile and can be applied not only for wire delay consideration, but also for bit-width migration, library migration and for feature size migration supporting the philosophy of design reuse. Experimental results show that with reclocking, the performance of the input designs can be improved by as much as 34%

Keywords: high level synthesis logic design
[184] Matthew F. Parkinson and Sri Parameswaran. Profiling in the asp codesign environment. In Proceedings of the Eighth International Symposium on System Synthesis (ISSS '95), pages 128-33, Cannes, France, 1995. IEEE Comput. Soc. Press. 5087887 ASP codesign environment hardware/software codesign methodology profiling high-level profiling tools execution profiling C code Hardware/Software Codesign project dedicated hardware workstation Automated Synthesis and Partitioning system. [ bib | .pdf ]
Automation of the hardware/software codesign methodology brings with it the need to develop sophisticated high-level profiling tools. This paper presents a profiling tool which uses execution profiling on standard C code to obtain accurate and consistent times at the level of individual compound code sections. This tool is used in the ASP Hardware/Software Codesign project. The results from this tool show that profiling must be performed on dedicated hardware which is as close as possible to the final implementation, as opposed to a workstation

Keywords: circuit CAD computer architecture software tools systems analysis virtual machines
[185] Seyed M. Kia and Sri Parameswaran. Design automation of self checking circuits. In European Design Automation Conference (EURO-DAC '94 + EURO VHDL '94)), pages 252-7, Grenoble, France, 1994. ACM. 4800771 design automation self checking circuits CAD tools strongly fault secure strongly code disjoint totally self checking code disjoint sequential synchronous circuits combinatorial synchronous circuits shift registers counters adders checkers structural level VHDL. [ bib ]
We explain the steps of the CAD tools developed for self checking circuits. The CAD tools developed are used to design strongly fault secure, strongly code disjoint (SFS/SCD) and totally self checking, code disjoint (TSC/CD) circuits. Self checking combinatorial and sequential synchronous circuits including shift registers, counters, adders and checkers are designed, using these tools. The output of these CAD tools is given in structural level VHDL which can be synthesized via commercial tools

Keywords: combinatorial circuits logic testing sequential circuits specification languages
[186] Seyed M. Kia and Sri Parameswaran. Novel architectures for tsc/cd and sfs/scd synchronous controllers. In Proceedings 12th IEEE VLSI Test Symposium (VTS '94), pages 138-43, Cherry Hill, NJ, USA, 1994. IEEE Comput. Soc. Press. 4691131 synchronous controllers totally self checking code disjoint strongly fault secure strongly code disjoint TSC/CD models SFS/SCD models edge triggered flip-flops error propagating flip-flops D flip-flop two-rail codes T flip-flop presetting error propagation. [ bib | .pdf ]
Introduces design models for totally self checking, code disjoint (TSC/CD) and strongly fault secure, strongly code disjoint (SFS/SCD) synchronous controllers. The TSC/CD and SFS/SCD models are based on two new proposed low-cost, modular, totally self checking (TSC), edge triggered and error propagating (code disjoint) flip-flops; one, a D flip-flop which can be used in TSC and strongly fault secure (SFS) synchronous circuits with two-rail codes; the other a T flip-flop, used in a similar way as the D flip-flop but retaining the error as an indicator until the next presetting, as an aid to error propagation

Keywords: circuit reliability flip-flops logic testing sequential circuits
[187] Sri Parameswaran, Paradip Jha, and Nikil Dutt. Resynthesizing controllers for minimum execution time. In Asia Pacific Conference in Hardware Description Languages (APCHDL '94), Toyohashi, Japan, 1994. [ bib ]
Describes a powerful post-synthesis approach called reclocking, for performance improvement by minimizing the total execution time. By back annotating the wire delays of designs created by a high level synthesis system, and then finding an optimal clockwidth, we resynthesize the controller to improve performance without altering the datapath. Reclocking is versatile and can be applied not only for wire delay consideration, but also for bit-width migration, library migration and for feature size migration supporting the philosophy of design reuse. Experimental results show that with reclocking, the performance of the input designs can be improved by as much as 34%

[188] Sri Parameswaran and Mark F. Schulz. Computer-aided selection of components for technology-independent specifications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(11):1333-50, 1994. Copyright 1994, IEE 4807285 0278-0070 computer-aided selection technology-independent specifications synchronous circuits fast algorithms components selection SPOT system timing information digital design. [ bib | .pdf ]
The specification of a synchronous circuit can be given as a set of abstract building blocks that are interconnected. A set of fast algorithms are presented here for the selection of components that map each of these abstract building blocks to one of a number of suitable physical components. The first set of algorithms select the set of fastest or cheapest (smallest area) of all possible components. Another set of algorithms is given that will find a solution with user-defined constraints. These algorithms, which are implemented as part of the SPOT system, use a exhaustive list of timing information to increase the likelihood of a good solution

Keywords: circuit CAD digital circuits logic CAD synchronisation
[189] Matthew F. Parkinson, Paul M.Taylor, and Sri Parameswaran. C to vhdl converter in a codesign environment. In Proceedings of VHDL International Users Forum. Spring Conference (VHDL Forum '94), pages 100-9, Oakland, CA, USA, 1994. IEEE Computer. Soc. Press. 4706722 VHDL codesign environment high-level synthesis tool C code behavioural description chip-level hardware interconnect. [ bib | .pdf ]
Automation of the hardware/software codesign methodology brings with it the need to develop sophisticated high-level synthesis tools. This paper presents a tool which is the result of such development. This tool converts standard C code into an equivalent VHDL behavioural description. This description is used to generate a chip-level hardware interconnect of identical functionality to the original C code

Keywords: circuit CAD formal specification program interpreters specification languages
[190] Tim Healy and Sri Parameswaran. Bomara: A boltzmann machine for register allocation and interconnection. In 11th Australian Microelectronics Conference. Microelectronics, Meeting the Needs of Modern Technology (MICRO '93), pages 69-74, Gold Coast, Qld., Australia, 1993. Inst. Radio & Electron. Eng. 4910442 BoMaRA program Boltzmann machine register allocation register interconnection probabilistic methods register number minimisation interconnection number minimisation multiple variables storage. [ bib ]
Most register allocation methods only minimise the number of registers. BoMaRA, a program using probabilistic methods for register allocation, attempts to minimise the number of registers and the number of interconnections that arise when multiple variables are stored in a single register

Keywords: Boltzmann machines data flow graphs logic CAD logic partitioning minimisation of switching nets multiprocessor interconnection networks probabilistic logic shift registers
[191] Seyed M. Kia and Sri Parameswaran. Automated self checking system using vhdl. In Asia Pacific Conference on Hardware Description Languages (APCHDL '93), pages 131-135, Brisbane, Australia, 1993. APCHDL Conference Secretariat. [ bib ]
[192] Seyed M. Kia and Sri Parameswaran. Synchronous tsc/cd error indicator for self checking systems. In Pacific Rim International Symposium on Fault tolerant Computing (PRISFC '93), pages 156-160, Melbourne, Australia, 1993. [ bib ]
[193] Sri Parameswaran and Adam Postula. Proceedings of the first asia pacific conference on hardware description languages (apchdl '93). Brisbane, Australia, 1993. APCHDL Conference Secretariat. [ bib ]
[194] Sri Parameswaran and Mark F. Schulz. A critical look at adaptive logic networks. In Proceedings of the Fourth Australian Conference on Neural Networks (ACNN'93), pages 102-5, Melbourne, Vic., Australia, 1993. Sydney Univ. Electr. Eng. 4540868 adaptive logic networks Atree2 software package digital logic techniques Hamming error code correction minimized combinational logic. [ bib ]
Critically analyses adaptive logic networks (ALNs), which have been developed by W. Armstrong et al. (1979). The authors take some of the problems distributed by W. Armstrong in the Atree2 software package, and apply standard digital logic techniques to the same problems. From the set of tests done, it is concluded that the problems looked at can be solved by standard digital logic techniques and Hamming error code correction. The standard digital logic techniques coupled with Hamming error code correction produce a minimized combinational logic up to 100 times faster than ALNs

Keywords: adaptive systems error correction codes logic circuits neural nets
[195] Matthew F. Parkinson, Paul M.Taylor, and Sri Parameswaran. A profiler for automated translation of signal processing algorithms into high speed hardware/software hybrid architectures. In 11th Australian Microelectronics Conference. Microelectronics, Meeting the Needs of Modern Technology (MICRO '93), pages 81-6, Gold Coast, Qld., Australia, 1993. Inst. Radio & Electron. Eng. 4910444 profiler automated translation signal processing algorithms high speed hardware/software hybrid architectures hardware/software codesign hardware partition software partition code division C code execution profiling automatic partitioning profiling techniques. [ bib ]
In order to perform a hardware/software codesign on an algorithm, it is essential to divide the code into hardware and software partitions. This paper presents the technique of execution profiling `C code' in preparation for automatic partitioning. The profiler presented overcomes many shortcomings of traditional profiling techniques

Keywords: C language formal specification logic CAD logic partitioning program compilers signal processing
[196] Matthew F. Parkinson, Paul M.Taylor, Sri Parameswaran, and Adam Postula. An automated hardware software codesign system using vhdl. In Asia Pacific Conference of Hardware Description Language (APCHDL '93), pages 267-280, Brisbane, Australia, 1993. APCHDL Conference Secretariat. [ bib ]
[197] Sri Parameswaran and Mark F. Schulz. Spot: an expert system for digital synthesis. In 8th Australian Conference on Microelectronics (MICROS '89), pages 95-101, Brisbane, Qld., Australia, 1989. Inst. Eng. Australia. 3523881 expert system digital synthesis knowledge based SPOT technology-independent description functional description LSI MSI SSI EPLD. [ bib ]
Describes an approach to the automation of digital synthesis using a knowledge based expert system named SPOT. A technology-independent description is produced from a functional description. This technology-independent description is further processed to create technology-dependent description for both LSI/MSI/SSI and EPLD technologies

Keywords: expert systems integrated logic circuits large scale integration logic arrays logic CAD
[198] Mark F. Schulz and Sri Parameswaran. An expert system for design of asics. In Conference on Computing Systems and Information Technology 1989 (CCSIT '89), pages 127-31, Sydney, NSW, Australia, 1989. Instn. Eng. Australia. 3853351 ASIC design digital design automation expert system SPOT technology-independent description structural description technology-dependent description. [ bib ]
Describes an expert system named SPOT for the automation of digital design. The first section of the project described in this paper is used to create a technology-independent description from a structural description. The final section of the project is to create a technology-dependent description from the technology-independent description

Keywords: application specific integrated circuits circuit CAD expert systems
[199] Center for History and New Media. Zotero quick start guide. [ bib | http ]
[200] Sri Parameswaran. Improving education planning in sri lanka - computer science and mathematics. Technical report, Asian Development Bank, June 1999. [ bib ]
[201] Sri Parameswaran. Proposal for a centre for system-on-a-chip research. Technical report, The University of Queensland / State Department of Queensland, September 1998. [ bib ]
[202] Sri Parameswaran. Strategies for the indian higher education student market in engineering/it. Technical report, The University of Queensland, June 1998. [ bib ]
[203] Sri Parameswaran. Proposed new horizons diploma & professional masters program in information technology. Technical report, The University of Queensland, September 1998. [ bib ]
[204] Sri Parameswaran, Jorgen Peddersen, and Ashley Partis. A low power chip architecture. [ bib ]

This file was generated by bibtex2html 1.95.