Joerg Henkel & Sri Parameswaran, editeurs. Designing embedded processors: A low power perspective. Springer, 2007. [ bib | .pdf ]

Seng Lin Shee & Sri Parameswaran. Architectural Exploration of Heterogeneous Multiprocessor Systems for JPEG. International Journal of Parallel Processing, vol. 35, 2007. [ bib | .pdf ]

Multicore processors have been utilized in embedded systems and general computing applications for some time. However, these multicore chips execute multiple applications concurrently, with each core carrying out a particular task in the system. Such systems can be found in gaming, automotive real-time systems and video / image encoding devices. These system are commonly deployed to overcome deadline misses, which are primarily due to overloading of a single multitasking core. In this paper, we explore the use of multiple cores for a a single application, as opposed to multiple applications executing in a parallel fashion. A single application is parallelized using two different methods: one, a master-slave model; and two, a sequential pipeline model. The systems were implemented using Tensilica's Xtensa LX processors with queues as the means of communications between two cores. In a master-slave model, we utilized a course grained approach whereby a main core distributes the workload to the remaining cores and reads the processed data before writing the results back to file. In the pipeline model, a lower granularity is used. The application is partitioned into multiple sequential blocks; each block representing a stage in a sequential pipeline. For both models we applied a number of differing configurations ranging from a single core to a nine-core system. We found that without any optimization for the seven core system, the sequential pipeline approach has a more efficient area usage, with an area increase to speedup ratio of 1.83 compared to the master-slave approach of 4.34. With selective optimization in the pipeline approach, we obtained speed ups of up to 4.6times while with an area increase of only 3.1times (area increase to speedup ratio of just 0.68).

Seng Lin Shee & Sri Parameswaran. Design Methodology for Pipelined Heterogeneous Multiprocessor System. In Design Automation Conference (DAC'07), page 6pp, San Diego, CA, USA, 2007. [ bib | .pdf ]

Multiprocessor SoC systems have led to the increasing use of parallel hardware along with the associated software. These approaches have included coprocessor, homogeneous processor (e.g. SMP) and application specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to its configurability and programmability. In this work, we introduce a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration. A streaming application is taken and manually broken into a series of algorithmic stages (each of which make up a stage in a pipeline). We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose a heuristic to efficiently search the design space for a pipeline-based multi ASIP system. We have implemented the proposed heterogeneous multiprocessor methodology using a commercial extensible processor (Xtensa LX from Tensilica Inc.). We have evaluated our system by creating two benchmarks (MP3 and JPEG encoders) which are mapped to our proposed design platform. Our multiprocessor design provided a performance improvement of at least 4.11X (JPEG) and 3.36X (MP3) compared to the single processor design. The minimum cost obtained through our heuristic was within 5.47 respectively.

Jude Angelo Ambrose, Roshan Ragel & Sri Parameswaran. RIJID: Random Code Injection to Mask Power Analysis based Side Channel Attacks. In Design Automation Conference (DAC '07), page 6pp, San Diego, Ca, USA, 2007. [ bib | .pdf ]

Side channel attacks are becoming a major threat to the security of embedded systems. Countermeasures proposed to overcome Simple Power Analysis and Differential Power Analysis, are data masking, table masking, current flattening, circuitry level solutions, dummy instruction insertions and balancing bit-flips. All these techniques are either susceptible to multi-order side channel attacks, not sufficiently generic to cover all encryption algorithms, or burden the system with high area cost, run-time or energy consumption. A HW/SW based randomized instruction injection technique is proposed in this paper to overcome the pitfalls of previous countermeasures. Our technique injects random instructions at random places during the execution of an application which protects the system from both SPA and DPA. Further, we devise a systematic method to measure the security level of a power sequence and use it to measure the number of random instructions needed, to suitably confuse the adversary. Our processor model costs 1.9 processor, and costs on average 29.8 energy consumption for six industry standard cryptographic algorithms.

Yee Jern Chong & Sri Parameswaran. Automatic Application Specific Floating-point Unit Generation. In Design, Automation and Test in Europe (DATE'07) Conference, Nice, France, 2007. IEEE. [ bib | .pdf ]

This paper describes the creation of custom floating point units (FPUs) for Application Specific Instruction Set Processors(ASIPs). ASIPs allow the customization of processors for use in embedded systems, by extending the instruction set which enhances the performance of an application or a class of application. These extended instructions are manifested as separate hardware blocks, making the creation of any necessary floating point instructions quite unwieldy. On the other hand, using a predefined FPU includes a large monolithic hardware block with considerable number of unused instructions. A customized FPU will overcome these drawbacks, yet the manual creation of one is a time consuming, error prone process. This paper presents a methodology for automatically generating floating-point units (FPUs) that are customized for specific applications at the instruction level. Custom FPUs were generated for several Mediabench applications. Area savings over a fully-featured FPU without resource sharing of 26 resource sharing and 33 Clock period increased in some cases by up to 9.5 sharing.

Andhi Janapsatya, Aleksandar Ignjatovic, Sri Parameswaran & Joerg Henkel. Instruction Trace Compression for Rapid Instruction Cache Simulation. In Design, Automation and Test in Europe (DATE'07) Conference, page 6pp. IEEE, 2007. [ bib | .pdf ]

Abstract: A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9 (on average) and improves performance by 40.0 compared to a traditional cache system which is identical in size (40 refs.)

Ivan Siu-Chuang Lu, Neil Weste & Sri Parameswaran. A power-efficient 5.6-GHz process-compensated CMOS frequency divider. IEEE Transactions on Circuits and Systems II, vol. 54, no. 4, pages 323-327, 2007. [ bib | .pdf ]

This brief presents a robust, power efficient CMOS frequency divider for the 5-GHz UNII band. The divider operates as a voltage controlled ring oscillator with the output frequency modulated by the switching of the input transmission gate. The divider, designed in a 0.25-micron SOS-CMOS technology, occupies 35x25microm2 and exhibit a operating frequency of 5.6 GHz while consuming 79 microW at a supply voltage of 0.8 V. Process and temperature tolerant operation can be achieved by utilizing a novel compensation circuitry to calibrate the speed of the ring oscillator-based divider. The simple compensation circuitry contains low-speed digital logic and dissipates minimal additional power since it is powered on only during the one-time factory calibration sequence (10 refs.)

Jorgen Peddersen & Sri Parameswaran. CLIPPER: Counter-based Low Impact Processor Power Estimation at Run-time. In 12th Asia and South Pacific Design Automation Conference (ASP-DAC 2007), pages 890-895, Yokohama, Japan, 2007. IEEE. [ bib | .pdf ]

Numerous dynamic power management techniques have been proposed which utilize the knowledge of processor power/energy consumption at run-time. So far, no efficient method to provide run-time power/energy data has been presented. Current measurement systems draw too much power to be used in small embedded designs and existing performance counters can not provide sufficient information for run-time optimization. This paper presents a novel methodology to solve the problem of run-time power optimization by designing a processor that estimates its own power/energy consumption. Estimation is performed by the addition of small counters that tally events which consume power. This methodology has been applied to an existing processor resulting in an average power error of 2 adds little impact to the design, with only a 4.9 area and a 3 of an application that utilizes the processor showcases the benefits the methodology enables in dynamic power optimization.

Jorgen Peddersen & Sri Parameswaran. Energy driven Application Adaptation at Run-Time. In 20th International Conference on VLSI DESIGN, pages 385-390, Bangalore, India, 2007. [ bib | .pdf ]

Until recently, there has been a lack of methods to trade-off energy use for quality of service at run-time in stand-alone embedded systems. Such systems are motivated by the need to increase the apparent available battery energy of portable devices, with minimal compromise in quality. The available systems either drew too much power or added considerable overheads due to task swapping. In this paper we demonstrate a feasible method to perform these trade-offs. This work has been enabled by a low-impact power/energy estimating processor which utilizes counters to estimate power and energy consumption at run-time. Techniques are shown that modify multimedia applications to differ the fidelity of their output to optimize the energy/quality trade-off. Two adaptation algorithms are applied to multimedia applications demonstrating the efficacy of the method. The method increases code size by 1 execution time by 0.02 acceptable and processes up to double the number of frames.

Andhi Janapsatya, Aleksander Ignjatovic & Sri Parameswaran. Exploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 8, pages 816-29, 2006. [ bib | .pdf ]

A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9 (on average) and improves performance by 40.0 compared to a traditional cache system which is identical in size.

Andhi Janapsatya, Aleks Ignjatovic & Sri Parameswaran. Finding optimal L1 cache configuration for embedded systems. In Asia South Pacific Design Automation Conference (ASPDAC 2006), pages 796-801, Yokohama, Japan, 2006. [ bib | .pdf ]

Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy.

Andhi Janapsatya, Aleks Ignjatovic & Sri Parameswaran. A novel instruction scratchpad memory optimization method based on concomitance metric. In Asia South Pacific Design Automation Conference (ASPDAC 2006), page 6 pp., Yokohama, Japan, 2006. [ bib | .pdf ]

Scratchpad memory has been introduced as a replacement for cache memory as it improves the performance of certain embedded systems. Additionally, it has also been demonstrated that scratchpad memory can significantly reduce the energy consumption of the memory hierarchy of embedded systems. This is significant, as the memory hierarchy consumes a substantial proportion of the total energy of an embedded system. This paper deals with optimization of the instruction memory scratchpad based on a methodology that uses a metric which we call the concomitance. This metric is used to find basic blocks which are executed frequently and in close proximity in time. Once such blocks are found, they are copied into the scratchpad memory at appropriate times; this is achieved using a special instruction inserted into the code at appropriate places. For a set of benchmarks taken from Mediabench, our scratchpad system consumed just 59 cache system, and 73 scratchpad system, while improving the overall performance. Compared to the state of the art method, the number of instructions copied into the scratchpad memory from the main memory is reduced by 88%.

Ivan Siu-Chuang Lu, Neil Weste & Sri Parameswaran. ADC precision requirement for digital ultra-wideband receivers with sublinear front-ends: a power and performance perspective. In 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems and Design (VLSI Design '06), page 6 pp., Hyderabad, India, 2006. IEEE Computer Society. [ bib | .pdf ]

This paper presents the power and performance analysis of a digital, direct sequence ultra-wideband (DS-UWB) receiver operating in the 3 to 4 GHz band. The signal to noise and distortion ratio (SNDR) and bit error rate (BER) were evaluated with varying degrees of front-end linearity and analog to digital converter (ADC) accuracy. The analysis and simulation results indicate two or more ADC bits are required for reliable data reception in the presence of strong interference and intermodulation distortion. In addition to BER performance, power consumption of different hardware configurations is also evaluated to form the cost function for evaluating design choices. The combined power and performance analysis indicates that starting with one-bit ADC resolutions, a substantial gain in reliability can be attained by increasing ADC resolution to two-bits or more. When the ADC resolution improves beyond three bits, front-end linearization achieves similar BER improvements to increasing the ADC accuracy, at a fraction of the power cost. As a result, linear front-end designs become significant only when high precision ADCs are utilized

Sri Parameswaran, Joerg Henkel & Newton Cheung. Instruction Matching and Modelling. In Paolo Ienne & Rainer Leupers, editeurs, Customizable and Configurable Embedded Processors. Elseiver, 2006. [ bib ]

Swarana Radhakrishnan, Hui Guo & Sri Parameswaran. Customization of Application Speci c Heterogeneous MultiPipeline Processors. In Design, Automation and Test in Europe Conference and Exhibition (DATE '06), page 6 pages, Munich, Germany, 2006. IEEE Comput. Soc. [ bib | .pdf ]

In this paper we propose application specific instruction set processors with heterogeneous multiple pipelines to efficiently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specified in C language, the design system can generate a processor with a number of pipelines specifically suitable to the application, and the parallel code associated with the processor. Each pipeline in such a processor is customized, and implements its own special instruction set so that the instructions can be executed in parallel with low hardware overhead. Our simulations and experiments with a group of benchmarks, largely from Mibench suite, show that on average, 77 pipeline ASIP, with the overheads of 49 power, 17% on switching activity, and 69% on code size (20 refs.)

Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran & Aleksandar Ignjatovic. Application Specific Forwarding Network and Instruction Encoding for Multi-pipe ASIPs. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006. [ bib ]

Small area and code size are two critical design issues in most of embedded system designs. In this paper, we tackle these issues by customizing forwarding networks and instruction encoding schemes for multi-pipe Application Specific Instruction-Set Processors (ASIPs). Forwarding is a popular technique to reduce data hazards in the pipeline to improve performance and is applied in almost all modern processor designs; but it is very area expensive. Instruction encoding schemes have a direct impact on code size; an efficient encoding method can lead to a small instruction width, and hence reducing the code size. We propose application specific techniques to reduce forwarding networks and instruction widths for ASIPs with multiple pipelines. By these design techniques, it is possible to reduce area, code size, and even power consumption (due to reduced area), without costing any performance. Our experiments, on a set of benchmarks using the proposed customization approaches show that, on average, there are 27 on area, 30 time, performance even improves by 4 period.

Roshan Ragel & Sri Parameswaran. Hardware Assisted Pre-emptive Control Flow Checking for Embedded Processors to improve Reliability. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006. [ bib | .pdf ]

Roshan G. Ragel & Sri Parameswaran. IMPRES: Integrated Monitoring for Processor Reliability and Security. In Design Automation Conference. (DAC '06), page 4 pages, San Francisco, CA, USA, 2006. ACM. [ bib | .pdf ]

Security and reliability in processor based systems are concerns requiring adroit solutions. Security is often compromised by code injection attacks, jeopardizing even `trusted software'. Reliability is of concern where unintended code is executed in modern processors with ever smaller feature sizes and low voltage swings causing bit ips. Countermeasures by software-only approaches increase code size by large amounts and therefore signicantly reduce performance. Hardware assisted approaches add extensive amounts of hardware monitors and thus incur unacceptably high hardware cost. This paper presents a novel hardware/software technique at the granularity of micro-instructions to reduce overheads considerably. Experiments show that our technique incurs an additional hardware overhead of 0.91 increase of 0.06 just 11.9 These overheads are far smaller than have been previously encountered.

Seng Lin Shee, Andrea Erdos & Sri Parameswaran. Heterogeneous Multiprocessor Implementations for JPEG : A Case Study. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006. [ bib ]

Hui Wu & Sri Parameswaran. Minimising the Energy Consumption of Real-Time Tasks with Precedence Constraints on A Single Processor. In The 2006 IFIP International Conference on Embedded And Ubiquitous Computing, page 12 pages, Seoul, Korea., 2006. Springer's Lecture Notes in Computer Science. [ bib ]

Jeremy Chan & Sri Parameswaran. NoCEE: energy macro-model extraction methodology for network on chip routers. In International Conference on Computer Aided Design (ICCAD-2005), pages 254-9, San Jose, CA, USA, 2005. IEEE. [ bib | .pdf ]

In this paper we present NoCEE, a fast and accurate method for extracting energy models for packet-switched network on chip (NoC) routers. Linear regression is used to model the relationship between events occurring in the NoC and energy consumption. The resulting models are cycle accurate and can be applied to different technology libraries. We verify the individual router estimation models with many different synthetically generated traffic patterns and data inputs. Characterization of a small library takes about two hours. The mean absolute energy estimation error of the resultant models is 5 a complete gate level simulation. We also apply this method to a number of complete NoCs with inputs extracted from synthetic application traces and compare our estimated results to the gate level power simulations (mean absolute error is 5 has been integrated with commercial logic synthesis flow and power estimation tools (synopsys design compiler and primepower), allowing application across different designs. The extracted models show the different trends across various parameterizations of network on chip routers and have been integrated into an architecture exploration framework

Newton Cheung, Sri Parameswaran & Joerg Henkel. Battery aware instruction generation for embedded processors. In Asia South Pacific Design Automation Conference (ASP-DAC '05), volume 1, pages 553-556, Shanghai, China, 2005. [ bib | .pdf ]

Automatic instruction generation is an efficient method to satisfy growing performance and meet design constraints for application specific instruction-set processors. A typical approach for instruction generation is to combine a large group of primitive instructions into a single extensible instruction for maximizing speedups. However, this approach often leads to large power dissipation and discharge current, posing a challenge to battery-powered products. In this paper, we propose a battery-aware automatic tool to design extensible instructions which minimizes power dissipation distribution by separating an instruction into multiple instructions. We verify our automatic tool using 50 different code segments, and five large real-world applications. Our tool reduces energy consumption by a further 5.8 (up to 17.7 approaches. For real-world applications, energy consumption is reduced by 6.6 for most cases. The automatic instruction generation tool is integrated into our application specific instruction-set processor tool suite (24 refs.)

Hui Guo & Sri Parameswaran. Balancing system level pipelines with stage voltage scaling. In IEEE Computer Society Annual Symposium on VLSI (IVLSI '05), pages 287-289, Tampa, FL, USA, 2005. IEEE Comput. Soc. [ bib | .pdf ]

This paper presents an approach to dynamically balance the pipeline by scaling the stage supply voltages. Simulation results show that by such an approach about 50 time, and 11 limited memory overhead

Ivan S. C. Lu, Neil Weste & Sri Parameswaran. The effect of receiver front-end non-linearity on DS-UWB systems operating in the 3 to 4 GHz band. In 2005 IEEE Wireless Communications and Networking Conference (WCNC '05), volume Vol. 2, pages 776-81, New Orleans, LA, USA, 2005. IEEE. [ bib | .pdf ]

The paper presents a performance analysis of direct sequence ultra wideband (DS-UWB) systems operating with non-linear receiver front-ends. Following this analysis, we propose the novel use of pulse doublets to mitigate non-linearity induced distortion. The signal-to-noise-and-distortion ratio (SNDR) and bit error rate (BER) are evaluated with varying degrees of non-linearity and interference power. Simulation results indicate significant performance improvements by using pulse doublets under high interference power and non-linear operating conditions. Using pulse doublets allows reduced front-end linearity requirements and enables improvements in more critical circuit parameters. Front-end modules, such as low noise amplifiers (LNAs), mixers and baseband amplifiers, are designed using Peregrine's 0.5 ?m SOS-CMOS process to demonstrate the benefits of circuits designed with relaxed linearity requirements. Simulation results obtained using the Cadence Spectre RF simulator indicate that the sub-linear front-end achieves 33 dB increase in voltage gain, 2 dB improvement in noise figure, 64 in power and 917 MHz extension in bandwidth over its more linear counterpart

Sri Parameswaran & Joerg Henkel. Instruction code mapping for performance increase and energy reduction in embedded computer systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 4, pages 498-502, 2005. [ bib | .pdf ]

In this paper, we present a novel and fast constructive technique that relocates the instruction code in such a manner into the main memory that the cache is utilized more efficiently. The technique is applied as a preprocessing step, i.e., before the code is executed. Our technique is applicable in embedded systems where the number and characteristics of tasks running on the system is known a priori. The technique does not impose any computational overhead to the system. As a result of applying our technique to a variety of real-world applications we observed through simulation a significant drop of cache misses. Furthermore, the energy consumption of the whole system (CPU, caches, buses, main memory) is reduced by up to 65 benefits could be achieved by a slightly increased main memory size of about 13% on average

Sri Parameswaran, Jorgen Peddersen & Ashley Partis. A Modular Approach to TCP/IPv6 Hardware Implementation, 2 June 2005 2005. [ bib ]

Sri Parameswaran, Jorgen Peddersen & Ashley Partis. Low Power Chip Architecture, 2005. [ bib ]

Jorgen Peddersen, Seng Lin Shee, Andhi Janapsatya & Sri Parameswaran. Rapid embedded hardware/software system generation. In 18th International Conference on VLSI Design (VLSI Design '05), pages 111-16, Kolkata, India, 2005. IEEE Computer Soc. [ bib | .pdf ]

This paper presents an RTL generation scheme for a SimpleScalar/PISA instruction set architecture with system calls to implement C programs. The scheme utilizes ASIPmeister, a processor generation tool. The RTL generated is available for download. The second part of the paper shows a method of reducing the PISA instruction set and generating a processor for a given application. This reduction and generation can be performed within an hour, making this one of the fastest methods of generating an application specific processor. For five benchmark applications, we show that on average, processor size can be reduced by 30 by 24%

Swarnalatha Radhakrishnan, Hui Guo & Sri Parameswaran. n-pipe: Application Specific Heterogeneous Multi-Pipeline Processor Design. In Workshop on Application Specific Processors (WASP '05), page 8 pages, Jersey City, NJ, USA, 2005. [ bib ]

In this paper we propose Application Specic Instruction Set Processors with heterogeneous multiple pipelines to efciently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specied in C language, the design system can generate a processor with a number of pipelines specically suitable to the application and the parallel code associated with the processor. Each pipeline in such a processor is customized, and implements its own special instruction set so that the instructions can be executed in parallel with low hardware overhead. Our simulations and experiments with a group of benchmarks, largely from Mibench suite, show that on average, 77 performance improvement can be achieved compared to a single pipeline ASIP, with the overheads of 49 activity, and 69% on code size.

Roshan Ragel, Sri Parameswaran & Sayed Kia. Micro embedded monitoring for security in application specific instruction-set processors. In International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '05), pages 304-314, San Francisco, CA, USA, 2005. ACM. [ bib | .pdf ]

This paper presents a methodology for monitoring security in Application Specific Instruction-set Processors (ASIPs). This is a generalized methodology for inline monitoring insecure operations in machine instructions at microinstruction level. Microinstructions are embedded into the critical machine instructions forming self checking instructions. We name this method Micro Embedded Monitoring. Since ASIPs are designed exclusively for a particular application domain, the Instruction Set Architecture (ISA) of an ASIP is based on the application executed. Knowledge of the domain gives an insight into the kinds of the security threats which need to be considered. The fact that the ISA design is based on the application makes room to accommodate security monitoring support during the design phase by embedding microinstructions into the critical machine instructions. Since the microinstructions are the lowest possible software level architecture, we could expect to get better performance by implementing security detection using microinstruction routines. Four different embedded security monitoring routines are implemented for evaluation. The average performance penalty with these monitoring routines with ten different benchmarks is 1.93 3.07% respectively.

Seng Lin Shee, Sri Parameswaran & Newton Cheung. Architecture for loop acceleration: a case study. In International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS '05), pages 297-302, Jersey City, NJ, USA, 2005. IEEE. [ bib | .pdf ]

In this paper, we show a novel approach to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages of this approach, we investigate a JPEG encoding algorithm and accelerate one of its loop by implementing it in a coprocessor. We contrast the acceleration by implementing the critical segment as two different coprocessors and a set of customized instructions. The two different coprocessor approaches are: a high-level synthesis (HLS) approach; and a custom coprocessor approach. The HLS approach provides a faster method of generating coprocessors. We show that a loop performance improvement of 2.57× is achieved using the custom coprocessor approach, compared to 1.58× for the HLS approach and 1.33× for the customized instruction approach compared with just the main processor. Respective energy savings within the loop are 57%, 28% and 19%

Jeremy Chan & Sri Parameswaran. NoCGEN:a template based reuse methodology for Networks On Chip architecture. In 17th International Conference on VLSI Design (VLSI Design '04), pages 717-20, Mumbai, India, 2004. IEEE Comput. Soc. [ bib | .pdf ]

In this paper, we describe NoCGEN, a Network On Chip (NoC) generator, which is used to create a simulatable and synthesizable NoC description. NoCGEN uses a set of modularised router components that can be used to form different routers with a varying number of ports, routing algorithms, data widths and buffer depths. A graph description representing the interconnection between these routers is used to generate a top-level VHDL description. A wormhole output-queued 2-D mesh router was created to verify the capability of NoCGEN. Various parameterized designs were synthesized to provide estimated gate counts of 129 K to 695 K for a number of topologies varying from a 2×2 mesh to a 4×4 mesh, with constant data bus size width of 32. The NoC was simulated with random traffic using a mixed SystemC/VHDL environment to ensure correctness of operation and to obtain performance and average latency. The results show an accepted load of 53 depth from 8 to 32 flits for the 4×4 mesh router

Newton Cheung, Sri Parameswaran & Joerg Henkel. A Quantitative Study and Estimation Models for Extensible Instructions in Embedded Processors. In International Conference on Computer Aided Design (ICCAD '04), pages 183-189, San Jose, CA, USA, 2004. IEEE. [ bib | .pdf ]

Designing extensible instructions is a computationally complex task, due to the large design space each instruction is exposed to. One method of speeding up the design cycle is to characterize instructions and estimate their peculiarities during a design exploration. In this paper, we study and derive three estimation models for extensible instructions: area overhead, latency, and power consumption under a wide range of customization parameters. System decomposition and regression analysis are used as the underlying methods to characterize and analyze extensible instructions. We verify our estimation models using automatically and manually generated extensible instructions, plus extensible instructions used in large real-world applications. The mean absolute error of our estimation models arc as small as: 3.4 and 4.2 through the time consuming synthesis and simulation steps using commercial tools. Our estimation models achieve an average speedup of three orders of magnitude over the commercial tools and thus enable us to conduct a fast and extensive design space exploration that would otherwise not be possible. The estimation models are integrated into our extensible processor tool suite (29 refs.)

Newton Cheung, Sri Parameswaran, Joerg Henkel & Jeremy Chan. MINCE: matching instructions using combinational equivalence for extensible processor. In Design, Automation and Test in Europe Conference and Exhibition (DATE '04), volume Vol.2, pages 1020-5, Paris, France, 2004. IEEE Comput. Soc. [ bib | .pdf ]

Designing custom-extensible instructions for extensible processors is a computationally complex task because of the large design space. The task of automatically matching candidate instructions in an application (e.g. written in a high-level language) to a pre-designed library of extensible instructions is especially challenging. Previous approaches have focused on identifying extensible instructions (e.g. through profiling), synthesizing extensible instructions, estimating expected performance gains etc. In this paper we introduce our approach of automatically matching extensible instructions as this key step is missing in automating the entire design flow of an ASIP with extensible instruction capabilities. Since matching using simulation is practically infeasible (simulation time), and traditional pattern matching approaches would not yield reliable results (ambiguity related to a functionally equivalent code that can be represented in many different ways), we adopt combinational equivalence checking. Our MINCE tool as part of our ASIP design flow consists of a translator, a filtering algorithm and a combinational equivalence checking tool. We report matching times of extensible instructions that are 7.3x faster on average (using Mediabench applications) compared to the best known approaches to the problem (partial simulations). In all our experiments MINCE matched correctly and the outcome of the matching step yielded an average speedup of the application of 2.47x. As a summary, our work represents a key step towards automating the whole design flow of an ASIP with extensible instruction capabilities

Andhi Janapsatya, Sri Parameswaran & Joerg Henkel. REMcode: relocating embedded code for improving system efficiency. IEE Proceedings-Computers and Digital Techniques, vol. 151, no. 6, pages 457-65, 2004. [ bib | .pdf ]

The memory hierarchy subsystem has a significant impact on performance and energy consumption of an embedded system. Methods which increase the hit ratio of the cache hierarchy will typically enhance the performance and reduce the embedded system's total energy consumption. This is mainly due to reduced cache-to-memory bus transactions, fewer main memory accesses and fewer processor waiting cycles. A heuristic approach is presented to reduce the total number of cache misses by carefully relocating selected sections of the application's software code within the main memory, thus reducing conflict misses resulting from the cache hierarchy. The method requires no hardware modifications i.e. it is a software-only approach. For the first time such a method is applied to large program traces, and the miss rates and corresponding energy savings are observed while varying cache size, line size and associativity. Relocating the code consistently produces superior performance on direct-mapped cache. Since direct-mapped caches, being smaller in silicon area than caches with higher associativity (for the same size), cost less in terms of energy/access, and access faster, using direct-mapped instruction cache with code relocation for performance-oriented embedded systems is recommended. A maximum cache miss rate reduction from 71 of up to 63% with only a small increase in main memory size

Andhi Janapsatya, Sri Parameswaran & Aleksander Ignjatovic. Hardware/software managed scratchpad memory for embedded system. In International Conference on Computer Aided Design (ICCAD 2004), pages 370-7, San Jose, CA, USA, 2004. IEEE. [ bib | .pdf ]

In this paper, we propose a methodology for energy reduction and performance improvement. The target system comprises of an instruction scratchpad memory instead of an instruction cache. Highly utilized code segments are copied into the scratchpad memory, and are executed from the scratchpad. The copying of code segments from main memory to the scratchpad is performed during runtime. A custom hardware controller is used to manage the copying process. The hardware controller is activated by strategically placed custom instructions within the executing program. These custom instructions inform the hardware controller when to copy during program execution. Novel heuristic algorithms are implemented to determine locations within the program to insert these custom instructions, as well as to choose the best sets of code segments to be copied to the scratchpad memory. For a set of realistic benchmarks, experimental results indicate the method uses 50.7 by 53.2 which is identical in size. Cache systems compared had sizes ranging from 256 to 16K bytes and associativities ranging from 1 to 32

Swarana Radhakrishnan, Hui Guo & Sri Parameswaran. Dual-pipeline heterogeneous ASIP design. In International Conference on Hardware/Software Codesign and Systems Synthesis (CODES + ISSS '04), pages 12-17, Stockholm, Sweden, 2004. ACM. [ bib | .pdf ]

In this paper we demonstrate the feasibility of a dual pipeline application specific instruction set processor. We take a C program and create a target instruction set by compiling to a basic instruction set from which some instructions are merged, while others discarded. Based on the target instruction set, parallelism of the application program is analyzed and two unique instruction sets are generated for a heterogeneous dual-pipeline processor. The dual pipe processor is created by making two unique ASIPs (VHDL descriptions) utilizing the ASIP-Meister Tool Suite, and fusing the two VHDL descriptions to construct a dual pipeline processor. Our results show that in comparison to the single pipeline application specific instruction set processor, the performance improves by 27.6 reduces by 6.1 at the cost of increased area which for benchmarks considered is 16.7% on average

Newton Cheung, Sri Parameswaran & Joerg Henkel. Rapid Configuration & Instruction Selection for an ASIP: A Case Study. In Ahmed A. Jerraya, S. Yoo, N. When & D. Verkest, editeurs, Embedded Software for SoC. Kluwer Publishing, 2003. [ bib ]

Newton Cheung, Sri Parameswaran & Joerg Henkel. INSIDE: INstruction Selection/Identification & Design Exploration for extensible processors. In International Conference on Computer Aided Design (ICCAD '03), pages 291-7, San Jose, CA, USA, 2003. IEEE. [ bib | .pdf ]

This paper presents the INSIDE system that rapidly searches the design space for extensible processors, given area and performance constraints of an embedded application, while minimizing the design turn-around-time. Our system consists of a) a methodology to determine which code segments are most suited for implementation as a set of extensible instructions, b) a heuristic algorithm to select pre-configured extensible processors as well as extensible instructions (library), and c) an estimation tool which rapidly estimates the performance of an application on a generated extensible processor. By selecting the right combination of a processor core plus extensible instructions, we achieve a performance increase on average of 2.03x (up to 7x) compared to the base processor core at a minimum hardware overhead of 25% on average

Newton Cheung, Sri Parameswaran & Joerg Henkel. Rapid configuration and instruction selection for an ASIP: a case study. In Design, Automation and Test in Europe Conference and Exhibition (DATE '03), pages 802-7, Munich, Germany, 2003. IEEE Comput. Soc. [ bib | .pdf ]

We present a methodology that maximizes the performance of Tensilica based Application Specific Instruction-set Processor (ASIP) through instruction selection when an area constraint is given. Our approach rapidly selects from a set of pre-fabricated coprocessors/functional units from our library of pre-designed specific instructions (to evaluate our technology we use the Tensilica platform). As a result, we significantly increase application performance while area constraints are satisfied. Our methodology uses a combination of simulation, estimation and a pre-characterised library of instructions, to select the appropriate co-processors and instructions. We report that by selecting the appropriate coprocessors/functional units and specific TIE instructions, the total execution time of complex applications (we study a voice encoder/decoder), an application's performance can be reduced by up to 85 Our estimator used in the system takes typically less than a second to estimate, with an average error rate of 4 simulation, which takes 45 minutes). The total selection process using our methodology takes 3-4 hours, while a full design space exploration using simulation would take several days

Ivan S. C. Lu, Neil Weste & Sri Parameswaran. A digital ultra-wideband multiband transceiver architecture with fast frequency hopping capabilities. In 2003 IEEE Conference on Ultra Wideband Systems and Technologies (UWST '03), pages 448-52, Reston, VA, USA, 2003. IEEE. [ bib | .pdf ]

This paper presents for the first time the circuit parameter analysis of a digital multiband-UWB transceiver, encompassing a novel low-power sub-band generator. This sub-band generator is capable of producing multiple frequency bands, enabling sub-band generation from 3 to 10 GHz with nanosecond switching times. The circuit analysis of the complete transceiver is used to set parameters of components. The analysis indicate that a LNA gain of 20 dB, baseband amplifier gain of 45 dB, matched filter accuracy of five bits, ADC accuracy of two bits, a 60 dB dynamic range of the multi-frequency generator, and front end offset voltage of less than 30 mV is required to achieve a 10 dB SNR. Hspice simulation utilizing 0.35 ?m CMOS technology suggest that the power consumption of the sub-band generator is 8 mW from a 1.8 V power supply

Sri Parameswaran, Joerg Henkel & Haris Lekastas. Multi-parametric improvements for embedded systems using code-placement and address bus coding. In Asia and South Pacific Design Automation Conference 2003 (ASP-DAC '03), pages 15-21, Kitakyushu, Japan, 2003. IEEE. [ bib | .pdf ]

Code placement techniques for instruction code have shown to increase an SoC's performance mostly due to the increased cache hit ratios and as such those techniques can be a major optimization strategy for embedded systems. Little has been investigated on the interdependencies between code placement techniques and interconnect traffic (e.g. bus traffic) and optimization techniques combining both. In this paper we show as the first approach of its kind that a carefully designed known code placement strategy combined and adapted to a known interconnect encoding scheme does not only lead to a performance increase but it does also lead to a significant reduction of interconnect-related energy consumption. This becomes especially interesting since future SoC bus systems (or more general: networks on a chip) are predicted to be a dominant energy consumer of an SoC. We show that a high-level optimization strategy like code placement and a lower-level optimization strategy like interconnect encoding are NOT orthogonal. Specifically, we report cache miss reduction ratios of 32 with bus related energy savings of 50.4 of up to 95.7 results have been verified by means of diverse real-world SoC applications

Tony Han & Sri Parameswaran. SWASAD: an ASIC design for high speed DNA sequence matching. In ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design (ASP-DAC / VLSI Design '02), pages 541-6, Bangalore, India, 2002. IEEE Comput. Soc. [ bib | .pdf ]

Presents the Smith and Waterman algorithm-specific ASIC design (SWASAD) project. This is a hardware solution that implements the S and W algorithm.. The SWASAD is an improved implementation of the biological information signal processor (BISP) design. The SWASAD chip fabricated on a 0.5 ?m process achieves 3200 million matrix cells per second (MCPS) per chip, with a layout size of 7.1 mm by 7.1 mm. This is a large improvement over existing designs and improves data throughput by using a smaller datawidth

Sri Parameswaran, Joerg Henkel, Xiaobo Sharon Hu & Rajesh Gupta. Proceedings of the Tenth International Symposium on Hardware/Software Codesign, CODES 2002. In CODES 2002, Estes Park, Colorado, 2002. ACM. [ bib ]

Sri Parameswaran. Code placement in hardware software Co synthesis to improve performance and reduce cost. In Design, Automation and Test in Europe. Conference and Exhibition (DATE '01), pages 626-32, Munich, Germany, 2001. IEEE Comput. Soc. [ bib | .pdf ]

This paper introduces an algorithm for code placement in cache, and maps it to memory using a second algorithm. The target architecture is a multiprocessor system with IS' level cache and a common main memory. These algorithms guarantee that as many instruction codewords as possible of the high priority tasks remain in cache all of the time so that other tasks do not overwrite them. This method improves the overall performance, and might result in cheaper systems if more powerful processors are not needed. Amount of memory increase necessary to facilitate this scheme is in the order of 13 of highest priority tasks always in memory can vary from 3 depending upon how many tasks (and their sizes) are allocated to each processor

Sri Parameswaran & Joerg Henkel. I-CoPES: fast instruction code placement for embedded systems to improve performance and energy efficiency. In IEEE/ACM International Conference on Computer Aided Design. (ICCAD 2001), pages 635-41, San Jose, CA, USA, 2001. IEEE. [ bib | .pdf ]

The ratio of cache hits to cache misses in a computer system is, to a large extent, responsible for its characteristics such as energy consumption and performance. In recent years energy efficiency has become one of the dominating design constraints, due to the rapid growth in market share for mobile computing/communication/internet devices. We present a novel fast constructive technique that relocates the instruction code in such a manner into the main memory that the cache is utilized more efficiently. The technique is applied as a re-processing step, i.e. before the code is executed. it is applicable for embedded systems where the number and characteristics of tasks running on the system are known a priori. The technique does not impose any computational overhead to the system. As a result of applying our technique to a variety of real-world applications we measured (through simulation) that the number of cache misses drops significantly. Further, this reduces the energy consumption of a whole system (CPU, caches, buses, main memory) by up to 65 memory size of 13% on average

Allan Rae & Sri Parameswaran. Voltage reduction of application-specific heterogeneous multiprocessor systems for power minimisation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E84-A, no. 9, pages 2296-302, 2001. [ bib ]

We present a design strategy to reduce power demands in application-specific heterogeneous multiprocessor systems with interdependent subtasks. This power reduction scheme can be used with a randomised search such as a genetic algorithm where multiple trial solutions are tested. The scheme is applied to each trial solution after allocation and scheduling have been performed. Power savings are achieved by equally expanding each processor's execution time with a corresponding reduction in their respective operating voltage. Lowest cost solutions achieve average reductions of 24% while minimum power solutions average 58%

Allan Rae & Sri Parameswaran. Synthesising application-specific heterogeneous multiprocessors using differential evolution. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E84-A, no. 12, pages 3125-31, 2001. [ bib ]

This paper presents an application-specific, heterogeneous multiprocessor synthesis system, named HeMPS, that combines a form of evolutionary computation known as Differential Evolution with a scheduling heuristic to search the design space efficiently. We demonstrate the effectiveness of our technique by comparing it to similar existing systems. The proposed strategy is shown to be faster than recent systems on large problems while providing equivalent or improved final solutions

Vince E Boros, Aleks D. Rakic & Sri Parameswaran. High-level model of a WDMA passive optical bus for a reconfigurable multiprocessor system. In Design Automation Conference. (DAC '00), pages 221-6, Los Angeles, CA, USA, 2000. ACM. [ bib | .pdf ]

We describe the first iteration of a comprehensive model with which we can investigate the practical limits on optical bus bandwidth and number of bus processing modules for given signal power. The selection algorithm will ultimately allow programmable evaluation of system parameters bus bandwidth, optical power budget, electrical power budget, number of modules and space consumption for an optimal design that is suitable for on-the-fly system reconfiguration

Sri Parameswaran, Matthew F.Parkinson & Peter Bartlett. Profiling in the ASP codesign environment. Journal of Systems Architecture, vol. 46, no. 14, pages 1263-74, 2000. [ bib | .pdf ]

Automation of the hardware/software codesign (HSC) methodology brings with it the need to develop sophisticated high-level profiling tools. This paper presents a profiling tool which uses execution profiling on standard C code to obtain accurate and consistent times at the level of individual compound code sections. This tool is used in the ASP hardware/software codesign project. The results from this tool show that profiling must be performed on dedicated hardware which is as close as possible to the final implementation, as opposed to a workstation. Further, in this paper a formula is derived for the number of times a program has to be profiled in order to get an accurate estimate of the number of times a loop with an indeterminate loop count is executed

Allan Rae & Sri Parameswaran. Voltage reduction of application-specific heterogeneous multiprocessor systems for power minimisation. In Asia and South Pacific Design Automation Conference 2000 with EDA TechnoFair 2000 ( ASP-DAC 2000), pages 147-52, Yokohama, Japan, 2000. IEEE. [ bib | .pdf ]

We present a design strategy to reduce power demands in application-specific heterogeneous multiprocessor systems with interdependent subtasks. This power reduction scheme can be used with a randomised search such as a genetic algorithm where multiple trial solutions are tested. The scheme is applied to each trial solution after allocation and scheduling have been performed. Power savings are achieved by equally expanding each processor's execution time with a corresponding reduction in their respective operating voltage. Lowest cost solutions achieve average reductions of 24% while minimum power solutions average 58%

Seyed M. Kia & Sri Parameswaran. Self-checking synchronous controller design. IEE Proceedings-Computers and Digital Techniques, vol. 146, no. 1, pages 9-12, 1999. [ bib | .pdf ]

Efficient models are introduced for totally self-checking/code disjoint (TSC/CD) and strongly fault-secure/strongly code disjoint (SFS/SCD) synchronous controller models. These models are based on two low-cost, modular, TSC edge-triggered and error-propagating CD flip-flops. Properties of the proposed synchronous controller models are proven. The design procedure for these models and their proper applications are explained

Hui Guo & Sri Parameswaran. Unrolling loops with Indeterminate Loop Counts in System Level Pipelines. In Asia South Pacific Design Automation Conference (ASP-DAC '98), pages 195-200, Yokohama, Japan, 1998. [ bib | .pdf ]

This paper describes the unrolling of loops with indeterminate loop counts in system level pipelines. Two methods are discussed in this paper. The first method is the varied latency method, where the input is blocked until the pipeline is clear. This variation in the input arrival time gives rise to the name. In this method, the output is in the same order as the input. The second method, called the fixed latency method, allows for the input arrival time to remain unchanged. The loops with loop count in excess of the number of unrolled loops must be stored until a suitable gap in the system becomes available. It is shown that the number of loops should be equal to the sum of the expected value of the loop count and standard deviation of the loop count in the varied latency method, and the expected value of the loop count for the fixed latency method

Seyed M. Kia & Sri Parameswaran. Designs for self checking flip-flops. IEE Proceedings-Computers and Digital Techniques, vol. 145, no. 2, pages 81-8, 1998. [ bib | .pdf ]

The authors introduce two low-cost, modular, totally self checking (TSC), edge triggered and error propagating (code disjoint) flip-flops: one, a D flip-flop used in TSC and strongly fault secure (SFS) synchronous circuits with two-rail codes, the other a T flip-flop, used in a similar way as the D flip-flop but retaining the error as an indicator until the next presetting, to aid error propagation. Thus, the self checking T flip-flop can be used as an error indicator. The self checking D flip-flop is smaller than the duplicate D flip-flop circuitry by 30 than the pervious error indicator in the literature. These circuits, unlike previously reported circuits, can also detect stuck-at faults in the clock inputs. The authors have also presented TSC/error propagating applications for the above flip-flops: a counter and a shift register

Sri Parameswaran. HW-SW co-synthesis: the present and the future. In Asian and South Pacific Design Automation Conference 1998 (ASP-DAC '98), pages 19-22, Yokohama, Japan, 1998. IEEE. [ bib | .pdf ]

As we move towards several million transistors per chip it is desirable to move to higher levels of abstraction for the purposes of automated design of systems. Increasing performance of microprocessors in the marketplace is moving the balance between software and hardware. In this environment, it is necessary to adapt our tools to create systems, which encompass these fast microprocessors rather than compete with them. It is important to adapt other peripheral components such as sensors and RF circuits into our design methodology

Sri Parameswaran & Hui Guo. Power reduction in pipelines. In Asian and South Pacific Design Automation Conference 1998 (ASP-DAC '98), pages 545-50, Yokohama, Japan, 1998. IEEE. [ bib | .pdf ]

The reduction of power consumption for a system level pipeline is addressed in this paper. The pipeline is composed of several stages. Each stage has several behaviours. Different behaviours have differing execution times. The speed of the pipeline is only affected by the behaviours on the critical path of the slowest stages. Other behaviours can be slowed down to decrease the power consumed in the system. We propose a multi-voltage supply scheme, in which differing behaviours are supplied with differing voltages. The formulas for computing the supply voltage of each behaviour and minimal power consumption are derived in this paper. The results of computer experiment show that up to 80% hardware power can be saved with this scheme

Allan Rae & Sri Parameswaran. Application-specific heterogeneous multiprocessor synthesis using differential-evolution - II. In 11th International Symposium on System Synthesis (ISSS '98), pages 83-8, Hsinchu, Taiwan, 1998. IEEE Comput. Soc. [ bib | .pdf ]

This paper presents an application-specific, heterogeneous multiprocessor synthesis system, named HeMPS, that combines a form of Evolutionary Computation known as Differential Evolution with a scheduling heuristic to search the design space efficiently. We demonstrate the effectiveness of our technique by comparing it to similar existing systems. The proposed strategy is shown to be faster than recent systems on large problems while providing equivalent or improved final solutions

Allan R. Rae & Sri Parameswaran. Application-Specific Heterogeneous Multiprocessor Synthesis Using Differential Evolution. In Asia Pacific Conference on Hardware Description Languages (APCHDL '98), page 6 pages, Seoul, South Korea, 1998. [ bib ]

Hui Guo & Sri Parameswaran. Unfolding loops with indeterminate count in system level pipelines. In 14th Australian Microelectronics Conference. Microelectronics: Technology Today for the Future (MICRO '97), pages 82-7, Melbourne, Vic., Australia, 1997. IREE Soc. [ bib ]

This paper describes the unrolling of loops with indeterminate loop counts in system level pipelines. Two methods are discussed in this paper. The first method is the varied latency method, where the input is blocked until the pipeline is clear. This variation in the input arrival time gives rise to the name. In this method, the output is in the same order as the input. The second method, called the fixed latency method, allows for the input arrival time to remain unchanged. The loops with loop count in excess of the number of unrolled loops must be stored until a suitable gap in the system becomes available. It is shown that the number of loops should be equal to the sum of the expected value of the loop count and standard deviation of the loop count in the varied latency method, and the expected value of the loop count for the fixed latency method

Seyed M Kia & Sri Parameswaran. Design of TSC/CD and SFS/SCD Synchronous Circuits with TSC/error propagating Flip-Flops'. In 11th Australian Microelectronics Conference (MICRO '97), pages 75-80, Sydney, NSW, Australia, 1997. Inst. Radio & Electron. Eng. [ bib ]

In this paper, we introduce a low-cost, modular, totally self checking (TSC), edge triggered and error propagating (code disjoint) D-type flip-flop. This D flip-flop can be used in TSC and strongly fault secure (SFS) synchronous circuits with two-rail codes

Seyed M. Kia & Sri Parameswaran. An efficient self exercising two rail checker. Journal of Microelectronic Systems Integration, vol. 5, no. 3, pages 159-65, 1997. [ bib ]

In this paper, a, new efficient self exercising checker for a class of non-systematic two rail codes is introduced. This proposed checker is called the “Two Dimensional Self Exercising Two Rail checker” (TDSETR checker). The TDSETR checker is used to check non-systematic two rail codes. In this method, the checking is done on line. This method significantly reduces the complexity and delay of checking compared to previous circuits. The high speed and low cost advantages of the TDSETR checker are compared with conventional circuits (e.g., the circuit is 30 reported circuits for 256 bit pair inputs)

Sri Parameswaran & Hui Guo. Power consumption in CMOS combinational logic blocks at high frequencies. In Asia and South Pacific Design Automation Conference 1997 (ASP-DAC '97), pages 195-200, Chiba, Japan, 1997. IEEE. [ bib | .pdf ]

A new model for estimating dynamic power dissipation in CMOS combinational circuits at differing voltages is presented. The proposed model deals with power dissipation of circuits at saturation frequencies, where the output voltage does not reach 100 the output voltage waveform is almost a triangular waveform. We show that the dynamic power consumption at saturation frequencies is only dependent on the supply voltage, and is independent of load capacitance and switching speed. This model shows that when a circuit is working in the saturation frequency range, as the frequency is increased, the performance/power ratio is increased. However, this increase in performance/power ratio is at the expense of noise margin. The model is theoretically and empirically shown to be correct. This model can be used to design a system where the differing combinational logic blocks are supplied with differing voltages. Such a system would consume lower power than if the system was supplied by a single voltage rail

Sri Parameswaran & Hui Guo. Partitioning of system level pipelines. In 14th Australian Microelectronics Conference. Microelectronics: Technology Today for the Future ( MICRO '97), pages 233-8, Melbourne, Vic., Australia, 1997. IREE Soc. [ bib ]

A system to be pipelined usually consists of several sub-systems or sub-processes. Each sub-process may have several implementations. Each implementation has an associated cost and an associated execution time. Different partitions result in a different pipeline. To obtain an efficient pipeline, partitioning is important. In this paper, an algorithm to partition a system in order to obtain an optimal or near-optimal pipeline, is given. Results obtained by this method are very close to the optimal solution, but can be found in a fraction of the time taken to search for an optimal solution

Sri Parameswaran & Hui Guo. Power reduction in pipelines. In 14th Australian Microelectronics Conference. Microelectronics: Technology Today for the Future ( MICRO '97), pages 239-44, Melbourne, Vic., Australia, 1997. IREE Soc. [ bib | .pdf ]

The reduction of power consumption for a system level pipeline is addressed in this paper. The pipeline is composed of several stages. Each stage has several behaviours. Different behaviours have differing execution times. The speed of the pipeline is only affected by the behaviours on the critical path of the slowest stages. Other behaviours can be slowed down to decrease the power consumed in the system. We propose a multi-voltage supply scheme, in which differing behaviours are supplied with differing voltages. The formulas for computation of the supply voltage for each behaviour and minimal power consumption are derived in this paper. The results of computer experiments are also provided here

Sri Parameswaran & Hui Guo. Extracting Higher Performance/Power Ratio in Combinational CMOS Circuits. In Sixth International Workshop on Power, Timing, Modelling, Optimization and Simulation (PATMOS '96), pages 93-102, University of Bologna, 1996. [ bib ]

Pradip Jha, Sri Parameswaran & Nikil Dutt. Reclocking Controllers for Minimum Execution Time. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E78-A, no. 12, pages 1715-1721, 1995. [ bib ]

In this paper we describe a method for resynthesizing the controller of a design for a fixed datapath with the objective of increasing the design's throughput by minimizing its total execution time. This work has potential in two important areas: one, design reuse for retargetting datapaths to new libraries, new technologies and different bit-widths; and two, back-annotation of physical design information during High-Level Synthesis (HLS), and subsequent adjustment of the design's schedule to account for realistic physical design information with minimal changes to the datapath. We present our approach using various formulations, prove optimality of our algorithm and demonstrate the effectiveness of our technique on several HLS benchmarks. We have observed improvements of up to 34 application of our controller resynthesis technique to the outputs of HLS.

Pradip Jha, Sri Parameswaran & Nikil Dutt. Reclocking for high level synthesis. In Asia and South Pacific Design Automation Conference. IFIP International Conference on Computer Hardware Description Languages and their Applications. IFIP Interntional Conference on Very Large Scale Integration (ASP-DAC'95/CHDL'95/VLSI'95.), pages 49-54, Chiba, Japan, 1995. Nihon Gakkai Jimu Senta. [ bib | .pdf ]

Describes a powerful post-synthesis approach called reclocking, for performance improvement by minimizing the total execution time. By back annotating the wire delays of designs created by a high level synthesis system, and then finding an optimal clockwidth, we resynthesize the controller to improve performance without altering the datapath. Reclocking is versatile and can be applied not only for wire delay consideration, but also for bit-width migration, library migration and for feature size migration supporting the philosophy of design reuse. Experimental results show that with reclocking, the performance of the input designs can be improved by as much as 34%

Matthew F. Parkinson & Sri Parameswaran. Profiling in the ASP codesign environment. In Proceedings of the Eighth International Symposium on System Synthesis (ISSS '95), pages 128-33, Cannes, France, 1995. IEEE Comput. Soc. Press. [ bib | .pdf ]

Automation of the hardware/software codesign methodology brings with it the need to develop sophisticated high-level profiling tools. This paper presents a profiling tool which uses execution profiling on standard C code to obtain accurate and consistent times at the level of individual compound code sections. This tool is used in the ASP Hardware/Software Codesign project. The results from this tool show that profiling must be performed on dedicated hardware which is as close as possible to the final implementation, as opposed to a workstation

Seyed M. Kia & Sri Parameswaran. Design automation of self checking circuits. In European Design Automation Conference (EURO-DAC '94 + EURO VHDL '94)), pages 252-7, Grenoble, France, 1994. ACM. [ bib ]

We explain the steps of the CAD tools developed for self checking circuits. The CAD tools developed are used to design strongly fault secure, strongly code disjoint (SFS/SCD) and totally self checking, code disjoint (TSC/CD) circuits. Self checking combinatorial and sequential synchronous circuits including shift registers, counters, adders and checkers are designed, using these tools. The output of these CAD tools is given in structural level VHDL which can be synthesized via commercial tools

Seyed M. Kia & Sri Parameswaran. Novel architectures for TSC/CD and SFS/SCD synchronous controllers. In Proceedings 12th IEEE VLSI Test Symposium (VTS '94), pages 138-43, Cherry Hill, NJ, USA, 1994. IEEE Comput. Soc. Press. [ bib | .pdf ]

Introduces design models for totally self checking, code disjoint (TSC/CD) and strongly fault secure, strongly code disjoint (SFS/SCD) synchronous controllers. The TSC/CD and SFS/SCD models are based on two new proposed low-cost, modular, totally self checking (TSC), edge triggered and error propagating (code disjoint) flip-flops; one, a D flip-flop which can be used in TSC and strongly fault secure (SFS) synchronous circuits with two-rail codes; the other a T flip-flop, used in a similar way as the D flip-flop but retaining the error as an indicator until the next presetting, as an aid to error propagation

Sri Parameswaran, Paradip Jha & Nikil Dutt. Resynthesizing Controllers for Minimum Execution Time. In Asia Pacific Conference in Hardware Description Languages (APCHDL '94), Toyohashi, Japan, 1994. [ bib ]

Describes a powerful post-synthesis approach called reclocking, for performance improvement by minimizing the total execution time. By back annotating the wire delays of designs created by a high level synthesis system, and then finding an optimal clockwidth, we resynthesize the controller to improve performance without altering the datapath. Reclocking is versatile and can be applied not only for wire delay consideration, but also for bit-width migration, library migration and for feature size migration supporting the philosophy of design reuse. Experimental results show that with reclocking, the performance of the input designs can be improved by as much as 34%

Sri Parameswaran & Mark F. Schulz. Computer-aided selection of components for technology-independent specifications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 13, no. 11, pages 1333-50, 1994. [ bib | .pdf ]

The specification of a synchronous circuit can be given as a set of abstract building blocks that are interconnected. A set of fast algorithms are presented here for the selection of components that map each of these abstract building blocks to one of a number of suitable physical components. The first set of algorithms select the set of fastest or cheapest (smallest area) of all possible components. Another set of algorithms is given that will find a solution with user-defined constraints. These algorithms, which are implemented as part of the SPOT system, use a exhaustive list of timing information to increase the likelihood of a good solution

Matthew F. Parkinson, Paul M.Taylor & Sri Parameswaran. C to VHDL converter in a codesign environment. In Proceedings of VHDL International Users Forum. Spring Conference (VHDL Forum '94), pages 100-9, Oakland, CA, USA, 1994. IEEE Computer. Soc. Press. [ bib | .pdf ]

Automation of the hardware/software codesign methodology brings with it the need to develop sophisticated high-level synthesis tools. This paper presents a tool which is the result of such development. This tool converts standard C code into an equivalent VHDL behavioural description. This description is used to generate a chip-level hardware interconnect of identical functionality to the original C code

Tim Healy & Sri Parameswaran. BoMaRA: A Boltzmann machine for register allocation and interconnection. In 11th Australian Microelectronics Conference. Microelectronics, Meeting the Needs of Modern Technology (MICRO '93), pages 69-74, Gold Coast, Qld., Australia, 1993. Inst. Radio & Electron. Eng. [ bib ]

Most register allocation methods only minimise the number of registers. BoMaRA, a program using probabilistic methods for register allocation, attempts to minimise the number of registers and the number of interconnections that arise when multiple variables are stored in a single register

Seyed M. Kia & Sri Parameswaran. Automated Self Checking System using VHDL. In Asia Pacific Conference on Hardware Description Languages (APCHDL '93), pages 131-135, Brisbane, Australia, 1993. APCHDL Conference Secretariat. [ bib ]

Seyed M. Kia & Sri Parameswaran. Synchronous TSC/CD Error Indicator for self checking systems. In Pacific Rim International Symposium on Fault tolerant Computing (PRISFC '93), pages 156-160, Melbourne, Australia, 1993. [ bib ]

Sri Parameswaran & Adam Postula. Proceedings of the First Asia Pacific Conference on Hardware Description Languages (APCHDL '93). Brisbane, Australia, 1993. APCHDL Conference Secretariat. [ bib ]

Sri Parameswaran & Mark F. Schulz. A critical look at adaptive logic networks. In Proceedings of the Fourth Australian Conference on Neural Networks (ACNN'93), pages 102-5, Melbourne, Vic., Australia, 1993. Sydney Univ. Electr. Eng. [ bib ]

Critically analyses adaptive logic networks (ALNs), which have been developed by W. Armstrong et al. (1979). The authors take some of the problems distributed by W. Armstrong in the Atree2 software package, and apply standard digital logic techniques to the same problems. From the set of tests done, it is concluded that the problems looked at can be solved by standard digital logic techniques and Hamming error code correction. The standard digital logic techniques coupled with Hamming error code correction produce a minimized combinational logic up to 100 times faster than ALNs

Matthew F. Parkinson, Paul M.Taylor & Sri Parameswaran. A profiler for automated translation of signal processing algorithms into high speed hardware/software hybrid architectures. In 11th Australian Microelectronics Conference. Microelectronics, Meeting the Needs of Modern Technology (MICRO '93), pages 81-6, Gold Coast, Qld., Australia, 1993. Inst. Radio & Electron. Eng. [ bib ]

In order to perform a hardware/software codesign on an algorithm, it is essential to divide the code into hardware and software partitions. This paper presents the technique of execution profiling `C code' in preparation for automatic partitioning. The profiler presented overcomes many shortcomings of traditional profiling techniques

Matthew F. Parkinson, Paul M.Taylor, Sri Parameswaran & Adam Postula. An Automated Hardware Software Codesign System Using VHDL. In Asia Pacific Conference of Hardware Description Language (APCHDL '93), pages 267-280, Brisbane, Australia, 1993. APCHDL Conference Secretariat. [ bib ]

Sri Parameswaran & Mark F. Schulz. SPOT: an expert system for digital synthesis. In 8th Australian Conference on Microelectronics (MICROS '89), pages 95-101, Brisbane, Qld., Australia, 1989. Inst. Eng. Australia. [ bib ]

Describes an approach to the automation of digital synthesis using a knowledge based expert system named SPOT. A technology-independent description is produced from a functional description. This technology-independent description is further processed to create technology-dependent description for both LSI/MSI/SSI and EPLD technologies

Mark F. Schulz & Sri Parameswaran. An expert system for design of ASICs. In Conference on Computing Systems and Information Technology 1989 (CCSIT '89), pages 127-31, Sydney, NSW, Australia, 1989. Instn. Eng. Australia. [ bib ]

Describes an expert system named SPOT for the automation of digital design. The first section of the project described in this paper is used to create a technology-independent description from a structural description. The final section of the project is to create a technology-dependent description from the technology-independent description

Sri Parameswaran. Improving Education Planning in Sri Lanka - Computer Science and Mathematics. Rapport technique, Asian Development Bank, June 1999 1999. [ bib ]

Sri Parameswaran. Proposal for a Centre for System-on-a-Chip Research. Rapport technique, The University of Queensland / State Department of Queensland, September 1998 1998. [ bib ]

Sri Parameswaran. Strategies for the Indian Higher Education Student Market in Engineering/IT. Rapport technique, The University of Queensland, June 1998 1998. [ bib ]

Sri Parameswaran. Proposed New Horizons Diploma & Professional Masters Program in Information Technology. Rapport technique, The University of Queensland, September 1998 1998. [ bib ]

Sri Parameswaran, Jorgen Peddersen & Ashley Partis. A Low Power Chip Architecture, 2005. [ bib ]


This file has been generated by bibtex2html 1.86.