Joerg Henkel & Sri Parameswaran, editeurs.
Designing embedded processors: A low power perspective.
Springer, 2007.
[ bib |
.pdf ]
Seng Lin Shee & Sri Parameswaran.
Architectural Exploration of Heterogeneous Multiprocessor
Systems for JPEG.
International Journal of Parallel Processing, vol. 35, 2007.
[ bib |
.pdf ]
Seng Lin Shee & Sri Parameswaran.
Design Methodology for Pipelined Heterogeneous Multiprocessor
System.
In Design Automation Conference (DAC'07), page 6pp, San Diego, CA,
USA, 2007.
[ bib |
.pdf ]
Jude Angelo Ambrose, Roshan Ragel & Sri Parameswaran.
RIJID: Random Code Injection to Mask Power Analysis based Side
Channel Attacks.
In Design Automation Conference (DAC '07), page 6pp, San Diego, Ca,
USA, 2007.
[ bib |
.pdf ]
Yee Jern Chong & Sri Parameswaran.
Automatic Application Specific Floating-point Unit Generation.
In Design, Automation and Test in Europe (DATE'07) Conference, Nice,
France, 2007. IEEE.
[ bib |
.pdf ]
Andhi Janapsatya, Aleksandar Ignjatovic, Sri Parameswaran & Joerg Henkel.
Instruction Trace Compression for Rapid Instruction Cache
Simulation.
In Design, Automation and Test in Europe (DATE'07) Conference, page
6pp. IEEE, 2007.
[ bib |
.pdf ]
Ivan Siu-Chuang Lu, Neil Weste & Sri Parameswaran.
A power-efficient 5.6-GHz process-compensated CMOS frequency
divider.
IEEE Transactions on Circuits and Systems II, vol. 54, no. 4, pages
323-327, 2007.
[ bib |
.pdf ]
Jorgen Peddersen & Sri Parameswaran.
CLIPPER: Counter-based Low Impact Processor Power Estimation at
Run-time.
In 12th Asia and South Pacific Design Automation Conference (ASP-DAC
2007), pages 890-895, Yokohama, Japan, 2007. IEEE.
[ bib |
.pdf ]
Jorgen Peddersen & Sri Parameswaran.
Energy driven Application Adaptation at Run-Time.
In 20th International Conference on VLSI DESIGN, pages 385-390,
Bangalore, India, 2007.
[ bib |
.pdf ]
Andhi Janapsatya, Aleksander Ignjatovic & Sri Parameswaran.
Exploiting Statistical Information for Implementation of
Instruction Scratchpad Memory in Embedded System.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 14, no. 8, pages 816-29, 2006.
[ bib |
.pdf ]
Andhi Janapsatya, Aleks Ignjatovic & Sri Parameswaran.
Finding optimal L1 cache configuration for embedded systems.
In Asia South Pacific Design Automation Conference (ASPDAC 2006),
pages 796-801, Yokohama, Japan, 2006.
[ bib |
.pdf ]
Andhi Janapsatya, Aleks Ignjatovic & Sri Parameswaran.
A novel instruction scratchpad memory optimization method based
on concomitance metric.
In Asia South Pacific Design Automation Conference (ASPDAC 2006),
page 6 pp., Yokohama, Japan, 2006.
[ bib |
.pdf ]
Ivan Siu-Chuang Lu, Neil Weste & Sri Parameswaran.
ADC precision requirement for digital ultra-wideband receivers
with sublinear front-ends: a power and performance perspective.
In 19th International Conference on VLSI Design held jointly with 5th
International Conference on Embedded Systems and Design (VLSI Design '06),
page 6 pp., Hyderabad, India, 2006. IEEE Computer Society.
[ bib |
.pdf ]
Sri Parameswaran, Joerg Henkel & Newton Cheung.
Instruction Matching and Modelling.
In Paolo Ienne & Rainer Leupers, editeurs, Customizable and
Configurable Embedded Processors. Elseiver, 2006.
[ bib ]
Swarana Radhakrishnan, Hui Guo & Sri Parameswaran.
Customization of Application Speci c Heterogeneous MultiPipeline
Processors.
In Design, Automation and Test in Europe Conference and Exhibition
(DATE '06), page 6 pages, Munich, Germany, 2006. IEEE Comput. Soc.
[ bib |
.pdf ]
Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran & Aleksandar Ignjatovic.
Application Specific Forwarding Network and Instruction Encoding
for Multi-pipe ASIPs.
In International Conference on Hardware/Software Codesign and Systems
Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006.
[ bib ]
Roshan Ragel & Sri Parameswaran.
Hardware Assisted Pre-emptive Control Flow Checking for Embedded
Processors to improve Reliability.
In International Conference on Hardware/Software Codesign and Systems
Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006.
[ bib |
.pdf ]
Roshan G. Ragel & Sri Parameswaran.
IMPRES: Integrated Monitoring for Processor Reliability and
Security.
In Design Automation Conference. (DAC '06), page 4 pages, San
Francisco, CA, USA, 2006. ACM.
[ bib |
.pdf ]
Seng Lin Shee, Andrea Erdos & Sri Parameswaran.
Heterogeneous Multiprocessor Implementations for JPEG : A Case
Study.
In International Conference on Hardware/Software Codesign and Systems
Synthesis (CODES + ISSS '06), page 6, Seoul, Korea, 2006.
[ bib ]
Hui Wu & Sri Parameswaran.
Minimising the Energy Consumption of Real-Time Tasks with
Precedence Constraints on A Single Processor.
In The 2006 IFIP International Conference on Embedded And Ubiquitous
Computing, page 12 pages, Seoul, Korea., 2006. Springer's Lecture Notes in
Computer Science.
[ bib ]
Jeremy Chan & Sri Parameswaran.
NoCEE: energy macro-model extraction methodology for network on
chip routers.
In International Conference on Computer Aided Design (ICCAD-2005),
pages 254-9, San Jose, CA, USA, 2005. IEEE.
[ bib |
.pdf ]
Newton Cheung, Sri Parameswaran & Joerg Henkel.
Battery aware instruction generation for embedded processors.
In Asia South Pacific Design Automation Conference (ASP-DAC '05),
volume 1, pages 553-556, Shanghai, China, 2005.
[ bib |
.pdf ]
Hui Guo & Sri Parameswaran.
Balancing system level pipelines with stage voltage scaling.
In IEEE Computer Society Annual Symposium on VLSI (IVLSI '05), pages
287-289, Tampa, FL, USA, 2005. IEEE Comput. Soc.
[ bib |
.pdf ]
Ivan S. C. Lu, Neil Weste & Sri Parameswaran.
The effect of receiver front-end non-linearity on DS-UWB systems
operating in the 3 to 4 GHz band.
In 2005 IEEE Wireless Communications and Networking Conference (WCNC
'05), volume Vol. 2, pages 776-81, New Orleans, LA, USA, 2005. IEEE.
[ bib |
.pdf ]
Sri Parameswaran & Joerg Henkel.
Instruction code mapping for performance increase and energy
reduction in embedded computer systems.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 13, no. 4, pages 498-502, 2005.
[ bib |
.pdf ]
Sri Parameswaran, Jorgen Peddersen & Ashley Partis.
A Modular Approach to TCP/IPv6 Hardware Implementation, 2 June
2005 2005.
[ bib ]
Sri Parameswaran, Jorgen Peddersen & Ashley Partis.
Low Power Chip Architecture, 2005.
[ bib ]
Jorgen Peddersen, Seng Lin Shee, Andhi Janapsatya & Sri Parameswaran.
Rapid embedded hardware/software system generation.
In 18th International Conference on VLSI Design (VLSI Design '05),
pages 111-16, Kolkata, India, 2005. IEEE Computer Soc.
[ bib |
.pdf ]
Swarnalatha Radhakrishnan, Hui Guo & Sri Parameswaran.
n-pipe: Application Specific Heterogeneous Multi-Pipeline
Processor Design.
In Workshop on Application Specific Processors (WASP '05), page 8
pages, Jersey City, NJ, USA, 2005.
[ bib ]
Roshan Ragel, Sri Parameswaran & Sayed Kia.
Micro embedded monitoring for security in application specific
instruction-set processors.
In International Conference on Compilers, Architecture, and Synthesis
for Embedded Systems (CASES '05), pages 304-314, San Francisco, CA, USA,
2005. ACM.
[ bib |
.pdf ]
Seng Lin Shee, Sri Parameswaran & Newton Cheung.
Architecture for loop acceleration: a case study.
In International Conference on Hardware/Software Codesign and System
Synthesis (CODES + ISSS '05), pages 297-302, Jersey City, NJ, USA, 2005.
IEEE.
[ bib |
.pdf ]
Jeremy Chan & Sri Parameswaran.
NoCGEN:a template based reuse methodology for Networks On Chip
architecture.
In 17th International Conference on VLSI Design (VLSI Design '04),
pages 717-20, Mumbai, India, 2004. IEEE Comput. Soc.
[ bib |
.pdf ]
Newton Cheung, Sri Parameswaran & Joerg Henkel.
A Quantitative Study and Estimation Models for Extensible
Instructions in Embedded Processors.
In International Conference on Computer Aided Design (ICCAD '04),
pages 183-189, San Jose, CA, USA, 2004. IEEE.
[ bib |
.pdf ]
Newton Cheung, Sri Parameswaran, Joerg Henkel & Jeremy Chan.
MINCE: matching instructions using combinational equivalence for
extensible processor.
In Design, Automation and Test in Europe Conference and Exhibition
(DATE '04), volume Vol.2, pages 1020-5, Paris, France, 2004. IEEE Comput.
Soc.
[ bib |
.pdf ]
Andhi Janapsatya, Sri Parameswaran & Joerg Henkel.
REMcode: relocating embedded code for improving system
efficiency.
IEE Proceedings-Computers and Digital Techniques, vol. 151, no. 6,
pages 457-65, 2004.
[ bib |
.pdf ]
Andhi Janapsatya, Sri Parameswaran & Aleksander Ignjatovic.
Hardware/software managed scratchpad memory for embedded
system.
In International Conference on Computer Aided Design (ICCAD 2004),
pages 370-7, San Jose, CA, USA, 2004. IEEE.
[ bib |
.pdf ]
Swarana Radhakrishnan, Hui Guo & Sri Parameswaran.
Dual-pipeline heterogeneous ASIP design.
In International Conference on Hardware/Software Codesign and Systems
Synthesis (CODES + ISSS '04), pages 12-17, Stockholm, Sweden, 2004. ACM.
[ bib |
.pdf ]
Newton Cheung, Sri Parameswaran & Joerg Henkel.
Rapid Configuration & Instruction Selection for an ASIP: A Case
Study.
In Ahmed A. Jerraya, S. Yoo, N. When & D. Verkest, editeurs,
Embedded Software for SoC. Kluwer Publishing, 2003.
[ bib ]
Newton Cheung, Sri Parameswaran & Joerg Henkel.
INSIDE: INstruction Selection/Identification & Design
Exploration for extensible processors.
In International Conference on Computer Aided Design (ICCAD '03),
pages 291-7, San Jose, CA, USA, 2003. IEEE.
[ bib |
.pdf ]
Newton Cheung, Sri Parameswaran & Joerg Henkel.
Rapid configuration and instruction selection for an ASIP: a
case study.
In Design, Automation and Test in Europe Conference and Exhibition
(DATE '03), pages 802-7, Munich, Germany, 2003. IEEE Comput. Soc.
[ bib |
.pdf ]
Ivan S. C. Lu, Neil Weste & Sri Parameswaran.
A digital ultra-wideband multiband transceiver architecture with
fast frequency hopping capabilities.
In 2003 IEEE Conference on Ultra Wideband Systems and Technologies
(UWST '03), pages 448-52, Reston, VA, USA, 2003. IEEE.
[ bib |
.pdf ]
Sri Parameswaran, Joerg Henkel & Haris Lekastas.
Multi-parametric improvements for embedded systems using
code-placement and address bus coding.
In Asia and South Pacific Design Automation Conference 2003 (ASP-DAC
'03), pages 15-21, Kitakyushu, Japan, 2003. IEEE.
[ bib |
.pdf ]
Tony Han & Sri Parameswaran.
SWASAD: an ASIC design for high speed DNA sequence matching.
In ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design
Automation Conference and 15h International Conference on VLSI Design
(ASP-DAC / VLSI Design '02), pages 541-6, Bangalore, India, 2002. IEEE
Comput. Soc.
[ bib |
.pdf ]
Sri Parameswaran, Joerg Henkel, Xiaobo Sharon Hu & Rajesh Gupta.
Proceedings of the Tenth International Symposium on
Hardware/Software Codesign, CODES 2002.
In CODES 2002, Estes Park, Colorado, 2002. ACM.
[ bib ]
Sri Parameswaran.
Code placement in hardware software Co synthesis to improve
performance and reduce cost.
In Design, Automation and Test in Europe. Conference and Exhibition
(DATE '01), pages 626-32, Munich, Germany, 2001. IEEE Comput. Soc.
[ bib |
.pdf ]
Sri Parameswaran & Joerg Henkel.
I-CoPES: fast instruction code placement for embedded systems to
improve performance and energy efficiency.
In IEEE/ACM International Conference on Computer Aided Design. (ICCAD
2001), pages 635-41, San Jose, CA, USA, 2001. IEEE.
[ bib |
.pdf ]
Allan Rae & Sri Parameswaran.
Voltage reduction of application-specific heterogeneous
multiprocessor systems for power minimisation.
IEICE Transactions on Fundamentals of Electronics, Communications and
Computer Sciences, vol. E84-A, no. 9, pages 2296-302, 2001.
[ bib ]
Allan Rae & Sri Parameswaran.
Synthesising application-specific heterogeneous multiprocessors
using differential evolution.
IEICE Transactions on Fundamentals of Electronics, Communications and
Computer Sciences, vol. E84-A, no. 12, pages 3125-31, 2001.
[ bib ]
Vince E Boros, Aleks D. Rakic & Sri Parameswaran.
High-level model of a WDMA passive optical bus for a
reconfigurable multiprocessor system.
In Design Automation Conference. (DAC '00), pages 221-6, Los
Angeles, CA, USA, 2000. ACM.
[ bib |
.pdf ]
Sri Parameswaran, Matthew F.Parkinson & Peter Bartlett.
Profiling in the ASP codesign environment.
Journal of Systems Architecture, vol. 46, no. 14, pages 1263-74,
2000.
[ bib |
.pdf ]
Allan Rae & Sri Parameswaran.
Voltage reduction of application-specific heterogeneous
multiprocessor systems for power minimisation.
In Asia and South Pacific Design Automation Conference 2000 with EDA
TechnoFair 2000 ( ASP-DAC 2000), pages 147-52, Yokohama, Japan, 2000. IEEE.
[ bib |
.pdf ]
Seyed M. Kia & Sri Parameswaran.
Self-checking synchronous controller design.
IEE Proceedings-Computers and Digital Techniques, vol. 146, no. 1,
pages 9-12, 1999.
[ bib |
.pdf ]
Hui Guo & Sri Parameswaran.
Unrolling loops with Indeterminate Loop Counts in System Level
Pipelines.
In Asia South Pacific Design Automation Conference (ASP-DAC '98),
pages 195-200, Yokohama, Japan, 1998.
[ bib |
.pdf ]
Seyed M. Kia & Sri Parameswaran.
Designs for self checking flip-flops.
IEE Proceedings-Computers and Digital Techniques, vol. 145, no. 2,
pages 81-8, 1998.
[ bib |
.pdf ]
Sri Parameswaran.
HW-SW co-synthesis: the present and the future.
In Asian and South Pacific Design Automation Conference 1998 (ASP-DAC
'98), pages 19-22, Yokohama, Japan, 1998. IEEE.
[ bib |
.pdf ]
Sri Parameswaran & Hui Guo.
Power reduction in pipelines.
In Asian and South Pacific Design Automation Conference 1998 (ASP-DAC
'98), pages 545-50, Yokohama, Japan, 1998. IEEE.
[ bib |
.pdf ]
Allan Rae & Sri Parameswaran.
Application-specific heterogeneous multiprocessor synthesis
using differential-evolution - II.
In 11th International Symposium on System Synthesis (ISSS '98), pages
83-8, Hsinchu, Taiwan, 1998. IEEE Comput. Soc.
[ bib |
.pdf ]
Allan R. Rae & Sri Parameswaran.
Application-Specific Heterogeneous Multiprocessor Synthesis
Using Differential Evolution.
In Asia Pacific Conference on Hardware Description Languages (APCHDL
'98), page 6 pages, Seoul, South Korea, 1998.
[ bib ]
Hui Guo & Sri Parameswaran.
Unfolding loops with indeterminate count in system level
pipelines.
In 14th Australian Microelectronics Conference. Microelectronics:
Technology Today for the Future (MICRO '97), pages 82-7, Melbourne, Vic.,
Australia, 1997. IREE Soc.
[ bib ]
Seyed M Kia & Sri Parameswaran.
Design of TSC/CD and SFS/SCD Synchronous Circuits with TSC/error
propagating Flip-Flops'.
In 11th Australian Microelectronics Conference (MICRO '97), pages
75-80, Sydney, NSW, Australia, 1997. Inst. Radio & Electron. Eng.
[ bib ]
Seyed M. Kia & Sri Parameswaran.
An efficient self exercising two rail checker.
Journal of Microelectronic Systems Integration, vol. 5, no. 3, pages
159-65, 1997.
[ bib ]
Sri Parameswaran & Hui Guo.
Power consumption in CMOS combinational logic blocks at high
frequencies.
In Asia and South Pacific Design Automation Conference 1997 (ASP-DAC
'97), pages 195-200, Chiba, Japan, 1997. IEEE.
[ bib |
.pdf ]
Sri Parameswaran & Hui Guo.
Partitioning of system level pipelines.
In 14th Australian Microelectronics Conference. Microelectronics:
Technology Today for the Future ( MICRO '97), pages 233-8, Melbourne, Vic.,
Australia, 1997. IREE Soc.
[ bib ]
Sri Parameswaran & Hui Guo.
Power reduction in pipelines.
In 14th Australian Microelectronics Conference. Microelectronics:
Technology Today for the Future ( MICRO '97), pages 239-44, Melbourne, Vic.,
Australia, 1997. IREE Soc.
[ bib |
.pdf ]
Sri Parameswaran & Hui Guo.
Extracting Higher Performance/Power Ratio in Combinational CMOS
Circuits.
In Sixth International Workshop on Power, Timing, Modelling,
Optimization and Simulation (PATMOS '96), pages 93-102, University of
Bologna, 1996.
[ bib ]
Pradip Jha, Sri Parameswaran & Nikil Dutt.
Reclocking Controllers for Minimum Execution Time.
IEICE Transactions on Fundamentals of Electronics, Communications and
Computer Sciences, vol. E78-A, no. 12, pages 1715-1721, 1995.
[ bib ]
Pradip Jha, Sri Parameswaran & Nikil Dutt.
Reclocking for high level synthesis.
In Asia and South Pacific Design Automation Conference. IFIP
International Conference on Computer Hardware Description Languages and their
Applications. IFIP Interntional Conference on Very Large Scale Integration
(ASP-DAC'95/CHDL'95/VLSI'95.), pages 49-54, Chiba, Japan, 1995. Nihon Gakkai
Jimu Senta.
[ bib |
.pdf ]
Matthew F. Parkinson & Sri Parameswaran.
Profiling in the ASP codesign environment.
In Proceedings of the Eighth International Symposium on System
Synthesis (ISSS '95), pages 128-33, Cannes, France, 1995. IEEE Comput. Soc.
Press.
[ bib |
.pdf ]
Seyed M. Kia & Sri Parameswaran.
Design automation of self checking circuits.
In European Design Automation Conference (EURO-DAC '94 + EURO VHDL
'94)), pages 252-7, Grenoble, France, 1994. ACM.
[ bib ]
Seyed M. Kia & Sri Parameswaran.
Novel architectures for TSC/CD and SFS/SCD synchronous
controllers.
In Proceedings 12th IEEE VLSI Test Symposium (VTS '94), pages
138-43, Cherry Hill, NJ, USA, 1994. IEEE Comput. Soc. Press.
[ bib |
.pdf ]
Sri Parameswaran, Paradip Jha & Nikil Dutt.
Resynthesizing Controllers for Minimum Execution Time.
In Asia Pacific Conference in Hardware Description Languages (APCHDL
'94), Toyohashi, Japan, 1994.
[ bib ]
Sri Parameswaran & Mark F. Schulz.
Computer-aided selection of components for
technology-independent specifications.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 13, no. 11, pages 1333-50, 1994.
[ bib |
.pdf ]
Matthew F. Parkinson, Paul M.Taylor & Sri Parameswaran.
C to VHDL converter in a codesign environment.
In Proceedings of VHDL International Users Forum. Spring Conference
(VHDL Forum '94), pages 100-9, Oakland, CA, USA, 1994. IEEE Computer. Soc.
Press.
[ bib |
.pdf ]
Tim Healy & Sri Parameswaran.
BoMaRA: A Boltzmann machine for register allocation and
interconnection.
In 11th Australian Microelectronics Conference. Microelectronics,
Meeting the Needs of Modern Technology (MICRO '93), pages 69-74, Gold Coast,
Qld., Australia, 1993. Inst. Radio & Electron. Eng.
[ bib ]
Seyed M. Kia & Sri Parameswaran.
Automated Self Checking System using VHDL.
In Asia Pacific Conference on Hardware Description Languages (APCHDL
'93), pages 131-135, Brisbane, Australia, 1993. APCHDL Conference
Secretariat.
[ bib ]
Seyed M. Kia & Sri Parameswaran.
Synchronous TSC/CD Error Indicator for self checking systems.
In Pacific Rim International Symposium on Fault tolerant Computing
(PRISFC '93), pages 156-160, Melbourne, Australia, 1993.
[ bib ]
Sri Parameswaran & Adam Postula.
Proceedings of the First Asia Pacific Conference on Hardware
Description Languages (APCHDL '93).
Brisbane, Australia, 1993. APCHDL Conference Secretariat.
[ bib ]
Sri Parameswaran & Mark F. Schulz.
A critical look at adaptive logic networks.
In Proceedings of the Fourth Australian Conference on Neural Networks
(ACNN'93), pages 102-5, Melbourne, Vic., Australia, 1993. Sydney Univ.
Electr. Eng.
[ bib ]
Matthew F. Parkinson, Paul M.Taylor & Sri Parameswaran.
A profiler for automated translation of signal processing
algorithms into high speed hardware/software hybrid architectures.
In 11th Australian Microelectronics Conference. Microelectronics,
Meeting the Needs of Modern Technology (MICRO '93), pages 81-6, Gold Coast,
Qld., Australia, 1993. Inst. Radio & Electron. Eng.
[ bib ]
Matthew F. Parkinson, Paul M.Taylor, Sri Parameswaran & Adam Postula.
An Automated Hardware Software Codesign System Using VHDL.
In Asia Pacific Conference of Hardware Description Language (APCHDL
'93), pages 267-280, Brisbane, Australia, 1993. APCHDL Conference
Secretariat.
[ bib ]
Sri Parameswaran & Mark F. Schulz.
SPOT: an expert system for digital synthesis.
In 8th Australian Conference on Microelectronics (MICROS '89), pages
95-101, Brisbane, Qld., Australia, 1989. Inst. Eng. Australia.
[ bib ]
Mark F. Schulz & Sri Parameswaran.
An expert system for design of ASICs.
In Conference on Computing Systems and Information Technology 1989
(CCSIT '89), pages 127-31, Sydney, NSW, Australia, 1989. Instn. Eng.
Australia.
[ bib ]
Sri Parameswaran.
Improving Education Planning in Sri Lanka - Computer Science and
Mathematics.
Rapport technique, Asian Development Bank, June 1999 1999.
[ bib ]
Sri Parameswaran.
Proposal for a Centre for System-on-a-Chip Research.
Rapport technique, The University of Queensland / State Department of
Queensland, September 1998 1998.
[ bib ]
Sri Parameswaran.
Strategies for the Indian Higher Education Student Market in
Engineering/IT.
Rapport technique, The University of Queensland, June 1998 1998.
[ bib ]
Sri Parameswaran.
Proposed New Horizons Diploma & Professional Masters Program in
Information Technology.
Rapport technique, The University of Queensland, September 1998 1998.
[ bib ]
Sri Parameswaran, Jorgen Peddersen & Ashley Partis.
A Low Power Chip Architecture, 2005.
[ bib ]
This file has been generated by
bibtex2html 1.86.
Multicore processors have been utilized in embedded systems and general
computing applications for some time. However, these multicore chips
execute multiple applications concurrently, with each core carrying
out a particular task in the system. Such systems can be found in
gaming, automotive real-time systems and video / image encoding devices.
These system are commonly deployed to overcome deadline misses, which
are primarily due to overloading of a single multitasking core. In
this paper, we explore the use of multiple cores for a a single application,
as opposed to multiple applications executing in a parallel fashion.
A single application is parallelized using two different methods:
one, a master-slave model; and two, a sequential pipeline model.
The systems were implemented using Tensilica's Xtensa LX processors
with queues as the means of communications between two cores. In
a master-slave model, we utilized a course grained approach whereby
a main core distributes the workload to the remaining cores and reads
the processed data before writing the results back to file. In the
pipeline model, a lower granularity is used. The application is partitioned
into multiple sequential blocks; each block representing a stage
in a sequential pipeline. For both models we applied a number of
differing configurations ranging from a single core to a nine-core
system. We found that without any optimization for the seven core
system, the sequential pipeline approach has a more efficient area
usage, with an area increase to speedup ratio of 1.83 compared
to the master-slave approach of 4.34. With selective optimization
in the pipeline approach, we obtained speed ups of up to 4.6times
while with an area increase of only 3.1times (area increase to speedup
ratio of just 0.68).
Multiprocessor SoC systems have led to the increasing use of parallel
hardware along with the associated software. These approaches have
included coprocessor, homogeneous processor (e.g. SMP) and application
specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a
viable alternative to conventional processing entities (PEs) due
to its configurability and programmability. In this work, we introduce
a heterogeneous multi-processor system using ASIPs as processing
entities in a pipeline configuration. A streaming application is
taken and manually broken into a series of algorithmic stages (each
of which make up a stage in a pipeline). We formulate the problem
of mapping each algorithmic stage in the system to an ASIP configuration,
and propose a heuristic to efficiently search the design space for
a pipeline-based multi ASIP system.
We have implemented the proposed heterogeneous multiprocessor methodology
using a commercial extensible processor (Xtensa LX from Tensilica
Inc.). We have evaluated our system by creating two benchmarks (MP3
and JPEG encoders) which are mapped to our proposed design platform.
Our multiprocessor design provided a performance improvement of at
least 4.11X (JPEG) and 3.36X (MP3) compared to the single processor
design. The minimum cost obtained through our heuristic was within
5.47 respectively.
Side channel attacks are becoming a major threat to the security of
embedded systems. Countermeasures proposed to overcome Simple Power
Analysis and Differential Power Analysis, are data masking, table
masking, current flattening, circuitry level solutions, dummy instruction
insertions and balancing bit-flips. All these techniques are either
susceptible to multi-order side channel attacks, not sufficiently
generic to cover all encryption algorithms, or burden the system
with high area cost, run-time or energy consumption.
A HW/SW based randomized instruction injection technique is proposed
in this paper to overcome the pitfalls of previous countermeasures.
Our technique injects random instructions at random places during
the execution of an application which protects the system from both
SPA and DPA. Further, we devise a systematic method to measure the
security level of a power sequence and use it to measure the number
of random instructions needed, to suitably confuse the adversary.
Our processor model costs 1.9 processor, and costs on average 29.8 energy consumption for six industry standard cryptographic algorithms.
This paper describes the creation of custom floating point units (FPUs)
for Application Specific Instruction Set Processors(ASIPs). ASIPs
allow the customization of processors for use in embedded systems,
by extending the instruction set which enhances the performance of
an application or a class of application. These extended instructions
are manifested as separate hardware blocks, making the creation of
any necessary floating point instructions quite unwieldy. On the
other hand, using a predefined FPU includes a large monolithic hardware
block with considerable number of unused instructions. A customized
FPU will overcome these drawbacks, yet the manual creation of one
is a time consuming, error prone process. This paper presents a methodology
for automatically generating floating-point units (FPUs) that are
customized for specific applications at the instruction level. Custom
FPUs were generated for several Mediabench applications. Area savings
over a fully-featured FPU without resource sharing of 26 resource sharing and 33 Clock period increased in some cases by up to 9.5 sharing.
Abstract: A method to both reduce energy and improve performance in
a processor-based embedded system is described in this paper. Comprising
of a scratchpad memory instead of an instruction cache, the target
system dynamically (at runtime) copies into the scratchpad code segments
that are determined to be beneficial (in terms of energy efficiency
and/or speed) to execute from the scratchpad. We develop a heuristic
algorithm to select such code segments based on a metric, called
concomitance. Concomitance is derived from the temporal relationships
of instructions. A hardware controller is designed and implemented
for managing the scratchpad memory. Strategically placed custom instructions
in the program inform the hardware controller when to copy instructions
from the main memory to the scratchpad. A novel heuristic algorithm
is implemented for determining locations within the program where
to insert these custom instructions. For a set of realistic benchmarks,
experimental results indicate the method uses 41.9 (on average) and improves performance by 40.0 compared to a traditional cache system which is identical in size
(40 refs.)
This brief presents a robust, power efficient CMOS frequency divider
for the 5-GHz UNII band. The divider operates as a voltage controlled
ring oscillator with the output frequency modulated by the switching
of the input transmission gate. The divider, designed in a 0.25-micron
SOS-CMOS technology, occupies 35x25microm2 and exhibit a operating
frequency of 5.6 GHz while consuming 79 microW at a supply voltage of
0.8 V. Process and temperature tolerant operation can be achieved
by utilizing a novel compensation circuitry to calibrate the speed
of the ring oscillator-based divider. The simple compensation circuitry
contains low-speed digital logic and dissipates minimal additional
power since it is powered on only during the one-time factory calibration
sequence (10 refs.)
Numerous dynamic power management techniques have been proposed which
utilize the knowledge of processor power/energy consumption at run-time.
So far, no efficient method to provide run-time power/energy data
has been presented. Current measurement systems draw too much power
to be used in small embedded designs and existing performance counters
can not provide sufficient information for run-time optimization.
This paper presents a novel methodology to solve the problem of run-time
power optimization by designing a processor that estimates its own
power/energy consumption. Estimation is performed by the addition
of small counters that tally events which consume power. This methodology
has been applied to an existing processor resulting in an average
power error of 2 adds little impact to the design, with only a 4.9 area and a 3 of an application that utilizes the processor showcases the benefits
the methodology enables in dynamic power optimization.
Until recently, there has been a lack of methods to trade-off energy
use for quality of service at run-time in stand-alone embedded systems.
Such systems are motivated by the need to increase the apparent available
battery energy of portable devices, with minimal compromise in quality.
The available systems either drew too much power or added considerable
overheads due to task swapping. In this paper we demonstrate a feasible
method to perform these trade-offs. This work has been enabled by
a low-impact power/energy estimating processor which utilizes counters
to estimate power and energy consumption at run-time. Techniques
are shown that modify multimedia applications to differ the fidelity
of their output to optimize the energy/quality trade-off. Two adaptation
algorithms are applied to multimedia applications demonstrating the
efficacy of the method. The method increases code size by 1 execution time by 0.02 acceptable and processes up to double the number of frames.
A method to both reduce energy and improve performance in a processor-based
embedded system is described in this paper. Comprising of a scratchpad
memory instead of an instruction cache, the target system dynamically
(at runtime) copies into the scratchpad code segments that are determined
to be beneficial (in terms of energy efficiency and/or speed) to
execute from the scratchpad. We develop a heuristic algorithm to
select such code segments based on a metric, called concomitance.
Concomitance is derived from the temporal relationships of instructions.
A hardware controller is designed and implemented for managing the
scratchpad memory. Strategically placed custom instructions in the
program inform the hardware controller when to copy instructions
from the main memory to the scratchpad. A novel heuristic algorithm
is implemented for determining locations within the program where
to insert these custom instructions. For a set of realistic benchmarks,
experimental results indicate the method uses 41.9 (on average) and improves performance by 40.0 compared to a traditional cache system which is identical in size.
Modern embedded system execute a single application or a class of
applications repeatedly. A new emerging methodology of designing
embedded system utilizes configurable processors where the cache
size, associativity, and line size can be chosen by the designer.
In this paper, a method is given to rapidly find the L1 cache miss
rate of an application. An energy model and an execution time model
are developed to find the best cache configuration for the given
embedded application. Using benchmarks from Mediabench, we find that
our method is on average 45 times faster to explore the design space,
compared to Dinero IV while still having 100% accuracy.
Scratchpad memory has been introduced as a replacement for cache memory
as it improves the performance of certain embedded systems. Additionally,
it has also been demonstrated that scratchpad memory can significantly
reduce the energy consumption of the memory hierarchy of embedded
systems. This is significant, as the memory hierarchy consumes a
substantial proportion of the total energy of an embedded system.
This paper deals with optimization of the instruction memory scratchpad
based on a methodology that uses a metric which we call the concomitance.
This metric is used to find basic blocks which are executed frequently
and in close proximity in time. Once such blocks are found, they
are copied into the scratchpad memory at appropriate times; this
is achieved using a special instruction inserted into the code at
appropriate places. For a set of benchmarks taken from Mediabench,
our scratchpad system consumed just 59 cache system, and 73 scratchpad system, while improving the overall performance. Compared
to the state of the art method, the number of instructions copied
into the scratchpad memory from the main memory is reduced by 88%.
This paper presents the power and performance analysis of a digital,
direct sequence ultra-wideband (DS-UWB) receiver operating in the
3 to 4 GHz band. The signal to noise and distortion ratio (SNDR)
and bit error rate (BER) were evaluated with varying degrees of front-end
linearity and analog to digital converter (ADC) accuracy. The analysis
and simulation results indicate two or more ADC bits are required
for reliable data reception in the presence of strong interference
and intermodulation distortion. In addition to BER performance, power
consumption of different hardware configurations is also evaluated
to form the cost function for evaluating design choices. The combined
power and performance analysis indicates that starting with one-bit
ADC resolutions, a substantial gain in reliability can be attained
by increasing ADC resolution to two-bits or more. When the ADC resolution
improves beyond three bits, front-end linearization achieves similar
BER improvements to increasing the ADC accuracy, at a fraction of
the power cost. As a result, linear front-end designs become significant
only when high precision ADCs are utilized
In this paper we propose application specific instruction set processors
with heterogeneous multiple pipelines to efficiently exploit the
available parallelism at instruction level. We have developed a design
system based on the Thumb processor architecture. Given an application
specified in C language, the design system can generate a processor
with a number of pipelines specifically suitable to the application,
and the parallel code associated with the processor. Each pipeline
in such a processor is customized, and implements its own special
instruction set so that the instructions can be executed in parallel
with low hardware overhead. Our simulations and experiments with
a group of benchmarks, largely from Mibench suite, show that on average,
77 pipeline ASIP, with the overheads of 49 power, 17% on switching activity, and 69% on code size (20 refs.)
Small area and code size are two critical design issues in most of
embedded system designs. In this paper, we tackle these issues by
customizing forwarding networks and instruction encoding schemes
for multi-pipe Application Specific Instruction-Set Processors (ASIPs).
Forwarding is a popular technique to reduce data hazards in the pipeline
to improve performance and is applied in almost all modern processor
designs; but it is very area expensive. Instruction encoding schemes
have a direct impact on code size; an efficient encoding method can
lead to a small instruction width, and hence reducing the code size.
We propose application specific techniques to reduce forwarding networks
and instruction widths for ASIPs with multiple pipelines. By these
design techniques, it is possible to reduce area, code size, and
even power consumption (due to reduced area), without costing any
performance. Our experiments, on a set of benchmarks using the proposed
customization approaches show that, on average, there are 27 on area, 30 time, performance even improves by 4 period.
Security and reliability in processor based systems are concerns requiring
adroit solutions. Security is often compromised by code injection
attacks, jeopardizing even `trusted software'. Reliability is of
concern where unintended code is executed in modern processors with
ever smaller feature sizes and low voltage swings causing bit ips.
Countermeasures by software-only approaches increase code size by
large amounts and therefore signicantly reduce performance. Hardware
assisted approaches add extensive amounts of hardware monitors and
thus incur unacceptably high hardware cost. This paper presents a
novel hardware/software technique at the granularity of micro-instructions
to reduce overheads considerably. Experiments show that our technique
incurs an additional hardware overhead of 0.91 increase of 0.06 just 11.9 These overheads are far smaller than have been previously encountered.
In this paper we present NoCEE, a fast and accurate method for extracting
energy models for packet-switched network on chip (NoC) routers.
Linear regression is used to model the relationship between events
occurring in the NoC and energy consumption. The resulting models
are cycle accurate and can be applied to different technology libraries.
We verify the individual router estimation models with many different
synthetically generated traffic patterns and data inputs. Characterization
of a small library takes about two hours. The mean absolute energy
estimation error of the resultant models is 5 a complete gate level simulation. We also apply this method to a
number of complete NoCs with inputs extracted from synthetic application
traces and compare our estimated results to the gate level power
simulations (mean absolute error is 5 has been integrated with commercial logic synthesis flow and power
estimation tools (synopsys design compiler and primepower), allowing
application across different designs. The extracted models show the
different trends across various parameterizations of network on chip
routers and have been integrated into an architecture exploration
framework
Automatic instruction generation is an efficient method to satisfy
growing performance and meet design constraints for application specific
instruction-set processors. A typical approach for instruction generation
is to combine a large group of primitive instructions into a single
extensible instruction for maximizing speedups. However, this approach
often leads to large power dissipation and discharge current, posing
a challenge to battery-powered products. In this paper, we propose
a battery-aware automatic tool to design extensible instructions
which minimizes power dissipation distribution by separating an instruction
into multiple instructions. We verify our automatic tool using 50
different code segments, and five large real-world applications.
Our tool reduces energy consumption by a further 5.8 (up to 17.7 approaches. For real-world applications, energy consumption is reduced
by 6.6 for most cases. The automatic instruction generation tool is integrated
into our application specific instruction-set processor tool suite
(24 refs.)
This paper presents an approach to dynamically balance the pipeline
by scaling the stage supply voltages. Simulation results show that
by such an approach about 50 time, and 11 limited memory overhead
The paper presents a performance analysis of direct sequence ultra
wideband (DS-UWB) systems operating with non-linear receiver front-ends.
Following this analysis, we propose the novel use of pulse doublets
to mitigate non-linearity induced distortion. The signal-to-noise-and-distortion
ratio (SNDR) and bit error rate (BER) are evaluated with varying
degrees of non-linearity and interference power. Simulation results
indicate significant performance improvements by using pulse doublets
under high interference power and non-linear operating conditions.
Using pulse doublets allows reduced front-end linearity requirements
and enables improvements in more critical circuit parameters. Front-end
modules, such as low noise amplifiers (LNAs), mixers and baseband
amplifiers, are designed using Peregrine's 0.5 ?m SOS-CMOS process
to demonstrate the benefits of circuits designed with relaxed linearity
requirements. Simulation results obtained using the Cadence Spectre
RF simulator indicate that the sub-linear front-end achieves 33 dB
increase in voltage gain, 2 dB improvement in noise figure, 64 in power and 917 MHz extension in bandwidth over its more linear
counterpart
In this paper, we present a novel and fast constructive technique
that relocates the instruction code in such a manner into the main
memory that the cache is utilized more efficiently. The technique
is applied as a preprocessing step, i.e., before the code is executed.
Our technique is applicable in embedded systems where the number
and characteristics of tasks running on the system is known a priori.
The technique does not impose any computational overhead to the system.
As a result of applying our technique to a variety of real-world
applications we observed through simulation a significant drop of
cache misses. Furthermore, the energy consumption of the whole system
(CPU, caches, buses, main memory) is reduced by up to 65 benefits could be achieved by a slightly increased main memory size
of about 13% on average
This paper presents an RTL generation scheme for a SimpleScalar/PISA
instruction set architecture with system calls to implement C programs.
The scheme utilizes ASIPmeister, a processor generation tool. The
RTL generated is available for download. The second part of the paper
shows a method of reducing the PISA instruction set and generating
a processor for a given application. This reduction and generation
can be performed within an hour, making this one of the fastest methods
of generating an application specific processor. For five benchmark
applications, we show that on average, processor size can be reduced
by 30 by 24%
In this paper we propose Application Specic Instruction Set Processors
with heterogeneous multiple pipelines to efciently exploit the available
parallelism at instruction level. We have developed a design system
based on the Thumb processor architecture. Given an application specied
in C language, the design system can generate a processor with a
number of pipelines specically suitable to the application and the
parallel code associated with the processor. Each pipeline in such
a processor is customized, and implements its own special instruction
set so that the instructions can be executed in parallel with low
hardware overhead. Our simulations and experiments with a group of
benchmarks, largely from Mibench suite, show that on average, 77 performance improvement can be achieved compared to a single pipeline
ASIP, with the overheads of 49 activity, and 69% on code size.
This paper presents a methodology for monitoring security in Application
Specific Instruction-set Processors (ASIPs). This is a generalized
methodology for inline monitoring insecure operations in machine
instructions at microinstruction level. Microinstructions are embedded
into the critical machine instructions forming self checking instructions.
We name this method Micro Embedded Monitoring. Since ASIPs are designed
exclusively for a particular application domain, the Instruction
Set Architecture (ISA) of an ASIP is based on the application executed.
Knowledge of the domain gives an insight into the kinds of the security
threats which need to be considered. The fact that the ISA design
is based on the application makes room to accommodate security monitoring
support during the design phase by embedding microinstructions into
the critical machine instructions. Since the microinstructions are
the lowest possible software level architecture, we could expect
to get better performance by implementing security detection using
microinstruction routines. Four different embedded security monitoring
routines are implemented for evaluation. The average performance
penalty with these monitoring routines with ten different benchmarks
is 1.93 3.07% respectively.
In this paper, we show a novel approach to accelerate loops by tightly
coupling a coprocessor to an ASIP. Latency hiding is used to exploit
the parallelism available in this architecture. To illustrate the
advantages of this approach, we investigate a JPEG encoding algorithm
and accelerate one of its loop by implementing it in a coprocessor.
We contrast the acceleration by implementing the critical segment
as two different coprocessors and a set of customized instructions.
The two different coprocessor approaches are: a high-level synthesis
(HLS) approach; and a custom coprocessor approach. The HLS approach
provides a faster method of generating coprocessors. We show that
a loop performance improvement of 2.57× is achieved using the custom
coprocessor approach, compared to 1.58× for the HLS approach and
1.33× for the customized instruction approach compared with just
the main processor. Respective energy savings within the loop are
57%, 28% and 19%
In this paper, we describe NoCGEN, a Network On Chip (NoC) generator,
which is used to create a simulatable and synthesizable NoC description.
NoCGEN uses a set of modularised router components that can be used
to form different routers with a varying number of ports, routing
algorithms, data widths and buffer depths. A graph description representing
the interconnection between these routers is used to generate a top-level
VHDL description. A wormhole output-queued 2-D mesh router was created
to verify the capability of NoCGEN. Various parameterized designs
were synthesized to provide estimated gate counts of 129 K to 695
K for a number of topologies varying from a 2×2 mesh to a 4×4 mesh,
with constant data bus size width of 32. The NoC was simulated with
random traffic using a mixed SystemC/VHDL environment to ensure correctness
of operation and to obtain performance and average latency. The results
show an accepted load of 53 depth from 8 to 32 flits for the 4×4 mesh router
Designing extensible instructions is a computationally complex task,
due to the large design space each instruction is exposed to. One
method of speeding up the design cycle is to characterize instructions
and estimate their peculiarities during a design exploration. In
this paper, we study and derive three estimation models for extensible
instructions: area overhead, latency, and power consumption under
a wide range of customization parameters. System decomposition and
regression analysis are used as the underlying methods to characterize
and analyze extensible instructions. We verify our estimation models
using automatically and manually generated extensible instructions,
plus extensible instructions used in large real-world applications.
The mean absolute error of our estimation models arc as small as:
3.4 and 4.2 through the time consuming synthesis and simulation steps using commercial
tools. Our estimation models achieve an average speedup of three
orders of magnitude over the commercial tools and thus enable us
to conduct a fast and extensive design space exploration that would
otherwise not be possible. The estimation models are integrated into
our extensible processor tool suite (29 refs.)
Designing custom-extensible instructions for extensible processors
is a computationally complex task because of the large design space.
The task of automatically matching candidate instructions in an application
(e.g. written in a high-level language) to a pre-designed library
of extensible instructions is especially challenging. Previous approaches
have focused on identifying extensible instructions (e.g. through
profiling), synthesizing extensible instructions, estimating expected
performance gains etc. In this paper we introduce our approach of
automatically matching extensible instructions as this key step is
missing in automating the entire design flow of an ASIP with extensible
instruction capabilities. Since matching using simulation is practically
infeasible (simulation time), and traditional pattern matching approaches
would not yield reliable results (ambiguity related to a functionally
equivalent code that can be represented in many different ways),
we adopt combinational equivalence checking. Our MINCE tool as part
of our ASIP design flow consists of a translator, a filtering algorithm
and a combinational equivalence checking tool. We report matching
times of extensible instructions that are 7.3x faster on average
(using Mediabench applications) compared to the best known approaches
to the problem (partial simulations). In all our experiments MINCE
matched correctly and the outcome of the matching step yielded an
average speedup of the application of 2.47x. As a summary, our work
represents a key step towards automating the whole design flow of
an ASIP with extensible instruction capabilities
The memory hierarchy subsystem has a significant impact on performance
and energy consumption of an embedded system. Methods which increase
the hit ratio of the cache hierarchy will typically enhance the performance
and reduce the embedded system's total energy consumption. This is
mainly due to reduced cache-to-memory bus transactions, fewer main
memory accesses and fewer processor waiting cycles. A heuristic approach
is presented to reduce the total number of cache misses by carefully
relocating selected sections of the application's software code within
the main memory, thus reducing conflict misses resulting from the
cache hierarchy. The method requires no hardware modifications i.e.
it is a software-only approach. For the first time such a method
is applied to large program traces, and the miss rates and corresponding
energy savings are observed while varying cache size, line size and
associativity. Relocating the code consistently produces superior
performance on direct-mapped cache. Since direct-mapped caches, being
smaller in silicon area than caches with higher associativity (for
the same size), cost less in terms of energy/access, and access faster,
using direct-mapped instruction cache with code relocation for performance-oriented
embedded systems is recommended. A maximum cache miss rate reduction
from 71 of up to 63% with only a small increase in main memory size
In this paper, we propose a methodology for energy reduction and performance
improvement. The target system comprises of an instruction scratchpad
memory instead of an instruction cache. Highly utilized code segments
are copied into the scratchpad memory, and are executed from the
scratchpad. The copying of code segments from main memory to the
scratchpad is performed during runtime. A custom hardware controller
is used to manage the copying process. The hardware controller is
activated by strategically placed custom instructions within the
executing program. These custom instructions inform the hardware
controller when to copy during program execution. Novel heuristic
algorithms are implemented to determine locations within the program
to insert these custom instructions, as well as to choose the best
sets of code segments to be copied to the scratchpad memory. For
a set of realistic benchmarks, experimental results indicate the
method uses 50.7 by 53.2 which is identical in size. Cache systems compared had sizes ranging
from 256 to 16K bytes and associativities ranging from 1 to 32
In this paper we demonstrate the feasibility of a dual pipeline application
specific instruction set processor. We take a C program and create
a target instruction set by compiling to a basic instruction set
from which some instructions are merged, while others discarded.
Based on the target instruction set, parallelism of the application
program is analyzed and two unique instruction sets are generated
for a heterogeneous dual-pipeline processor. The dual pipe processor
is created by making two unique ASIPs (VHDL descriptions) utilizing
the ASIP-Meister Tool Suite, and fusing the two VHDL descriptions
to construct a dual pipeline processor. Our results show that in
comparison to the single pipeline application specific instruction
set processor, the performance improves by 27.6 reduces by 6.1 at the cost of increased area which for benchmarks considered is
16.7% on average
This paper presents the INSIDE system that rapidly searches the design
space for extensible processors, given area and performance constraints
of an embedded application, while minimizing the design turn-around-time.
Our system consists of a) a methodology to determine which code segments
are most suited for implementation as a set of extensible instructions,
b) a heuristic algorithm to select pre-configured extensible processors
as well as extensible instructions (library), and c) an estimation
tool which rapidly estimates the performance of an application on
a generated extensible processor. By selecting the right combination
of a processor core plus extensible instructions, we achieve a performance
increase on average of 2.03x (up to 7x) compared to the base processor
core at a minimum hardware overhead of 25% on average
We present a methodology that maximizes the performance of Tensilica
based Application Specific Instruction-set Processor (ASIP) through
instruction selection when an area constraint is given. Our approach
rapidly selects from a set of pre-fabricated coprocessors/functional
units from our library of pre-designed specific instructions (to
evaluate our technology we use the Tensilica platform). As a result,
we significantly increase application performance while area constraints
are satisfied. Our methodology uses a combination of simulation,
estimation and a pre-characterised library of instructions, to select
the appropriate co-processors and instructions. We report that by
selecting the appropriate coprocessors/functional units and specific
TIE instructions, the total execution time of complex applications
(we study a voice encoder/decoder), an application's performance
can be reduced by up to 85 Our estimator used in the system takes typically less than a second
to estimate, with an average error rate of 4 simulation, which takes 45 minutes). The total selection process
using our methodology takes 3-4 hours, while a full design space
exploration using simulation would take several days
This paper presents for the first time the circuit parameter analysis
of a digital multiband-UWB transceiver, encompassing a novel low-power
sub-band generator. This sub-band generator is capable of producing
multiple frequency bands, enabling sub-band generation from 3 to
10 GHz with nanosecond switching times. The circuit analysis of the
complete transceiver is used to set parameters of components. The
analysis indicate that a LNA gain of 20 dB, baseband amplifier gain
of 45 dB, matched filter accuracy of five bits, ADC accuracy of two
bits, a 60 dB dynamic range of the multi-frequency generator, and
front end offset voltage of less than 30 mV is required to achieve
a 10 dB SNR. Hspice simulation utilizing 0.35 ?m CMOS technology
suggest that the power consumption of the sub-band generator is 8
mW from a 1.8 V power supply
Code placement techniques for instruction code have shown to increase
an SoC's performance mostly due to the increased cache hit ratios
and as such those techniques can be a major optimization strategy
for embedded systems. Little has been investigated on the interdependencies
between code placement techniques and interconnect traffic (e.g.
bus traffic) and optimization techniques combining both. In this
paper we show as the first approach of its kind that a carefully
designed known code placement strategy combined and adapted to a
known interconnect encoding scheme does not only lead to a performance
increase but it does also lead to a significant reduction of interconnect-related
energy consumption. This becomes especially interesting since future
SoC bus systems (or more general: networks on a chip) are predicted
to be a dominant energy consumer of an SoC. We show that a high-level
optimization strategy like code placement and a lower-level optimization
strategy like interconnect encoding are NOT orthogonal. Specifically,
we report cache miss reduction ratios of 32 with bus related energy savings of 50.4 of up to 95.7 results have been verified by means of diverse real-world SoC applications
Presents the Smith and Waterman algorithm-specific ASIC design (SWASAD)
project. This is a hardware solution that implements the S and W
algorithm.. The SWASAD is an improved implementation of the biological
information signal processor (BISP) design. The SWASAD chip fabricated
on a 0.5 ?m process achieves 3200 million matrix cells per second
(MCPS) per chip, with a layout size of 7.1 mm by 7.1 mm. This is
a large improvement over existing designs and improves data throughput
by using a smaller datawidth
This paper introduces an algorithm for code placement in cache, and
maps it to memory using a second algorithm. The target architecture
is a multiprocessor system with IS' level cache and a common main
memory. These algorithms guarantee that as many instruction codewords
as possible of the high priority tasks remain in cache all of the
time so that other tasks do not overwrite them. This method improves
the overall performance, and might result in cheaper systems if more
powerful processors are not needed. Amount of memory increase necessary
to facilitate this scheme is in the order of 13 of highest priority tasks always in memory can vary from 3 depending upon how many tasks (and their sizes) are allocated to
each processor
The ratio of cache hits to cache misses in a computer system is, to
a large extent, responsible for its characteristics such as energy
consumption and performance. In recent years energy efficiency has
become one of the dominating design constraints, due to the rapid
growth in market share for mobile computing/communication/internet
devices. We present a novel fast constructive technique that relocates
the instruction code in such a manner into the main memory that the
cache is utilized more efficiently. The technique is applied as a
re-processing step, i.e. before the code is executed. it is applicable
for embedded systems where the number and characteristics of tasks
running on the system are known a priori. The technique does not
impose any computational overhead to the system. As a result of applying
our technique to a variety of real-world applications we measured
(through simulation) that the number of cache misses drops significantly.
Further, this reduces the energy consumption of a whole system (CPU,
caches, buses, main memory) by up to 65 memory size of 13% on average
We present a design strategy to reduce power demands in application-specific
heterogeneous multiprocessor systems with interdependent subtasks.
This power reduction scheme can be used with a randomised search
such as a genetic algorithm where multiple trial solutions are tested.
The scheme is applied to each trial solution after allocation and
scheduling have been performed. Power savings are achieved by equally
expanding each processor's execution time with a corresponding reduction
in their respective operating voltage. Lowest cost solutions achieve
average reductions of 24% while minimum power solutions average 58%
This paper presents an application-specific, heterogeneous multiprocessor
synthesis system, named HeMPS, that combines a form of evolutionary
computation known as Differential Evolution with a scheduling heuristic
to search the design space efficiently. We demonstrate the effectiveness
of our technique by comparing it to similar existing systems. The
proposed strategy is shown to be faster than recent systems on large
problems while providing equivalent or improved final solutions
We describe the first iteration of a comprehensive model with which
we can investigate the practical limits on optical bus bandwidth
and number of bus processing modules for given signal power. The
selection algorithm will ultimately allow programmable evaluation
of system parameters bus bandwidth, optical power budget, electrical
power budget, number of modules and space consumption for an optimal
design that is suitable for on-the-fly system reconfiguration
Automation of the hardware/software codesign (HSC) methodology brings
with it the need to develop sophisticated high-level profiling tools.
This paper presents a profiling tool which uses execution profiling
on standard C code to obtain accurate and consistent times at the
level of individual compound code sections. This tool is used in
the ASP hardware/software codesign project. The results from this
tool show that profiling must be performed on dedicated hardware
which is as close as possible to the final implementation, as opposed
to a workstation. Further, in this paper a formula is derived for
the number of times a program has to be profiled in order to get
an accurate estimate of the number of times a loop with an indeterminate
loop count is executed
We present a design strategy to reduce power demands in application-specific
heterogeneous multiprocessor systems with interdependent subtasks.
This power reduction scheme can be used with a randomised search
such as a genetic algorithm where multiple trial solutions are tested.
The scheme is applied to each trial solution after allocation and
scheduling have been performed. Power savings are achieved by equally
expanding each processor's execution time with a corresponding reduction
in their respective operating voltage. Lowest cost solutions achieve
average reductions of 24% while minimum power solutions average 58%
Efficient models are introduced for totally self-checking/code disjoint
(TSC/CD) and strongly fault-secure/strongly code disjoint (SFS/SCD)
synchronous controller models. These models are based on two low-cost,
modular, TSC edge-triggered and error-propagating CD flip-flops.
Properties of the proposed synchronous controller models are proven.
The design procedure for these models and their proper applications
are explained
This paper describes the unrolling of loops with indeterminate loop
counts in system level pipelines. Two methods are discussed in this
paper. The first method is the varied latency method, where the input
is blocked until the pipeline is clear. This variation in the input
arrival time gives rise to the name. In this method, the output is
in the same order as the input. The second method, called the fixed
latency method, allows for the input arrival time to remain unchanged.
The loops with loop count in excess of the number of unrolled loops
must be stored until a suitable gap in the system becomes available.
It is shown that the number of loops should be equal to the sum of
the expected value of the loop count and standard deviation of the
loop count in the varied latency method, and the expected value of
the loop count for the fixed latency method
The authors introduce two low-cost, modular, totally self checking
(TSC), edge triggered and error propagating (code disjoint) flip-flops:
one, a D flip-flop used in TSC and strongly fault secure (SFS) synchronous
circuits with two-rail codes, the other a T flip-flop, used in a
similar way as the D flip-flop but retaining the error as an indicator
until the next presetting, to aid error propagation. Thus, the self
checking T flip-flop can be used as an error indicator. The self
checking D flip-flop is smaller than the duplicate D flip-flop circuitry
by 30 than the pervious error indicator in the literature. These circuits,
unlike previously reported circuits, can also detect stuck-at faults
in the clock inputs. The authors have also presented TSC/error propagating
applications for the above flip-flops: a counter and a shift register
As we move towards several million transistors per chip it is desirable
to move to higher levels of abstraction for the purposes of automated
design of systems. Increasing performance of microprocessors in the
marketplace is moving the balance between software and hardware.
In this environment, it is necessary to adapt our tools to create
systems, which encompass these fast microprocessors rather than compete
with them. It is important to adapt other peripheral components such
as sensors and RF circuits into our design methodology
The reduction of power consumption for a system level pipeline is
addressed in this paper. The pipeline is composed of several stages.
Each stage has several behaviours. Different behaviours have differing
execution times. The speed of the pipeline is only affected by the
behaviours on the critical path of the slowest stages. Other behaviours
can be slowed down to decrease the power consumed in the system.
We propose a multi-voltage supply scheme, in which differing behaviours
are supplied with differing voltages. The formulas for computing
the supply voltage of each behaviour and minimal power consumption
are derived in this paper. The results of computer experiment show
that up to 80% hardware power can be saved with this scheme
This paper presents an application-specific, heterogeneous multiprocessor
synthesis system, named HeMPS, that combines a form of Evolutionary
Computation known as Differential Evolution with a scheduling heuristic
to search the design space efficiently. We demonstrate the effectiveness
of our technique by comparing it to similar existing systems. The
proposed strategy is shown to be faster than recent systems on large
problems while providing equivalent or improved final solutions
This paper describes the unrolling of loops with indeterminate loop
counts in system level pipelines. Two methods are discussed in this
paper. The first method is the varied latency method, where the input
is blocked until the pipeline is clear. This variation in the input
arrival time gives rise to the name. In this method, the output is
in the same order as the input. The second method, called the fixed
latency method, allows for the input arrival time to remain unchanged.
The loops with loop count in excess of the number of unrolled loops
must be stored until a suitable gap in the system becomes available.
It is shown that the number of loops should be equal to the sum of
the expected value of the loop count and standard deviation of the
loop count in the varied latency method, and the expected value of
the loop count for the fixed latency method
In this paper, we introduce a low-cost, modular, totally self checking
(TSC), edge triggered and error propagating (code disjoint) D-type
flip-flop. This D flip-flop can be used in TSC and strongly fault
secure (SFS) synchronous circuits with two-rail codes
In this paper, a, new efficient self exercising checker for a class
of non-systematic two rail codes is introduced. This proposed checker
is called the “Two Dimensional Self Exercising Two Rail checker”
(TDSETR checker). The TDSETR checker is used to check non-systematic
two rail codes. In this method, the checking is done on line. This
method significantly reduces the complexity and delay of checking
compared to previous circuits. The high speed and low cost advantages
of the TDSETR checker are compared with conventional circuits (e.g.,
the circuit is 30 reported circuits for 256 bit pair inputs)
A new model for estimating dynamic power dissipation in CMOS combinational
circuits at differing voltages is presented. The proposed model deals
with power dissipation of circuits at saturation frequencies, where
the output voltage does not reach 100 the output voltage waveform is almost a triangular waveform. We show
that the dynamic power consumption at saturation frequencies is only
dependent on the supply voltage, and is independent of load capacitance
and switching speed. This model shows that when a circuit is working
in the saturation frequency range, as the frequency is increased,
the performance/power ratio is increased. However, this increase
in performance/power ratio is at the expense of noise margin. The
model is theoretically and empirically shown to be correct. This
model can be used to design a system where the differing combinational
logic blocks are supplied with differing voltages. Such a system
would consume lower power than if the system was supplied by a single
voltage rail
A system to be pipelined usually consists of several sub-systems or
sub-processes. Each sub-process may have several implementations.
Each implementation has an associated cost and an associated execution
time. Different partitions result in a different pipeline. To obtain
an efficient pipeline, partitioning is important. In this paper,
an algorithm to partition a system in order to obtain an optimal
or near-optimal pipeline, is given. Results obtained by this method
are very close to the optimal solution, but can be found in a fraction
of the time taken to search for an optimal solution
The reduction of power consumption for a system level pipeline is
addressed in this paper. The pipeline is composed of several stages.
Each stage has several behaviours. Different behaviours have differing
execution times. The speed of the pipeline is only affected by the
behaviours on the critical path of the slowest stages. Other behaviours
can be slowed down to decrease the power consumed in the system.
We propose a multi-voltage supply scheme, in which differing behaviours
are supplied with differing voltages. The formulas for computation
of the supply voltage for each behaviour and minimal power consumption
are derived in this paper. The results of computer experiments are
also provided here
In this paper we describe a method for resynthesizing the controller
of a design for a fixed datapath with the objective of increasing
the design's throughput by minimizing its total execution time. This
work has potential in two important areas: one, design reuse for
retargetting datapaths to new libraries, new technologies and different
bit-widths; and two, back-annotation of physical design information
during High-Level Synthesis (HLS), and subsequent adjustment of the
design's schedule to account for realistic physical design information
with minimal changes to the datapath. We present our approach using
various formulations, prove optimality of our algorithm and demonstrate
the effectiveness of our technique on several HLS benchmarks. We
have observed improvements of up to 34 application of our controller resynthesis technique to the outputs
of HLS.
Describes a powerful post-synthesis approach called reclocking, for
performance improvement by minimizing the total execution time. By
back annotating the wire delays of designs created by a high level
synthesis system, and then finding an optimal clockwidth, we resynthesize
the controller to improve performance without altering the datapath.
Reclocking is versatile and can be applied not only for wire delay
consideration, but also for bit-width migration, library migration
and for feature size migration supporting the philosophy of design
reuse. Experimental results show that with reclocking, the performance
of the input designs can be improved by as much as 34%
Automation of the hardware/software codesign methodology brings with
it the need to develop sophisticated high-level profiling tools.
This paper presents a profiling tool which uses execution profiling
on standard C code to obtain accurate and consistent times at the
level of individual compound code sections. This tool is used in
the ASP Hardware/Software Codesign project. The results from this
tool show that profiling must be performed on dedicated hardware
which is as close as possible to the final implementation, as opposed
to a workstation
We explain the steps of the CAD tools developed for self checking
circuits. The CAD tools developed are used to design strongly fault
secure, strongly code disjoint (SFS/SCD) and totally self checking,
code disjoint (TSC/CD) circuits. Self checking combinatorial and
sequential synchronous circuits including shift registers, counters,
adders and checkers are designed, using these tools. The output of
these CAD tools is given in structural level VHDL which can be synthesized
via commercial tools
Introduces design models for totally self checking, code disjoint
(TSC/CD) and strongly fault secure, strongly code disjoint (SFS/SCD)
synchronous controllers. The TSC/CD and SFS/SCD models are based
on two new proposed low-cost, modular, totally self checking (TSC),
edge triggered and error propagating (code disjoint) flip-flops;
one, a D flip-flop which can be used in TSC and strongly fault secure
(SFS) synchronous circuits with two-rail codes; the other a T flip-flop,
used in a similar way as the D flip-flop but retaining the error
as an indicator until the next presetting, as an aid to error propagation
Describes a powerful post-synthesis approach called reclocking, for
performance improvement by minimizing the total execution time. By
back annotating the wire delays of designs created by a high level
synthesis system, and then finding an optimal clockwidth, we resynthesize
the controller to improve performance without altering the datapath.
Reclocking is versatile and can be applied not only for wire delay
consideration, but also for bit-width migration, library migration
and for feature size migration supporting the philosophy of design
reuse. Experimental results show that with reclocking, the performance
of the input designs can be improved by as much as 34%
The specification of a synchronous circuit can be given as a set of
abstract building blocks that are interconnected. A set of fast algorithms
are presented here for the selection of components that map each
of these abstract building blocks to one of a number of suitable
physical components. The first set of algorithms select the set of
fastest or cheapest (smallest area) of all possible components. Another
set of algorithms is given that will find a solution with user-defined
constraints. These algorithms, which are implemented as part of the
SPOT system, use a exhaustive list of timing information to increase
the likelihood of a good solution
Automation of the hardware/software codesign methodology brings with
it the need to develop sophisticated high-level synthesis tools.
This paper presents a tool which is the result of such development.
This tool converts standard C code into an equivalent VHDL behavioural
description. This description is used to generate a chip-level hardware
interconnect of identical functionality to the original C code
Most register allocation methods only minimise the number of registers.
BoMaRA, a program using probabilistic methods for register allocation,
attempts to minimise the number of registers and the number of interconnections
that arise when multiple variables are stored in a single register
Critically analyses adaptive logic networks (ALNs), which have been
developed by W. Armstrong et al. (1979). The authors take some of
the problems distributed by W. Armstrong in the Atree2 software package,
and apply standard digital logic techniques to the same problems.
From the set of tests done, it is concluded that the problems looked
at can be solved by standard digital logic techniques and Hamming
error code correction. The standard digital logic techniques coupled
with Hamming error code correction produce a minimized combinational
logic up to 100 times faster than ALNs
In order to perform a hardware/software codesign on an algorithm,
it is essential to divide the code into hardware and software partitions.
This paper presents the technique of execution profiling `C code'
in preparation for automatic partitioning. The profiler presented
overcomes many shortcomings of traditional profiling techniques
Describes an approach to the automation of digital synthesis using
a knowledge based expert system named SPOT. A technology-independent
description is produced from a functional description. This technology-independent
description is further processed to create technology-dependent description
for both LSI/MSI/SSI and EPLD technologies
Describes an expert system named SPOT for the automation of digital
design. The first section of the project described in this paper
is used to create a technology-independent description from a structural
description. The final section of the project is to create a technology-dependent
description from the technology-independent description