

WR

**b)=G** 

# Modern (Embedded) Processor Systems

# Prof. Dr. Akash Kumar Chair for Processor Design

(Ack: my past and current students/PostDocs) (Some slides adapted from Koren, Krishna, Anand)

cfaed.tu-dresden.de





#### Outline

- 9
- □ History of computer systems
- Trends in modern computer systems
- Design flow and considerations
- □ Modern challenges and solutions(??)

# History of Hardware / VLSI

#### 10

Vacuum tube(Lee De Forest, 1906)



- ENIAC(1946, UPenn)
- Transistor
   (1947, Bardeen, Brattain,
   Shockley)









# History of Hardware / VLSI

- 11
- Intel 4004
   (1971, 1400 transistors)

 Intel Core i7 - Ivy Bridge (2012, >1.4 Billion transistors)



 Very Large Scale Integration (VLSI) – originally defined for chips having transistors in the order of 100,000. Other terms such as ULSI came along, but the usage VLSI remains dominant

### Moore's Law

12

In 1965, Intel's Gordon Moore predicted that the number of transistors that can be integrated on single chip would double about every two years



http://cpudb.stanford.edu/ © Akash Kumar

#### 40 Years of microprocessor trend data



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

## Design Productivity Gap

14

Increasing number of transistors makes it harder to design the system

Late launch of products directly hurts profits



## System Design Considerations

#### 15

- □ System : sensor -> processor -> actuator
- Considerations
  - Technology
  - Performance
  - Power consumption
  - Volume of production
  - Upgradability / ease of maintenance
  - Reliability
  - Testability
  - Availability of CAD and software tools, IP's, hardware and software libraries
  - Cost, chip area
  - Legal and certification requirements, client specifications
  - •••••

# **Digital Hardware Market Segments**

- Processor, GPU
- DRAM, Flash memories
  - . .
- (Co-)Processor alternatives
  - ASIC (application specific integrated circuit)
  - ASSP (application specific standard product)
  - FPGA (field programmable gate array)
- Convergence as System on Chip (SoC), which may also contain analog, mixed-signal, and radio-frequency functions





## Embedded systems architecture

- 17
- Trend towards Multi-Processor Systems-on-chip (MPSoC)
- Homogeneous vs heterogeneous systems
- Different memory models
- Different network architectures
  - Network-on-chip
  - Buses

#### Homogeneous vs heterogeneous



### Homogeneous vs heterogeneous

- Heterogeneity is increasing
  - Different levels of parallelism in application
  - □ uProc better for control-flow
  - DSP better for signal processing
  - Dedicated hardware blocks needed for certain parts
  - Improves efficiency and saves power
- Homogeneous systems
  - Better for fault-tolerance
  - Only one compiled version of any application needed
  - Easier to design and replicate
  - Easy to support task migration

## Memory usage

#### **Cell Broadband Engine Processor**



20

#### Embedded systems – local memory



Network/ bus delay may be unpredictable

#### Embedded systems – global memory



Global memory may be better for shared data

## Embedded systems – combination



Communication pattern also determines which architecture is better Message passing OR Shared memory

#### Embedded systems – network

24 Processor 3 Processor 1 Processor 2 Interconnection network Arbiter Input/ Output **Processor** 4 Memory

#### Interconnection network-on-chip

25



#### Interconnection network – bus

26

Processor 2 Processor 1 **Processor 3** High speed bus Arbiter Arbiter Input/ Output Processor 4 Memory

### Point-to-point networks

27

Processor 1 Processor 2 Processor 3 Arbiter Input/ Output Processor 4 Memory

## System Design – Hw/Sw Codesign



28

- Take decisions on whether to implement in hardware or software
  - Consider the advantages vs costs
- If hardware, whether to use commercial off the shelf (COTS) components or custom components



#### Modern Multimedia Embedded Systems



30

# Predictable Design Flow



#### ANALYSIS: Time Spent in a Restaurant

#### Restaurant













#### DESIGN

Automated design technique for multiple combinations of applications



#### MANAGEMENT

Resource manager for heterogeneous systems running multiple applications



### Design- and Run-time Flow



## **Design Template**

40



## **Design Template**

41



CA: Communication Assist (DMA like)

### Design- and Run-time Flow



## Design- and Run-time Flow

Applications are known?

43

Can multiple applications run simultaneously?

□ Application models are available?

□ Application domain(s) is known?

□ Use representative applications...

## Analysis – SDF Graph

- 44
- □ First proposed in 1987 by Edward Lee
- □ SDF Graphs used extensively
  - SDFG: Synchronous Data Flow Graphs
  - DSP applications
  - Multimedia applications
- Similar to task graphs with dependencies



# Synchronous Dataflow Graphs



- Execution time per processor
- Memory requirement per processor

#### Channels

- Buffer constraints
- Token size
- Bandwidth requirements

Graph

Throughput constraint

## Analysis – SDF Graph

- □ Analyze deadlocks
- □ Check for consistency
- Compute throughput
- Model mapping of tasks on processors
- □ Model scheduling depends on the algorithm
- Model communication bandwidth
- Model buffers local memory and network interface
- Evaluate throughput-buffer trade-offs

#### Throughput-buffer trade-offs

47





© Akash Kumar



Mapping applications to the architecture

- Model all aspects, leading to ans predictable systemace analysis
- Verify if mapping is deadlock-free
- Calculate buffer-distributions

 Compute static order schedules for hard-RT apps

 Integrated into SDF<sup>3</sup> (Synchronous Data Flow For Free) tool flow

© Akash Kumar

C

#### **Multi-Application Multi-Processor**

**Synthesis** Hardware

- Instantiate processing components
- Instantiate interconnect components
- Route connections, generate VHDL code Software
- Generate wrapper code for each actor
- Reserve memory for communication
- Program connections, if needed





2



#### Design synthesized using TCL scripts

- Script ensures compatibility with different Xilinx software versions
- Carry out design space exploration

#### Tool-flow (MAMPS) targeted towards Xilinx FPGAs

0

Applications

- Virtex 6 Xilinx ML605 board
- Supports run-time reconfiguration

#### Tool available online for use

 $\bigcirc$ 

Currently used by 20 research groups worldwide Hardware Constitution

Generated a design with 100 Nicroblazes!!
Microblazes!!
Microblazes!!
Microblazes!!





## MJPEG Case Study

53



- □ One iteration decodes a single MCU (minimal coded unit)
- □ Each MCU consists of up to 10 blocks of frequency values
- WCET determined through measurement and scenario detection techniques

# **Designer Effort**

54

| Step                              | Time spent |
|-----------------------------------|------------|
| Parallelizing the MJPEG code      | < 3 days   |
| Creating the SDF graph            | 5 minutes  |
| Gathering required actor metrics  | 1 day      |
| Creating application model        | 1 hour     |
| Generating architecture model     | 1 second   |
| Mapping the design (SDF3)         | 1 minute   |
| Generating Xilinx project (MAMPS) | 16 seconds |
| Synthesis of the system           | 17 minutes |
| Total time spent                  | ~ 4 days   |

#### Design- and Run-time Flow



# (Re-)Configuration??

56

Determine which resource to use when

- Change the device types?
- Change the device functionality?
- □ Change the communication?
- Change the mapping
- Change the schedule

#### Reconfigurable Heterogeneous MPSoC

- 57
- Customizable at run-time depending upon the application requirements
- The tasks taking a long time in software can be accelerated by configuring the programmable tiles appropriately





- The reconfigurable tiles can be configured to achieve fault-tolerance as well
- Size and cost reduction by time-multiplexing the reconfigurable hardware

## Partially Reconfigurable MPSoC



<sup>©</sup> Akash Kumar

## Loading Processor Executable Code at Run-time



## Migrating Tasks

60



61

## **Modern Challenges**

## **Issues and Modern Trends**

- The communication bottleneck
  - **3D** Chips
  - Optical interconnects
- Leakage current limiting size reduction
  - Multi-gate or gate-all-around transistors (Intel 22nm uses 3D/tri-gate transistors)
  - Channel strain engineering, silicon-on-insulator-based technologies, and high-k/metal gate materials
- One may not fit all
  - Hardware/Software Co-design
  - Fault-tolerant / reconfigurable computing
- Power issues
  - Multi-core and heterogeneous architectures









# **Technology Scaling**

#### Dennard scaling principles [1]

| Device Parameters      | Scaling Factor   |  |  |
|------------------------|------------------|--|--|
| Device dimension       | 1/k              |  |  |
| Doping concentration   | 1/k              |  |  |
| Voltage                | 1/k              |  |  |
| Current                | 1/k              |  |  |
| Capacitance            | 1/k              |  |  |
| Delay time per circuit | 1/k              |  |  |
| Power dissipation      | 1/k <sup>2</sup> |  |  |
| Area                   | 1/k <sup>2</sup> |  |  |
| Power density          | 1                |  |  |

[1] R. Dennard et al. "Design of Ion-Implanted MOSFET's with Very Small Physical Dimensions," IEEE Journal of Solid-State Circuits, 1974.



# **Technology Scaling**

#### Digression from Dennard's scaling beyond 65nm

- Non-ideal voltage scaling: limit on threshold voltage scaling
- Non-ideal gate oxide scaling
- Sub-threshold leakage power

#### Power dissipation increases with technology scaling

- Heat localization (hot spots)
- Higher temperature => device wear-out

#### Technology Scaling and Power Density

65



#### Technology Scaling and Power Density



#### What cause Faults?

67





Toll of the presidency - Photo 6 of 6
President-elect Barack Obama, in a photo illustration demonstrating how he might age over four years in office. Provided by PopPhoto.com. Photo by Kevin Dietsch

Manufacturing Defects

Aging (a.k.a., Circuit Wearout)

#### What causes Faults?

68





Internal Electronic Noise

**Electromagnetic Interference** 

#### What cause Faults?









Become part of the

ZDNet community.

nuclear explosions the world has ever seen, which took place in a Siberian natural gas pipeline, according to a new book published on Monday.



# Fault Classification



## Failures during Lifetime

71



- □ Three phases of system lifetime
  - Infant mortality (imperfect test, weak components)
  - Normal lifetime (transient/intermittent faults)
  - Wear-out period (circuit aging)

# The Impact of Technology Scaling



- □ More leakage
- More process variability
- Smaller critical charges

■ Trends show soft-error rates incr. exp., 8% per tech generation

Weaker transistors and wires

## Effect on Embedded systems

#### Decreased Lifetime:

- Mission failures
- Reduced safety in critical systems
  - Power plants, transportation, medical etc.











## Effect on Embedded systems













**Computation** errors

## Fault-Aware System Design

75

- □ Faults are inevitable....learn to live with faults !!!
- □ How to address them??
  - Fault prevention
  - Fault tolerance
  - Fault removal
  - Fault forecasting



### Single-layer Fault tolerance

76

The usual "phenomenon-based" approach
 Provide a "perfect" hardware to upper layers



### Levels of Fault Tolerance

77



#### Application areas and requirements

| 78 |  |                                |                                      | ot all applications<br>re the same level of<br>reliability |                           |  |
|----|--|--------------------------------|--------------------------------------|------------------------------------------------------------|---------------------------|--|
|    |  | Application Area               | Priority of reliability requirements |                                                            | Other relevant<br>metrics |  |
|    |  |                                | Functional<br>Reliability            | Timing Reliability                                         |                           |  |
|    |  | Banking                        | High                                 | Medium                                                     |                           |  |
|    |  | Multimedia                     |                                      |                                                            | Throughput                |  |
|    |  | Portable<br>multimedia         | Medium                               | High                                                       | Throughput,<br>Energy     |  |
|    |  | Health<br>monitoring           | High                                 | Medium ~ High                                              | Energy, Lifetime          |  |
|    |  | Satellites /<br>Space Missions | Medium                               | Medium ~ High                                              | Lifetime                  |  |

#### **Cross-layer Approach**



## Case-Study – Nanosatellites

- □ Light-weight: Wet mass of 1-10kg
- □ Small satellites: Notion of cube-sats, 1U=10x10x10
- Increasingly being used as they are cheaper to design and launch
  - 2004-2013: 75 launches in total
  - **2014 Q1: 94 launces**

80

- Typically low earth orbit
- □ Satellite swarms are also used



CubeSat – University of Liege

# Case-Study – Nanosatellites

- 81
- FPGA use increasing in nanosats lower price, faster development
- Nanosats affected by high energy particles in space leading to glitches
- Most common error in FPGAs- Single Event Upset (SEU) – a transient error that might flip configuration bits



### **CFAED** Paths



# **Path G: Resilience**



## Overview – Resilience at TU Dresden

84



85

# **Approximate Computing**

# The Computational Efficiency Gap



IBM Watson playing Jeopardy, 2011

<sup>©</sup> Akash Kumar

### Humans Approximate



### But Computers DO NOT



- Overkill (for many applications)
- Leads to inefficiency
- Can computers be more efficient by producing "just good enough" results?

## Its an Approximate World ... At the Top

- 89
- No golden answer (multiple answers are equally acceptable)
  - Web search, recommendation systems
- Even the best algorithm cannot produce correct results all the time
  - Most recognition / machine learning problems
- Too expensive to produce fully correct or optimal results
  - Heuristic and probabilistic algorithms, relaxed consistency models, ...









Eventual consistency © Akash Kumar

## Its an Approximate World ... At the Top



90







**Miller-Rabin** primality test **Eventual** consistency © Akash Kumar

### Approximate Computing Throughout the Stack



# Approximation in System Design

92

□ Arising from the application level

Inherent lack of notion or ability for a single 'correct' answer

'Noisy' or redundant real-world data

Perceptual limitations

Arising from the transistor level
 Increasing fault-rates
 Increased effort/resource to achieve fault-tolerance

# Approximation in System Design

93



# Conclusions

94

Transistor scaling leading to increased faults
 Designing systems to tolerate faults inevitable
 Need to handle faults at all levels of critical systems

- Applications often lack notion of a 'correct' result
- Immense need/potential to trade-off performance and energy consumed



# Ongoing Research Activities

#### Reliability/Energy Optimization

- Reconfigurable approximate computing at run-time
- Optimize energy and reliability
- Minimize thermal cycling and peak temperature
- Task remapping and scheduling for dealing with faults

#### **Processing Architecture Design**

- Determine and design appropriate system architecture
- Design predictable components network and communication assist
- Partially reconfigurable tile-based heterogeneous multiprocessor systems
- Task-migration module in hardware for predictable delay

#### Low-Power and Fault-Tolerant FPGA Designs

- Improving fault-tolerance of FPGA through LUT content manipulation
- Novel error-correction mechanisms for FPGAs
- Leakage-aware resource management techniques
- Electronic Design Automation Place and Route for FPGAs

# Chair for Processor Design

96



## **Questions and Answers**

97



Email: akash.kumar@tu-dresden.de

