

## History

- 1994: Intel and HP began working on Merced
- 2001: released Itanium processor at 733, 800MHz
- 2002: released Itanium 2 processor at 900MHz and
- 1GHz, codenamed McKinley
- 2003:
  - Madison was introduced with three version
  - Hondo was announced as the HP mx2 dual-processor module
  - Deerfield was released as the first low voltage Itanium processor
- 2004:
  - released first processor in Madison 9M series
     Fanwood core debuted
- Upcoming: Montecito











## Register Stack (cont)

- Register stack frame can be resized using *alloc* instruction
- Register Stack Engine (RSE)

   Automatically save and restore register stack without explicit software intervention
  - Use spare memory bandwidth in the background
  - Spill and fill may cause RSE traffic
  - RSE traffic degrades performance

#### **Register Rotation**

- Register renaming mechanism that enables the concurrent execution of multiple iterations of a loop
- Rotating register
  - PR32-PR127, FR32-FR127, programmable sized GR starting from GR32
  - Size of rotating area in GR file determined by *alloc* instruction (size either be 0 or 8x)
  - Rotate toward larger register number
  - Renaming register number = rotate register number + value of rrb



| Instruction Type     Description     Execution Unit Type       A     Integer ALU     I-unit or M-unit       I     Non-ALU integer     I-unit       M     Memory     M-unit       F     Floating-point     F-unit       B     Branch     B-unit       L+X     Extended     I-unit/B-unit |                        | truction                 | •••                 |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|--------------------------|---------------------|--|
| I Non-ALU integer I-unit<br>M Memory M-unit<br>F Floating-point F-unit<br>B Branch B-unit<br>L+X Extended I-unit/B-unit<br>Example:                                                                                                                                                     | Instruction Type       | Description              | Execution Unit Type |  |
| M Memory M-unit<br>F Floating-point F-unit<br>B Branch B-unit<br>L+X Extended I-unit/B-unit<br>Example:                                                                                                                                                                                 |                        |                          |                     |  |
| F Floating-point F-unit<br>B Branch B-unit<br>L+X Extended I-unit/B-unit<br>Example:                                                                                                                                                                                                    | -                      |                          |                     |  |
| B Branch B-unit<br>L+X Extended I-unit/B-unit<br>Example:                                                                                                                                                                                                                               |                        | Memory                   |                     |  |
| L+X Extended I-unit/B-unit Example:                                                                                                                                                                                                                                                     | F                      | Floating-point           | F-unit              |  |
| Example:                                                                                                                                                                                                                                                                                | В                      | Branch                   | B-unit              |  |
|                                                                                                                                                                                                                                                                                         | L+X                    | Extended                 | I-unit/B-unit       |  |
| l : mov, shl B : br, brp<br>M : load L+X: brl                                                                                                                                                                                                                                           | A : add,<br>I : mov, s | or F:fadd<br>shl B:br,bi | q                   |  |





# **Branch Hints**

- Mechanism to decrease the branch misprediction rate
- Do not affect the functional behavior of the program and may be ignored by the processor

| spnt | Static Not-Taken  | Ignore this branch, do not allocate prediction resources<br>for this branch.                                     |
|------|-------------------|------------------------------------------------------------------------------------------------------------------|
| sptk | Static Taken      | Always predict taken, do not allocate prediction<br>resources for this branch.                                   |
| dpnt | Dynamic Not-Taken | Use dynamic prediction hardware. If no dynamic history<br>information exists for this branch, predict not-taken. |
| dptk | Dynamic Taken     | Use dynamic prediction hardware. If no dynamic history<br>information exists for this branch, predict taken.     |



| Itanium 2 Proc. Execution Units | # Units                                                                                                            | Latency     |
|---------------------------------|--------------------------------------------------------------------------------------------------------------------|-------------|
| Memory Load Ports               | 2                                                                                                                  | 1 cycle (L) |
| Memory Store Ports              | 2                                                                                                                  | NA          |
| ALUs (integer)                  | 6                                                                                                                  | 1 cycle     |
| Integer Units                   | 2                                                                                                                  | 1 cycle     |
| Integer Shift                   | $     \begin{array}{c}       2 \\       1 \\       6 \\       1 \\       2 \\       2 \\       2     \end{array} $ | 1 cycle     |
| Multimedia ALUs                 | 6                                                                                                                  | 2 cycles    |
| Parallel Multiply Units         | 1                                                                                                                  | 2 cycles    |
| Parallel Shift-Mask Units       | 2                                                                                                                  | 2 cycles    |
| FP FMAC (multiply-accumulate)   | 2                                                                                                                  | 4 cycles    |
| FP FMISC (compares, merge, etc) | 2                                                                                                                  | 4 cycles    |
| Branch Unit                     | 3                                                                                                                  | 0-2 cycles  |



MFB

# Cache System Distinction

- Cache Latency L1i/L1d 2 cycles => 1 cycle
  - L2 (I, FP) 6, 9 cycles => 5, 6 cycles
- L3 (I, FP) 21, 24 cycles => 12, 13 cycles · Virtual address and physical address
- - 50-bit => 64-bit for virtual address - 44-bit => 50-bit for physical address
- · Cache line size
- Doubled for every level of cache
- Page size
- Up to 4GB, used to be up to 256MB
- · Cache line transfer bandwidth
  - Doubled

