MIPS R4x00 Instruction
Set
Courtesy Don Miller, Arizona State University
references
MIPS R4000 User's Manual, Joe Heinrich,
Prentice-Hall, 1993 (class text).
MIPSpro™ Assembly Language Programmer's Guide, Huffman, L. and Graves, D.,
Silicon Graphics, 1996 available here.
Inside L4/MIPS, Gernot Heiser, UNSW, Sydney, Australia, Version 2.14,
July 12, 2000, pp. 10-11 and 24-33 (Class Notes I).
Computer Organization and Design the Hardware/Software Interface, 2nd Edition,
Patterson, D. and Hennessy, J., Morgan Kaufmann, 1998. (cse 330 text).
MIPS Handout #1 - MIPS Instructions - a Subset
MIPS Handout #2 - Endianness
MIPS Handout #3 - System Coprocessor Status Register - CP0 Register 12
MIPS Handout #4 - Endianness (continued)
MIPS Handout #5 - Super Simple Macro Example
MIPS Handout #6 - MIPS R4600 Pipeline
All references below to listing numbers, table numbers, figure
numbers and section numbers are from Inside L4/MIPS version 2.14:
L 4.1, lines 1, 7, and 15 page 24 means see listing L 4.1, lines 1, 7 and
15 on page 24 of Inside L4/MIPS.
T 2.1 page 11 is Table 2.1 on page 11, Figure 2.3 page 15 is
Figure 2.3 on page 15 and S 3.2 is section 3.2.
Examples from Inside L4/MIPS are preceded by listing number, line number(s) and
page number like this:
L 4.1 lines 1, 7, 15 page 24
overview
MIPS processors - a bunch of families
R2000, R3000, R4000, R5000 and R10000 <-- close to latest family
R4000 family - R4x00
R4600 <-- L4/MIPS runs on these
R4600
R4700
R4000
R4000
R4400
So when it doesn't matter we will use R4x00, else will use R4600
to refer to MIPS R4600.
No significant differences between R4600 and R4700
No significant differences between R4000 and R4400 (cse 530 text covers these)
R4x00
64-bit architecture
RISC
1 instruction per clock cycle
pipelined - 5 stage pipeline
branches, jumps and loads - 1 cycle delay
jumps and branches- till target instruction can be executed
loads - till data available
processor has 32 general registers - r0-r31
assemblers use symbolic names for these based on compiler conventions
Ref: Table 2.1 page 11
also see regdef.h
r0 zero reads as 0 and
writes to it are ignored
r1 at assembler uses to
store intermediate results of pseudo-instruction macros
r30 s8/fp frame pointer (caller saved)
r31 ra linkage register for jal
instruction
.set at a pseudo-op
- assembler uses atwithout a warning message/users get a warning message
.set noat user programs can use at without a warning
message/if assembler requires it you get a warning message
r26, r27 k0, k1 kernel's
exception handlers use these as temporaries
don't use if there can be an exception, i.e., only use if interrupts are
disabled
you need to be sure no page fault can occur
no mapped addresses used
or
all virtual memory pages already mapped
ti, si and vi temporaries (callee saved), caller saved and integer function results - see Table 2.1
also there are:
pc program counter or
instruction pointer
hi/lo for multiply and divide - we may not need
SR status register aka. CP0 register 12 (CP0 is
"System CoProcessor 0) see MIPS HO #3
in other machines this is called PS or PSW (processor status register or word)
syntax: MIPS HO #1 uses $ prefix for registers - defines in regdef.h gets rid of them in assembler source
MIPS R4x00 data sizes:
doubleword
dw 8 bytes
word
w 4 bytes
halfword
hw 2 bytes
byte b 1 byte (8 bits)
There are 5 groups of instructions - see MIPS HO #1
·
arithmetic
·
logical
·
data transfer
·
conditional branch
·
unconditional jump
and there's a few special ones (see below)
Note: RISC processors have
·
cpu operations
·
loads and stores
RISC processors don't
have
·
cpu operations on values in memory
e.g., no add A,
B where A and B are in memory
and no add ri,
A where A is in memory
RISC processors have a lot of registers, e.g., ³
32 for intermediate storage in cpu
- get a lot
of their high performance from this
code listings - see L 1.1 page 2
.set reorder default -
allows assembler to reorder machine
language instructions for efficiency
.set noreorder enables you to Do It Yourself (DIY) for
maximum efficiency
L 1.1 lines 0-3 page 2
0 mtc0 t0, C0_STATUS 1 ..... 2 ..... 3 move a5, a4
assembler can't reorder - this is indicated with BOLD line numbers
L 1.1 lines 4-7 page 2
4 3: lui a0, KERNEL_BASE 5 6 7 and a2, a0, a4
lines that can be reordered have non-bold line numbers
Notes below sort of follow Patterson and Hennessy's order of covering the instruction set - skipping a lot of things that should be familiar to you.
We only do 64-bit cpu instructions and concentrate on the ones that come up in the L4 code covered in Inside L4/MIPS.
add and subtract
register format
op rd, rs, rt rd ¬ rs op rt
3
operands rs source
register
rd destination register
rt target register
leftmost is destination
right two are sources
op is add or subtract
add s0, s1,
s2 s0 ¬
s1 + s2
add s0, s0,
s1 s0 ¬
s0 + s1
sub s0, s1,
s2 s0 ¬
s1 - s2
sub s0, s0,
s1 s0 ¬
s0 - s1
immediate format
immediate - 16 bit constant that gets sign extended
add immediate
addi s0, s1, 100 s0 ¬ s1 + 10010
unsigned
add unsigned
unsigned Þ
overflow not detected
(signed Þ overflow detected)
addu s0, s1, s2 s0 ¬ s1 + s2
immediate unsigned
addiu s0, s1, 100 s0 ¬ s1 + 100
moving things around
moving to and from system control coprocessor (CP0) -
aka. Status Register (SR)
see MIPS HO #3 on SR
dmtc0 rt,
rd CP0 register rd
¬
rt
dmfc0 rt, rd
rt ¬
CP0 register rd
2 operands rt
general register
rd coprocessor 0 register
L 4.1 lines 3, 12 page 24
3 dmfc0 k0, C0_ENTRYHI /* k0 ¬ C0_ENTRYHI 12 dmtc0 k0, C0_ENTRYLO1 /* C0_ENTRY_LO1 ¬ k0
notes: C0_ENTRYHI and C0_ENTRYLO1
are 64 bit registers
info flow is right to left for mtc0 (like a store)
mtc0 rt, rd
mtf0 rt, rd
L 4.4 lines 0, 2 page 30
mtc0 k0,
C0_STATUS /*
C0_STATUS ¬
k00...31
mfc0 k0,
C0_STATUS /*
k0 ¬
(C0_STATUS31)32 || C0_STATUS
notes: ||
º concatenated
bit31
º bit number 31 (sign bit of a 32
bit 2's complement integer)
(bit31)32 º
32 copies of bit 31
so for mfc0 - coprocessor 0 status register contents are
sign extended and the result is moved to k0
load and store
doubleword (8 bytes)
word (4 bytes)
halfword (2 bytes)
byte (8 bits)
op rt, offset (base)
3
operands rt
destination for loads/source for stores
offset 16 bit base 10 - gets sign
extended
base a general
register
op is load or store
load
rt, offset
(base)
rt ¬
M [base + offset]
store rt, offset
(base) M [base +
offset] ¬
rt
load and store doubleword
ld s0,
100(s1) s0 <-- M [s1 + 100]
sd s0, 100(s1) M [s1 +
100] <-- s0
L 4.1 lines 1, 15, 7 page 24
1 sd t0, K_TLB_T0_SAVE(k0) /* M[k0 + 15210] <-- t0 [save t0] ..... 7 ld t0, (k1) /* t0 <-- M[k1 + 0] ..... 15 ld t0, K_TLB_T0_SAVE(k0) /* t0 <-- M[k0 + 15210] [restore t0]
Notes:
L 4.1 line 0 loads the base address of the kernel_vars
structure into k0
K_TLB_T0_SAVE indexes into kernel_vars
to access a save area for t0
K_TLB_T0_SAVE is defined as 152 in kernel_h
~ page 5
S 3.2 has more info
load and store word
lw
rt, offset (base)
sw rt, offset (base)
loads:
word that's loaded is sign extended
stores: least significant word (lower numbered four bytes) are stored at memory
address specified by base and offset
L 4.2 lines 17, 18 , 20, 21 page 26
17 lw t0, (t2) /* these instructions 18 lw t1, 4(t2) /* collectively do ..... /* the following: 20 sw t0, 8(k1) /* 8(k1) <-- t2 21 sw t1, 12(k1) /* 12(k1) <-- 4(t2)
load and store byte
same as load and store word except
loads: byte is sign extended
stores: least significant byte is stored at address
L4.3 line 20
page 28
Also
see Figure 2.3 page 15 (and MIPS HO #3) and Figure 4.1
page 29.
20 sb t0, -24(sp) /* M[sp-24] <-- SR7...0
Note: t0 has a copy of SR when line 20
is executed
endianness
See
MIPS HOs #2 and #4 on Endianness
Some processors are Little-endian (LE) and some processors are Big-endian (BE)
and some can be both.
Key point:
In a LE format the lsd has the lowest byte number and is stored at the lowest
address.
In " BE "
" msd "
"
"
"
"
" "
" "
"
" "
LE:
80x86/VAX/PDP-11
BE: 680x0/IBM 370/Sparc
MIPS and Alpha can be programmed to be either. L4 configures MIPS R4x00 as a Big-endian.
Since we are always on the same processor - endianness is only important to us if we want to access a subpart of a word or doubleword.
load upper immediate
lui rt,
immediate /* main idea rt
31...16 ¬
immed
more exactly: rt ¬
(immed15)32 || immed || 016
Put 16 bit immediate into upper half of 32 bit lower half of a
64-bit register
and zero low order 16 bits. Then sign extend to fill upper half of 64 bit register.
L4.1 line 1 page 24
lui k0, kernel_base /* k0 <-- 0xffff ffff 8004 0000
Notes:
kernel_base is defined as 0x8004 (8004H) in kernel.h ~ page 4.
k0 is loaded with base address of statically allocated kernel data - physical address
0x40000
see Figure 3.1 page 18 and S 3.2.1
Key point - RISC processors don't keep state (e.g., <, >, =, etc.) around in condition code bits in the PS word between instructions - this speeds up pipeline processing. HW to handle interrupts/exceptions is less complicated - pipeline flushing simpler/faster.
So how do RISC processors do conditional branching?
1. branch based on the contents of 2 general registers
conditional
branch instructions.
2. set a general register based on some comparison and then branch based on
what's in this register
set
instruction
conditional
branch instruction
There are zillions of conditional branch
instructions (37 of them).
We'll just consider a tiny subset that includes the ones in the L4 code.
branching based on the contents of 2 general registers
beq
rs, rt, label
bne rs, rt, label
rs source register
rt target register
label branch target address
technically label is the word offset from the instruction following the branch. It's computed by taking the 16 bit offset field in the branch instruction, shifting it left by 2 bits and sign extending it.
beq s1, s2, foo
if s1 = s2 goto foo
else pc <-- pc +
4 /* all instructions are 4
bytes
bne s1, s2, foo
if s1 ¹
s2 go to foo
else pc <-- pc + 4
Typical usage - compare contents of a register to r0 (zero)
beq s1, zero, foo
if s1 = 0 goto
foo
else pc <-- pc + 4
bne
s1, zero, foo
if s1 ¹
0 goto foo
else pc <-- pc + 4
L 4.1 lines 8 and 17 page 24
8 bne t0, k0, 2f /* if t0 ¹ k0 goto 2f
.....
17 2: j tlb2_miss
L 4.3 line 13 and 25 page 28
13 beq k1, zero, 1f /* if k1 = 0 goto 1f ..... 25 1: slti t1, AT, MAX_SYSCALL_NUMBER
set instructions
all are variants of set on less than
slt rd, rs, rt
if rs < rt
rd ¬
1 /* rs,
rt considered as signed integers
else
rd ¬
0
slti rd, rs, immed
if rs < immed
rd ¬
1 /* rs
and immed are signed integers
else
rd ¬
0
L 4.3 line 25 page 28
25 1: slti t1, AT, MAX_SYSCALL_NUMBER /* if syscall # < 7 then t1 <-- 1
Notes: #define MAX_SYSCALL_NUMBER 7 is in syscalls.h ~ page 1
valid system call numbers are 0-6
sltu, sltiu - same idea with rs, rd, immed considered as unsigned integers (natural numbers)
slt t0, s0, s1
bne t0, zero,
foo /* if s0 < s1 goto foo (t0 will
be ¹
0 in this case)
goo:
....
/* else continue in-line at goo
more branches
bgtz rs,
foo if rs >
0 goto foo
bgez rs, foo if rs ³
0 goto foo
blez rs, foo if rs
£
0 goto foo
bltz rs, foo if rs
<
0 goto foo
L5.1 line 0 page 48
0 bltz sdesc, receive_only /* if a0 < 0 goto receive_only
Note: register naming convention in IPC code is given in T 5.1 page 48 - sdesc is mnemonic for a0.
jump
j label /* goto label
target
field in instruction is a word offset from next instruction
this is really an unconditional branch - has more bits (26) so can go further
away though
jump register
j rs
/* goto address in rs
jump and
link
jal label /* goto label
ra ¬
pc + 8
address of instruction after instruction in branch delay slot
is saved
in r31
note: HO #1 seems wrong - it says: ra <-- pc + 4
jalr
rs /* goto address in rs
ra ¬
pc + 8
and, or, xor
register format
op
rd,
rs, rt
rd ¬
rs op rt
/* bitwise operations
op is and, or, exclusive or
and s0, s1,
s2 s0 ¬s1 & s2
or s0, s1,
s2 s0 ¬
s1 | s2
xor s0, s1,
s2 s0 ¬
s1 Å
s2
L 4.4 lines 4, 6 page 30
4 and k0, k0, k1 /* k0 <-- k0 & k1 ..... 6 or k0, k0, k1 /* k0 <-- k0 | k1
L 4.2 line 9 page 26
9 xor k0, t0 /* k0 <-- k0 Å t0
immediate format
op
rt, rs, immed rt,
<-- rs op immed
immed is a 16-bit constant that gets 0-extended
andi
s0, s1, immed s0 <-- 048
|| (immed & s115...0)
ori s0, s1,
immed s0 <-- s163...16 ||
(immed | s115...0)
xor s0, s1,
immed s0 <-- s1 Å
(048 || immed)
andi s1, s1,
0xfffe /* clear bit 0 of s1
ori s1, s1, 0x0002 /* set
bit 1 of s1
L 4.3 lines 12, 13 page 28
see also MIPS HO #3
12 andi k1, t0, ST_KSU /* k13,2 <-- SR processor mode at time of exception 13 beq k1, zero, 1f /* if this mode was kernel mode got0 1f
Notes:
when line 12 is executed, t0 has SR at time exception occurred
SR3,2 are KSU bit field: kmode = 00/umode = 10
ST_KSU º
0x0018 in R4KC0_H ~ page 2 - used to access KSU bit field in SR
so line 12 is: k1 <-- 059 || KSU || 03
L 4.2.2 lines 0-2 page 30
0 mfc0 k0, C0_STATUS /* collectively these lines 1 ori k0, k0, ST_EXL /* set the EXL bit in SR 2 mtc0 k0, C0_STATUS /* to 1
Notes:
C0_STATUS º
Coprocessor register 12 (SR) - see r4kc0.h ~ page 2
ST_EXL º
0x0002 - see r4kc0.h ~ page 2
see also MIPS HO #3
L 5.7 line 17 page 56
17 xori v0, v0, L4_IPC_SRC_MASK /* v0 <-- v0 Å 0x08
Note: L4_IPC_SRC_MASK º 0x08 - see ipc.h ~ page 2
shift instructions - dsll, dsrl, sll, srl, dsllv and dsrlv
doubleword shifts
dsll rd, rt,
sa rd <-- rt
<< sa /* sa is shift amount, 0
into low order bits
dsrl rd, rt, sa
rd <--rt >> sa
/* 0 into high order bits
L4.1 lines 4, 5 page 24
4 dsll t0, k0, 38 /* t0 ¬ k0 <<38 || 038 5 srl t0, t0, 47 /* t0 ¬ 047 || t063...47
word shifts
sll rd, rt,
sa /* acts like a sll in a 32-bit
register - result is sign extended
srl rd, rt, sa
/* acts like a srl in a 32-bit register - result is sign extended
L 4.3 lines 9, 10 page 28
9 srl k1, 5 /* collectively these lines clear 10 sll k1, 5 /* KSU ERL EXL and IE fields of SR
Note: at start of line 9 k1 has processor status at time
of exception
see also MIPS HO #3
run-time variable length shifts
dsllv rd,
rt, rs /* same as dsll
and dsrl except rs holds shift amount
dsrlv rd, rt, rs
L4.2 line 5 page 26
5 1: dsrlv t1, k0, t1 /* t1 <-- (k0 >> (t1))
syscall, eret, nop, move, b, macros and dla
syscall - system call
generates a system call exception
simplified (not already in exception mode and not in bootstrap mode)
EPC <--
pc /* address of
instruction that caused exception is stored in EPC (CP0 R14)
CAUSE <-- info on exception cause is saved in CAUSE register (CP0 R13) -
some info is:
did exception occur in a branch delay slot
coprocessor unit #
interrupt pending mask at time of exception
exception code, e.g., TLB exception, breakpoint, syscall, address error
pc <--
0x180 /* 180H a physical address addressed
via CKSEG0 is put in pc
/* these physical addresses (PAs) start at 0xffff ffff 8000 0000
so boiled down
EPC <-- pc
CAUSE <--cause info
pc <-- 0x180 - the starting address of the general
exception handler
note: there's just one syscall instruction in the
MIPS R4x00
in L4 higher level
code: at <-- syscall
# and general exception handler
examines it to determine the type of system call
L4.3 lines 21 and 27-30 page 28
21 bne AT, zero, 1f /* if syscall is not ipc goto 1f (ipc is syscall 0) ..... 27 dsll AT, 3 /* collectively these 3 lines 28 daddu t0, k0, AT /* index into system call jump table 29 ld t0, K_SYSCALL_JMP_TABLE(t0) /* get address of the system call routine 30 jr t0 /* and transfer control to it
eret - exception return
simplified
pc <-- EPC
SR <-- current status word with EXL = 0
/* processor
no longer at exception level
tlbwr
the four doublewords of TLB entry addressed by RANDOM register
(CP0 register 1)
are loaded from CP0 registers 5, 10, 2 and 3 - RANDOM can be 0...47.
TLB G bit <-- EntryLo0G
·
EntryLo1G
TLB [RANDOM]
<-- PageMask || (EntryHi AND NOT PageMask) || EntryLo1|| EntryLo0
TLB140 <-- EntryLoG
·
EntryLo1G
simplified
TLB[ RANDOM] <-- a tlb entry
nop, move and b
not HW instructions - assembler generates 1 instruction equivalents such as the following
nop
add r0, r0, r0
L 4.3 line 7 page 28
move t2, sp
addu t2, sp, r0
L 4.3 line 14 page 28 -
save sp in t2
b
5f
beq r0, r0, 5f
L 4.10 line 26 page 36
b
send_only_short beq r0, r0, send_only_short
L 4.11 line 23 page 38
macro instructions
see kernel/macros.h
L4.9 line 42 page 35
see MIPS HO#5 for the details
42 tcbtop(t8) /* t8 <-- top of current thread's tcb
in macros.h (~ page 3)
#define tcbtop(tcb) \
ori tcb, sp, TCB0
in kernel.h (~ page 3)
#define TCB_SIZE0x800
#define TCB0 (TCB_SIZE - 1) /* so TCB0 º
0x7ff
syntax is same as C preprocessor
replace tcbtop(tcb) with
"ori tcb, sp, TCB0"
each line that is to be included in place of macro name and
arguments is preceded by a '\'
dla
can't find this guy anywhere - probably an assembler instruction
or macro that
translates into the equivalent of la in other processors - it loads
an address
Load and Branch Delay Slots and Programming
Introduction
MIPS R4600 has a 5 stage pipeline
See MIPS HO#6
load delays
can get a "data hazard" type pipeline stall if
data being loaded is needed in the immediately following instruction
this is called a "load delay" and the instruction position immediately following
the load is called a "load delay slot"
good programmers and assemblers try to code so that the instruction in the load delay slot does not need the data that is being loaded
L 4.1 lines 9-11 page 24
9 lwu t0, 8(k1) 10 lwu k0, 12(k1) /* executes in load delay slot of line 9 lwu 11 dmtc0 t0, C0_ENTRYLO0 /* data from lwu on line 9 is available now
branch delays
Suppose an instruction (e.g., I1 in MIPS HO#6)
fetched in clock cycle #1 (CC1) is a branch or
jump e.g.,
beq s1, zero, label
and suppose there is enough HW to do the branch computation in the decode stage (CC2)
as occurs in many MIPS implementations (e.g., R4600)
still the instruction that logically follows the branch, I2 , can't be fetched until CC3
i.e., this instruction is delayed whether the branch is taken or not
this is called a "branch delay" and the possibly unused slot is called a
"branch delay slot"
To reduce wasted instruction cycles we could place instructions
in the branch delay slot that we would like executed
·
independent of whether or not the branch is taken
·
in one of the cases (taken or not taken) and we don't care in the other
(not taken or taken)
L 4.3 lines 21-23, 25 page 28
example of instruction useful whether or not branch is taken
21 bne AT, zero, 1f 22 dsubu sp, 24 /* instruction in branch delay slot always used 23 j k_ipc /* i.e., it's used later in this routine and in k_ipc ..... 25 1: slti t1, AT, MAX_SYSCALL_NUMBER + 1
L 4.1 lines 8-10 page 24
example of instruction in branch delay slot useful if branch not taken and harmless if branch taken
8 bne t0, k0, 2f 9 lwu t0, 8(k1)/* always executed/used in-line and harmless if branch taken 10 lwu k0, 12(k1)
L 4.2 lines 15, 16 page 26
sometimes can't use the branch delay slot.
15 bne t1, zer0, xtlb_refill_fail 16 nop
Note: could have left nop out - Liedtke inserts nops to make unfilled branch delay slots obvious and as an aid to cycle counting.
L4.2 lines 11-14 page 26
one last case - user branch prediction
uses a different branch instruction - branch on equal likely
beql rs, rt, label
if rs = rt
goto label and also execute instruction in branch delay slot
else don't execute instruction in branch
delay slot and continue in-line
11 beql t1, zero, 1b 12 dsrl t1, t0, 6 /* only executed if branch is taken 13 dsrl k0, t0, 6 /* normal in-line execution 14 dsllv t1, t1, k0 /* uses unmodified t1