

### School of Computer Science & Engineering

### **COMP9242 Advanced Operating Systems**

2023 T3 Week 03 Part 2
Virtualisation Principles
@GernotHeiser



# Copyright Notice

# These slides are distributed under the Creative Commons Attribution 3.0 License

- You are free:
  - to share—to copy, distribute and transmit the work
  - to remix—to adapt the work
- under the following conditions:
  - Attribution: You must attribute the work (but not in any way that suggests that the author endorses you or your use of the work) as follows:

"Courtesy of Gernot Heiser, UNSW Sydney"

The complete license text can be found at http://creativecommons.org/licenses/by/3.0/legalcode



# Today's Lecture

- What are virtual machines, and why do we have them
- Mechanics: how do they work
- Modern hardware support
- Fun and games with hypervisors



# Virtual Machine Basics



# Virtual Machine (VM)

"A VM is an efficient, isolated duplicate of a real machine" [Popek&Goldberg 74]

- Duplicate: VM should behave identically to the real machine
  - Programs cannot distinguish between real or virtual hardware
  - Except for:
    - Fewer resources (potentially different between executions)
    - Some timing differences (when dealing with devices)
- Isolated: Several VMs execute without interfering with each other
- Efficient: VM should execute at speed close to that of real hardware
  - Requires that most instruction are executed directly by real hardware

Hypervisor aka virtual machine monitor (VMM): Software layer implementing the VM







# Why Virtual Machines?

- Historically used for easier sharing of expensive mainframes
  - Run several (even different) OSes on same machine
    - called guest operating system
  - Each on a subset of physical resources
  - Can run single-user single-tasked OS in time-sharing mode







# Why Virtual Machines?

- Heterogenous concurrent guest OSes
  - eg Linux + Windows
- Improved isolation for consolidated servers: QoS & Security
  - total mediation/encapsulation:
    - replication
    - migration/consolidation
    - checkpointing
    - debugging
- Uniform view of hardware

Would not be needed if OSes provided proper security & resource management!





# Why Virtual Machines: Cloud Computing

- Increased utilisation by sharing hardware
- Reduced maintenance cost through scale
- On-demand provisioning
- Dynamic load balancing through migration





Cloud Provider Data Centre





# Hypervisor aka Virtual Machine Monitor

- Software layer that implements virtual machine
- Controls resources
  - Partitions hardware
  - Schedules guests
    - "world switch"
  - Mediates access to shared resources
    - e.g. console, network

### **Implications:**

- Hypervisor executes in privileged mode
- Guest software executes in unprivileged mode

Privileged guest instructions trap to hypervisor





# Native vs Hosted Hypervisor



Native execution



Native/
Bare-metal/
Type-I
Hypervisor



Hosted/ Type-II Hypervisor

- Hosted VMM besides native apps
  - Sandbox untrusted apps
  - Convenient for running alternative OS on desktop
  - leverage host drivers

### **Overheads:**

- Double mode switches
- Double context switches
- Host not optimised for exception forwarding



# Virtualisation Mechanics



### Instruction Emulation

- Traditional trap-and-emulate (T&E) approach:
  - guest attempts to access physical resource
  - hardware raises exception (trap), invoking HV's exception handler
  - hypervisor emulates result, based on access to virtual resource



### Most instructions do not trap

- prerequisite for efficient virtualisation
- requires VM ISA (almost) same as processor ISA



# Trap & Emulate Requirements

No-op is insufficient!

- Privileged instruction: when executed in user mode will trap
- Privileged state: determines resource allocation
  - Incl. privilege level, PT ptr, exception vectors...
- Sensitive instruction:
  - control sensitive: change privileged state
  - behaviour sensitive: expose privileged state
    - eg privileged instructions which NO-OP in user state
- Innocuous instruction: not sensitive

**T&E virtualisable HW**: All sensitive instructions are privileged

- Some inherently sensitive, e.g. set interrupt level
- Some contextdependent, e.g. store to page table

Can run unmodified guest binary





- Modify binary: binary translation (VMware)
- Modify hypervisor "ISA": para-virtualisation



### Virtualisation vs Address Translation



# Shadow Page Table





# **Shadow Page Table**

COMP9242 2023 T3 W03 Part 2: Virtual Machines

Hypervisor must shadow (virtualize) PT updates by guest: trap guest writes to guest PT translate guest PA in guest (virtual) PTE using memory map insert translated PTE in shadow PT Shadow PT has TLB semantics User ld r0, adr Guest (i.e. weak consistency) ⇒ virtual address Update at synchronisation points: Virt\_PT\_ptr page faults Guest Guest (Software) physical **TLB** flushes address SPT is a virtual TI B PT ptr similar semantics **VMM** (Hardware) **Physical** can be incomplete address data Memory

# Lazy Shadow Update

User Guest OS Hypervisor

write-protect GPT

unprotect GPT & mark dirty

add another mapping;
return to user

update dirty shadow;
write-protect GPT

unprotect GPT



# Lazy Shadow Update

User Guest OS Hypervisor

write-protect GPT
unprotect GPT & mark dirty
invalidate another mapping;
flush TLB
update dirty shadow;
write-protect GPT;
flush TLB
continue



# Real Guest Page Table



# Optimised Guest Page Table





### **Guest Self-Virtualisation**



Example: Interrupt-enable in virtual PSR

- guest and VMM agree on VPSR location
- VMM queues guest IRQs when disabled in VPSR



### **Device Models**





### **Emulated Device**



# Split Driver

VirtIO: Linux I/O virtualisation interface

Simplified, high-level device interface

- small number of hypercalls
- new (but very simple) driver
- low overhead
- must port drivers to hypervisor





# Driver OS (Xen Dom0)





# Pass-Through Driver

### Unmodified native driver

- Must trust driver (and guest) for DMA
  - except with hardware support: I/O MMU
- Can't share device between VMs
  - except with hardware support: recent NICs

### "Self-virtualising" devices:

- Single-root I/O virtualisation (SRIOV)
- NIC presenting multiple, isolated virtual NIC interfaces

Direct device access by guest





# Modern Hardware Support



### x86 Virtualisation Extensions: VT-x

**Traditional** New processor mode: VT-x root mode x86 behaviour orthogonal to protection rings entered on virtualisation trap Non-Root Root Ring 3 Ring 3 Ring 2 Ring 2 Kernel entry Ring 1 Ring 1 Ring 0 Ring 0 VM exit Hypervisor **Guest Kernel** 



# Arm Virtualisation Extensions [1/6]

### EL<sub>2</sub> aka "hyp mode"





### New privilege level

- Strictly higher than kernel (EL<sub>1</sub>)
- Virtualizes or traps all sensitive instructions
- Presently only available in Arm TrustZone "normal world"
- Latest ISA revision supports it also in "secure world"

# Arm Virtualisation Extensions [2/6]

### **Configurable Traps**







# Arm Virtualisation Extensions [3/6]

### **Emulation**



- 1) Load faulting instruction:
  - Compulsory L1-D miss!
- 2) Decode instruction
  - Complex logic
- 3) Emulate instruction
  - Usually straightforward



# Arm Virtualisation Extensions [3/6]



- 1) HW decodes instruction
  - No L1 miss
  - No software decode
- 2) SW emulates instruction
  - Usually straightforward

# Arm Virtualisation Extensions (4)

2-stage translation (EPTs)



# Arm Virtualisation Extensions [4/6]

### 2-stage translation cost

- On page fault walk twice number of page tables!
- Can have a page miss on each, requiring PT walk
- O(n²) misses in worst case for n-level PT
- Worst-case cost is massively worse than for single-level translation!





# Arm Virtualisation Extensions [5/6]

### **Virtual Interrupts**



- 2-part IRQ controller
  - global "distributor"
  - per-CPU "interface"
- New H/W "virt. CPU interface"
  - Mapped to guest
  - Used by HV to forward IRQ
  - Used by guest to acknowledge
- Halves hypervisor invocations for interrupt virtualization

x86: issue only for legacy level-triggered IRQs



# Arm Virtualisation Extensions [6/6]

### System MMU (I/O MMU)



- Devices use virtual addresses.
- Translated by system MMU
  - elsewhere called IOMMU
  - translation cache, like TLB
  - reloaded from I/O page table

x86 different (VT-d)

Many ARM SoCs different

- Can do pass-through I/O safely
  - guest accesses device registers
  - no hypervisor invocation



### RISC-V H Extension

# V = 1 Virtual U mode Virtual S mode Virtual S mode Trap Machine mode

### Add virtual U+S modes

- Extra registers for VM state
- Re-direct VS traps to S
- 2-stage address translation
- VIRQ injection

Hypervisor



# World Switch Comparison

**x86** 

Arm

RISC-V

- VM state is ≈ 4 KiB
- Save/restore done by hardware on VMexit/VMentry
- Fast and simple

- VM state is 488 B
- Save/restore done by hypervisor
- Selective save/restore
  - Eg traps w/o world switch

- VM state ≈ 80 B
- Save/restore done by hypervisor
- Selective save/restore
  - Eg traps w/o world switch



World switch

Guest 2 state

Restore

VM 2 control block



# Fun and Games with Hypervisors



# Hybrid Hypervisor-OSes.

Huge TCB, contains full Linux system (kernel and userland)!

Often falsely called a "Type-2" hypervisor

Idea: Turn OS into hypervisor by running in VT-x root mode, pioneered by KVM





# Why Still Have an OS?

COMP9242 2023 T3 W03 Part 2: Virtual Machines



# Qubes OS: Everything Is A VM



### More Fun and Games...

- Time-travelling virtual machines [King '05]
  - debug backwards by replaying VM from checkpoint, log state changes
- SecVisor: kernel integrity by virtualisation [Seshadri '07]
  - controls modifications to kernel (guest) memory
- Overshadow: protect apps from OS [Chen '08]
  - make user memory opaque to OS by transparently encrypting
- Turtles: Recursive virtualisation [Ben-Yehuda '10]
  - virtualize VT-x to run hypervisor in VM
- CloudVisor: mini-hypervisor underneath Xen [Zhang '11]
  - isolates co-hosted VMs belonging to different users
  - leverages remote attestation (TPM) and Turtles ideas
- Containers (Docker etc):
  - Example of OS API virtualisation

... and many more..

