COMP9315 21T1 ♢ Implementing Recovery ♢ [0/20]
For a DBMS to recover from a system failure, it needs
- a mechanism to record what updates were "in train" at failure time
- methods for restoring the database(s) to a valid state afterwards
Assume multiple transactions are running when failure occurs
- uncommitted transactions need to be rolled back (
ABORT
)
- committed, but not yet finalised, tx's need to be completed
A critical mechanism in achieving this is the
transaction (event) log
COMP9315 21T1 ♢ Implementing Recovery ♢ [1/20]
Three "styles" of logging
- undo ... removes changes by any uncommitted tx's
- redo ... repeats changes by any committed tx's
- undo/redo ... combines aspects of both
All approaches require:
- a sequential file of log records
- each log record describes a change to a data item
- log records are written before changes to data
- actual changes to data are written later
Known as
write-ahead logging (PostgreSQL uses WAL)
COMP9315 21T1 ♢ Implementing Recovery ♢ [2/20]
Simple form of logging which ensures atomicity.
Log file consists of a sequence of small records:
-
<START T>
... transaction T
begins
-
<COMMIT T>
... transaction T
completes successfully
-
<ABORT T>
... transaction T
fails (no changes)
-
<T,X,v>
... transaction T
changed value of X
from v
Notes:
- we refer to
<T,X,v>
generically as <UPDATE>
log records
- update log entry created for each
WRITE
(not OUTPUT
)
- update log entry contains old value
(new value is not recorded)
COMP9315 21T1 ♢ Implementing Recovery ♢ [3/20]
Data must be written to disk in the following order:
-
<START>
transaction log record
-
<UPDATE>
log records indicating changes
- the changed data elements themselves
-
<COMMIT>
log record
Note: sufficient to have
<T,X,v>
output before
X
, for each
X
COMP9315 21T1 ♢ Implementing Recovery ♢ [4/20]
For the example transaction, we would get:
t Action v B(A) B(B) D(A) D(B) Log
--------------------------------------------------------
(0) BEGIN . . . 8 5 <START T>
(1) READ(A,v) 8 8 . 8 5
(2) v = v*2 16 8 . 8 5
(3) WRITE(A,v) 16 16 . 8 5 <T,A,8>
(4) READ(B,v) 5 16 5 8 5
(5) v = v+1 6 16 5 8 5
(6) WRITE(B,v) 6 16 6 8 5 <T,B,5>
(7) FlushLog
(8) StartCommit
(9) OUTPUT(A) 6 16 6 16 5
(10) OUTPUT(B) 6 16 6 16 6
(11) EndCommit <COMMIT T>
(12) FlushLog
Note that T is not regarded as committed until (12) completes.
COMP9315 21T1 ♢ Implementing Recovery ♢ [5/20]
Simplified view of recovery using UNDO logging:
- scan backwards through log
- if
<COMMIT T>
, mark T
as committed
- if
<T,X,v>
and T
not committed, set X
to v
on disk
- if
<START T>
and T
not committed, put <ABORT T>
in log
Assumes we scan entire log; use checkpoints to limit scan.
COMP9315 21T1 ♢ Implementing Recovery ♢ [6/20]
Algorithmic view of recovery using UNDO logging:
committedTrans = abortedTrans = startedTrans = {}
for each log record from most recent to oldest {
switch (log record) {
<COMMIT T> : add T to committedTrans
<ABORT T> : add T to abortedTrans
<START T> : add T to startedTrans
<T,X,v> : if (T in committedTrans)
else
{ WRITE(X,v); OUTPUT(X) }
} }
for each T in startedTrans {
if (T in committedTrans) ignore
else if (T in abortedTrans) ignore
else write <ABORT T> to log
}
flush log
COMP9315 21T1 ♢ Implementing Recovery ♢ [7/20]
Simple view of recovery implies reading entire log file.
Since log file grows without bound, this is infeasible.
Eventually we can delete "old" section of log.
- i.e. where all prior transactions have committed
This point is called a
checkpoint.
- all of log prior to checkpoint can be ignored for recovery
COMP9315 21T1 ♢ Implementing Recovery ♢ [8/20]
Problem: many concurrent/overlapping transactions.
How to know that all have finished?
- periodically, write log record
<CHKPT (T1,..,Tk)>
(contains references to all active transactions ⇒ active tx table)
- continue normal processing (e.g. new tx's can start)
- when all of
T1,..,Tk
have completed,
write log record <ENDCHKPT>
and flush log
Note: tx manager maintains chkpt and active tx information
COMP9315 21T1 ♢ Implementing Recovery ♢ [9/20]
Recovery: scan backwards through log file processing as before.
Determining where to stop depends on ...
- whether we meet
<ENDCHKPT>
or <CHKPT...>
first
If we encounter <ENDCHKPT>
first:
- we know that all incomplete tx's come after prev
<CHKPT...>
- thus, can stop backward scan when we reach
<CHKPT...>
If we encounter
<CHKPT (T1,..,Tk)>
first:
- crash occurred during the checkpoint period
- any of
T1,..,Tk
that committed before crash are ok
- for uncommitted tx's, need to continue backward scan
COMP9315 21T1 ♢ Implementing Recovery ♢ [10/20]
Problem with UNDO logging:
- all changed data must be output to disk before committing
- conflicts with optimal use of the buffer pool
Alternative approach is
redo logging:
- allow changes to remain only in buffers after commit
- write records to indicate what changes are "pending"
- after a crash, can apply changes during recovery
COMP9315 21T1 ♢ Implementing Recovery ♢ [11/20]
Requirement for redo logging: write-ahead rule.
Data must be written to disk as follows:
- start transaction log record
- update log records indicating changes
- then commit log record (
OUTPUT
)
- then
OUTPUT
changed data elements themselves
Note that update log records now contain
<T,X,v'>
,
where
v'
is the
new value for
X
.
COMP9315 21T1 ♢ Implementing Recovery ♢ [12/20]
For the example transaction, we would get:
t Action v B(A) B(B) D(A) D(B) Log
--------------------------------------------------------
(0) BEGIN . . . 8 5 <START T>
(1) READ(A,v) 8 8 . 8 5
(2) v = v*2 16 8 . 8 5
(3) WRITE(A,v) 16 16 . 8 5 <T,A,16>
(4) READ(B,v) 5 16 5 8 5
(5) v = v+1 6 16 5 8 5
(6) WRITE(B,v) 6 16 6 8 5 <T,B,6>
(7) COMMIT <COMMIT T>
(8) FlushLog
(9) OUTPUT(A) 6 16 6 16 5
(10) OUTPUT(B) 6 16 6 16 6
Note that T is regarded as committed as soon as (8) completes.
COMP9315 21T1 ♢ Implementing Recovery ♢ [13/20]
Simplified view of recovery using REDO logging:
- identify all committed tx's (backwards scan)
- scan forwards through log
- if
<T,X,v>
and T
is committed, set X
to v
on disk
- if
<START T>
and T
not committed, put <ABORT T>
in log
Assumes we scan entire log; use checkpoints to limit scan.
COMP9315 21T1 ♢ Implementing Recovery ♢ [14/20]
UNDO logging and REDO logging are incompatible in
- order of outputting
<COMMIT T>
and changed data
- how data in buffers is handled during checkpoints
Undo/Redo logging combines aspects of both
- requires new kind of update log record
<T,X,v,v'>
gives both old and new values for X
- removes incompatibilities between output orders
As for previous cases, requires write-ahead of log records.
Undo/redo loging is common in practice; Aries algorithm.
COMP9315 21T1 ♢ Implementing Recovery ♢ [15/20]
❖ Undo/Redo Logging (cont) | |
For the example transaction, we might get:
t Action v B(A) B(B) D(A) D(B) Log
--------------------------------------------------------
(0) BEGIN . . . 8 5 <START T>
(1) READ(A,v) 8 8 . 8 5
(2) v = v*2 16 8 . 8 5
(3) WRITE(A,v) 16 16 . 8 5 <T,A,8,16>
(4) READ(B,v) 5 16 5 8 5
(5) v = v+1 6 16 5 8 5
(6) WRITE(B,v) 6 16 6 8 5 <T,B,5,6>
(7) FlushLog
(8) StartCommit
(9) OUTPUT(A) 6 16 6 16 5
(10) <COMMIT T>
(11) OUTPUT(B) 6 16 6 16 6
Note that T is regarded as committed as soon as (10) completes.
COMP9315 21T1 ♢ Implementing Recovery ♢ [16/20]
❖ Undo/Redo Logging (cont) | |
Simplified view of recovery using UNDO/REDO logging:
- scan log to determine committed/uncommitted txs
- for each uncommitted tx
T
add <ABORT T>
to log
- scan backwards through log
- if
<T,X,v,w>
and T
is not committed, set X
to v
on disk
- scan forwards through log
- if
<T,X,v,w>
and T
is committed, set X
to w
on disk
COMP9315 21T1 ♢ Implementing Recovery ♢ [17/20]
❖ Undo/Redo Logging (cont) | |
The above description simplifies details of undo/redo logging.
Aries is a complete algorithm for undo/redo logging.
Differences to what we have described:
- log records contain a sequence numnber (LSN)
- LSNs used in tx and buffer managers, and stored in data pages
- additional log record to mark
<END>
(of commit or abort)
-
<CHKPT>
contains only a timestamp
-
<ENDCHKPT..>
contains tx and dirty page info
COMP9315 21T1 ♢ Implementing Recovery ♢ [18/20]
PostgreSQL uses write-ahead undo/redo style logging.
It also uses multi-version concurrency control, which
- tags each record with a tx and update timestamp
MVCC simplifies some aspects of undo/redo, e.g.
- some info required by logging is already held in each tuple
- no need to undo effects of aborted tx's; use old version
COMP9315 21T1 ♢ Implementing Recovery ♢ [19/20]
❖ Recovery in PostgreSQL (cont) | |
Transaction/logging code is distributed throughout backend.
Core transaction code is in src/backend/access/transam
.
Transaction/logging data is written to files in PGDATA/pg_wal
- a number of very large files containing log records
- old files are removed once all txs noted there are completed
- new files added when existing files reach their capacity (16MB)
- number of tx log files varies depending on tx activity
COMP9315 21T1 ♢ Implementing Recovery ♢ [20/20]
Produced: 20 Apr 2021