Transaction Processing


Transaction Processing1/146

Where transaction processing fits in the DBMS:

[Diagram:Pics/txproc/dbmsarch-small.png]


Transactions2/146

A transaction is:

Transactions occur naturally in context of DB applications, e.g. In order to achieve satisfactory performance (throughput): (Notation: we use the abbreviation "tx" as a synonym for "transaction")


... Transactions3/146

A transaction

E.g.   select ... update ... insert ... select ... insert ...

A transaction can end in two possible ways:

Above is essentially the atomicity requirement in ACID (see later).


... Transactions4/146

A transaction is a DB state-change operation.

[Diagram:Pics/txproc/tx-exec1-small.png]

Assume that the code of the transaction

Above is essentially the consistency requirement in ACID (see later).


... Transactions5/146

Transactions execute on a collection of data that is

Transactions need an environment that is Goal: data integrity should be maintained at all times (for each tx)


... Transactions6/146

If a transaction commits, must ensure that

Part-way through a transaction, must ensure that If a transaction aborts, must ensure that If there is a system failure, must ensure that on restart


Concurrency in DBMSs7/146

Concurrency in a multi-user DBMS like PostgreSQL:

[Diagram:Pics/txproc/pg-runtime-small.png]


ACID Properties8/146

Data integrity is assured if transactions satisfy the following:

Atomicity

Consistency
Isolation
Durability


... ACID Properties9/146

Atomicity can be represented by state-transitions:

[Diagram:Pics/txproc/trans-states-small.png]

COMMIT all changes preserved,   ABORT database unchanged


... ACID Properties10/146

Transaction processing:

Consistency is the property mentioned earlier: Ensuring this must be left to application programmers.

Our discussion thus focusses on: Atomicity, Durability, Isolation


... ACID Properties11/146

Atomicity is handled by the commit and abort mechanisms

Durability is handled by implementing stable storage, via

Isolation is handled by concurrency control mechanisms


Review of Transaction Terminology12/146

To describe transaction effects, we consider:

Relationship between the above operations and SQL:


... Review of Transaction Terminology13/146

More on transactions and SQL

In PostgreSQL, tx's cannot be defined inside stored procedures (e.g. PLpgSQL)


... Review of Transaction Terminology14/146

The READ, WRITE, ABORT, COMMIT operations:

The operations are typically denoted as:

RT(X) read item X in transaction T
WT(X) write item X in transaction T
AT abort transaction T
CT commit transaction T


Schedules15/146

A schedule gives the sequence of operations that occur

E.g.   RT1(A)   RT2(B)   W T1(A)   WT3(C)   RT2(A)   WT3(B)   ...

For a single tx, there is a single schedule   (its operations in order).

For a group of tx's, there are very many possible schedules.


... Schedules16/146

Consider a simple banking transaction, expressed in "PLpgSQL":


create function
   withdraw(Acct integer, Required float) returns text
as $$
declare
   Bal float;
begin
   select balance into Bal from Accounts where id=Acct;  R
   Bal := Bal - Required;
   update Accounts set balance=Bal where id=Acct;        W
   if (Bal < 0) then
      rollback; return 'Insufficient funds';             A
   else
      commit; return 'New balance: '||Bal::text;         C
   end if;
end;
$$ language plpgsql;

Notes:


... Schedules17/146

If tx T = withdraw(A,R) it has two possible schedules

If tx T = withdraw(A,R1) and tx S = withdraw(B,R2)
some possible schedules are:


... Schedules18/146

Serial schedules have no interleave of operations from different tx's.

Why serial schedules are good:

As would occur e.g. in a single-user database system.


... Schedules19/146

With different-ordered serial executions, tx's may get different results.

I.e. StateAfter(T1;T2) = StateAfter(T2;T1) is not generally true.

Consider the following two transactions:

T1 : select sum(salary)
     from Employee where dept='Sales'

T2 : insert into Employee
     values (....,'Sales',...)

If we execute T1 then T2 we get a smaller salary total than if we execute T2 then T1.

In both cases, however, the salary total is consistent with the state of the database at the time the transaction is executed.


... Schedules20/146

A serial execution of consistent transactions is always consistent.

If transactions execute under a concurrent (nonserial) schedule, the potential exists for conflict among their effects.

In the worst case, the effect of executing the transactions ...

So why don't we observe such problems in real DBMSs? ...


... Schedules21/146

Not all concurrent executions cause problems.

For example, the schedules

T1: R(X) W(X)           R(Y) W(Y)
T2:           R(X) W(X)

or

T1: R(X) W(X)      R(Y)      W(Y)
T2:           R(X)      W(X)

or ...

leave the database in a consistent state.


Example Transaction #122/146

Problem: Allocate a seat on a plane flight

Implement as a function returning success status:

function allocSeat(paxID    integer,
                   flightID integer,
                   seatNum  string)
         returning boolean
{
    check whether seat currently occupied
    if (already occupied)
        tell pax seat taken; return !ok
    else
        assign pax to seat; return ok
}

Assume schema:


Flight(flightID, flightNum, flightTime, ...)
SeatingAlloc(flightID, seatNum, paxID, ...)


... Example Transaction #123/146

PLpgSQL implementation for seat allocation:


create or replace function
    allocSeat(paxID int, fltID int, seat text) returns boolean
as $$
declare
    pid int;
begin
    select paxID into pid from SeatingAlloc
    where  flightID = fltID and seatNum = seat;
    if (pid is not null) then
        return false;  -- someone else already has seat
    else
        update SeatingAlloc set pax = paxID
        where  flightID = fltID and seatNum = seat;
        commit;
        return true;
    end if;
end;
$$ langauge plpgsql;


... Example Transaction #124/146

If customer Cust1 executes allocSeat(Cust1,f,23B)

If customer Cust2 then executes allocSeat(Cust2,f,23B) The system behaves as required tx is consistent.


... Example Transaction #125/146

Consider two customers trying allocSeat(?,f,23B) simultaneously.

A possible order of operations ...

Cause of problem: unfortunate interleaving of operations within concurrent transactions.

Serial execution (e.g. enforced by locks) could solve these kinds of problem.


Example Transaction #226/146

Problem: transfer funds between two accounts in same bank.

Implement as a function returning success status:

function transfer(sourceAcct integer,
                  destAcct   integer,
                  amount     real)
         returning boolean
{
    check whether sourceAcct is valid
    check whether destAcct is valid
    check whether sourceAcct has
          sufficient funds (>= amount)
    if (all ok) {
        withdraw money from sourceAcct
        deposit money into destAcct
}   }


... Example Transaction #227/146

PLpgSQL for funds transfer between accounts:


create or replace function
    transfer(source int, dest int, amount float)
    returns boolean
as $$
declare
    ok   boolean := true;
    acct Accounts%rowtype;
begin
    select * into acct
    from   Accounts where id=source;
    if (not found) then
        raise warning 'Invalid Withdrawal Account';
        ok := false;
    end if;
    select * from Accounts where id=dest;
    if (not found) then
        raise warning 'Invalid Deposit Account';
        ok := false;
    end if;
    if (acct.balance < amount) then
        raise warning 'Insufficient funds';
        ok := false;
    end if;
    if (not ok) then rollback; return false; end if;
    update Accounts
    set    balance := balance - amount
    where  id = source;
    update Accounts
    set    balance := balance + amount
    where  id = dest;
    commit;
    return true;
end;
$$ language plpgsql;


... Example Transaction #228/146

If customer transfers $1000 from Acct1 to Acct2

But if Cust1 and Cust2 both transfer from Acct1 together But account ran out of money after Cust1 took their cash.

Similar to earlier problem; could be fixed by serialization.


... Example Transaction #229/146

Consider customer transfers $1000 from Acct1 to Acct2

If transactions are not atomic/durable:


Transaction Anomalies30/146

What problems can occur with concurrent transactions?

The set of phenomena can be characterised broadly under:


... Transaction Anomalies31/146

Dirty read: a transaction reads data written by a concurrent uncommitted transaction

Example:

     Transaction T1       Transaction T2
(1)  select a into X
     from R where id=1
(2)                       select a into Y
                          from R where id=1
(3)  update R set a=X+1
     where id=1
(4)  commit
(5)                       update R set a=Y+1
                          where id=1
(6)                       commit

Effect: T1's update on R.a is lost.


... Transaction Anomalies32/146

Nonrepeatable read: a transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read)

Example:

     Transaction T1    Transaction T2
(1)  select * from R
     where id=5
(2)                    update R set a=8
                       where id=5
(3)                    commit
(4)  select * from R
     where id=5

Effect: T1 runs same query twice; sees different data


... Transaction Anomalies33/146

Phantom read: a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction

Example:

     Transaction T1    Transaction T2
(1)  select count(*)
     from R where a=5
(2)                    insert into R(id,a,b)
                              values (2,5,8)
(3)                    commit
(4)  select count(*)
     from R where a=5

Effect: T1 runs same query twice; sees different result set


Example of Transaction Failure34/146

The above examples generally assumed that transactions committed.

Additional problems can arise when transactions abort.

We give examples using the following two transactions:

T1: read(X)           T2: read(X)
    X := X + N            X := X + M
    write(X)              write(X)
    read(Y)
    Y := Y - N
    write(Y)

and initial DB state X=100, Y=50, N=5, M=8.


... Example of Transaction Failure35/146

Consider the following schedule where one transaction fails:

T1: R(X) W(X) A
T2:             R(X) W(X)

Transaction T1 aborts after writing X.

The abort will rollback the changes to X, but where the undo occurs can affect the results.

Consider three places where rollback might occur:

T1: R(X) W(X) A [1]     [2]     [3]
T2:                 R(X)    W(X)


Transaction Failure - Case 136/146

This scenario is ok.   T1's effects have been eliminated.

Database   Transaction T1       Transaction T2
---------  ------------------   --------------
X    Y               X    Y               X
100  50              ?    ?               ?
           read(X)   100
           X:=X+N    105
105        write(X)
           abort
100        rollback
                                read(X)   100
                                X:=X+M    108
108                             write(X)
---------
108  50


Transaction Failure - Case 237/146

In this scenario, some of T1's effects have been retained.

Database   Transaction T1       Transaction T2
---------  ------------------   --------------
X    Y               X    Y               X
100  50              ?    ?               ?
           read(X)   100
           X:=X+N    105
105        write(X)
           abort
                                read(X)   105
                                X:=X+M    113
100        rollback
113                             write(X)
---------
113  50


Transaction Failure - Case 338/146

In this scenario, T2's effects have been lost, even after commit.

Database   Transaction T1       Transaction T2
---------  ------------------   --------------
X    Y               X    Y               X
100  50              ?    ?               ?
           read(X)   100
           X:=X+N    105
105        write(X)
           abort
                                read(X)   105
                                X:=X+M    113
113                             write(X)
100        rollback
---------
100  50


Transaction Isolation


Transaction Isolation40/146

If transactions always run "single-user":

Simplest form of isolation: serial execution   (T1 ; T2 ; T3 ; ...)


... Transaction Isolation41/146

In practice, serial execution gives poor performance.

We need approaches that allow "safe" concurrency

The remainder of this discussion involves


DBMS Transaction Management42/146

Abstract view of DBMS concurrency mechanisms:

[Diagram:Pics/txproc/txproc1-small.png]

The Scheduler


Serializability43/146

Consider two schedules S1 and S2 produced by

S1 and S2 are equivalent if A schedule S for a set of concurrent tx's T1 ..Tn is serializable if Under these circumstances, consistency is guaranteed.


... Serializability44/146

Two formulations of serializability:

View serializability is strictly weaker than conflict serializability.

i.e. there are VS schedules that are not CS, but not vice versa


Isolation and Concurrency Control45/146

It is not useful to

The goal is to devise concurrency control schemes Serializability tests are used in proving properties of these schemes.


Other Properties of Schedules46/146

Above discussion explicitly assumes that all transactions commit.

What happens if some transaction aborts?

Under serial execution, there is no problem:

With concurrent execution, there may be problems:


Recoverability47/146

Consider the serializable schedule:

T1:        R(X)  W(Y)  C
T2:  W(X)                 A

(where the final value of Y is dependent in the X value)

Notes:

The schedule produces an invalid database state, even though serializable.


... Recoverability48/146

Recoverable schedules avoid these kinds of problems.

For a schedule to be recoverable, we require additional constraints

and this property must hold for all transactions Tj

Note that recoverability does not prevent "dirty reads".

In order to make schedules recoverable in the presence of dirty reads and aborts, may need to abort multiple transactions.


Cascading Aborts49/146

Recall the earlier non-recoverable schedule:

T1:        R(X)  W(Y)  C
T2:  W(X)                 A

To make it recoverable requires:

T1:        R(X)  W(Y)  ...   A
T2:  W(X)                 A

Known as cascading aborts (or cascading rollback).


... Cascading Aborts50/146

Example: T3 aborts, causing T2 to abort, causing T1 to abort

T1:                    R(Y)  W(Z)        A
T2:        R(X)  W(Y)                 A
T3:  W(X)                          A

Even though T1 has no direct connection with T3
(i.e. no shared data).

This kind of problem ...


... Cascading Aborts51/146

Cascading aborts can be avoided if

Effectively: eliminate the possibility of reading dirty data.

Downside: reduces opportunity for concurrency.

GUW call these ACR (avoid cascading rollback) schedules.

All ACR schedules are also recoverable.


Strictness52/146

Strict schedules also eliminate the chance of writing dirty data.

A schedule is strict if

Strict schedules simplify the task of rolling back after aborts.


... Strictness53/146

Example: non-strict schedule

T1:  W(X)        A
T2:        W(X)     A

Problems with handling rollback after aborts:


Schedule Properties54/146

Relationship between various classes of schedules:

[Diagram:Pics/txproc/schedules-small.png]

Schedules ought to be serializable and strict.

But more serializable/strict less concurrency.

DBMSs allow users to trade off "safety" against performance.


Transaction Isolation Levels55/146

Previous examples showed:

Safest approach ... Other approaches are weaker ... Is a trade-off useful?


... Transaction Isolation Levels56/146

In many real applications, there is either

This leads to a trade-off between performance and isolation


... Transaction Isolation Levels57/146

SQL provides a mechanism for database programmers to specify how much isolation to apply to transactions

set transaction
    read only  -- so weaker isolation may be ok
    read write -- suggests stronger isolation needed
isolation level
    -- weakest isolation, maximum concurrency
    read uncommitted
    read committed
    repeatable read
    serializable
    -- strongest isolation, minimum concurrency


... Transaction Isolation Levels58/146

Meaning of transaction isolation levels:


Isolation          Dirty          Nonrepeatable   Phantom
Level              Read           Read            Read

Read uncommitted   Possible       Possible        Possible
 
Read committed     Not possible   Possible        Possible

Repeatable read    Not possible   Not possible    Possible

Serializable       Not possible   Not possible    Not possible


... Transaction Isolation Levels59/146

For transaction isolation, PostgreSQL

Note: cannot implement read uncommitted because of MVCC


... Transaction Isolation Levels60/146

A PostgreSQL tx consists of a sequence of SQL statements:

BEGIN S1; S2; ... Sn; COMMIT;

Isolation levels affect view of DB provided to each Si:


... Transaction Isolation Levels61/146

Using PostgreSQL's serializable isolation level, a select:

Using the serializable isolation level, an update fails: The transaction containing the update must then rollback and re-start.


Implementing Concurrency Control


Concurrency Control63/146

Approaches to concurrency control:


Lock-based Concurrency Control64/146

Locks introduce additional mechanisms in DBMS:

[Diagram:Pics/txproc/txproc2-small.png]

The Scheduler


... Lock-based Concurrency Control65/146

Lock table entries contain:

Lock and unlock operations must be atomic.

Lock upgrade:


... Lock-based Concurrency Control66/146

Synchronise access to shared data items via following rules:

These rules alone do not guarantee serializability.


... Lock-based Concurrency Control67/146

Consider the following schedule, using locks:

T1: Lr(Y)     R(Y)               U(Y)         Lw(X) W(X) U(X)
T2:      Lr(X)    R(X) U(X) Lw(Y)....W(Y) U(Y)

(where Lr = read-lock, Lw = write-lock, U = unlock)

Locks correctly ensure controlled access to shared objects (X, Y).

Despite this, the schedule is not serializable.


Two-Phase Locking68/146

To guarantee serializability, we require an additional constraint on how locks are applied:

Each transaction is then structured as:


... Two-Phase Locking69/146

Consider the following two transactions:

T1: Lw(A) R(A) F1 W(A) Lw(B) U(A) R(B) G1 W(B) U(B)
T2: Lw(A) R(A) F2 W(A) Lw(B) U(A) R(B) G2 W(B) U(B)

They follow 2PL protocol, inducing a schedule like:

T1(a): Lw(A)      R(A) F1 W(A) Lw(B) U(A)
T2(a):      Lw(A) ...................... R(A) F2 W(A)

T1(b): R(B)      G1 W(B) U(B)
T2(b):     Lw(B) ............ U(A) R(B) G2 W(B) U(B)


Problems with Locking70/146

Appropriate locking can guarantee correctness.

However, it also introduces potential undesirable effects:


Deadlock71/146

Deadlock occurs when two transactions are waiting for a lock on an item held by the other.

Example:

T1: Lw(A) R(A)            Lw(B) ......
T2:            Lw(B) R(B)       Lw(A) .....

How to deal with deadlock?


... Deadlock72/146

Handling deadlock involves forcing a transaction to "back off".


... Deadlock73/146

Simple approach: timeout

Better approach: waits-for graph


... Deadlock74/146

A cycle in the waits-for graph indicates a deadlock.

Could prevent deadlock by

Could detect deadlock by


... Deadlock75/146

Alternative deadlock handling: timestamps.

Each tx is permanently allocated a unique timestamp (e.g. start-time).

When Tj tries to get lock held by Tk

Both schemes prevent waits-for cycles by imposing order/priority on tx's.


... Deadlock76/146

Tj tries to get lock held by Tk ...

Wait-die scheme:

Wound-wait schema:


... Deadlock77/146

Properties of deadlock handling methods:


Starvation78/146

Starvation occurs when one transaction

Whether it occurs depends on the lock wait/release strategy.

Multiple locks need to decide which to release first.

Solutions:


Locking Granularity79/146

Locking typically reduces concurrency reduces throughput.

Granularity of locking can impact performance:

+ lock a small item more of database accessible

+ lock a small item quick update quick lock release

- lock small items more locks more lock management

Granularity levels: field, row (tuple), table, whole database


... Locking Granularity80/146

Multiple-granularity locking protocol:

Example: T1 scans table R and updates some tuples Example: T2 uses an index to read part of R


Locking in B-trees81/146

How to lock a B-tree leaf node?

One possibility:

If for searching (select), locks would be read locks.

If for insert/delete, locks would be write locks.

This approach gives poor performance (lock on root is bottleneck).


... Locking in B-trees82/146

Approach for searching (select) ...

Approach for insert/delete ...


Optimistic Concurrency Control83/146

Locking is a pessimistic approach to concurrency control:

Costs: lock management, deadlock handling, contention.

In systems where read:write ratio is very high


... Optimistic Concurrency Control84/146

Transactions have three distinct phases:

Timestamps are recorded at points noted:

[Diagram:Pics/txproc/occ-phases-small.png]


... Optimistic Concurrency Control85/146

Data structures needed for validation:

Use the V timestamps as ordering for transactions


... Optimistic Concurrency Control86/146

Validation check for transaction T

If this test fails for any Ti, then T is rolled back.

What this prevents ...


... Optimistic Concurrency Control87/146

Problems with optimistic concurrency control:

Notes:


Multi-version Concurrency Control88/146

Multi-version concurrency control (MVCC) aims to

Achieves this by Main difference between MVCC and standard locking:


... Multi-version Concurrency Control89/146

Each transaction is

Each record in the database is


... Multi-version Concurrency Control90/146

When a reader Ti is accessing the database

When a writer Tj changes a data item


Concurrency Control in SQL91/146

Transactions in SQL are specified by

In PostgreSQL, other actions that cause rollback:


... Concurrency Control in SQL92/146

More fine-grained control of "undo" via savepoints:

Example:

begin;
  insert into numbersTable values (1);
  savepoint my_savepoint;
  insert into numbersTable values (2);
  rollback to savepoint my_savepoint;
  insert into numbersTable values (3);
commit;

will insert 1 and 3 into the table, but not 2.


... Concurrency Control in SQL93/146

SQL standard defines four levels of transaction isolation.

The weakest level allows dirty reads, phantom reads, etc.

PostgreSQL implements: repeatable-read = serializable, read-uncommitted = read-committed


... Concurrency Control in SQL94/146

Using the serializable isolation level, a select:

Using the serializable isolation level, an update fails: The transaction containing the failed update will rollback and re-start.


... Concurrency Control in SQL95/146

Explicit control of concurrent access is available, e.g.

Table-level locking: LOCK TABLE

Row-level locking: SELECT FOR UPDATE, DELETE

All locks are released at end of transaction (no explicit unlock)


Concurrency Control in PostgreSQL96/146

PostgreSQL uses two styles of concurrency control:

From the SQL (PLpgSQL) level:


... Concurrency Control in PostgreSQL97/146

Implementing MVCC in PostgreSQL requires:


... Concurrency Control in PostgreSQL98/146

Rules for a tuple to be visible to Tj:


... Concurrency Control in PostgreSQL99/146

A transaction sees a consistent view of the database, but may not see the "current" view of the database.

E.g. T1 does a select and then concurrent T2 deletes some of T1's selected tuples

This is OK unless tx's communicate outside the database system.

For applications that require that every transaction accesses the current consistent version of the data, explicit locks are available.

Locks are available at various granluarities:


Implementing Atomicity/Durability


Atomicity/Durability101/146

Reminder:

Transactions are atomic

Transaction effects are durable Implementation of atomicity/durability is intertwined.


Durability102/146

What kinds of "system failures" do we need to deal with?

The last requires off-site backup; all others should be locally recoverable.


... Durability103/146

Consider following scenario:

[Diagram:Pics/txproc/crash-small.png]

Desired behaviour after system restart:


... Durability104/146

Durabilty begins with a stable disk storage subsystem.

Operations:

Each call to putBlock must ensure that Subsequent getBlock on same disk sector must ensure that


... Durability105/146

Implementation of transaction operations R(V) and W(V)

Value R(Object V) {
    B = getBuf(blockContaining(V))
    return value of V from B
}
void W(Object V, value k) {
    B = getBuf(blockContaining(V))
    set value of V in B
}

Note:


... Durability106/146

getBuf() and putBuf() interface buffer pool with disk

Buffer getBuf(BlockAddr A) {
    if (!inBufferPool(A)) {
        B = availableBuffer(A);
        getBlock(B);
    }
    return bufferContaining(A);
}
void putBuf(BlockAddr A) {
    B = bufferContaining(A);
    putBlock(B);
}


Stable Store107/146

One simple strategy using redundancy: stable store.

Protects against all failures in a single disk sector.

Each logical disk page X is stored twice.

(Obvious disadvantage: disk capacity is halved)

X is stored in sector S on disk L and sector T on disk R

Assume that a sound parity-check is available

(i.e. can always recognise whether data has transferred memdisk correctly)


... Stable Store108/146

Low level sector i/o functions:

int writeSector(char *b, Disk d, Sector s) {
   int nTries = 0;
   do {
      nTries++;  write(d,s,b);
   } while (bad(parity) && nTries < MaxTries)
   return nTries;
}
int readSector(char *b, Disk d, Sector s) {
   int nTries = 0;
   do {
      nTries++;  read(d,s,b);
   } while (bad(parity) && nTries < MaxTries)
   return nTries;
}


... Stable Store109/146

Writing data to disk with stable store:

int writeBlock(Buffer *b, Disk d, Sector s) {
   int sec;
   for (;;) {
      sec = (s > 0) ? s : getUnusedSector(d);
      n = writeSector(b->data, d, sec);
      if (n == maxTries)
         mark s as unusable
      else
         return sec;
   }
}


... Stable Store110/146

Reading data from disk with stable store:

int readBlock(Buffer *b) {
   int n = readSector(b->data, b->diskL, b->sectorL);
   if (n == maxTries) {
      n = readSector(b->data, b->diskR, b->sectorR);
      if (n == maxTries) return -1; // read failure
   }
   return 0; // successful read
}


... Stable Store111/146

Consider scenario where power fails during write operation:

Wish to restore the system to a state where:


... Stable Store112/146

How stable storage handles failure during writing:


RAID113/146

RAID gives techniques to achieve

Requires: (See texts for further discussion on RAID)


Stable Storage Subsystem114/146

We can prevent/minimise loss/corruption of data due to:

If all of these implemented, assume stable storage subsystem.


Dealing with Transactions115/146

The remaining "failure modes" that we need to consider:

Standard technique for managing these:


Architecture for Atomicity/Durability116/146

How does a DBMS provide for atomicity/durability?

[Diagram:Pics/txproc/arch-small.png]


Execution of Transactions117/146

Transactions deal with three address spaces:

Each of these may hold a different "version" of a DB object.


... Execution of Transactions118/146

Operations available for data transfer:

READ/WRITE are issued by transaction.

INPUT/OUTPUT are issued by buffer manager (and log manager).


... Execution of Transactions119/146

Example of transaction execution:

-- implements A = A*2; B = B+1;
BEGIN
READ(A,v); v = v*2; WRITE(A,v);
READ(B,v); v = v+1; WRITE(B,v);
COMMIT

READ accesses the buffer manager and may cause INPUT.

COMMIT needs to ensure that buffer contents go to disk.


... Execution of Transactions120/146

States as the transaction executes:


t   Action        v  Buf(A)  Buf(B)  Disk(A)  Disk(B)
-----------------------------------------------------
(0) BEGIN         .      .       .        8        5
(1) READ(A,v)     8      8       .        8        5
(2) v = v*2      16      8       .        8        5
(3) WRITE(A,v)   16     16       .        8        5
(4) READ(B,v)     5     16       5        8        5
(5) v = v+1       6     16       5        8        5
(6) WRITE(B,v)    6     16       6        8        5
(7) OUTPUT(A)     6     16       6       16        5
(8) OUTPUT(B)     6     16       6       16        6

After tx completes, we must have either
Disk(A)=8,Disk(B)=5   or   Disk(A)=16,Disk(B)=6

If system crashes before (8), may need to undo disk changes.
If system crashes after (8), may need to redo disk changes.


Transactions and Buffer Pool121/146

Two issues arise w.r.t. buffers:

Ideally, we want stealing and not forcing.


... Transactions and Buffer Pool122/146

Handling stealing:

Handling no forcing:


Logging123/146

Three "styles" of logging

All approaches require:


Undo Logging124/146

Simple form of logging which ensures atomicity.

Log file consists of a sequence of small records:

Notes:


... Undo Logging125/146

Data must be written to disk in the following order:

  1. start transaction log record
  2. update log records indicating changes
  3. the changed data elements themselves
  4. the commit log record
Note: sufficient to have <T,X,v> output before X, for each X


... Undo Logging126/146

For the example transaction, we would get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,8>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,5>
(7)  FlushLog
(8)  StartCommit
(9)  OUTPUT(A)     6   16     6    16     5
(10) OUTPUT(B)     6   16     6    16     6
(11) EndCommit                                <COMMIT T>
(12) FlushLog

Note that T is not regarded as committed until (11).


... Undo Logging127/146

Simplified view of recovery using UNDO logging:

committedTrans = abortedTrans = startedTrans = {}
for each log record from most recent to oldest {
    switch (log record) {
    <COMMIT T> : add T to committedTrans
    <ABORT T>  : add T to abortedTrans
    <START T>  : add T to startedTrans
    <T,X,v>    : if (T in committedTrans)
                       // don't undo committed changes
                 else
                     { WRITE(X,v); OUTPUT(X) }
}   }
for each T in startedTrans {
    if (T in committedTrans) ignore
    else if (T in abortedTrans) ignore
    else write <ABORT T> to log
}
flush log


... Undo Logging128/146

Recall example transaction and consider effects of system crash at the following points:

Before (9) ... disk "restored" (unchanged); <ABORT T> written

(9)-(11) ... disk restored to original state; <ABORT T> written

After (12) ... A and B left unchanged; T treated as committed

"Disk restored" means

WRITE(B,5); OUTPUT(B); WRITE(A,8); OUTPUT(A);


Checkpointing129/146

Previous view of recovery implied reading entire log file.

Since log file grows without bound, this is infeasible.

Eventually we can delete "old" section of log.

This point is called a checkpoint.


... Checkpointing130/146

Problem: many concurrent/overlapping transactions.

How to know that all have finished?

Simplistic approach

  1. stop accepting new transactions (block them)
  2. wait until all active tx's either commit or abort
    (and have written a <COMMIT T> or <ABORT T> on the log)
  3. flush the log to disk
  4. write new log record <CHKPT>, and flush log again
  5. resume accepting new transactions
Known as quiescent checkpointing.


... Checkpointing131/146

Obvious problem with quiescent checkpointing

Better strategy: nonquiescent checkpointing
  1. write log record <CHKPT (T1,..,Tk)>
    (contains references to all active transactions active tx table)
  2. continue normal processing (e.g. new tx's can start)
  3. when all of T1,..,Tk have completed,
    write log record <ENDCHKPT> and flush log


... Checkpointing132/146

Recovery: scan backwards through log file processing as before.

Determining where to stop depends on ...

If we encounter <ENDCHKPT> first:

If we encounter <CHKPT (T1,..,Tk)> first:


Redo Logging133/146

Problem with UNDO logging:

Alternative approach is redo logging:


... Redo Logging134/146

Requirement for redo logging: write-ahead rule.

Data must be written to disk as follows:

  1. start transaction log record
  2. update log records indicating changes
  3. the commit log record
  4. the changed data elements themselves
Note that update log records now contain <T,X,v'>,
where v' is the new value for X.


... Redo Logging135/146

For the example transaction, we would get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,16>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,6>
(7)  COMMIT                                   <COMMIT T>
(8)  FlushLog
(9)  OUTPUT(A)     6   16     6    16     5
(10) OUTPUT(B)     6   16     6    16     6

Note that T is regarded as committed as soon as (8) completes.


... Redo Logging136/146

Simplified view of recovery using REDO logging:

set committedTrans // e.g. from tx table
for each log record from oldest to most recent {
    switch (log record) {
    <COMMIT T> : ignore // know these already
    <ABORT T>  : add T to abortedTrans
    <START T>  : add T to startedTrans
    <T,X,v'>   : if (T in committedTrans)
                       { WRITE(X,v'); OUTPUT(X) }
                 else
                     // nothing to do, no change on disk
}   }
for each T in startedTrans {
    if (T in committedTrans) ignore
    else if (T in abortedTrans) ignore
    else write <ABORT T> to log
}
flush log


... Redo Logging137/146

Data required for REDO logging checkpoints:


... Redo Logging138/146

Checkpoints in REDO logging use (as before):

Steps in REDO log checkpointing
  1. write <CHKPT (T1,..,Tk)> to log and flush log
  2. output to disk all changed objects in buffer pool
    which have not yet been written to disk
    whose tx's had committed before <CHKPT...>
  3. write <ENDCHKPT> to log and flush log
Note that other tx's may continue between steps 2 and 3.


... Redo Logging139/146

Recovery with checkpointed REDO log.

As for UNDO logging, two cases ...

Last checkpoint record before crash is <ENDCHKPT>

Last checkpoint record before crash is <CHKPT (T1,..,Tk)>


Undo/Redo Logging140/146

UNDO logging and REDO logging are incompatible in

Undo/Redo logging combines aspects of both


... Undo/Redo Logging141/146

Requirement for undo/redo logging: write-ahead.

Data must be written to disk as follows:

  1. start transaction log record
  2. update log records indicating changes
  3. the changed data elements themselves
Do not specify when the <COMMIT T> record is written.


... Undo/Redo Logging142/146

For the example transaction, we might get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,8,16>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,5,6>
(7)  FlushLog
(8)  StartCommit
(9)  OUTPUT(A)     6   16     6    16     5
(10)                                          <COMMIT T>
(11) OUTPUT(B)     6   16     6    16     6

Note that T is regarded as committed as soon as (10) completes.


... Undo/Redo Logging143/146

Recovery using undo/redo logging:

  1. redo all committed tx's from earliest to latest
  2. undo all incomplete tx's from latest to earliest
Consider effects of system crash on example

Before (10) ... treat as incomplete; undo all changes

After (10) ... treat as complete; redo all changes


... Undo/Redo Logging144/146

Steps in UNDO/REDO log checkpointing

  1. write <CHKPT (T1,..,Tk)> to log and flush log
  2. output to disk all dirty memory buffers
  3. write <ENDCHKPT> to log and flush log
Note that other tx's may continue between steps 2 and 3.

A consequence of the above:

(If we allowed this, a buffer may contain changes from both committed and aborted tx's)


... Undo/Redo Logging145/146

The above description simplifies details of undo/redo logging.

Aries is a complete algorithm for undo/redo logging.

Differences to what we have described:

(For more details consult any text or COMP9315 05s1 lecture notes)


Recovery in PostgreSQL146/146

PostgreSQL uses write-ahead undo/redo style logging.

However, it also uses multi-version concurrency control

Transaction/logging code is distributed throughout backend source.

Core transaction code is in src/backend/access/transam.


Produced: 15 May 2016