COMP9315: Storage: Devices, Files, Pages, Tuples, Buffers, Catalogs


Storage Management


Storage Management2/234

Aims of storage management in DBMS:


... Storage Management3/234

The storage manager provides mechanisms for:


... Storage Management4/234

Examples of references (addresses) used in DBMSs:

Note that Offsets may be indexes into mapping tables giving real address.


... Storage Management5/234

Levels of DBMS related to storage management:

[Diagram:Pics/storage/dbmsarch-small.png]


... Storage Management6/234

Topics to be considered:

Each topic will be illustrated by its implementation in PostgreSQL.


Views of Data7/234

Users and top-level query evaluator see data as

[Diagram:Pics/storage/abstract-store-small.png]


... Views of Data8/234

Relational operators and access methods see data as

[Diagram:Pics/storage/phys-store-small.png]


... Views of Data9/234

File manager sees both DB objects and file store

[Diagram:Pics/storage/table-to-file-small.png]


... Views of Data10/234

Disk manager sees data as

[Diagram:Pics/storage/disk-store-small.png]

On typical modern databases, handled by operating system filesystem.


Storage Manager Interface11/234

The storage manager provides higher levels of system

Example: simple scan of a relation:

select name from Employee

is implemented as something like

DB db = openDatabase("myDB");
Rel r = openRel(db,"Employee");
Scan s = startScan(r);
Tuple t;
while ((t = nextTuple(s)) != NULL)
{
   char *name = getField(t,"name");
   printf("%s\n", name);
}


... Storage Manager Interface12/234

The above shows several kinds of operations/mappings:

The DBMS storage manager provides all of these, broken down across several modules.


Data Flow in Query Evaluation13/234

[Diagram:Pics/storage/query-ops-small.png]


... Data Flow in Query Evaluation14/234

Notes on typical implementation strategies:


Files in DBMSs15/234

Data sets can be viewed at several levels of abstraction in DBMSs.

Logical view: a file is a named collection of data items   (e.g. a table of tuples)

Abstract physical view: a file is a sequence of fixed-size data blocks.

Physical realisation: a collection of sectors scattered over ≥1 disks.

The abstraction used for managing this: PageId.


... Files in DBMSs16/234

[Diagram:Pics/storage/file-views-small.png]


... Files in DBMSs17/234

Two possibilities for DBMS disk managers to handle data:


File System Interface18/234

Most access to data on disks in DBMSs is via a file system.

Typical operations provided by the operating system:

fd = open(fileName,mode)
  // open a named file for reading/writing/appending
close(fd)
  // close an open file, via its descriptor 
nread = read(fd, buf, nbytes)
  // attempt to read data from file into buffer 
nwritten = write(fd, buf, nbytes)
  // attempt to write data from buffer to file
lseek(fd, offset, seek_type)
  // move file pointer to relative/absolute file offset
fsync(fd)
  // flush contents of file buffers to disk


Storage Technology19/234

At this point in memory technology development:

New technologies may eventually change this picture entirely But expect spinning disk technology to dominate for at least 5 more years.


Computational Storage20/234

Characteristics of main memory (RAM):

Accessing memory:

load   reg,byte_address
store  reg,byte_address

Cache memory has similar characteristics to RAM, but is

Typical capacities: RAM (256MB..64GB), Cache (64MB..2GB)


Bulk Data Storage21/234

Requirements for bulk data storage:


... Bulk Data Storage22/234

Several kinds of bulk data storage technology currently exist:

Characteristics of bulk data storage technologies:


Magnetic Disks23/234

Classical/dominant bulk storage technology.

Characteristics:

Capacity increase over last decade:   4MB 1GB 1TB

Modest increase in speed; good reduction in cost.


Optical Disks24/234

Optical disks provides an alternative spinning disk storage technology.

Several varieties: CD-ROM, CD-R, CD-RW, DVD-RW

Compared to magnetic disks, CD's have

More suited to write-once, read-many applications (static DBs).


Flash Memory25/234

Flash memory is a non-mechanical alternative to disk storage.

Compared to disks, flash memory has


... Flash Memory26/234

Properties of flash memory require specialised file system

Example: updating data in flash storage

Limitations on updating reduce potential DB applications. Overall, not yet a serious contender as a DBMS substrate.


Disk Management


Disk Manager28/234

Aim:


... Disk Manager29/234

Basic disk management interface is simple:

void get_page(PageId p, Page buf)

void put_page(PageId p, Page buf)

PageId allocate_pages(int n) void deallocate_page(PageId p, int n)


Disk Technology30/234

Disk architecture:

[Diagram:Pics/storage/disk-small.png]


... Disk Technology31/234

Characteristics of disks:

Accessing disk:

read  block at address (p,t,s)
write block at address (p,t,s)


Disk Access Costs32/234

Access time includes:

Cost to write a block is similar to cost of reading But if we need to verify data on disk


... Disk Access Costs33/234

Example disk #1 characteristics:

[Diagram:Pics/storage/sector-blocks-small.png]

Note that this analysis is simplified because #bytes/track and #sectors/track varies between outer and inner tracks (same storage density, reduced track length.


... Disk Access Costs34/234

Time Tr to read one random block on disk #1:

Tr = seek + rotation + transfer

Minimum Tr = 0 + 0 + 0.13 = 0.13 ms

Maximum Tr = 50 + 16.7 + 0.13 = 66.8 ms

Average Tr = 25 + (16.7/2) + 0.13 = 33.5 ms


... Disk Access Costs35/234

If operating system deals in 4KB blocks:

[Diagram:Pics/storage/4k-blocks-small.png]

Tr(4-blocks) = 25 + (16.7/2) + 4×0.13 + 3×0.01 = 33.9 ms

Tr(1-block) = 25 + (16.7/2) + 0.13 = 33.5 ms

Note that the cost of reading 4KB is comparable to reading 1KB.

Sequential access reduces average block read cost significantly, but


... Disk Access Costs36/234

Example disk #2 characteristics:

Addressing = 3 bits (surface) + 13 bits (cylinder) + 8 bits (sector)

If using 32-bit addresses, this leaves 8 bits (28=256 items/block).


Disk Characteristics37/234

Three important characteristics of disk subsystems:

Mean time to (complete) failure: 3-10 years.


... Disk Characteristics38/234

Increasing capacity:

Improving access time: Improving reliability:


Increasing Disk Capacity39/234

Compress data (e.g. LZ encoding)

+ more data fits on disk

- compression/expansion overhead

For large compressible data (e.g. text), significant savings.

For most relational data (e.g. int, char(8)), no significant saving.

For high-performance memory caching, may never want to expand
(there is current research working on "computable" compressed data formats).


Improving Disk Access Costs40/234

Approach #1: Use knowledge of data access patterns.

E.g. two records frequently accessed together
put them in the same block (clustering)

E.g. records scanned sequentially
place them in "staggered" blocks, double-buffer

Arranging data to match access patterns can improve throughput by 10-20 times.

 

Approach #2: Avoid reading blocks for each item access.

E.g. buffer blocks in memory, assume likely re-use


Scheduled Disk Access41/234

Low-level disk manager (driver, controller):

Example head movement scheduler: elevator algorithm


Disk Layout42/234

If data sets are going to be frequently accessed in a pre-determined manner, arrange data on disk to minimise access time.

E.g. sequential scan

Older operating systems provided fine-grained control of disk layout.

Modern systems generally don't, because of programmer complexity.

Unix has raw disk partitions: no file system, you write driver to manage disk.


Improving Writes43/234

Nonvolatile write buffers

Log disk


Double Buffering44/234

Double-buffering exploits potential concurrency between disk and memory.

While reads/writes to disk are underway, other processing can be done.

[Diagram:Pics/storage/double-buffering-small.png]

With at least two buffers, can keep disk working full-time.


... Double Buffering45/234

Example: select sum(salary) from Employee

read A into buffer then process buffer content
read B into buffer then process buffer content
read C into buffer then process buffer content
...

Costs:

Typically,   Tp < Tr   (depends on kind of processing)


... Double Buffering46/234

Double-buffering approach:

read A into buffer1
process A in buffer1
  and concurrently read B into buffer2
process B in buffer2
  and concurrently read C into buffer1
...

Costs:

General observation: use of multiple buffers can lead to substantial cost savings.
We will see numerous examples where multiple memory buffers are exploited.


Multiple Disk Systems47/234

Various strategies can be employed to improve capacity, performance and reliability when multiple disks are available.

RAID (redundant arrays on independent disks) defines a standard set of such techniques.

Essentially, multiple disks allow

Capacity increases naturally by adding multiple disks

(although there is obviously a trade-off between increased capacity and increased reliability via redundancy)


RAID Level 048/234

Uses striping to partition data for one file over several disks

E.g. for n disks, block i in the file is written to disk (i mod n)

Example: file with 6 data blocks striped onto 4 disks using (pid mod 4)

[Diagram:Pics/storage/raid0-small.png]

Increases capacity, improves data transfer rates, reduces reliability.


... RAID Level 049/234

The disk manager and RAID controller have to perform a mapping something like:

    writePage(PageId)

to

    disk = diskOf(PageId,ndisks)
    cyln = cylinderOf(PageId)
    plat = platterOf(PageId)
    sect = sectorOf(PageId)
    writeDiskPage(disk, cyln, plat, sect)

(We discuss later how the pid might be represented and mapped)


RAID Level 150/234

Uses mirroring (or shadowing) to store multiple copies of each block.

Since disks can be read/written in parallel, transfer cost unchanged.

Multiple copies allows for single-disk failure with no data loss.

Example: file with 4 data blocks mirrored on two 2-disk partitions

[Diagram:Pics/storage/raid1-small.png]

Reduces capacity, improves reliability, no effect on data transfer rates.


... RAID Level 151/234

The disk manager and RAID controller have to perform a mapping something like:

    writePage(PageId)

to

    n = ndisksInPartition
    disk = diskOf(PageId,n)
    cyln = cylinderOf(PageId)
    plat = platterOf(PageId)
    sect = sectorOf(PageId)
    writeDiskPage(disk, cyln, plat, sect)
    writeDiskPage(disk+n, cyln, plat, sect)


RAID levels 2-652/234

The higher levels of raid incorporate various combinations of:

The differences are primarily in:

RAID levels 2-5 can recover from failure in a single disk.

RAID level 6 can recover from smultaneous failures in two disks.


Disk Media Failure53/234

Rarely, a bit will be transferred to/from the disk incorrectly.

Error-correcting codes can check for and recover from this.

If recovery is not possible, the operation can simply be repeated.

If repeated reads/writes on the same block fail:


Database Objects54/234

DBMSs maintain various kinds of objects/information:

databasecan be viewed as an super-object for all others
parametersglobal configuration information
cataloguemeta-information describing database contents
tablesnamed collections of tuples
tuplescollections of typed field values
indexesaccess methods for efficient searching
update logsfor handling rollback/recovery
proceduresactive elements


... Database Objects55/234

The disk manager implements how DB objects are mapped to file system.

References to data objects typically reduce to e.g.

The disk manager needs to convert buffer access to


Single-file DBMS56/234

One possible storage organisation is a single file for the entire database.

All objects are allocated to regions of this file.

[Diagram:Pics/storage/single-file-small.png]

Objects are allocated to regions (segments) of the file.

If an object grows too large for allocated segment, allocate an extension.

What happens to allocated space when objects are removed?


... Single-file DBMS57/234

Allocating space in Unix files is easy:

If the seek goes way beyond the end of the file: Under these circumstances, a disk manager is easy to implement.


Single-file Disk Manager58/234

Simple disk manager for a single-file database:

// Disk Manager data/functions
#define PAGESIZE 2048   // bytes per page
typedef int PageId;     // PageId is block index
typedef struct DBrec {
   char *dbname;     // copy of database name
   int fd;           // the database file
   SpaceTable map;   // map of free/used areas 
   NameTable names;  // map names to areas + sizes
} *DB;

typedef struct Relrec {
   char *relname;    // copy of table name
   int   start;      // page index of start of table data
   int   npages;     // number of pages of table data
   ...
} * Rel;


... Single-file Disk Manager59/234

DB openDatabase(char *name) { 
   DB db = new(struct DBrec);
   db->dbname = strdup(name);
   db->fd = open(name,O_RDWR);
   db->map = readSpaceTable(DBfd);
   db->names = readNameTable(DBfd);
   return db;
}
// stop using DB and update all meta-data
void closeDatabase(DB db) {
   writeSpaceTable(db->fd,db->map);
   writeNameTable(db->fd,db->map);
   fsync(db->fd);
   close(db->fd);
   free(db);
}


... Single-file Disk Manager60/234

// set up struct describing relation
Rel openRelation(DB db, char *rname) {
   Rel r = new(struct Relrec);
   r->relname = strdup(rname);
   // get relation data from map tables
   r->start = ...;
   r->npages = ...;
   return r;
}

// stop using a relation
void closeRelation(Rel r) {
   free(r);
}

#define nPages(r)  (r->npages)
#define makePageId(r,i)  (r->first + i)


... Single-file Disk Manager61/234

// assume that Page = byte[PageSize]
// assume that PageId = block number in file

// read page from file into memory buffer
void get_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   read(db->fd, buf, PAGESIZE);
}

// write page from memory buffer to file
void put_page(Db db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   write(db->fd, buf, PAGESIZE);
}


... Single-file Disk Manager62/234

// managing contents of mapping table is complex
// assume a list of (offset,length,status) tuples

// allocate n new pages at end of file
PageId allocate_pages(int n) {
   int endfile = lseek(db->fd, 0, SEEK_END);
   addNewEntry(db->map, endfile, n);
   // note that file itself is not changed
}
// drop n pages starting from p
void deallocate_pages(PageId p, int n) {
   markUnused(db->map, p, n);
   // note that file itself is not changed
}


Example: Scanning a Relation63/234

With the above disk manager, our example:

select name from Employee

might be implemented as something like

DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
int npages = nPages(r);
Page buffer = malloc(PAGESIZE*sizeof(char));
for (int i = 0; i < npages; i++) {
   PageId pid = makePageId(r,i);
   get_page(db, pid, buffer);
   foreach tuple in buffer {
      get tuple data and extract name
   }
}


Multiple-file Disk Manager64/234

Most DBMSs don't use a single large file for all data.

They typically provide:

Precise file structure varies between individual DBMSs.


... Multiple-file Disk Manager65/234

Structure of PageId for data pages in such systems ...

If system uses one file per table, PageId contains:

If system uses several files per table, PageId contains:


Oracle File Structures66/234

Oracle uses five different kinds of files:

data files catalogue, tables, procedures
redo log files update logs
alert log files record system events
control files configuration info
archive files off-line collected updates


... Oracle File Structures67/234

There may be multiple instances of each kind of file:

Data files are


... Oracle File Structures68/234

Tablespaces are logical units of storage (cf directories).

Every database object resides in exactly one tablespace.

Units of storage within a tablespace:

data block fixed size unit of storage (cf 2KB page)
extent specific number of contiguous data blocks
segment set of extents allocated to a single database object

Segments can span multiple data files; extents cannot.

To be confusing, tables are called datafiles internally in Oracle.


... Oracle File Structures69/234

Layout of data within Oracle file storage:

[Diagram:Pics/storage/oracle-files-small.png]


PostgreSQL Storage Manager70/234

PostgreSQL uses the following file organisation ...

[Diagram:Pics/storage/pg-file-arch-small.png]


... PostgreSQL Storage Manager71/234

Components of storage subsystem:

PostgreSQL has two basic kinds of files: Note: smgr designed for many storage devices; only mag disk handler used


Relations as Files72/234

PostgreSQL identifies relation files via their OIDs.

The core data structure for this is RelFileNode:

typedef struct RelFileNode
{
    Oid  spcNode;  // tablespace
    Oid  dbNode;   // database
    Oid  relNode;  // relation
} RelFileNode;

Global (shared) tables (e.g. pg_database) have


... Relations as Files73/234

The relpath function maps RelFileNode to file:


char *relpath(RelFileNode rnode)  // simplified
{
    char *path = malloc(ENOUGH_SPACE);

    if (rnode.spcNode == GLOBALTABLESPACE_OID) {
        /* Shared system relations live in PGDATA/global */
        Assert(rnode.dbNode == 0);
        sprintf(path, "%s/global/%u",
                DataDir, rnode.relNode);
    }
    else if (rnode.spcNode == DEFAULTTABLESPACE_OID) {
        /* The default tablespace is PGDATA/base */
        sprintf(path, "%s/base/%u/%u",
                DataDir, rnode.dbNode, rnode.relNode);
    }
    else {
        /* All other tablespaces accessed via symlinks */
        sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir
                rnode.spcNode, rnode.dbNode, rnode.relNode);
    }
    return path;
}


File Descriptor Pool74/234

Unix has limits on the number of concurrently open files.

PostgreSQL maintains a pool of open file descriptors:

File names are simply strings: typedef char *FileName

Open files are referenced via: typedef int File

A File is an index into a table of "virtual file descriptors".


... File Descriptor Pool75/234

Interface to file descriptor (pool):


File FileNameOpenFile(FileName fileName,
                      int fileFlags, int fileMode);
     // open a file in the database directory ($PGDATA/base/...)
File PathNameOpenFile(FileName fileName,
                      int fileFlags, int fileMode);
     // open a file given a full file path
File OpenTemporaryFile(bool interXact);
     // open temp file; flag: close at end of transaction?
void FileClose(File file);
void FileUnlink(File file);
int  FileRead(File file, char *buffer, int amount);
int  FileWrite(File file, char *buffer, int amount);
int  FileSync(File file);
long FileSeek(File file, long offset, int whence);
int  FileTruncate(File file, long offset);


... File Descriptor Pool76/234

Virtual file descriptors (Vfd)

VfdCache[0] holds list head/tail pointers.


... File Descriptor Pool77/234

Virtual file descriptor records (simplified):


typedef struct vfd
{
    s_short  fd;              // current FD, or VFD_CLOSED if none
    u_short  fdstate;         // bitflags for VFD's state
    File     nextFree;        // link to next free VFD, if in freelist
    File     lruMoreRecently; // doubly linked recency-of-use list
    File     lruLessRecently;
    long     seekPos;         // current logical file position
    char     *fileName;       // name of file, or NULL for unused VFD
    // NB: fileName is malloc'd, and must be free'd when closing the VFD
    int      fileFlags;       // open(2) flags for (re)opening the file
    int      fileMode;        // mode to pass to open(2)
} Vfd;


File Manager78/234

The "magnetic disk storage manager"

PostgreSQL PageId values are structured:

typedef struct
{
    RelFileNode rnode;    // which relation
    ForkNumber  forkNum;  // which fork
    BlockNumber blockNum; // which block 
} BufferTag;


... File Manager79/234

Access to a block of data proceeds as follows:

offset = BlockNumber * BLCKSZ
fileID = RelFileNode+ForkNumber
if (fileID is already in Vfd pool) {
    if (offset is in this file)
    	fd = use Vfd from pool
    else
        fd = allocate new Vfd for next part of file
} else {
    fd = allocate new Vfd for this file
}
seek to offset in fd
read/write data page (BLCKSZ bytes)

BLCKSZ is a global configurable constant (default: 8192).


Buffer Pool


Buffer Manager81/234

Aim:

Buffer pool


... Buffer Manager82/234

Buffer pool interposed between access methods and disk manager

[Diagram:Pics/storage/dbmsarch-small.png]

Access methods/page manager normally work via get_page() calls;
now work via calls to get_page_via_buffer_pool())


... Buffer Manager83/234

Basic buffer pool interface

Page request_page(PageId p);

void release_page(PageId p); void mark_page(PageId p); void flush_page(PageId p); void hold_page(PageId p); Buffer pool typically provides interface to allocate_page and deallocate_page as well.


Buffer Pool84/234

[Diagram:Pics/storage/pool-small.png]


... Buffer Pool85/234

Buffer pool data structures:

For each frame, we need to know:


... Buffer Pool86/234

In subsequent discussion, we assume:


Requesting Pages87/234

Call from client: request_page(pid)

If page pid is already in buffer pool:

If page pid is not already in buffer pool:


... Requesting Pages88/234

Advantages:

Disadvantages:


Releasing Pages89/234

The release_page function indicates that a page

If the page hasn't been modified, simply overwritten when replaced.

If the page has been modified, must be written to disk before replaced.

Possible problem: changes not immediately reflected on disk


... Releasing Pages90/234

Advantages:

Disadvantages: If a page remains in pool over multiple transactions (This is generally handled by some kind of logging mechanism (e.g. Oracle redo log files).


Buffer Manager Example #191/234

Self join: an example where buffer pool achieves major efficiency gains.

Consider a query to find pairs of employees with the same birthday:

select e1.name, e2.name
from   Employee e1, Employee e2
where  e1.id < e2.id and e1.birthday = e2.birthday

This might be implemented inside the DBMS via nested loops:

for each tuple t1 in Employee e1 {
    for each tuple t2 in Employee e2 {
        if (t1.id < t2.id &&
                t1.birthday == t2.birthday)
            append (t1.name,t2.name) to result set
    }
}


... Buffer Manager Example #192/234

In terms of page-level operations, the algorithm looks like:

DB db = openDatabase("myDB");
Rel emp = openRel(db,"Employee");
int npages = nPages(emp);

for (int i = 0; i < npages; i++) {
    PageId pid1 = makePageId(emp,i);
    Page p1 = request_page(pid1);
    for (int j = 0; j < npages; i++) {
        PageId pid2 = makePageId(emp,j);
        Page p2 = request_page(pid2);
        // compare all pairs of tuples from p1,p2
        // construct solution set from matching pairs
        release_page(pid2);
    }
    release_page(pid1);
}


... Buffer Manager Example #193/234

Consider a buffer pool with 200 frames and a relation with b ≤ 200 pages:

Total number of page reads = b   (entire relation is read exactly once)


... Buffer Manager Example #194/234

Now consider a buffer pool with 2 frames (the minimum required for the join):

(continued ...)


... Buffer Manager Example #195/234

(... continued)

Total number of page reads = b * (b-1)

Cf. 200-frame buffer vs 2-frame buffer ... if b=100, 100 reads vs 10000 reads.


The request_page Operation96/234

Method:

  1. Check buffer pool to see if it already contains requested page.
    If not, the page is brought in as follows:
    1. Choose a frame for replacement, using replacement policy
    2. If frame chosen is dirty, write page to disk
    3. Read requested page into now-vacant buffer frame
    4. Set dirty=False and pinCount=0 for this frame
  2. Pin the frame containing requested page (i.e. update pin count).
  3. Return reference to frame (Page) containing requested page.


Other Buffer Operations97/234

The release_page operation:

Note: no effect on disk or buffer contents until replacement required.

The mark_page operation:

Note: doesn't actually write to disk; indicates that frame needs to be written if used for replacement;

The flush_page operation:

Note: not generally used by higher levels of DBMS; they rely on request/release protocol.


Page Replacement Policies98/234

Several schemes are commonly in use:

LRU works for VM because of working set model
(recent past accesses determines future accesses)

For DBMS, we can predict patterns of page access better
(from our knowledge of how the relational operations are implemented)


... Page Replacement Policies99/234

The cost benefits from a buffer pool (with n frames) is determined by:

Example (a): sequential scan, LRU or MRU, n ≥ b, no competition

First scan costs b reads; subsequent scans are ``free''.

Example (b): sequential scan, MRU, n < b, no competition

First scan costs b reads; subsequent scans cost b - n reads.

Example (c): sequential scan, LRU, n < b, no competition

All scans cost b reads; known as sequential flooding.


Page Access Times100/234

How to determine when a page in the buffer was last accessed?

Could simply use the time of the last request_page for that PageId.

But this doesn't reflect real accesses to page.

For more realism, could use last request_page or release_page time.

Or could introduce operations for examining and modifying pages in pool:


Buffer Manager Example #2101/234

Standard join: an example where replacement policy can have large impact.

Consider a query to find customers who are also employees:

select c.name
from   Customer c, Employee e
where  c.ssn = e.ssn;

This might be implemented inside the DBMS via nested loops:

for each tuple t1 in Customer {
    for each tuple t2 in Employee {
        if (t1.ssn == t2.ssn)
            append (t1.name) to result set
    }
}


... Buffer Manager Example #2102/234

Assume that:


... Buffer Manager Example #2103/234

Works well with MRU strategy:

Total page reads = bC + (n - 2) + bC × (bE - (n-2)) = 20 + 9 + 20*2 = 189

Note: assumes that both request_page and release_page set the last usage timestamp.


... Buffer Manager Example #2104/234

Works less well with LRU strategy:

Total page reads = bC + bC × bE = 20 + 20*10 = 220


PostgreSQL Buffer Manager105/234

PostgreSQL buffer manager:

Same code used by backends which need a local buffer pool.

Buffers are located in a large region of shared memory.

Functions:  src/backend/storage/buffer/*.c

Definitions:  src/include/storage/buf*.h


... PostgreSQL Buffer Manager106/234

Buffer pool consists of:


... PostgreSQL Buffer Manager107/234

[Diagram:Pics/storage/buffer-memory-small.png]


... PostgreSQL Buffer Manager108/234

Definitions related to buffer manager:

include/storage/buf.h

include/storage/bufmgr.h include/storage/buf_internals.h Code in: backend/storage/buffer/


Buffer Pool Data Objects109/234

BufferDescriptors: array of structures describing buffers

Buffer: index into BufferDescriptors BufMgrLock: global lock on buffer pool BufferTag


... Buffer Pool Data Objects110/234

Buffer manager data types:

BufFlags: BM_DIRTY, BM_VALID, BM_TAG_VALID, BM_IO_IN_PROGRESS, ...

typedef struct buftag {
   RelFileNode rnode;     /* physical relation identifier */
   ForkNumber  forkNum;
   BlockNumber blockNum;  /* relative to start of reln */
} BufferTag;

typedef struct sbufdesc {  (simplified)
   BufferTag   tag;         /* ID of page contained in buffer */
   BufFlags    flags;       /* see bit definitions above */
   uint16      usage_count; /* usage counter for clock sweep */
   unsigned    refcount;    /* # of backends holding pins */
   int         buf_id;      /* buffer's index number (from 0) */
   int         freeNext;    /* link in freelist chain */
   ...
} BufferDesc;


Buffer Pool Functions111/234

Buffer manager interface:

Buffer ReadBuffer(Relation r, BlockNumber n)

Actually a special case of ReadBuffer_Common, which also handles variations like different replacement strategy, forks, temp buffers, ...


... Buffer Pool Functions112/234

Buffer manager interface (cont):

void ReleaseBuffer(Buffer buf)

void MarkBufferDirty(Buffer buf)


... Buffer Pool Functions113/234

Additional buffer manager functions:

Page BufferGetPage(Buffer buf)

BufferIsPinned(Buffer buf) CheckPointBuffers etc. etc. etc.


... Buffer Pool Functions114/234

Important internal buffer manager function:

BufferDesc *BufferAlloc(
                        Relation r, ForkNumber f,
                        BlockNumber n, bool *found)


Clock-sweep Replacement Strategy115/234

PostgreSQL page replacement strategy: clock-sweep

For specialised kinds of access (e.g. sequential scan), can allocate a private "buffer ring" with different replacement strategy.


Record/Tuple Management


Views of Data117/234

The disk and buffer manager provide the following view:

Database applications view data as: Standard terminology: records are also called tuples, items, rows, ...


... Views of Data118/234

The abstract view of a relation:

The physical representation of a relation:


... Views of Data119/234

We use the following low-level abstractions:

RecPage

Record


... Views of Data120/234

We use the following high-level abstractions:

Relation

Tuple


Records vs Tuples121/234

A table is defined by a collection of attributes (schema), e.g.

create table Employee (
   id#  integer primary key,
   name varchar(20),   -- or char(20)
   job  varchar(10),   -- or char(10)
   dept number(4)
);

A tuple is a collection of attribute values for such a schema, e.g.

    (33357462, 'Neil Young', 'Musician', 0277)

A record is a sequence of bytes, containing data for one tuple.


Record Management122/234

Aim:

In other words, the record manager reconciles the views of a block: Assumptions   (neither of which are essential):


Page-level Operations123/234

Operations to access records from a page ...

Record get_record(RecordId rid)

Record first_record() Record next_record()


... Page-level Operations124/234

Operations to make changes to records in a page ...

void update_record(RecordId rid, Record rec)

RecordId insert_record(Record rec) void delete_record(RecordId rid)


Tuple-level Operations125/234

Typ   getTypField(int fno)

Examples:   getIntField(1),   getStringField(2)

void   setTypField(int fno, Typ val)

Examples:   setIntField(1,42),   setStringField(2,"abc")

Also need operations to convert between Record and Tuple formats.


Relation-level Operations126/234

Tuple get_tuple(RecordId rid)

Tuple first_tuple() Tuple next_tuple() Plus operations to insert, delete and modify Tuples  (analogous to Records)


Example Query127/234

Recall previous example of simple scan of a relation:

select name from Employee

implemented as:

DB db = openDatabase("myDB");
Rel r = openRel(db,"Employee");
Scan s = startScan(r);
Tuple t;
while ((t = nextTuple(s)) != NULL)
{
   char *name = getField(t,"name");
   printf("%s\n", name);
}


... Example Query128/234

Conceptually, the scanning implementation is simple:

// maintain "current" state of scan
struct ScanRec { Rel curRel; RecId curRec };
typedef struct ScanRec *Scan;

Scan startScan(Rel r) {
   Scan s = malloc(sizeof(struct ScanRec));
   s->curRec = firstRecId(r);
   return s;
}

Tuple nextTuple(Scan s) {
   Tuple t = fetchTuple(s->curRec);
   s->curRec = nextRecId(r,s->curRec);
   return t;
}


... Example Query129/234

The real implementation relies on the buffer manager:

struct ScanRec {
   Rel curRel; PageId curPID; RecPage curPage;
};
typedef struct ScanRec *Scan;

Scan startScan(Rel r)
{
   Scan s = malloc(sizeof(struct ScanRec));
   s->curPID = firstPageId(r);
   Buffer page = request_page(s->curPage);
   s->curPage = start_page_scan(page);
   return s;
}


... Example Query130/234

And similarly the nextTuple() function:

Tuple nextTuple(Scan s)
{
   // if more records in the current page
   Tuple t;
   if (t = next_rec_in_page(s->curPage)) != NULL)
      return t;
   while (t == null) {   // current page finished
      release_page(s->curPID);   // release current page
      s->curPID = next_page_id(s->curRel, s->curPID);
      // ... and if no more pages, then finished
      if (s->curPID == NULL) return NULL;
      Buffer page = request_page(s->curPID);
      s->curPage = start_page_scan(page);
      t = next_rec_in_page(s->curPage);
   }
   return t;
}


Record Identifiers131/234

The implementation of RecordIDs is determined by the physical storage structure of the DBMS.

A RecordId always has at least two components:

If multiple files for a relation, then also need: (Or, more likely, use a PageId which combines both the file number and page number)

Some DBMSs provide ROWIDs in SQL to permit efficient tuple access.

PostgreSQL provides a unique OID for every row in the database.


... Record Identifiers132/234

RecordID components are

E.g. with 4KB pages and 16 bits available for page addressing E.g. using indexes into a slot table to identify records within a page


Example RecordId Structure133/234

Consider a DBMS like Oracle which uses a small number of large files.

Suitable RecordIds for such a system, using 32-bits, might be built as:

Example:

[Diagram:Pics/storage/RecordId-small.png]

(Note: however you partition the bits, you can address at most 4 billion records)


... Example RecordId Structure134/234

Consider a DBMS like MiniSQL, which uses one data file per relation.

One possibility is a variation on the Oracle approach:

Another possibility is (Under this scheme, there will be multiple records in the DB with the same rid)


Manipulating RecordIds135/234

Functions for constructing/interrogating RecordIds:

typedef unsigned int RecordId;

RecordId makeRecordId(int file, int page, int slot) {
   return (file << 28) | (page << 8) | (slot);
}
int fileNo(RecordId rid) { return (rid >> 28) & 0xF; }

int pageNo(RecordId rid) { return (rid >> 8) & 0xFFFFF; }

int slotNo(RecordId rid) { return rid & 0xFF; }


... Manipulating RecordIds136/234

Alternative implementation if details of file/page are hidden within PageId:

typedef unsigned int PageId; //only uses 24-bits
typedef unsigned int RecordId;

RecordId makeRecordId(PageId pid, int slot) {
    return (pid << 8) | (slot);
}

int pageId(RecordId rid) { return (rid >> 8) & 0xFFFFFF; }

int slotNo(RecordId rid) { return rid & 0xFF; }


Record Formats137/234

Records are stored within fixed-length pages.

Records may be fixed-length:

Records may be variable-length:


Fixed-length Records138/234

Encoding scheme for fixed-length records:

[Diagram:Pics/storage/fixed-length-small.png]

Since record format is frequently consulted at query time, it should be memory-resident.


... Fixed-length Records139/234

Advantages of fixed-length records:

Disadvantages of fixed-length records: Note: if all records were close to specified maximum size, this would be the most compact format.


... Fixed-length Records140/234

Handling attempts to insert values larger than available fields:

Alignment considerations (for numeric fields) may require:


Variable-length Records141/234

Some encoding schemes for variable-length records:


... Variable-length Records142/234

More encoding schemes for variable-length records:


... Variable-length Records143/234

Advantages of variable-length records:

Disadvantages of variable-length records:


Spanned Records144/234

How to handle record that does not fit into free space in page?

Two approaches:


... Spanned Records145/234

Advantages of spanned records:

Disadvantages of spanned records: More common strategy than spanning:


Converting Records to Tuples146/234

A Record

The information on how to interpret the bytes For variable-length records, further formatting information is stored in the record itself.


... Converting Records to Tuples147/234

DBMSs typically define a fixed set of field types for use in schema.

E.g.   DATE, FLOAT, INTEGER, NUMBER(n), VARCHAR(n), ...

This determines the primitive types to be handled in the implementation:

DATE time_t
FLOAT float,double
INTEGER int,long
NUMBER(n) int[]
VARCHAR(n) char[]


Defining Tuples148/234

To convert a Record to a Tuple we need to know:

This leads to two structs: FieldDesc and RelnDesc

typedef struct {
    short offset;  // index of starting byte
    short length;  // number of bytes
    Types type;    // reference to Type data
} FieldDesc;
typedef struct {
    char     *relname;  // relation name
    ushort    nfields;  // # of fields
    FieldDesc fields[]; // field descriptors
} RelnDesc;


... Defining Tuples149/234

For the example relation:

FieldDesc fields[] = malloc(4*sizeof(FieldDesc);
fields[0] = FieldDesc(0,4,INTEGER);
fields[1] = FieldDesc(4,20,VARCHAR);
fields[2] = FieldDesc(24,10,CHAR);
fields[3] = FieldDesc(34,4,NUMBER);

This defines the schema


... Defining Tuples150/234

A Tuple can be defined as

typedef struct {
    Record    data;     // pointer to data
    ushort    nfields;  // # fields
    FieldDesc fields[]; // field descriptions
} Tuple;


... Defining Tuples151/234

A Tuple is produced from a Record in the context of a RelnDesc.

It also necessary to know how the Record byte-string is structured.

Assume the following Record structure:

[Diagram:Pics/storage/rec1-small.png]

Assume also that lengths are 1-byte quantities   (no field longer than 256-bytes).


... Defining Tuples152/234

How the Record Tuple mapping might occur:


Tuple mkTuple(RelnDesc schema, Record record)
{
    int i, pos = 0;
    int size = sizeof(Tuple) +
               (nfields-1)*sizeof(FieldDesc);
    Tuple *t = malloc(size);
    t->data = record;
    t->nfields = schema.nfields;
    for (i=0; i < schema.nfields; i++) {
        int len = record[pos++];
        t->fields[i].offset = pos;
        t->fields[i].length = len;
        // could add checking for over-length fields, etc.
        t->fields[i].type = schema.fields[i].type;
        pos += length;
    }
    return t;
}


PostgreSQL Tuples153/234

Definitions: src/include/access/*tup*.h

Functions: src/backend/access/common/*tup*.c

PostgreSQL defines tuples via:


... PostgreSQL Tuples154/234

Tuple structure:

[Diagram:Pics/storage/pg-tuple-struct-small.png]


... PostgreSQL Tuples155/234

Tuple-related data types:


// representation of a data value
// may be the actual value, or may be a pointer to it
typedef unitptr_t Datum;

The actual data value:


... PostgreSQL Tuples156/234

Tuple-related data types: (cont)


typedef struct HeapTupleFields  // simplified
{
    TransactionId   t_xmin;       // inserting xact ID
    TransactionId   t_xmax;       // deleting or locking xact ID
    CommandId       t_cid;        // inserting/deleting command ID, or both
} HeapTupleFields;
typedef struct HeapTupleHeaderData // simplified
{
    HeapTupleFields t_heap;
    ItemPointerData t_ctid;       // current TID of this or newer tuple
    uint16          t_infomask2;  // number of attributes + flags
    uint16          t_infomask;   // flags e.g. has_null, has_varwidth
    uint8           t_hoff;       // sizeof header incl. bitmap+padding
    // above is fixed size (23 bytes) for all heap tuples
    bits8           t_bits[1];    // bitmap of NULLs, variable length
    // actual data follows at end of struct
} HeapTupleHeaderData;


... PostgreSQL Tuples157/234


typedef struct tupleDesc
{
    int                natts;      // number of attributes in the tuple
    Form_pg_attribute *attrs;      // array of pointers to attr descriptors
    TupleConstr       *constr;     // constraints, or NULL if none
    Oid                tdtypeid;   // composite type ID for tuple type
    int32              tdtypmod;   // typmod for tuple type
    bool               tdhasoid;   // tuple has oid attribute in its header
    int                tdrefcount; // reference count, -1 if not counting
} *TupleDesc;


... PostgreSQL Tuples158/234

Operations on Tuples:


// create Tuple from values
HeapTuple
heap_form_tuple(TupleDesc tupDesc, Datum *values, bool *isnull)

// return Datum given Tuple, attr and descriptor
//   sets isnull to true if value is NULL
#define heap_getattr(tup, attnum, tupleDesc, isnull) ...

// returns true if attribute has no value
bool heap_attisnull(HeapTuple tup, int attnum) ...

// produce a modified tuple from an existing one
HeapTuple
heap_modify_tuple(HeapTuple tuple, TupleDesc tupleDesc,
                  Datum *replValues, bool *replIsnull,
                  bool *doReplace)


Page Formats159/234

Ultimately, a Page is simply an array of bytes (byte[]).

We want to interpret/manipulate it as a collection of Records.

Typical operations on Pages:


... Page Formats160/234

Factors affecting Page formats:

Implementation of Page operations critically depends on format.


... Page Formats161/234

For fixed-length records, use record slots.

Insertion: place new record in first available slot.

Deletion: two possibilities for handling free record slots:

[Diagram:Pics/storage/rec-slots-small.png]


... Page Formats162/234

Problem with packed format and no slot directory

Could add a slot directory to overcome this, but wastes space.

Problem with unpacked/bitmap format


... Page Formats163/234

For variable-length records, use slot directory.

Possibilities for handling free-space within block:

In practice, a combination is useful:


... Page Formats164/234

Compacted free space:  

[Diagram:Pics/storage/free-list-small.png]

Note: "pointers" are implemented as word offsets within block.


... Page Formats165/234

Fragmented free space:  

[Diagram:Pics/storage/free-list1-small.png]


Storage Utilisation166/234

How many records can fit in a page?    (How long is a piece of string?)

Depends on:   page size, (avg) record size, slot directory, ...

For a typical DBMS application


... Storage Utilisation167/234

Example of determining space utilisation ...

Assumptions:


... Storage Utilisation168/234

Max record size = 4(offsets) + 4 + 20 + 12 + 4 = 44 bytes

Minimum number of records = 1024/44 = 23   (assume all max size and no directory)

Average number of records = 1024/40 = 25   (assume no directory)

So, allow 32 directory slots (5-bit slot indexes), and 32 bytes for directory.

Number of records = Nr,   where   44 × Nr + 32+4 ≤ 1024

Aim to maximise Nr, so Nr = 22

Notes: because there are 32 slots, could have up to 32 (small) records


... Storage Utilisation169/234

If we switched to 8KB pages, then

Minimum number of records = 8192/44 = 186   (assume all max size and no directory)

So, allow 256 slots (8-bit slot indexes), and 352 bytes for directory (256*11bits)

Number of records = Nr,   where   44 × Nr + 352 ≤ 8192

Aim to maximise Nr, so Nr = 178

Could reduce size of directory to allow more records ... but only so far.

Note: 11-bit directory entries also means that it's costly to access them.


Overflows170/234

Sometimes, it may not be possible to insert a record into a page:

  1. no free-space fragment large enough
  2. overall free-space is not large enough
  3. the record is larger than the page
  4. no more free directory slots in page
The first case can initially be handled by compacting the free-space.

If there is still insufficient space, we have one of the other cases.


... Overflows171/234

How the other cases are handled depends on the file organisation:


... Overflows172/234

Overflow files for very large records and BLOBs:

[Diagram:Pics/storage/ovflow-file-small.png]


... Overflows173/234

Page-based handling of overflows:

[Diagram:Pics/storage/ovflow-page-small.png]

Useful for scan-all-records type operations.


... Overflows174/234

Record-based handling of overflows:

[Diagram:Pics/storage/ovflow-record-small.png]

Useful for locating specific record via rid.


PostgreSQL Page Representation175/234

Functions: src/backend/storage/page/*.c

Definitions: src/include/storage/bufpage.h

Each page is 8KB (default BLCKSZ) and contains:

Large data items are stored in separate (TOAST) files.


... PostgreSQL Page Representation176/234

PostgreSQL tuple page layout:

[Diagram:Pics/storage/pg-page-struct-small.png]


... PostgreSQL Page Representation177/234

Page-related data types:


// a Page is simply a pointer to start of buffer
typedef Pointer Page;

// indexes into the tuple directory
typedef unit16  LocationIndex;

// entries in tuple directory (line pointer array)
typedef struct ItemIdData
{
   unsigned   lp_off:15,    // tuple offset from start of page
              lp_flags:2,   // state of item pointer
              lp_len:15;    // byte length of tuple
} ItemIdData;


... PostgreSQL Page Representation178/234

Page-related data types: (cont)


typedef struct PageHeaderData
{
   ...
   uint16        pd_flags;    // flag bits (e.g. free, full, ...
   LocationIndex pd_lower;    // offset to start of free space
   LocationIndex pd_upper;    // offset to end of free space
   LocationIndex pd_special;  // offset to start of special space
   uint16        pd_pagesize_version;
   ...
   ItemIdData    pd_linp[1];  // beginning of line pointer array
} PageHeaderData;

typedef PageHeaderData *PageHeader;


... PostgreSQL Page Representation179/234

Operations on Pages:

void PageInit(Page page, Size pageSize, ...)

OffsetNumber
 PageAddItem(Page page, Item item, Size size, ...)
void PageRepairFragmentation(Page page)


... PostgreSQL Page Representation180/234

PostgreSQL has two kinds of pages:

Both kinds of page have the same page layout.

One important difference:


Representing Database Objects


Database Objects182/234

RDBMSs manage different kinds of objects

Many objects have names (and, in PostgreSQL, all have OIDs).

How are the different types of objects represented?

How do we go from a name (or OID) to bytes stored on disk?


... Database Objects183/234

Top-level "objects" in typical SQL standard databases:

catalog ... SQL terminology for a database

schema ... collection of DB object definitions tablespace ... collection of DB files PostgreSQL also has cluster: a server managing a set of DBs.


... Database Objects184/234

Consider what information the RDBMS needs about relations:

Similarly for other DBMS objects (e.g. views, functions, triggers, ...)

All of this information is stored in the system catalog.

(The "system catalog" is also called "data dictionary" or "system view")

In most RDBMSs, the catalog itself is also stored as tables.


... Database Objects185/234

Standard for catalogs in SQL:2003: INFORMATION_SCHEMA.


Schemata(catalog_name, schema_name, schema_owner, ...)

Tables(table_catalog, table_schema, table_name, table_type, ...)

Columns(table_catalog, table_schema, table_name, column_name,
        ordinal_position, column_default, is_nullable, data_type, ...)

Views(table_catalog, table_schema, table_name, view_definition,
                       check_option, is_updatable, is_insertable_into)

Role_table_grants(grantor, grantee, privilege_type, is_grantable,
                         table_catalog, table_schema, table_name, ...)
etc. etc.

For complete details, see Section 30 of the PostgreSQL 8.0.3 documentation.


System Catalog186/234

Most DBMSs also have their own internal catalog structure.

Would typically contain information such as:


Users(id:int, name:string, ...)

Databases(id:int, name:string, owner:ref(User), ...)

Schemas(id:int, name:string, owner:ref(User), ...)

Types(id:int, name:string, defn:string, size:int, ...)

Tables(id:int, name:string, owner:ref(User),
                                inSchema:ref(Schema), ...)

Attributes(id, name:string, table:ref(Table),
                           type:ref(Type), pkey:bool, ...)
etc. etc.

Standard SQL INFORMATION_SCHEMA is provided as a set of views on these tables.


... System Catalog187/234

The catalog is manipulated by a range of SQL operations:

where Object is one of table, view, function, trigger, schema, ...

E.g. consider an SQL DDL operation such as:

create table ABC (
    x integer primary key,
    y integer
);


... System Catalog188/234

This would produce a set of catalog changes something like ...

userID := current_user();
schemaID := current_schema();
tabID := nextval('tab_id_seq');
select into intID id
from Types where name='integer';
insert into Tables(id,name,owner,inSchema,...)
  values (tabID, 'abc', userID, schema, ...)
attrID := nextval('attr_id_seq');
insert into Attributes(id,name,table,type,pkey,...)
    values (attrID, 'x', tabID, intID, true, ...)
attrID := nextval('attr_id_seq');
insert into Attributes(id,name,table,type,pkey,...)
    values (attrID, 'y', tabID, intID, false, ...)


... System Catalog189/234

In PostgreSQL, the system catalog is available to users via:

The low-level representation is available to sysadmins via:


PostgreSQL Catalog190/234

The \d? special commands in psql are just wrappers around queries on the low-level catalog tables, e.g.

\dt list information about tables
\dv list information about views
\df list information about functions
\dp list table access privileges
\dT list information about data types
\dd shows comments attached to DB objects


... PostgreSQL Catalog191/234

A PostgreSQL installation typically has several databases.

Some catalog information is global, e.g.

Other catalog information is local to each database, e.g


... PostgreSQL Catalog192/234

Global installation data is recorded in shared tables

Each kind of DB object has table(s) to describe it, e.g.


... PostgreSQL Catalog193/234

PostgreSQL tuples contain

OIDs are used as primary keys in many of the catalog tables.


Representing Users/Groups194/234

In version 8, PostgreSQL merged notions of users/groups into roles.

Represented by two base tables:   pg_authid,   pg_auth_members

View pg_shadow gives a more symbolic view of pg_authid.

View pg_user gives a copy of pg_shadow with passwords "hidden".

CREATE|ALTER|DROP USER statements modify pg_authid table.

CREATE|ALTER|DROP GROUP statements modify pg_auth_members table.

Both tables are global (shared across all DBs in a cluster).


... Representing Users/Groups195/234

pg_authid table contains information about roles:

oid unique integer key for this role
rolname symbolic name for role (PostgreSQL identifier)
rolpassword plain or md5-encrypted password
rolcreatedb can create new databases
rolsuper is a superuser (owns server process)
rolcatupdate can update system catalogs

etc. etc.


... Representing Users/Groups196/234

pg_shadow view contains information about users:

usename symbolic user name (e.g. 'jas')
usesysid integer key to reference user (pg_authid.oid)
passwd plain or md5-encrypted password
usecreatdb can create new databases
usesuper is a superuser (owns server process)
usecatupd can update system catalogs

etc. etc.


... Representing Users/Groups197/234

pg_group view contains information about user groups:

groname group name (e.g. 'developers')
grosysid integer key to reference group
grolist[] array containing group members
(vector of refs to pg_authid.oid)

Note the use of multi-valued attribute (PostgreSQL extension)


Representing High-level Objects198/234

Above the level of individual DB schemata, we have:

These tables are global to each PostgreSQL cluster.

Keys are names (strings) and must be unique within cluster.


... Representing High-level Objects199/234

pg_database contains information about databases:

datname database name (e.g. 'mydb')
datdba database owner (refs pg_authid.oid)
datpath where files for database are stored
(if not in the PGDATA directory)
datacl[] access permissions
datistemplate can be used to clone new databases
(e.g. template0, template1)

etc. etc.


... Representing High-level Objects200/234

Digression: access control lists (acl)

PostgreSQL represents access via an array of access elements.

Each access element contains:

UserName=Privileges/Grantor
group GroupName=Privileges/Grantor

where Privileges is a string enumerating privileges, e.g.

jas=arwdRxt/jas,fred=r/jas,joe=rwad/jas


... Representing High-level Objects201/234

pg_namespace contains information about schemata:

nspname namespace name (e.g. 'public')
nspowner namespace owner (refs pg_authid.oid)
nspacl[] access permissions

Note that nspname is a key and must be unique across cluster.


... Representing High-level Objects202/234

pg_tablespace contains information about tablespaces:

spcname tablespace name (e.g. 'disk5')
spcowner tablespace owner (refs pg_authid.oid)
spclocation full filepath to tablespace directory
spcacl[] access permissions

Two pre-defined tablespaces:


Representing Tables203/234

Entries in multiple catalog tables are required for each user-level table.

Due to O-O heritage, base table for tables is called pg_class.

The pg_class table also handles other "table-like" objects:

pg_class also handles sequences, indexes, and other "special" objects.

Tuples in pg_class have an OID, used as primary key.


... Representing Tables204/234

pg_class contains information about tables:

relname name of table (e.g. employee)
relnamespace schema in which table defined
(refs pg_namespace.oid)
reltype data type corresponding to table
(refs pg_type.oid)
relowner owner (refs pg_authid.oid)
reltuples # tuples in table
relacl access permissions


... Representing Tables205/234

pg_class also holds various flags/counters for each table:

relkind what kind of object
'r' = ordinary table, 'i' = index, 'v' = view
'c' = composite type, 'S' = sequence, 's' = special
relnatts # attributes in table
(how many entries in pg_attribute table)
relchecks # of constraints on table
(how many entries in pg_constraint table)
relhasindex table has/had an index?
relhaspkey table has/had a primary key?

etc.


... Representing Tables206/234

pg_type contains information about data types:

typname name of type (e.g. 'integer')
typnamespace schema in which type defined
(refs pg_namespace.oid)
typowner owner (refs pg_authid.oid)
typtype what kind of data type
'b' = base type, 'c' = complex (row) type, ...

Note: a complex type is automatically created for each table
(defines "type" for each tuple in table; also, type for functions returning SETOF)


... Representing Tables207/234

pg_type also contains storage-related information:

typlen how much storage used for values
(-1 for variable-length types, e.g. text)
typalign memory alignment for values
('c' = byte-boundary, 'i' = 4-byte-boundary, ...)
typrelid table associated with complex type
(refs pg_class.oid)
typstorage where/how values are stored
('p' = in-tuple, 'e' = in external table,   compressed?)

(We discuss more details of the pg_type table later ...)


... Representing Tables208/234

pg_attribute contains information about attributes:

attname name of attribute (e.g. 'empname')
attrelid table this attribute belongs to
(refs pg_class.oid)
attnum attribute position (1..n, sys attrs are -ve)
atttypid data type of this attribute
(refs pg_type.oid)

(attrelid,attnum) is unique, and used as primary key.


... Representing Tables209/234

pg_attribute also holds storage-related information:

attlen storage space required by attribute
(copy of pg_type.typlen for fixed-size values)
atttypmod storage space for var-length attributes
(e.g. 6+ATTR_HEADER_SIZE for char(6))
attalign memory-alignment info (copy of pg_type.typalign)
attndims number of dimensions if attr is an array


... Representing Tables210/234

pg_attribute also holds constraint/status information:

attnotnull attribute may not be null?
atthasdef attribute has a default values
(value is held in pg_attrdef table)
attisdropped attribute has been dropped from table

Also has notion of large data being stored in a separate table (so-called "TOAST" table).


... Representing Tables211/234

An SQL DDL statement like

create table MyTable (
   a int unique not null,
   b char(6)
);

will cause entries to be made in the following tables:


... Representing Tables212/234

The example leads to a series of database changes like


rel_oid := new_oid(); user_id = current_user();
insert into
   pg_class(oid,name,owner,kind,pages,tuples,...)
   values (rel_oid, 'mytable', user_id, 'r', 0, 0, ...)
select oid,typlen into int_oid,int_len
from   pg_type where typname = 'int';
insert into
   pg_attribute(relid,name,typid,num,len,typmod,notnull...)
   values (rel_oid, 'a', int_oid, 1, int_len, -1, true, ...)
select oid,typlen into char_oid,char_len
from   pg_type where typname = 'char';
insert into
   pg_attribute(relid,name,typid,num,len,typmod,notnull...)
   values (rel_oid, 'b', char_oid, 2, -1, 6+4, false, ...)
insert into
   pg_type(name,owner,len,type,relid,align,...)
   values ('mytable', user_id, 4, 'c', rel_oid, 'i', ...)


... Representing Tables213/234

pg_attrdef contains information about default values:

adrelid table that column belongs to
(refs pg_class.oid)
adnum which column in the table
(refs pg_attribute.attnum)
adsrc readable representation of default value
adbin internal representation of default value


... Representing Tables214/234

pg_constraint contains information about constraints:

conname name of constraint (not unique)
connamespace schema containing this constraint
contype kind of constraint
'c' = check, 'u' = unique,
'p' = primary key, 'f' = foreign key
conrelid which table (refs pg_class.oid)
conkey which attributes
(vector of values from pg_attribute.attnum)
consrc check constraint expression

(Names are automatically generated from context (fkey,check) if not supplied)


... Representing Tables215/234

For foreign-key constraints, pg_constraint also contains:

confrelid referenced table for foreign key
confkey key attributes in foreign table
conkey corresponding attributes in local table

Foreign keys also introduce triggers to perform checking.

For column-specific constraints:

consrc readable check constraint expression
conbin internal check constraint expression


... Representing Tables216/234

An SQL DDL statement like

create table MyOtherTable (
   x int check (x > 0),
   y int references MyTable(a),
   z int default -1
);

will cause similar entries as before in catalogs, plus


... Representing Tables217/234

The example leads to a series of database changes like


rel_oid := new_oid(); user_id = current_user();
insert into
   pg_class(oid,name,owner,kind,pages,tuples,...)
   values (rel_oid, 'myothertable', user_id, 'r', 0, 0, ...)
select oid,typlen into int_oid,int_len
from   pg_type where typname = 'int';
select oid into old_oid
from   pg_class where relname='mytable';
-- pg_attribute entries for attributes x=1, y=2, z=3
insert into
   pg_attrdef(relid,num,src,bin)
   values (rel_oid, 3, -1, {CONST :...})
insert into
   pg_constraint(type,relid,key,src,...)
   values ('c', rel_oid, {1}, '(x > 0)', ...)
insert into
   pg_constraint(type,relid,key,frelid,fkey,...)
   values ('f', rel_oid, {2}, old_oid, {1}, ...)


Representing Functions218/234

Stored procedures (functions) are defined as


create function power(int x, int y) returns int
as $$
declare  i int;  product int := 1;
begin
   for i in 1..y loop
      product := product * x;
   end loop;
   return product;
end;
$$ language plpgsql;

Stored procedures are represented in the catalog via


... Representing Functions219/234

pg_proc contains information about functions:

proname name of function (e.g. substr)
pronamespace schema in which function defined
(refs pg_namespace.oid)
proowner owner (refs pg_authid.oid)
proacl[] access permissions

etc.


... Representing Functions220/234

pg_proc also contains argument/usage information:

pronargs how many arguments
prorettype return type (refs pg_type.oid)
proargtypes[] argument types (ref pg_type.oid vector)
proreset returns set of values of prorettype
proisagg is function an aggregate?
proisstrict returns null if any arg is null
provolatile return value depends on side-effects?
('i' = immutable, 's'= stable, 'v' = volatile)


... Representing Functions221/234

pg_proc also contains implementation information:

prolang what language function written in
prosrc source code if interpreted (e.g. PLpgSQL)
probin additional info on how to invoke function
(interpretation is language-specific)


... Representing Functions222/234

Consider two alternative ways of defining a x2 function.


sq.c  int square_in_c(int x) { return x * x; }

create function square(int) returns int
as '/path/to/sq.o', 'square_in_c' language 'C';

or


create function square(int) returns int
as $$
begin
    return $1 * $1;
end;
$$ language plpgsql;


... Representing Functions223/234

The above leads to a series of database changes like


user_id := current_user();
select oid,typlen into int_oid,int_len
from   pg_type where typname = 'int';
insert into
   pg_proc(name,owner,rettype,nargs,argtypes,
           prosrc,probin...)
   values ('square', user_id, int_oid, 1, {int_oid},
           'square_in_c', '/path/to/sq.o', ...)
-- or
insert into
   pg_proc(name,owner,rettype,nargs,argtypes,
           prosrc,probin...)
   values ('square', user_id, int_oid, 1, {int_oid},
           'begin return $1 * $1; end;', '-', ...)


... Representing Functions224/234

Users can define their own aggregate functions (like max()).

Requires definition of three components:

This information is stored in the pg_aggregate catalog.

The aggregate's name is stored in the pg_proc catalog.


... Representing Functions225/234

Consider defining your own average() function

Need to define a new aggregate:

create aggregate average (
   basetype   = integer,
   sfunc      = int_avg_accum,
   stype      = int[],
   finalfunc  = int_avg_result,
   initcond   = '{0,0}'
);

and need to define functions to support aggregate ...


... Representing Functions226/234

create function
       int_avg_accum(state int[], int) returns int[]
as $$
declare res int[2];
begin
   res[1] := state[1] + $2; res[2] := res[2] + 1;
   return res;
end;
$$ language plpgsql;

create function
       int_avg_result(state int[]) returns int
as $$
begin
   if (state[2] = 0) then return null; end if;
   return (state[1] / state[2]);
end;
$$ language plpgsql;


... Representing Functions227/234

Users can define their own operators to use in expressions.

Operators are syntactic sugar for unary/binary functions.

Consider defining an operator for the power(x,y) function:

create operator ** (
   procedure = power, leftarg = int, rightarg = int
);

-- which can be used as
select 4 ** 3;
-- giving a result of 64

Operator definitions are stored in pg_operator catalog.


Representing Types228/234

Users can also define new data types, which includes

Consider defining a 3-dimensional point type for spatial data:

create type point3d (
   input = point3d_in,    -- function to parse values
   output = point3d_out,  -- function to display values
   internallength = 24,   -- space for three float8's
   alignment = double     -- align tuples properly
);


... Representing Types229/234

pg_type additional fields for user-defined types:

typinput text input conversion function
typoutput text output conversion function
typreceive binary input conversion function
typsend binary output conversion function

All attributes are references to pg_proc.oid


... Representing Types230/234

All data types need access methods for querying.

The following catalogs tables are involved in this:


... Representing Types231/234

pg_am holds information about access methods:

amname name of access method (e.g. btree)
amowner owner (refs pg_authid.oid)
amorder
strategy
operator for determining sort order
(0 if unsorted)
amcanunique does AM support unique indexes?
ammulticol does AM support multicolumn indexes?
amindexnulls does AM support NULL index entries?
amconcurrent does AM support concurrent updates?


... Representing Types232/234

pg_am also contains links to access functions:

amgettuple "next valid tuple" function
ambeginscan "start new scan" function
amrescan "restart this scan" function
amendscan "end this scan" function
amcostestimate estimate cost of index scan

All attributes are references to pg_proc.oid

Functions drive the query evaluation process.


... Representing Types233/234

pg_am also contains links to update functions:

aminsert "insert this tuple" function
ambuild "build new index" function
ambulkdelete bulk delete function
amvacuum
cleanup
post-vacuum cleanup function

All attributes are references to pg_proc.oid

Functions implement different aspects of updating data/index files.


... Representing Types234/234

Built-in access methods:

Some access methods introduce additional files (e.g. B-tree)


Produced: 29 Feb 2016