COMP9315: Storage: Devices, Files, Pages, Tuples, Buffers, Catalogs
Storage Management |
Storage Management | 2/234 |
Aims of storage management in DBMS:
... Storage Management | 3/234 |
The storage manager provides mechanisms for:
DB
Rel
Page
Tuple
PageId
TupleId
... Storage Management | 4/234 |
Examples of references (addresses) used in DBMSs:
PageID
PageID = FileID + Offset
Offset
TupleID
TupleID = PageID + Offset
Offset
Offset
... Storage Management | 5/234 |
Levels of DBMS related to storage management:
... Storage Management | 6/234 |
Topics to be considered:
Views of Data | 7/234 |
Users and top-level query evaluator see data as
... Views of Data | 8/234 |
Relational operators and access methods see data as
... Views of Data | 9/234 |
File manager sees both DB objects and file store
... Views of Data | 10/234 |
Disk manager sees data as
On typical modern databases, handled by operating system filesystem.
Storage Manager Interface | 11/234 |
The storage manager provides higher levels of system
select name from Employee
is implemented as something like
DB db = openDatabase("myDB"); Rel r = openRel(db,"Employee"); Scan s = startScan(r); Tuple t; while ((t = nextTuple(s)) != NULL) { char *name = getField(t,"name"); printf("%s\n", name); }
... Storage Manager Interface | 12/234 |
The above shows several kinds of operations/mappings:
Data Flow in Query Evaluation | 13/234 |
... Data Flow in Query Evaluation | 14/234 |
Notes on typical implementation strategies:
int
PageId = (FileNum<24)||PageNum
get_page(r,i,buf)
get_page(pid,buf)
DB
Rel
structs
struct RelRec { int fd; int npages; int blksize; } typedef struct RelRec *Rel;
Files in DBMSs | 15/234 |
Data sets can be viewed at several levels of abstraction in DBMSs.
Logical view: a file is a named collection of data items (e.g. a table of tuples)
Abstract physical view: a file is a sequence of fixed-size data blocks.
Physical realisation: a collection of sectors scattered over ≥1 disks.
The abstraction used for managing this: PageId
... Files in DBMSs | 16/234 |
... Files in DBMSs | 17/234 |
Two possibilities for DBMS disk managers to handle data:
File System Interface | 18/234 |
Most access to data on disks in DBMSs is via a file system.
Typical operations provided by the operating system:
fd = open(fileName,mode)// open a named file for reading/writing/appending close(fd)// close an open file, via its descriptor nread = read(fd, buf, nbytes)// attempt to read data from file into buffer nwritten = write(fd, buf, nbytes)// attempt to write data from buffer to file lseek(fd, offset, seek_type)// move file pointer to relative/absolute file offset fsync(fd)// flush contents of file buffers to disk
Storage Technology | 19/234 |
At this point in memory technology development:
Computational Storage | 20/234 |
Characteristics of main memory (RAM):
load reg,byte_address store reg,byte_address
Cache memory has similar characteristics to RAM, but is
Bulk Data Storage | 21/234 |
Requirements for bulk data storage:
... Bulk Data Storage | 22/234 |
Several kinds of bulk data storage technology currently exist:
Magnetic Disks | 23/234 |
Classical/dominant bulk storage technology.
Characteristics:
Modest increase in speed; good reduction in cost.
Optical Disks | 24/234 |
Optical disks provides an alternative spinning disk storage technology.
Several varieties: CD-ROM, CD-R, CD-RW, DVD-RW
Compared to magnetic disks, CD's have
Flash Memory | 25/234 |
Flash memory is a non-mechanical alternative to disk storage.
Compared to disks, flash memory has
... Flash Memory | 26/234 |
Properties of flash memory require specialised file system
Example: updating data in flash storage
Disk Management |
Disk Manager | 28/234 |
Aim:
... Disk Manager | 29/234 |
Basic disk management interface is simple:
void get_page(PageId p, Page buf)
PageId
Page
void put_page(PageId p, Page buf)
Page
PageId
PageId allocate_pages(int n)
n
void deallocate_page(PageId p, int n)
n
PageId
Disk Technology | 30/234 |
Disk architecture:
... Disk Technology | 31/234 |
Characteristics of disks:
read block at address (p,t,s) write block at address (p,t,s)
Disk Access Costs | 32/234 |
Access time includes:
... Disk Access Costs | 33/234 |
Example disk #1 characteristics:
... Disk Access Costs | 34/234 |
Time Tr to read one random block on disk #1:
Minimum Tr = 0 + 0 + 0.13 = 0.13 ms
Maximum Tr = 50 + 16.7 + 0.13 = 66.8 ms
Average Tr = 25 + (16.7/2) + 0.13 = 33.5 ms
... Disk Access Costs | 35/234 |
If operating system deals in 4KB blocks:
Tr(4-blocks) = 25 + (16.7/2) + 4×0.13 + 3×0.01 = 33.9 ms
Tr(1-block) = 25 + (16.7/2) + 0.13 = 33.5 ms
Note that the cost of reading 4KB is comparable to reading 1KB.
Sequential access reduces average block read cost significantly, but
... Disk Access Costs | 36/234 |
Example disk #2 characteristics:
If using 32-bit addresses, this leaves 8 bits (28=256 items/block).
Disk Characteristics | 37/234 |
Three important characteristics of disk subsystems:
... Disk Characteristics | 38/234 |
Increasing capacity:
Increasing Disk Capacity | 39/234 |
Compress data (e.g. LZ encoding)
+ more data fits on disk
- compression/expansion overhead
For large compressible data (e.g. text
For most relational data (e.g. int
char(8)
For high-performance memory caching, may never want to expand
(there is current research working on "computable" compressed data formats).
Improving Disk Access Costs | 40/234 |
Approach #1: Use knowledge of data access patterns.
E.g. two records frequently accessed together
⇒ put them in the same block (clustering)
E.g. records scanned sequentially
⇒ place them in "staggered" blocks, double-buffer
Arranging data to match access patterns can improve throughput by 10-20 times.
Approach #2: Avoid reading blocks for each item access.
E.g. buffer blocks in memory, assume likely re-use
Scheduled Disk Access | 41/234 |
Low-level disk manager (driver, controller):
Disk Layout | 42/234 |
If data sets are going to be frequently accessed in a pre-determined manner, arrange data on disk to minimise access time.
E.g. sequential scan
Modern systems generally don't, because of programmer complexity.
Unix has raw disk partitions: no file system, you write driver to manage disk.
Improving Writes | 43/234 |
Nonvolatile write buffers
Double Buffering | 44/234 |
Double-buffering exploits potential concurrency between disk and memory.
While reads/writes to disk are underway, other processing can be done.
With at least two buffers, can keep disk working full-time.
... Double Buffering | 45/234 |
Example: select sum(salary) from Employee
read A into buffer then process buffer content read B into buffer then process buffer content read C into buffer then process buffer content ...
Costs:
... Double Buffering | 46/234 |
Double-buffering approach:
read A into buffer1 process A in buffer1 and concurrently read B into buffer2 process B in buffer2 and concurrently read C into buffer1 ...
Costs:
General observation: use of multiple buffers can lead to substantial
cost savings.
We will see numerous examples where multiple memory buffers
are exploited.
Multiple Disk Systems | 47/234 |
Various strategies can be employed to improve capacity, performance and reliability when multiple disks are available.
RAID (redundant arrays on independent disks) defines a standard set of such techniques.
Essentially, multiple disks allow
(although there is obviously a trade-off between increased capacity and increased reliability via redundancy)
RAID Level 0 | 48/234 |
Uses striping to partition data for one file over several disks
E.g. for n disks, block i in the file is written to disk (i mod n)
Example: file with 6 data blocks striped onto 4 disks using (pid mod 4)
Increases capacity, improves data transfer rates, reduces reliability.
... RAID Level 0 | 49/234 |
The disk manager and RAID controller have to perform a mapping something like:
writePage(PageId)
to
disk = diskOf(PageId,ndisks) cyln = cylinderOf(PageId) plat = platterOf(PageId) sect = sectorOf(PageId) writeDiskPage(disk, cyln, plat, sect)
(We discuss later how the pid
RAID Level 1 | 50/234 |
Uses mirroring (or shadowing) to store multiple copies of each block.
Since disks can be read/written in parallel, transfer cost unchanged.
Multiple copies allows for single-disk failure with no data loss.
Example: file with 4 data blocks mirrored on two 2-disk partitions
Reduces capacity, improves reliability, no effect on data transfer rates.
... RAID Level 1 | 51/234 |
The disk manager and RAID controller have to perform a mapping something like:
writePage(PageId)
to
n = ndisksInPartition disk = diskOf(PageId,n) cyln = cylinderOf(PageId) plat = platterOf(PageId) sect = sectorOf(PageId) writeDiskPage(disk, cyln, plat, sect) writeDiskPage(disk+n, cyln, plat, sect)
RAID levels 2-6 | 52/234 |
The higher levels of raid incorporate various combinations of:
The differences are primarily in:
RAID level 6 can recover from smultaneous failures in two disks.
Disk Media Failure | 53/234 |
Rarely, a bit will be transferred to/from the disk incorrectly.
Error-correcting codes can check for and recover from this.
If recovery is not possible, the operation can simply be repeated.
If repeated reads/writes on the same block fail:
Database Objects | 54/234 |
DBMSs maintain various kinds of objects/information:
can be viewed as an super-object for all others | ||
global configuration information | ||
meta-information describing database contents | ||
named collections of tuples | ||
collections of typed field values | ||
access methods for efficient searching | ||
for handling rollback/recovery | ||
active elements |
... Database Objects | 55/234 |
The disk manager implements how DB objects are mapped to file system.
References to data objects typically reduce to e.g.
Offset
PageId
RecordID
PageId+Offset
Single-file DBMS | 56/234 |
One possible storage organisation is a single file for the entire database.
All objects are allocated to regions of this file.
Objects are allocated to regions (segments) of the file.
If an object grows too large for allocated segment, allocate an extension.
What happens to allocated space when objects are removed?
... Single-file DBMS | 57/234 |
Allocating space in Unix files is easy:
Single-file Disk Manager | 58/234 |
Simple disk manager for a single-file database:
// Disk Manager data/functions #define PAGESIZE 2048// bytes per page typedef int PageId;// PageId is block index typedef struct DBrec { char *dbname;// copy of database name int fd;// the database file SpaceTable map;// map of free/used areas NameTable names;// map names to areas + sizes } *DB; typedef struct Relrec { char *relname;// copy of table name int start;// page index of start of table data int npages;// number of pages of table data ... } * Rel;
... Single-file Disk Manager | 59/234 |
DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceTable(DBfd); db->names = readNameTable(DBfd); return db; }// stop using DB and update all meta-data void closeDatabase(DB db) { writeSpaceTable(db->fd,db->map); writeNameTable(db->fd,db->map); fsync(db->fd); close(db->fd); free(db); }
... Single-file Disk Manager | 60/234 |
// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname);// get relation data from map tables r->start = ...; r->npages = ...; return r; }// stop using a relation void closeRelation(Rel r) { free(r); } #define nPages(r) (r->npages) #define makePageId(r,i) (r->first + i)
... Single-file Disk Manager | 61/234 |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); }// write page from memory buffer to file void put_page(Db db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
... Single-file Disk Manager | 62/234 |
// managing contents of mapping table is complex // assume a list of (offset,length,status) tuples // allocate n new pages at end of file PageId allocate_pages(int n) { int endfile = lseek(db->fd, 0, SEEK_END); addNewEntry(db->map, endfile, n);// note that file itself is not changed }// drop n pages starting from p void deallocate_pages(PageId p, int n) { markUnused(db->map, p, n);// note that file itself is not changed }
Example: Scanning a Relation | 63/234 |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB"); Rel r = openRelation(db,"Employee"); int npages = nPages(r); Page buffer = malloc(PAGESIZE*sizeof(char)); for (int i = 0; i < npages; i++) { PageId pid = makePageId(r,i); get_page(db, pid, buffer); foreach tuple in buffer { get tuple data and extract name } }
Multiple-file Disk Manager | 64/234 |
Most DBMSs don't use a single large file for all data.
They typically provide:
... Multiple-file Disk Manager | 65/234 |
Structure of PageId
If system uses one file per table, PageId
PageId
Oracle File Structures | 66/234 |
Oracle uses five different kinds of files:
catalogue, tables, procedures | ||
update logs | ||
record system events | ||
configuration info | ||
off-line collected updates |
... Oracle File Structures | 67/234 |
There may be multiple instances of each kind of file:
SYSTEM
... Oracle File Structures | 68/234 |
Tablespaces are logical units of storage (cf directories).
Every database object resides in exactly one tablespace.
Units of storage within a tablespace:
fixed size unit of storage (cf 2KB page) | ||
specific number of contiguous data blocks | ||
set of extents allocated to a single database object |
Segments can span multiple data files; extents cannot.
To be confusing, tables are called datafiles internally in Oracle.
... Oracle File Structures | 69/234 |
Layout of data within Oracle file storage:
PostgreSQL Storage Manager | 70/234 |
PostgreSQL uses the following file organisation ...
... PostgreSQL Storage Manager | 71/234 |
Components of storage subsystem:
RelFileNode
storage/smgr
storage/smgr/md.c
storage/file
smgr
Relations as Files | 72/234 |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode { Oid spcNode;// tablespace Oid dbNode;// database Oid relNode;// relation } RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OID
dbNode == 0
... Relations as Files | 73/234 |
The relpath
RelFileNode
char *relpath(RelFileNode rnode)// simplified { char *path = malloc(ENOUGH_SPACE); if (rnode.spcNode == GLOBALTABLESPACE_OID) {/* Shared system relations live in PGDATA/global */ Assert(rnode.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, rnode.relNode); } else if (rnode.spcNode == DEFAULTTABLESPACE_OID) {/* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, rnode.dbNode, rnode.relNode); } else {/* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir rnode.spcNode, rnode.dbNode, rnode.relNode); } return path; }
File Descriptor Pool | 74/234 |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()
typedef char *FileName
Open files are referenced via: typedef int File
A File
... File Descriptor Pool | 75/234 |
Interface to file descriptor (pool):
File FileNameOpenFile(FileName fileName, int fileFlags, int fileMode);// open a file in the database directory ($PGDATA/base/...) File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);// open a file given a full file path File OpenTemporaryFile(bool interXact);// open temp file; flag: close at end of transaction? void FileClose(File file); void FileUnlink(File file); int FileRead(File file, char *buffer, int amount); int FileWrite(File file, char *buffer, int amount); int FileSync(File file); long FileSeek(File file, long offset, int whence); int FileTruncate(File file, long offset);
... File Descriptor Pool | 76/234 |
Virtual file descriptors (Vfd
VfdCache[0]
... File Descriptor Pool | 77/234 |
Virtual file descriptor records (simplified):
typedef struct vfd { s_short fd;// current FD, or VFD_CLOSED if none u_short fdstate;// bitflags for VFD's state File nextFree;// link to next free VFD, if in freelist File lruMoreRecently;// doubly linked recency-of-use list File lruLessRecently; long seekPos;// current logical file position char *fileName;// name of file, or NULL for unused VFD // NB: fileName is malloc'd, and must be free'd when closing the VFD int fileFlags;// open(2) flags for (re)opening the file int fileMode;// mode to pass to open(2) } Vfd;
File Manager | 78/234 |
The "magnetic disk storage manager"
PageId
PageId
typedef struct { RelFileNode rnode;// which relation ForkNumber forkNum;// which fork BlockNumber blockNum;// which block } BufferTag;
... File Manager | 79/234 |
Access to a block of data proceeds as follows:
offset = BlockNumber * BLCKSZ fileID = RelFileNode+ForkNumber if (fileID is already in Vfd pool) { if (offset is in this file) fd = use Vfd from pool else fd = allocate new Vfd for next part of file } else { fd = allocate new Vfd for this file } seek to offset in fd read/write data page (BLCKSZ bytes)
BLCKSZ
Buffer Pool |
Buffer Manager | 81/234 |
Aim:
Buffer pool
... Buffer Manager | 82/234 |
Buffer pool interposed between access methods and disk manager
Access methods/page manager normally work via get_page()
now work via calls to get_page_via_buffer_pool()
... Buffer Manager | 83/234 |
Basic buffer pool interface
Page request_page(PageId p);
p
void release_page(PageId p);
p
void mark_page(PageId p);
p
void flush_page(PageId p);
p
void hold_page(PageId p);
p
allocate_page
deallocate_page
Buffer Pool | 84/234 |
... Buffer Pool | 85/234 |
Buffer pool data structures:
... Buffer Pool | 86/234 |
In subsequent discussion, we assume:
Requesting Pages | 87/234 |
Call from client: request_page(pid)
If page pid
If page pid
... Requesting Pages | 88/234 |
Advantages:
Releasing Pages | 89/234 |
The release_page
If the page has been modified, must be written to disk before replaced.
Possible problem: changes not immediately reflected on disk
... Releasing Pages | 90/234 |
Advantages:
Buffer Manager Example #1 | 91/234 |
Self join: an example where buffer pool achieves major efficiency gains.
Consider a query to find pairs of employees with the same birthday:
select e1.name, e2.name from Employee e1, Employee e2 where e1.id < e2.id and e1.birthday = e2.birthday
This might be implemented inside the DBMS via nested loops:
for each tuple t1 in Employee e1 { for each tuple t2 in Employee e2 { if (t1.id < t2.id && t1.birthday == t2.birthday) append (t1.name,t2.name) to result set } }
... Buffer Manager Example #1 | 92/234 |
In terms of page-level operations, the algorithm looks like:
DB db = openDatabase("myDB"); Rel emp = openRel(db,"Employee"); int npages = nPages(emp); for (int i = 0; i < npages; i++) { PageId pid1 = makePageId(emp,i); Page p1 = request_page(pid1); for (int j = 0; j < npages; i++) { PageId pid2 = makePageId(emp,j); Page p2 = request_page(pid2);// compare all pairs of tuples from p1,p2 // construct solution set from matching pairs release_page(pid2); } release_page(pid1); }
... Buffer Manager Example #1 | 93/234 |
Consider a buffer pool with 200 frames and a relation with b ≤ 200 pages:
p1
p2
p2
Employee
... Buffer Manager Example #1 | 94/234 |
Now consider a buffer pool with 2 frames (the minimum required for the join):
p1
p2
p2
p2
... Buffer Manager Example #1 | 95/234 |
(... continued)
p2
p2
p1
p2
p1
Cf. 200-frame buffer vs 2-frame buffer ... if b=100, 100 reads vs 10000 reads.
The request_page | 96/234 |
Method:
dirty=False
pinCount=0
Page
Other Buffer Operations | 97/234 |
The release_page
The
The
Several schemes are commonly in use:
For DBMS, we can predict patterns of page access better
The cost benefits from a buffer pool (with n frames) is determined by:
First scan costs b reads; subsequent scans are ``free''.
Example (b): sequential scan, MRU, n < b, no competition
First scan costs b reads; subsequent scans cost b - n reads.
Example (c): sequential scan, LRU, n < b, no competition
All scans cost b reads; known as sequential flooding.
How to determine when a page in the buffer was last accessed?
Could simply use the time of the last
But this doesn't reflect real accesses to page.
For more realism, could use last
Or could introduce operations for examining and modifying pages in pool:
Standard join: an example where replacement policy can have large impact.
Consider a query to find customers who are also employees:
This might be implemented inside the DBMS via nested loops:
Assume that:
Works well with MRU strategy:
Note: assumes that both
Works less well with LRU strategy:
PostgreSQL buffer manager:
Buffers are located in a large region of shared memory.
Functions:
Definitions:
Buffer pool consists of:
Definitions related to buffer manager:
Buffer manager data types:
Buffer manager interface:
Buffer manager interface (cont):
Additional buffer manager functions:
Important internal buffer manager function:
PostgreSQL page replacement strategy: clock-sweep
The disk and buffer manager provide the following view:
The abstract view of a relation:
We use the following low-level abstractions:
We use the following high-level abstractions:
A table is defined by a collection of attributes (schema), e.g.
A tuple is a collection of attribute values for such a schema, e.g.
A record is a sequence of bytes, containing data for one tuple.
Aim:
Operations to access records from a page ...
Operations to make changes to records in a page ...
Typ
Also need operations to convert between
Recall previous example of simple scan of a relation:
implemented as:
Conceptually, the scanning implementation is simple:
The real implementation relies on the buffer manager:
And similarly the
The implementation of
A
Some DBMSs provide
PostgreSQL provides a unique OID for every row in the database.
Consider a DBMS like Oracle which uses a small number of large files.
Suitable
(Note: however you partition the bits, you can address at most 4 billion records)
Consider a DBMS like MiniSQL, which uses one data file per relation.
One possibility is a variation on the Oracle approach:
Functions for constructing/interrogating
Alternative implementation if details of file/page are hidden within
Records are stored within fixed-length pages.
Records may be fixed-length:
Encoding scheme for fixed-length records:
Since record format is frequently consulted at query time, it should be memory-resident.
Advantages of fixed-length records:
Handling attempts to insert values larger than available fields:
Some encoding schemes for variable-length records:
More encoding schemes for variable-length records:
Advantages of variable-length records:
How to handle record that does not fit into free space in page?
Two approaches:
Advantages of spanned records:
A
DBMSs typically define a fixed set of field types for use in schema.
E.g.
This determines the primitive types to be handled in the implementation:
To convert a
For the example relation:
This defines the schema
A
A
It also necessary to know how the
Assume the following
Assume also that lengths are 1-byte quantities
(no field longer than 256-bytes).
How the
Definitions:
Functions:
PostgreSQL defines tuples via:
Tuple structure:
Tuple-related data types:
The actual data value:
Tuple-related data types: (cont)
Operations on Tuples:
Ultimately, a
We want to interpret/manipulate it as a collection of
Typical operations on
Factors affecting
For fixed-length records, use record slots.
Insertion: place new record in first available slot.
Deletion: two possibilities for handling free record slots:
Problem with packed format and no slot directory
Problem with unpacked/bitmap format
For variable-length records, use slot directory.
Possibilities for handling free-space within block:
Compacted free space:
Note: "pointers" are implemented as word offsets within block.
Fragmented free space:
How many records can fit in a page?
(How long is a piece of string?)
Depends on: page size, (avg) record size, slot directory, ...
For a typical DBMS application
Example of determining space utilisation ...
Assumptions:
Max record size = 4(offsets) + 4 + 20 + 12 + 4 = 44 bytes
Minimum number of records = 1024/44 = 23
(assume all max size and no directory)
Average number of records = 1024/40 = 25
(assume no directory)
So, allow 32 directory slots (5-bit slot indexes), and 32 bytes for directory.
Number of records = Nr, where 44 × Nr + 32+4 ≤ 1024
Aim to maximise Nr, so Nr = 22
Notes: because there are 32 slots, could have up to 32 (small) records
If we switched to 8KB pages, then
So, allow 256 slots (8-bit slot indexes), and 352 bytes for directory (256*11bits)
Number of records = Nr, where 44 × Nr + 352 ≤ 8192
Aim to maximise Nr, so Nr = 178
Could reduce size of directory to allow more records ... but only so far.
Note: 11-bit directory entries also means that it's costly to access them.
Sometimes, it may not be possible to insert a record into a page:
If there is still insufficient space, we have one of the other cases.
How the other cases are handled depends on the file organisation:
Overflow files for very large records and BLOBs:
Page-based handling of overflows:
Useful for scan-all-records type operations.
Record-based handling of overflows:
Useful for locating specific record via rid.
Functions:
Definitions:
Each page is 8KB (default
PostgreSQL tuple page layout:
Page-related data types:
Page-related data types: (cont)
Operations on
PostgreSQL has two kinds of pages:
One important difference:
RDBMSs manage different kinds of objects
How are the different types of objects represented?
How do we go from a name (or OID) to bytes stored on disk?
Top-level "objects" in typical SQL standard databases:
catalog ... SQL terminology for a database
Consider what information the RDBMS needs about relations:
All of this information is stored in the system catalog.
(The "system catalog" is also called "data dictionary" or "system view")
In most RDBMSs, the catalog itself is also stored as tables.
Standard for catalogs in SQL:2003:
For complete details, see Section 30 of the PostgreSQL 8.0.3 documentation.
Most DBMSs also have their own internal catalog structure.
Would typically contain information such as:
Standard SQL
The catalog is manipulated by a range of SQL operations:
E.g. consider an SQL DDL operation such as:
This would produce a set of catalog changes something like ...
In PostgreSQL, the system catalog is available to users via:
The
A PostgreSQL installation typically has several databases.
Some catalog information is global, e.g.
Global installation data is recorded in shared tables
PostgreSQL tuples contain
In version 8, PostgreSQL merged notions of users/groups into roles.
Represented by two base tables:
View
View
Both tables are global (shared across all DBs in a cluster).
etc. etc.
etc. etc.
Note the use of multi-valued attribute (PostgreSQL extension)
Above the level of individual DB schemata, we have:
Keys are names (strings) and must be unique within cluster.
etc. etc.
Digression: access control lists (
PostgreSQL represents access via an array of access elements.
Each access element contains:
where Privileges is a string enumerating privileges, e.g.
Note that
Two pre-defined tablespaces:
Entries in multiple catalog tables are required for each user-level table.
Due to O-O heritage, base table for tables is called
The
Tuples in
etc.
Note: a complex type is automatically created for each table
(We discuss more details of the
Also has notion of large data being stored in a separate table
(so-called "TOAST" table).
An SQL DDL statement like
will cause entries to be made in the following tables:
The example leads to a series of database changes like
(Names are automatically generated from context (
For foreign-key constraints,
Foreign keys also introduce triggers to perform checking.
For column-specific constraints:
An SQL DDL statement like
will cause similar entries as before in catalogs, plus
The example leads to a series of database changes like
Stored procedures (functions) are defined as
Stored procedures are represented in the catalog via
etc.
Consider two alternative ways of defining a x2 function.
or
The above leads to a series of database changes like
Users can define their own aggregate functions (like
Requires definition of three components:
The aggregate's name is stored in the
Consider defining your own
Need to define a new aggregate:
and need to define functions to support aggregate ...
Users can define their own operators to use in expressions.
Operators are syntactic sugar for unary/binary functions.
Consider defining an operator for the
Operator definitions are stored in
Users can also define new data types, which includes
All attributes are references to
All data types need access methods for querying.
The following catalogs tables are involved in this:
All attributes are references to
Functions drive the query evaluation process.
All attributes are references to
Functions implement different aspects of updating data/index files.
Built-in access methods:
Produced: 29 Feb 2016
mark_page
operation:
Note: doesn't actually write to disk; indicates that frame needs to be written if used for replacement;
flush_page
Note: not generally used by higher levels of DBMS; they rely on request/release protocol.
write_page
Page Replacement Policies 98/234
LRU works for VM because of working set model
(recent past accesses determines future accesses)
(from our knowledge of how the relational operations are implemented)
... Page Replacement Policies 99/234
Example (a): sequential scan, LRU or MRU, n ≥ b, no competition
Page Access Times 100/234 request_page
PageId
request_page
release_page
examine_page(PageId, TupleId)
modify_page(PageId, TupleId, Tuple)
Buffer Manager Example #2 101/234
select c.name
from Customer c, Employee e
where c.ssn = e.ssn;
for each tuple t1 in Customer {
for each tuple t2 in Employee {
if (t1.ssn == t2.ssn)
append (t1.name) to result set
}
}
... Buffer Manager Example #2 102/234
Customer
Employee
... Buffer Manager Example #2 103/234
Total page reads = bC + (n - 2) + bC × (bE - (n-2)) = 20 + 9 + 20*2 = 189
Customer
Employee
Customer
Employee
request_page
release_page
... Buffer Manager Example #2 104/234
Total page reads = bC + bC × bE = 20 + 20*10 = 220
Customer
Employee
Employee
Customer
Employee
PostgreSQL Buffer Manager 105/234
Same code used by backends which need a local buffer pool.
src/backend/storage/buffer/*.c
src/include/storage/buf*.h
... PostgreSQL Buffer Manager 106/234
Nbuffers
BufferDesc
Nbuffers
Buffer
BufferDesc
Buffer
shared_buffers = 16MB
... PostgreSQL Buffer Manager 107/234
... PostgreSQL Buffer Manager 108/234 include/storage/buf.h
Buffer
include/storage/bufmgr.h
(i.e. the functions that other parts of the system call to user buffer manager)
include/storage/buf_internals.h
Code in: BufferDesc
backend/storage/buffer/
Buffer Pool Data Objects 109/234 BufferDescriptors
Buffer
BufferDescriptors
BufMgrLock
BufferTag
... Buffer Pool Data Objects 110/234
Buffer Pool Functions 111/234 Buffer ReadBuffer(Relation r, BlockNumber n)
Actually a special case of r
(may need to remove an existing unpinned page and read data from file)
Buffer
ForkNumber
ReadBuffer_Common
... Buffer Pool Functions 112/234 void ReleaseBuffer(Buffer buf)
ensures all activity on buffer is completed before returning
void MarkBufferDirty(Buffer buf)
... Buffer Pool Functions 113/234 Page BufferGetPage(Buffer buf)
BufferIsPinned(Buffer buf)
CheckPointBuffers
etc. etc. etc.
... Buffer Pool Functions 114/234 BufferDesc *BufferAlloc(
Relation r, ForkNumber f,
BlockNumber n, bool *found)
ReadBuffer
ReadBuffer
Clock-sweep Replacement Strategy 115/234
For specialised kinds of access (e.g. sequential scan),
can allocate a private "buffer ring"
with different replacement strategy.
NextVictimBuffer
usage_count
NextVictimBuffer
Record/Tuple Management
Views of Data 117/234
Database applications view data as:
PageId
Standard terminology: records are also called tuples, items, rows, ...
RecordId
RID
... Views of Data 118/234
The physical representation of a relation:
... Views of Data 119/234 RecPage
byte[]
Record
... Views of Data 120/234 Relation
Tuple
Records vs Tuples 121/234
create table Employee (
id# integer primary key,
name varchar(20), -- or char(20)
job varchar(10), -- or char(10)
dept number(4)
);
(33357462, 'Neil Young', 'Musician', 0277)
Record Management 122/234
In other words, the record manager reconciles the views of a block:
Tuple
Record
RecordId
Tuple
RecPage
Assumptions (neither of which are essential):
Page-level Operations 123/234 Record get_record(RecordId rid)
rid
Record
Record first_record()
Record
Record next_record()
Record
null
... Page-level Operations 124/234 void update_record(RecordId rid, Record rec)
rid
rec
RecordId insert_record(Record rec)
rid
void delete_record(RecordId rid)
rid
Tuple-level Operations 125/234 get
Field(int fno)
Examples: fno
Tuple
getIntField(1)
getStringField(2)
void
set
Field(int fno,
val)
Examples: fno
Tuple
val
setIntField(1,42)
setStringField(2,"abc")
Record
Tuple
Relation-level Operations 126/234 Tuple get_tuple(RecordId rid)
rid
Tuple
Tuple first_tuple()
Tuple
Tuple next_tuple()
Plus operations to insert, delete and modify Tuple
null
Tuples
Tuples
Records
Example Query 127/234
select name from Employee
DB db = openDatabase("myDB");
Rel r = openRel(db,"Employee");
Scan s = startScan(r);
Tuple t;
while ((t = nextTuple(s)) != NULL)
{
char *name = getField(t,"name");
printf("%s\n", name);
}
... Example Query 128/234
... Example Query 129/234
struct ScanRec {
Rel curRel; PageId curPID; RecPage curPage;
};
typedef struct ScanRec *Scan;
Scan startScan(Rel r)
{
Scan s = malloc(sizeof(struct ScanRec));
s->curPID = firstPageId(r);
Buffer page = request_page(s->curPage);
s->curPage = start_page_scan(page);
return s;
}
... Example Query 130/234 nextTuple()
Tuple nextTuple(Scan s)
{
Record Identifiers 131/234 RecordID
RecordId
If multiple files for a relation, then also need:
(Or, more likely, use a PageId
ROWID
... Record Identifiers 132/234 RecordID
E.g. with 4KB pages and 16 bits available for page addressing
E.g. using indexes into a slot table to identify records within a page
(page addresses are all of the form 0x0000, 0x1000, 0x2000, 0x3000, ...
RecordId
Example RecordId
133/234 RecordId
Example:
... Example RecordId
134/234
Another possibility is
(Under this scheme, there will be multiple records in the DB with the same RecordId
rid
Manipulating RecordId
135/234 RecordId
typedef unsigned int RecordId;
RecordId makeRecordId(int file, int page, int slot) {
return (file << 28) | (page << 8) | (slot);
}
int fileNo(RecordId rid) { return (rid >> 28) & 0xF; }
int pageNo(RecordId rid) { return (rid >> 8) & 0xFFFFF; }
int slotNo(RecordId rid) { return rid & 0xFF; }
... Manipulating RecordId
136/234 PageId
typedef unsigned int PageId;
Record Formats 137/234
Records may be variable-length:
Fixed-length Records 138/234
... Fixed-length Records 139/234
Disadvantages of fixed-length records:
Note: if all records were close to specified maximum size,
this would be the most compact format.
... Fixed-length Records 140/234
Alignment considerations (for numeric fields) may require:
RecordId
varchar
Variable-length Records 141/234
... Variable-length Records 142/234
<employee>
<id#>33357462</id#> <dept>0277</dept>
<name>Neil Young</name>
<job>Musician</job>
</employee>
Tuple
Record
... Variable-length Records 143/234
Disadvantages of variable-length records:
Spanned Records 144/234
... Spanned Records 145/234
Disadvantages of spanned records:
More common strategy than spanning:
Converting Records to Tuples 146/234 Record
The information on how to interpret the bytes
byte[]
Tuple
For variable-length records, further formatting information is stored in the record itself.
... Converting Records to Tuples 147/234 DATE
FLOAT
INTEGER
NUMBER(
)
VARCHAR(
)
DATE
time_t
FLOAT
float,double
INTEGER
int,long
NUMBER(
)
int[]
VARCHAR(
)
char[]
Defining Tuple
148/234 Record
Tuple
This leads to two structs: FieldDesc
RelnDesc
typedef struct {
short offset;
... Defining Tuple
149/234
FieldDesc fields[] = malloc(4*sizeof(FieldDesc);
fields[0] = FieldDesc(0,4,INTEGER);
fields[1] = FieldDesc(4,20,VARCHAR);
fields[2] = FieldDesc(24,10,CHAR);
fields[3] = FieldDesc(34,4,NUMBER);
... Defining Tuple
150/234 Tuple
Record
typedef struct {
Record data;
... Defining Tuple
151/234 Tuple
Record
RelnDesc
Record
Record
... Defining Tuple
152/234 Record
Tuple
Tuple mkTuple(RelnDesc schema, Record record)
{
int i, pos = 0;
int size = sizeof(Tuple) +
(nfields-1)*sizeof(FieldDesc);
Tuple *t = malloc(size);
t->data = record;
t->nfields = schema.nfields;
for (i=0; i < schema.nfields; i++) {
int len = record[pos++];
t->fields[i].offset = pos;
t->fields[i].length = len;
PostgreSQL Tuples 153/234 src/include/access/*tup*.h
src/backend/access/common/*tup*.c
Datum
... PostgreSQL Tuples 154/234
... PostgreSQL Tuples 155/234
Datum
int
... PostgreSQL Tuples 156/234
typedef struct HeapTupleFields
... PostgreSQL Tuples 157/234
typedef struct tupleDesc
{
int natts;
... PostgreSQL Tuples 158/234
Page Formats 159/234 Page
byte[]
Record
Page
get(rid)
TupleId
first()
Page
next()
Page
insert(rec)
Page
update(rid,rec)
delete(rid)
Page
... Page Formats 160/234 Page
Implementation of Page
Page
Page
Page
Page
... Page Formats 161/234
... Page Formats 162/234
Could add a slot directory to overcome this, but wastes space.
(e.g. 4KB page requires 12-bit offset (10-bit if word-aligned), 256 slots requires 8-bit index)
... Page Formats 163/234
In practice, a combination is useful:
... Page Formats 164/234
... Page Formats 165/234
Storage Utilisation 166/234
... Storage Utilisation 167/234
(integer,varchar(20),char(10),number(4))
char(10)
PageId
... Storage Utilisation 168/234
... Storage Utilisation 169/234
Minimum number of records = 8192/44 = 186
(assume all max size and no directory)
Overflows 170/234
The first case can initially be handled by compacting the free-space.
... Overflows 171/234
... Overflows 172/234
... Overflows 173/234
PageId
... Overflows 174/234
PostgreSQL Page Representation 175/234 src/backend/storage/page/*.c
src/include/storage/bufpage.h
BLCKSZ
Large data items are stored in separate (TOAST) files.
... PostgreSQL Page Representation 176/234
... PostgreSQL Page Representation 177/234
... PostgreSQL Page Representation 178/234
typedef struct PageHeaderData
{
...
uint16 pd_flags;
... PostgreSQL Page Representation 179/234 Page
void PageInit(Page page, Size pageSize, ...)
Page
pd_lower
pd_upper
OffsetNumber
PageAddItem(Page page, Item item, Size size, ...)
Page
void PageRepairFragmentation(Page page)
... PostgreSQL Page Representation 180/234
Both kinds of page have the same page layout.
Representing Database Objects
Database Objects 182/234
Many objects have names (and, in PostgreSQL, all have OIDs).
... Database Objects 183/234
schema ... collection of DB object definitions
tablespace ... collection of DB files
Schema.Relation
PostgreSQL also has cluster: a server managing a set of DBs.
... Database Objects 184/234
Similarly for other DBMS objects
(e.g. views, functions, triggers, ...)
... Database Objects 185/234 INFORMATION_SCHEMA
Schemata(catalog_name, schema_name, schema_owner, ...)
Tables(table_catalog, table_schema, table_name, table_type, ...)
Columns(table_catalog, table_schema, table_name, column_name,
ordinal_position, column_default, is_nullable, data_type, ...)
Views(table_catalog, table_schema, table_name, view_definition,
check_option, is_updatable, is_insertable_into)
Role_table_grants(grantor, grantee, privilege_type, is_grantable,
table_catalog, table_schema, table_name, ...)
System Catalog 186/234
Users(id:int, name:string, ...)
Databases(id:int, name:string, owner:ref(User), ...)
Schemas(id:int, name:string, owner:ref(User), ...)
Types(id:int, name:string, defn:string, size:int, ...)
Tables(id:int, name:string, owner:ref(User),
inSchema:ref(Schema), ...)
Attributes(id, name:string, table:ref(Table),
type:ref(Type), pkey:bool, ...)
INFORMATION_SCHEMA
... System Catalog 187/234
where Object is one of table, view, function, trigger, schema, ...
create
as
drop
alter
grant
on
create table ABC (
x integer primary key,
y integer
);
... System Catalog 188/234
userID := current_user();
schemaID := current_schema();
tabID := nextval('tab_id_seq');
select into intID id
from Types where name='integer';
insert into Tables(id,name,owner,inSchema,...)
values (tabID, 'abc', userID, schema, ...)
attrID := nextval('attr_id_seq');
insert into Attributes(id,name,table,type,pkey,...)
values (attrID, 'x', tabID, intID, true, ...)
attrID := nextval('attr_id_seq');
insert into Attributes(id,name,table,type,pkey,...)
values (attrID, 'y', tabID, intID, false, ...)
... System Catalog 189/234
The low-level representation is available to sysadmins via:
psql
\d
information_schema
(e.g. select * from information_schema.tables;
pg_catalog
pg_tables
PostgreSQL Catalog 190/234 \d?
psql
\dt
list information about tables
\dv
list information about views
\df
list information about functions
\dp
list table access privileges
\dT
list information about data types
\dd
shows comments attached to DB objects
... PostgreSQL Catalog 191/234
Other catalog information is local to each database, e.g
PGDATA/pg_global
... PostgreSQL Catalog 192/234
Each kind of DB object has table(s) to describe it, e.g.
... PostgreSQL Catalog 193/234
OIDs are used as primary keys in many of the catalog tables.
create table
oid
unique identifying number for tuple (optional)
tableoid
which table this tuple belongs to
xmin/xmax
which transaction created/deleted tuple (for MVCC)
Representing Users/Groups 194/234 pg_authid
pg_auth_members
pg_shadow
pg_authid
pg_user
pg_shadow
CREATE
ALTER
DROP
USER
pg_authid
CREATE
ALTER
DROP
GROUP
pg_auth_members
... Representing Users/Groups 195/234 pg_authid
oid
unique integer key for this role
rolname
symbolic name for role (PostgreSQL identifier)
rolpassword
plain or md5-encrypted password
rolcreatedb
can create new databases
rolsuper
is a superuser (owns server process)
rolcatupdate
can update system catalogs
... Representing Users/Groups 196/234 pg_shadow
usename
symbolic user name (e.g.
'jas'
usesysid
integer key to reference user (
pg_authid.oid
passwd
plain or md5-encrypted password
usecreatdb
can create new databases
usesuper
is a superuser (owns server process)
usecatupd
can update system catalogs
... Representing Users/Groups 197/234 pg_group
groname
group name (e.g.
'developers'
grosysid
integer key to reference group
grolist[]
array containing group members
(vector of refs to pg_authid.oid
Representing High-level Objects 198/234
These tables are global to each PostgreSQL cluster.
pg_database
pg_namespace
pg_tablespace
... Representing High-level Objects 199/234 pg_database
datname
database name (e.g.
'mydb'
datdba
database owner (refs
pg_authid.oid
datpath
where files for database are stored
(if not in the PGDATA directory)
datacl[]
access permissions
datistemplate
can be used to clone new databases
(e.g. template0
template1
... Representing High-level Objects 200/234 acl
UserName=Privileges/Grantor
group GroupName=Privileges/Grantor
jas=arwdRxt/jas,fred=r/jas,joe=rwad/jas
... Representing High-level Objects 201/234 pg_namespace
nspname
namespace name (e.g.
'public'
nspowner
namespace owner (refs
pg_authid.oid
nspacl[]
access permissions
nspname
... Representing High-level Objects 202/234 pg_tablespace
spcname
tablespace name (e.g.
'disk5'
spcowner
tablespace owner (refs
pg_authid.oid
spclocation
full filepath to tablespace directory
spcacl[]
access permissions
pg_default
pg_global
Representing Tables 203/234 pg_class
pg_class
CREATE TYPE AS
pg_class
pg_class
... Representing Tables 204/234 pg_class
relname
name of table (e.g.
employee
relnamespace
schema in which table defined
(refs pg_namespace.oid
reltype
data type corresponding to table
(refs pg_type.oid
relowner
owner (refs
pg_authid.oid
reltuples
# tuples in table
relacl
access permissions
... Representing Tables 205/234 pg_class
relkind
what kind of object
'r' = ordinary table, 'i' = index, 'v' = view
'c' = composite type, 'S' = sequence, 's' = special
relnatts
# attributes in table
(how many entries in pg_attribute
relchecks
# of constraints on table
(how many entries in pg_constraint
relhasindex
table has/had an index?
relhaspkey
table has/had a primary key?
... Representing Tables 206/234 pg_type
typname
name of type (e.g.
'integer'
typnamespace
schema in which type defined
(refs pg_namespace.oid
typowner
owner (refs
pg_authid.oid
typtype
what kind of data type
'b' = base type, 'c' = complex (row) type, ...
(defines "type" for each tuple in table; also, type for functions returning SETOF
... Representing Tables 207/234 pg_type
typlen
how much storage used for values
(-1 for variable-length types, e.g. text
typalign
memory alignment for values
('c' = byte-boundary, 'i' = 4-byte-boundary, ...)
typrelid
table associated with complex type
(refs pg_class.oid
typstorage
where/how values are stored
('p' = in-tuple, 'e' = in external table, compressed?)
pg_type
... Representing Tables 208/234 pg_attribute
attname
name of attribute (e.g.
'empname'
attrelid
table this attribute belongs to
(refs pg_class.oid
attnum
attribute position (1..n, sys attrs are -ve)
atttypid
data type of this attribute
(refs pg_type.oid
(attrelid,attnum)
... Representing Tables 209/234 pg_attribute
attlen
storage space required by attribute
(copy of pg_type.typlen
atttypmod
storage space for var-length attributes
(e.g. 6+ATTR_HEADER_SIZE for char(6)
attalign
memory-alignment info (copy of
pg_type.typalign
attndims
number of dimensions if attr is an array
... Representing Tables 210/234 pg_attribute
attnotnull
attribute may not be null?
atthasdef
attribute has a default values
(value is held in pg_attrdef
attisdropped
attribute has been dropped from table
... Representing Tables 211/234
create table MyTable (
a int unique not null,
b char(6)
);
pg_class
pg_attribute
pg_type
... Representing Tables 212/234
rel_oid := new_oid(); user_id = current_user();
insert into
pg_class(oid,name,owner,kind,pages,tuples,...)
values (rel_oid, 'mytable', user_id, 'r', 0, 0, ...)
select oid,typlen into int_oid,int_len
from pg_type where typname = 'int';
insert into
pg_attribute(relid,name,typid,num,len,typmod,notnull...)
values (rel_oid, 'a', int_oid, 1, int_len, -1, true, ...)
select oid,typlen into char_oid,char_len
from pg_type where typname = 'char';
insert into
pg_attribute(relid,name,typid,num,len,typmod,notnull...)
values (rel_oid, 'b', char_oid, 2, -1, 6+4, false, ...)
insert into
pg_type(name,owner,len,type,relid,align,...)
values ('mytable', user_id, 4, 'c', rel_oid, 'i', ...)
... Representing Tables 213/234 pg_attrdef
adrelid
table that column belongs to
(refs pg_class.oid
adnum
which column in the table
(refs pg_attribute.attnum
adsrc
readable representation of default value
adbin
internal representation of default value
... Representing Tables 214/234 pg_constraint
conname
name of constraint (not unique)
connamespace
schema containing this constraint
contype
kind of constraint
'c' = check, 'u' = unique,
'p' = primary key, 'f' = foreign key
conrelid
which table (refs
pg_class.oid
conkey
which attributes
(vector of values from pg_attribute.attnum
consrc
check constraint expression
fkey,check
... Representing Tables 215/234 pg_constraint
confrelid
referenced table for foreign key
confkey
key attributes in foreign table
conkey
corresponding attributes in local table
consrc
readable check constraint expression
conbin
internal check constraint expression
... Representing Tables 216/234
create table MyOtherTable (
x int check (x > 0),
y int references MyTable(a),
z int default -1
);
pg_constraint
x
y
pg_attrdef
z
... Representing Tables 217/234
rel_oid := new_oid(); user_id = current_user();
insert into
pg_class(oid,name,owner,kind,pages,tuples,...)
values (rel_oid, 'myothertable', user_id, 'r', 0, 0, ...)
select oid,typlen into int_oid,int_len
from pg_type where typname = 'int';
select oid into old_oid
from pg_class where relname='mytable';
Representing Functions 218/234
create function power(int x, int y) returns int
as $$
declare i int; product int := 1;
begin
for i in 1..y loop
product := product * x;
end loop;
return product;
end;
$$ language plpgsql;
pg_proc
pg_type
... Representing Functions 219/234 pg_proc
proname
name of function (e.g.
substr
pronamespace
schema in which function defined
(refs pg_namespace.oid
proowner
owner (refs
pg_authid.oid
proacl[]
access permissions
... Representing Functions 220/234 pg_proc
pronargs
how many arguments
prorettype
return type (refs
pg_type.oid
proargtypes[]
argument types (ref
pg_type.oid
proreset
returns set of values of
prorettype
proisagg
is function an aggregate?
proisstrict
returns null if any arg is null
provolatile
return value depends on side-effects?
('i' = immutable, 's'= stable, 'v' = volatile)
... Representing Functions 221/234 pg_proc
prolang
what language function written in
prosrc
source code if interpreted (e.g. PLpgSQL)
probin
additional info on how to invoke function
(interpretation is language-specific)
... Representing Functions 222/234
sq.c int square_in_c(int x) { return x * x; }
create function square(int) returns int
as '/path/to/sq.o', 'square_in_c' language 'C';
create function square(int) returns int
as $$
begin
return $1 * $1;
end;
$$ language plpgsql;
... Representing Functions 223/234
user_id := current_user();
select oid,typlen into int_oid,int_len
from pg_type where typname = 'int';
insert into
pg_proc(name,owner,rettype,nargs,argtypes,
prosrc,probin...)
values ('square', user_id, int_oid, 1, {int_oid},
'square_in_c', '/path/to/sq.o', ...)
... Representing Functions 224/234 max()
This information is stored in the pg_aggregate
pg_proc
... Representing Functions 225/234 average()
create aggregate average (
basetype = integer,
sfunc = int_avg_accum,
stype = int[],
finalfunc = int_avg_result,
initcond = '{0,0}'
);
... Representing Functions 226/234
create function
int_avg_accum(state int[], int) returns int[]
as $$
declare res int[2];
begin
res[1] := state[1] + $2; res[2] := res[2] + 1;
return res;
end;
$$ language plpgsql;
create function
int_avg_result(state int[]) returns int
as $$
begin
if (state[2] = 0) then return null; end if;
return (state[1] / state[2]);
end;
$$ language plpgsql;
... Representing Functions 227/234 power(x,y)
create operator ** (
procedure = power, leftarg = int, rightarg = int
);
pg_operator
Representing Types 228/234
Consider defining a 3-dimensional point type for spatial data:
create type point3d (
input = point3d_in,
... Representing Types 229/234 pg_type
typinput
text input conversion function
typoutput
text output conversion function
typreceive
binary input conversion function
typsend
binary output conversion function
pg_proc.oid
... Representing Types 230/234
pg_am
pg_opclass
pg_amop
pg_amproc
... Representing Types 231/234 pg_am
amname
name of access method (e.g.
btree
amowner
owner (refs
pg_authid.oid
amorder
strategy
operator for determining sort order
(0 if unsorted)
amcanunique
does AM support unique indexes?
ammulticol
does AM support multicolumn indexes?
amindexnulls
does AM support NULL index entries?
amconcurrent
does AM support concurrent updates?
... Representing Types 232/234 pg_am
amgettuple
"next valid tuple" function
ambeginscan
"start new scan" function
amrescan
"restart this scan" function
amendscan
"end this scan" function
amcostestimate
estimate cost of index scan
pg_proc.oid
... Representing Types 233/234 pg_am
aminsert
"insert this tuple" function
ambuild
"build new index" function
ambulkdelete
bulk delete function
amvacuum
cleanup
post-vacuum cleanup function
pg_proc.oid
... Representing Types 234/234
Some access methods introduce additional files (e.g. B-tree)
heap
btree
hash
rtree
GiST
SP-GiST
GIN