Non-classical DBMSs
Parallel and Distributed Databases | |
Parallel and Distributed Systems | 2/37 |
The discussion so far has revolved around systems
- with a single or small number of processors
- accessing a single memory space
- getting data from one or more disk devices
Parallel Architectures | 3/37 |
Types: shared memory, shared disk, shared nothing
Example shared-nothing architecture:
Typically in the same room (data transfer cost ~ 100's of μsecs)
... Parallel Architectures | 4/37 |
Hierarchical architectures are hybrid parallel ones
Typically on a local-area network (data transfer cost ~ msecs)
Distributed Architectures | 5/37 |
Distributed architectures are ...
- effectively shared-nothing, on a global-scale network
Typically on the Internet (data transfer cost ~ secs)
Parallel Databases (PDBs) | 6/37 |
Parallel databases provide various forms of parallelism ...
- processor parallelism can assist in speeding up memory ops
- processor parallelism introduces cache coherence issues
- disk parallelism can assist in overcoming latency
- disk parallelism can be used to improve fault-tolerance (RAID)
- one limiting factor is congestion on communication bus
PDBs typically run on closely-connected parallel architectures,
so we focus on hybrid architectures on a LAN.
Data Storage in PDBs | 7/37 |
Consider each table as a collection of pages ...
Page addressing: (Table, File, PageNum)
- Table maps to a set of files (e.g. named by tableID)
- File distinguishes primary/overflow files
- PageNum maps to an offset in a specific file
If all data for one table resides on one node
- the above addressing scheme is adequate
- Table can identify (Node, FileSet)
... Data Storage in PDBs | 8/37 |
However, with multiple nodes, we could ...
- replicate tables across several node
- in which case, Table yields { (Node, FileSet) }
- partition pages for one table across several nodes
- in which case page addressing changes to include node
- (Node, Table, File, PageNum)
Could also have a combination of partitioning and replication
... Data Storage in PDBs | 9/37 |
Data-partitioning example:
... Data Storage in PDBs | 10/37 |
Data-partitioning strategies for one table:
- round-robin partitioning
- cycle through nodes, each new tuple is added on the "next" node
- hash partitioning
- use hash value to determine which processor and page
- range partitioning
- ranges of attr values are assigned to processors
Assume: R(a,b,c,...), D0 .. Dn-1 disks, tup0 .. tupr-1 tuples
Storing data on many disks maximises chance for parallel data access
... Data Storage in PDBs | 11/37 |
Round-robin partitioning:
- tuple ti sent to Dj,
tuple ti+1 sent to D(j+1)%n
- advantage: spreads data uniformly across disks
- disadvantage: doesn't partition data "usefully" (for queries)
- sequential scan can exploit parallelism
- read data from multiple disks simultaneously
- index-based scan can exploit limited parallelism
- index gives list of pages, potential parallel read
- provides no assistance for hash-based access
... Data Storage in PDBs | 12/37 |
Hash partitioning
- hash functions:
hN(Ai) → NodeID,
hP(Ai) → PageNum
- well-designed hash functions can spread tuples uniformly
- hash-based access can work well
- all tuples matching query hash will be on one node
- sequential scan performance depends on uniform spread
- index-on-hash works as for round-robin
- provides no assistance for range queries
... Data Storage in PDBs | 13/37 |
Range partitioning
- uses partitioning vector pv to determine node for tuple
- allocates range of partitioning attribute values to each node
- pv = [ (v0,Di), (v1,Dj), ... (vh,Dm) ]
- all tuples with Ai ≤ v0 go to Di
- all tuples with v0 < Ai ≤ v1 go to Dj, etc.
- need to choose vi boundary points carefully
- to ensure reasonably uniform spread of data over disks
PostgreSQL and Parallelism | 14/37 |
PostgreSQL assumes
- shared memory space accesible to all back-ends
- files for one table are located on one disk
PostgreSQL allows
- data to be distributed across multiple disk devices
So could run on ...
- shared-memory, shared-disk architectures
- hierarchical architectures with distributed virtual memory
... PostgreSQL and Parallelism | 15/37 |
PostgreSQL can provide
- multiple servers running on separate nodes
- application #1: high availability
- "standby" server takes over if primary server fails
- application #2: load balancing
- several servers can be used to provide same data
- direct queries to least loaded server
Both need data synchronisation between servers
PostgreSQL uses notion of master and slave servers.
... PostgreSQL and Parallelism | 16/37 |
High availability ...
updates occur on master, recorded in tx log
tx logs shipped/streamed from master to slave(s)
slave uses tx logs to maintain current state
configuration controls frequency of log shipping
bringing slave up-to-date is fast (~1-2secs)
Note: small window for data loss (committed tx log records not sent)
Distributed Databases | 17/37 |
Two kinds of distributed databases
- parallel database on a distributed architecture
- single schema/control, data distributed over network
- independent databases on a distributed architecture
- independent schemas/DBMSs, combined via global schema
The latter are also called federated databases
Distribution of data complicates tx processing ...
- potential for multiple copies of data to become inconsistent
- commit or abort must occur consistently on all nodes
... Distributed Databases | 18/37 |
Distributed tx processing handled by two-phase commit
- initiating site has transaction coordinator Ci ...
- waits for all other sites executing tx
T
to "complete"
- sends
<prepare T>
message to all other sites
- waits for
<ready T>
response from all other sites
- if not received (timeout), or
<abort T>
received, flag abort
- if all other sites respond
<ready T>
, flag commit
- write
<commit T>
or <abort T>
to log
- send
<commit T>
or <abort T>
to all other sites
- non-initiating sites write log entries before responding
... Distributed Databases | 19/37 |
Distributed query processing
- may require query ops to be executed on different nodes
- node provides only source of some data
- some nodes may have limited set of operations
- needs to merge data received from different nodes
- may require data transformation (to fit schemas together)
Query optimisation in such contexts is difficult.
Assumptions made in conventional DBMSs:
- data is sets of tuples; tuples are lists of atomic values
- data values can be compared precisely (via =, >, <, ...)
- filters can be described via boolean formulae
- SQL is a suitable langauage for all data management
- transaction-based consistency is critical
- data stored on disk, processed in memory
- data transferred in blocks of many tuples
- disk ↔ memory cost is most expensive in system
- disks are connected to processors via fast local bus
Demands from modern applications
- more flexible data structuring mechanisms
- very large data objects/values (e.g. music, video)
- alternative comparisons/filters (e.g. similarity matching)
- massive amounts of data (too much to store "locally")
- massive number of clients (thousands tx's per second)
- solid-state storage (minimal data latency)
- data required globally (network latency)
Clearly, not all of these are relevant for every modern application.
Some conclusions:
- relational model doesn't work for all applications
- SQL is not appropriate for all applications
- hard transactions not essential for all applications
Some "modernists" claim that
- "for all" is really "for any"
- ⇒ relational DBMSs and SQL are dinosaurs
- ⇒ NoSQL is the new way
Some approaches:
- storage systems: Google FS, Hadoop DFS, Amazon S3
- data structures: BigTable, HBase, Cassandra, XML, RDF
- data structures: column-oriented DBMSs e.g. C-store
- data structures: graph databases e.g. Neo4j
- operations: multimedia similarity search e.g. Shazam
- operations: web search e.g. Google
- transactions: eventual consistency
- programming: object-relational mapping (ORM)
- programming: MapReduce
- languages: Sawzall, Pig, Hive, SPARQL
- DB systems: CouchDB, MongoDB, F1, Cstore
Scale, Distribution, Replication | 25/37 |
Data for modern applications is very large (TB, PB, XB)
- not feasible to store on a single machine
- not feasible to store in a single location
Many systems opt for massive networks of simple nodes
- each node holds moderate amount of data
- each data item is replicated on several nodes
- nodes clustered in different geographic sites
Benefits:
- reliability, fault-tolerance, availability
- proximity ... use data closest to client
- scope for parallel execution/evaluation
Schema-free Data Models | 26/37 |
Many new DBMSs provide (key,value) stores
- key is a unique identifier (cf. URI)
- value is an arbitrarily complex "object"
- e.g. a text document (often structured, e.g. Wiki, XML)
- e.g. a JSON object: (property,value) list
- e.g. an RDF triple (e.g.
<John,worksFor,UNSW>
)
- objects may contain keys to link to other objects
Tables can be simulated by a collection of "similar" objects.
Eventual Consistency | 27/37 |
RDBMSs use a strong transactional/consistency model
- if a tx commits, changes take effect "instantly"
- all tx's have a strong guarantee about data integrity
Many new DBMSs applications do not need strong consistency
- e.g. doesn't matter if catalogue shows yesterday's price
Because of distribution/replication
- update is initiated on one node
- different nodes may have different versions of data
- after some time, updates propagate to all nodes
... Eventual Consistency | 28/37 |
If different nodes have different versions of data
- conflicts arise, and need to be resolved (when noticed)
- need to decide which node has "the right value"
Levels of consistency (from Cassandra system)
- ONE: at least one node has committed change (weakest)
- QUORUM: at least half nodes holding data have committed
- ALL: changes propagated to all copies (strongest)
MapReduce is a programming model
- suited for use on large networks of computers
- processing large amounts of data with high parallelism
- originally developed by Google; Hadoop is open-source implementation
Computation is structured in two phases:
- Map phase:
- master node partitions work into sub-problems
- distributes them to worker nodes (who may further distribute)
- Reduce phase:
- master collects results of sub-problems from workers
- combines results to produce final answer
MapReduce makes use of (key,value) pairs
- key values identify parts of computation
Map(key1,val1) → list(key2,val2)
- applied in parallel to all (key1,val1) pairs
- results with common key2 are collected in group for "reduction"
Reduce(key2,list(val2)) → val3
- collects all values tagged with key2
- combines them to produce result(s) val3
"Classic" MapReduce example (word frequency in set of docs):
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: list of aggregated partial counts
sum = 0
for each c in partialCounts:
sum += c
emit (word, sum)
MapReduce as a "database language"
- some advocates of MapReduce have oversold it (replace SQL)
- DeWitt/Stonebraker criticised this
- return to low-level model of data access
- all done before in distributed DB research
- misses efficiency opportunities affored by DBMSs
- concensus is emerging
- SQL/MapReduce good for different kinds of task
- MapReduce as a basis for SQL-like languages (e.g. Apache HiveQL)
Some criticisms of the NoSQL approach:
Storage system to support distributed, replicated data (Apache)
- provides a view of data as a collection of named files
- each HDFS cluster has a collection of DataNodes
- manages blocks of data, typically on commodity hardware
- each file is a collection of blocks, stored on multiple nodes
- individual blocks may be replicated across nodes
- DataNodes provide read/write operations to clients
- each HDFS cluster has a NameNode
- manages name space and access to data by clients
- determines mapping of data blocks to DataNodes
- provides file open/close/rename operations to clients
Data replication in Hadoop
- organised by the NameNode
- all blocks in a file are same size
- determines placement of copies of each block
- uses HeartBeat and BlockReport info from DataNodes
- attempts to maximise reliability and performance
- e.g. keeps copies on separate DataNodes (obvious)
- e.g. assigns closest DataNode to client for read/write
- a complex optimisation problem, requires tuning
Distributed NoSQL "database management system" (Apache)
- provides a hybrid (key,value)/row-oriented store
- a column is a (name,value) pair (e.g. (Name,John)
- a supercolumn is also a (name,value) pair
- the value is a set of columns (cf tuple)
- a columnfamily is a set of (name,value) pairs (cf table)
- each value is effectively a supercolumn
- a keyspace is a namespace to hold objects (cf database)
Tables are distributed on a Hadoop DFS (DBA-specified replication)
CQL is the Cassandra query language (cf SQL)
create keyspace UNSW; use UNSW;
// don't need to specify all columns
create columnfamily Student (sid int primary key);
// columns are named as values are added
insert into Student (sid, name, degree)
values (12345, 'John Smith', 'BSc(CompSci)');
// familiar syntax ...
select * from Student where sid=12345;
// cannot reference non-family columns
select * from Student where degree='BSc(CompSci)';
// consistency can enter explicitly
update Student using consistency QUORUM
set degree = 'MIT' where sid = 12345;
Produced: 30 May 2016