Non-classical DBMSs

Parallel and Distributed Databases

Parallel and Distributed Systems 2/37

The discussion so far has revolved around systems

with a single or small number of processors
accessing a single memory space
getting data from one or more disk devices

[Diagram:Pics/parallel/one-processor-small.png]

Parallel Architectures 3/37

Types: shared memory, shared disk, shared nothing

Example shared-nothing architecture:

[Diagram:Pics/parallel/shared-nothing-small.png]

Typically in the same room (data transfer cost ~ 100's of μsecs)

... Parallel Architectures 4/37

Hierarchical architectures are hybrid parallel ones

[Diagram:Pics/parallel/hybrid-arch-small.png]

Typically on a local-area network (data transfer cost ~ msecs)

Distributed Architectures 5/37

Distributed architectures are ...

effectively shared-nothing, on a global-scale network

[Diagram:Pics/parallel/distrib-arch-small.png]

Typically on the Internet (data transfer cost ~ secs)

Parallel Databases (PDBs) 6/37

Parallel databases provide various forms of parallelism ...

processor parallelism can assist in speeding up memory ops
processor parallelism introduces cache coherence issues
disk parallelism can assist in overcoming latency
disk parallelism can be used to improve fault-tolerance (RAID)
one limiting factor is congestion on communication bus

PDBs typically run on closely-connected parallel architectures,
so we focus on hybrid architectures on a LAN.

Data Storage in PDBs 7/37

Consider each table as a collection of pages ...

Page addressing: (Table, File, PageNum)

Table maps to a set of files (e.g. named by tableID)
File distinguishes primary/overflow files
PageNum maps to an offset in a specific file

If all data for one table resides on one node

the above addressing scheme is adequate
Table can identify (Node, FileSet)

... Data Storage in PDBs 8/37

However, with multiple nodes, we could ...

replicate tables across several node
- in which case, Table yields { (Node, FileSet) }
partition pages for one table across several nodes
- in which case page addressing changes to include node
- (Node, Table, File, PageNum)

Could also have a combination of partitioning and replication

... Data Storage in PDBs 9/37

Data-partitioning example:

[Diagram:Pics/parallel/data-partition-small.png]

... Data Storage in PDBs 10/37

Data-partitioning strategies for one table:

round-robin partitioning
- cycle through nodes, each new tuple is added on the "next" node
hash partitioning
- use hash value to determine which processor and page
range partitioning
- ranges of attr values are assigned to processors

Assume: R(a,b,c,...), D₀ .. D_n-1 disks, tup₀ .. tup_r-1 tuples

Storing data on many disks maximises chance for parallel data access

... Data Storage in PDBs 11/37

Round-robin partitioning:

tuple t_i sent to D_j, tuple t_i+1 sent to D_(j+1)%n
advantage: spreads data uniformly across disks
disadvantage: doesn't partition data "usefully" (for queries)
sequential scan can exploit parallelism
- read data from multiple disks simultaneously
index-based scan can exploit limited parallelism
- index gives list of pages, potential parallel read
provides no assistance for hash-based access

... Data Storage in PDBs 12/37

Hash partitioning

hash functions: h_N(A_i) → NodeID, h_P(A_i) → PageNum
well-designed hash functions can spread tuples uniformly
hash-based access can work well
- all tuples matching query hash will be on one node
sequential scan performance depends on uniform spread
index-on-hash works as for round-robin
provides no assistance for range queries

... Data Storage in PDBs 13/37

Range partitioning

uses partitioning vector pv to determine node for tuple
- allocates range of partitioning attribute values to each node
pv = [ (v₀,D_i), (v₁,D_j), ... (v_h,D_m) ]
- all tuples with A_i ≤ v₀ go to D_i
- all tuples with v₀ < A_i ≤ v₁ go to D_j, etc.
need to choose v_i boundary points carefully
- to ensure reasonably uniform spread of data over disks

PostgreSQL and Parallelism 14/37

PostgreSQL assumes

shared memory space accesible to all back-ends
files for one table are located on one disk

PostgreSQL allows

data to be distributed across multiple disk devices

So could run on ...

shared-memory, shared-disk architectures
hierarchical architectures with distributed virtual memory

... PostgreSQL and Parallelism 15/37

PostgreSQL can provide

multiple servers running on separate nodes
application #1: high availability
- "standby" server takes over if primary server fails
application #2: load balancing
- several servers can be used to provide same data
- direct queries to least loaded server

Both need data synchronisation between servers

PostgreSQL uses notion of master and slave servers.

... PostgreSQL and Parallelism 16/37

High availability ...

updates occur on master, recorded in tx log

tx logs shipped/streamed from master to slave(s)

slave uses tx logs to maintain current state

configuration controls frequency of log shipping

bringing slave up-to-date is fast (~1-2secs)

Note: small window for data loss (committed tx log records not sent)

Distributed Databases 17/37

Two kinds of distributed databases

parallel database on a distributed architecture
- single schema/control, data distributed over network
independent databases on a distributed architecture
- independent schemas/DBMSs, combined via global schema

The latter are also called federated databases

Distribution of data complicates tx processing ...

potential for multiple copies of data to become inconsistent
commit or abort must occur consistently on all nodes

... Distributed Databases 18/37

Distributed tx processing handled by two-phase commit

initiating site has transaction coordinator C_i ...
- waits for all other sites executing tx T to "complete"
- sends <prepare T> message to all other sites
- waits for <ready T> response from all other sites
- if not received (timeout), or <abort T> received, flag abort
- if all other sites respond <ready T>, flag commit
- write <commit T> or <abort T> to log
- send <commit T> or <abort T> to all other sites
non-initiating sites write log entries before responding

... Distributed Databases 19/37

Distributed query processing

may require query ops to be executed on different nodes
- node provides only source of some data
- some nodes may have limited set of operations
needs to merge data received from different nodes
- may require data transformation (to fit schemas together)

Query optimisation in such contexts is difficult.

Non-classical DBMSs

Classical DBMSs 21/37

Assumptions made in conventional DBMSs:

data is sets of tuples; tuples are lists of atomic values
data values can be compared precisely (via =, >, <, ...)
filters can be described via boolean formulae
SQL is a suitable langauage for all data management
transaction-based consistency is critical
data stored on disk, processed in memory
data transferred in blocks of many tuples
disk ↔ memory cost is most expensive in system
disks are connected to processors via fast local bus

Modern DBMSs 22/37

Demands from modern applications

more flexible data structuring mechanisms
very large data objects/values (e.g. music, video)
alternative comparisons/filters (e.g. similarity matching)
massive amounts of data (too much to store "locally")
massive number of clients (thousands tx's per second)
solid-state storage (minimal data latency)
data required globally (network latency)

Clearly, not all of these are relevant for every modern application.

... Modern DBMSs 23/37

Some conclusions:

relational model doesn't work for all applications
SQL is not appropriate for all applications
hard transactions not essential for all applications

Some "modernists" claim that

"for all" is really "for any"
⇒ relational DBMSs and SQL are dinosaurs
⇒ NoSQL is the new way

... Modern DBMSs 24/37

Some approaches:

storage systems: Google FS, Hadoop DFS, Amazon S3
data structures: BigTable, HBase, Cassandra, XML, RDF
data structures: column-oriented DBMSs e.g. C-store
data structures: graph databases e.g. Neo4j
operations: multimedia similarity search e.g. Shazam
operations: web search e.g. Google
transactions: eventual consistency
programming: object-relational mapping (ORM)
programming: MapReduce
languages: Sawzall, Pig, Hive, SPARQL
DB systems: CouchDB, MongoDB, F1, Cstore

Scale, Distribution, Replication 25/37

Data for modern applications is very large (TB, PB, XB)

not feasible to store on a single machine
not feasible to store in a single location

Many systems opt for massive networks of simple nodes

each node holds moderate amount of data
each data item is replicated on several nodes
nodes clustered in different geographic sites

Benefits:

reliability, fault-tolerance, availability
proximity ... use data closest to client
scope for parallel execution/evaluation

Schema-free Data Models 26/37

Many new DBMSs provide (key,value) stores

key is a unique identifier (cf. URI)
value is an arbitrarily complex "object"
- e.g. a text document (often structured, e.g. Wiki, XML)
- e.g. a JSON object: (property,value) list
- e.g. an RDF triple (e.g. <John,worksFor,UNSW>)
objects may contain keys to link to other objects

Tables can be simulated by a collection of "similar" objects.

Eventual Consistency 27/37

RDBMSs use a strong transactional/consistency model

if a tx commits, changes take effect "instantly"
all tx's have a strong guarantee about data integrity

Many new DBMSs applications do not need strong consistency

e.g. doesn't matter if catalogue shows yesterday's price

Because of distribution/replication

update is initiated on one node
different nodes may have different versions of data
after some time, updates propagate to all nodes

... Eventual Consistency 28/37

If different nodes have different versions of data

conflicts arise, and need to be resolved (when noticed)
need to decide which node has "the right value"

Levels of consistency (from Cassandra system)

ONE: at least one node has committed change (weakest)
QUORUM: at least half nodes holding data have committed
ALL: changes propagated to all copies (strongest)

MapReduce 29/37

MapReduce is a programming model

suited for use on large networks of computers
processing large amounts of data with high parallelism
originally developed by Google; Hadoop is open-source implementation

Computation is structured in two phases:

Map phase:
- master node partitions work into sub-problems
- distributes them to worker nodes (who may further distribute)
Reduce phase:
- master collects results of sub-problems from workers
- combines results to produce final answer

... MapReduce 30/37

MapReduce makes use of (key,value) pairs

key values identify parts of computation

Map(key₁,val₁) → list(key₂,val₂)

applied in parallel to all (key₁,val₁) pairs
results with common key₂ are collected in group for "reduction"

Reduce(key₂,list(val₂)) → val₃

collects all values tagged with key₂
combines them to produce result(s) val₃

... MapReduce 31/37

"Classic" MapReduce example (word frequency in set of docs):

function map(String name, String document):
  // name: document name
  // document: document contents
  for each word w in document:
    emit (w, 1)
 
function reduce(String word, Iterator partialCounts):
  // word: a word
  // partialCounts: list of aggregated partial counts
  sum = 0
  for each c in partialCounts:
    sum += c
  emit (word, sum)

... MapReduce 32/37

MapReduce as a "database language"

some advocates of MapReduce have oversold it (replace SQL)
DeWitt/Stonebraker criticised this
- return to low-level model of data access
- all done before in distributed DB research
- misses efficiency opportunities affored by DBMSs
concensus is emerging
- SQL/MapReduce good for different kinds of task
- MapReduce as a basis for SQL-like languages (e.g. Apache HiveQL)

Modern vs Classical 33/37

Some criticisms of the NoSQL approach:

DeWitt/Stonebraker: MapReduce: A major step backwards
Online parody of noSQL advocates (strong language warning)

Hadoop DFS 34/37

Storage system to support distributed, replicated data (Apache)

provides a view of data as a collection of named files
each HDFS cluster has a collection of DataNodes
- manages blocks of data, typically on commodity hardware
- each file is a collection of blocks, stored on multiple nodes
- individual blocks may be replicated across nodes
- DataNodes provide read/write operations to clients
each HDFS cluster has a NameNode
- manages name space and access to data by clients
- determines mapping of data blocks to DataNodes
- provides file open/close/rename operations to clients

... Hadoop DFS 35/37

Data replication in Hadoop

organised by the NameNode
all blocks in a file are same size
determines placement of copies of each block
uses HeartBeat and BlockReport info from DataNodes
attempts to maximise reliability and performance
- e.g. keeps copies on separate DataNodes (obvious)
- e.g. assigns closest DataNode to client for read/write
a complex optimisation problem, requires tuning

Cassandra 36/37

Distributed NoSQL "database management system" (Apache)

provides a hybrid (key,value)/row-oriented store
a column is a (name,value) pair (e.g. (Name,John)
a supercolumn is also a (name,value) pair
- the value is a set of columns (cf tuple)
a columnfamily is a set of (name,value) pairs (cf table)
- each value is effectively a supercolumn
a keyspace is a namespace to hold objects (cf database)

Tables are distributed on a Hadoop DFS (DBA-specified replication)

... Cassandra 37/37

CQL is the Cassandra query language (cf SQL)

create keyspace UNSW;  use UNSW;
// don't need to specify all columns
create columnfamily Student (sid int primary key);
// columns are named as values are added
insert into Student (sid, name, degree)
values (12345, 'John Smith', 'BSc(CompSci)');
// familiar syntax ...
select * from Student where sid=12345;
// cannot reference non-family columns
select * from Student where degree='BSc(CompSci)';
// consistency can enter explicitly
update Student using consistency QUORUM
set degree = 'MIT' where sid = 12345;

Produced: 30 May 2016