Non-classical DBMSs


Parallel and Distributed Databases


Parallel and Distributed Systems2/37

The discussion so far has revolved around systems

[Diagram:Pics/parallel/one-processor-small.png]


Parallel Architectures3/37

Types:   shared memory,   shared disk,   shared nothing

Example shared-nothing architecture:

[Diagram:Pics/parallel/shared-nothing-small.png]

Typically in the same room   (data transfer cost ~ 100's of μsecs)


... Parallel Architectures4/37

Hierarchical architectures are hybrid parallel ones

[Diagram:Pics/parallel/hybrid-arch-small.png]

Typically on a local-area network   (data transfer cost ~ msecs)


Distributed Architectures5/37

Distributed architectures are ...

[Diagram:Pics/parallel/distrib-arch-small.png]

Typically on the Internet   (data transfer cost ~ secs)


Parallel Databases (PDBs)6/37

Parallel databases provide various forms of parallelism ...

PDBs typically run on closely-connected parallel architectures,
so we focus on hybrid architectures on a LAN.


Data Storage in PDBs 7/37

Consider each table as a collection of pages ...

Page addressing: (Table, File, PageNum)

If all data for one table resides on one node


... Data Storage in PDBs 8/37

However, with multiple nodes, we could ...

Could also have a combination of partitioning and replication


... Data Storage in PDBs 9/37

Data-partitioning example:

[Diagram:Pics/parallel/data-partition-small.png]


... Data Storage in PDBs 10/37

Data-partitioning strategies for one table:


Assume:   R(a,b,c,...),   D0 .. Dn-1 disks,   tup0 .. tupr-1 tuples

Storing data on many disks maximises chance for parallel data access


... Data Storage in PDBs 11/37

Round-robin partitioning:


... Data Storage in PDBs 12/37

Hash partitioning


... Data Storage in PDBs 13/37

Range partitioning


PostgreSQL and Parallelism14/37

PostgreSQL assumes

PostgreSQL allows So could run on ...


... PostgreSQL and Parallelism15/37

PostgreSQL can provide

Both need data synchronisation between servers

PostgreSQL uses notion of master and slave servers.


... PostgreSQL and Parallelism16/37

High availability ...

  • updates occur on master, recorded in tx log
  • tx logs shipped/streamed from master to slave(s)
  • slave uses tx logs to maintain current state
  • configuration controls frequency of log shipping
  • bringing slave up-to-date is fast (~1-2secs)

    Note: small window for data loss (committed tx log records not sent)


    Distributed Databases17/37

    Two kinds of distributed databases

    The latter are also called federated databases

    Distribution of data complicates tx processing ...


    ... Distributed Databases18/37

    Distributed tx processing handled by two-phase commit


    ... Distributed Databases19/37

    Distributed query processing

    Query optimisation in such contexts is difficult.


    Non-classical DBMSs


    Classical DBMSs21/37

    Assumptions made in conventional DBMSs:


    Modern DBMSs22/37

    Demands from modern applications

    Clearly, not all of these are relevant for every modern application.


    ... Modern DBMSs23/37

    Some conclusions:

    Some "modernists" claim that


    ... Modern DBMSs24/37

    Some approaches:


    Scale, Distribution, Replication25/37

    Data for modern applications is very large (TB, PB, XB)

    Many systems opt for massive networks of simple nodes Benefits:


    Schema-free Data Models26/37

    Many new DBMSs provide (key,value) stores

    Tables can be simulated by a collection of "similar" objects.


    Eventual Consistency27/37

    RDBMSs use a strong transactional/consistency model

    Many new DBMSs applications do not need strong consistency Because of distribution/replication


    ... Eventual Consistency28/37

    If different nodes have different versions of data

    Levels of consistency (from Cassandra system)


    MapReduce29/37

    MapReduce is a programming model

    Computation is structured in two phases:


    ... MapReduce30/37

    MapReduce makes use of (key,value) pairs

    Map(key1,val1) → list(key2,val2) Reduce(key2,list(val2)) → val3


    ... MapReduce31/37

    "Classic" MapReduce example (word frequency in set of docs):

    function map(String name, String document):
      // name: document name
      // document: document contents
      for each word w in document:
        emit (w, 1)
     
    function reduce(String word, Iterator partialCounts):
      // word: a word
      // partialCounts: list of aggregated partial counts
      sum = 0
      for each c in partialCounts:
        sum += c
      emit (word, sum)
    


    ... MapReduce32/37

    MapReduce as a "database language"


    Modern vs Classical33/37

    Some criticisms of the NoSQL approach:

    [Diagram:Pics/newdb/xtra-small.png]


    Hadoop DFS34/37

    Storage system to support distributed, replicated data (Apache)


    ... Hadoop DFS35/37

    Data replication in Hadoop


    Cassandra36/37

    Distributed NoSQL "database management system" (Apache)

    Tables are distributed on a Hadoop DFS  (DBA-specified replication)


    ... Cassandra37/37

    CQL is the Cassandra query language  (cf SQL)

    create keyspace UNSW;  use UNSW;
    // don't need to specify all columns
    create columnfamily Student (sid int primary key);
    // columns are named as values are added
    insert into Student (sid, name, degree)
    values (12345, 'John Smith', 'BSc(CompSci)');
    // familiar syntax ...
    select * from Student where sid=12345;
    // cannot reference non-family columns
    select * from Student where degree='BSc(CompSci)';
    // consistency can enter explicitly
    update Student using consistency QUORUM
    set degree = 'MIT' where sid = 12345;
    


    Produced: 30 May 2016