Big Data Platforms

Application Layer

Clustering/Classification

Data Access Layer

ORM/Object Persistence

Kundera

https://github.com/impetus-opensource/Kundera

Object persistence (ORM) for BigTable databases

GAE Datastore

Google App Engine (GAE) storage model; provides object persistence and simple query support

Ad Hoc Query

Hive

http://hive.apache.org/

SQL-like language for querying data stored in flat files on HDFS; user-defined interpreters serialize/deserialize data to/from files

Presentation on facebook's use of Hive:

http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation/

Pig

http://pig.apache.org/

Provides new query language ("Pig Latin"); SQL-like; supports MapReduce; runs on HDFS

SQL

Dremel

http://research.google.com/pubs/pub36632.html

Google system for query/analysis of large-scale datasets

SPARQL

Query language for RDF

http://www.w3.org/TR/rdf-sparql-query/

RDF is a web ontology language developed for the semantic web

Linear Algebra/Signal Processing

GraphLab

Abstract:

http://www.select.cs.cmu.edu/publications/scripts/papers.cgi?Low+al:uai10graphlab

Publications:

http://graphlab.org/home/publications/

D4M

http://www.mit.edu/~kepner/D4M/

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6289129&contentType=Conference+Publications&sortType%3Dasc_p_Sequence%26filter%3DAND(p_IS_Number%3A6287775)%26pageNumber%3D54

Pregel

http://dl.acm.org/citation.cfm?id=1582723

Google's system for large-scale graph processing

Data Storage Layer

RDBMS

Sharding schemes for SQL databases:

http://www.25hoursaday.com/weblog/2009/01/16/BuildingScalableDatabasesProsAndConsOfVariousDatabaseShardingSchemes.aspx

Oracle

Postgres

MySQL

Distributed Filesystem

http://www.mit.edu/~kepner/D4M/

HDFS

Hadoop Distributed Filesystem (HDFS)

http://hadoop.apache.org/docs/r0.17.1/hdfs_design.html

Distributes data across commodity cluster nodes; handles replication and failover; implemented in Java

Lustre

http://wiki.lustre.org/index.php/Main_Page

Parallel distributed filesystem, often used in HPC systems; common on the Top 500 list; used as a central filesystem on dedicated machines (unlike Hadoop clusters in which compute nodes also serve as storage nodes)

BigTable/Triple Store

HBase and Accumulo are implementations of Google's BigTable paper (2006):

http://research.google.com/archive/bigtable-osdi06.pdf

All data stored as triples, i.e. (row_key, column_key, value)

Rows are distributed across tablet servers and sorted, assuring fast lookup of any given row key; do-it-yourself relationships; no/limited transaction support

HBase

http://hbase.apache.org/

Accumulo

Developed by NSA, released as open-source

http://accumulo.apache.org/

Adds cell-level security

Cassandra

http://cassandra.apache.org/

BigTable implementation; used by many large applications

Spanner

Google's new (2012) distributed datastore:

http://research.google.com/archive/spanner.html

Handles replication across physically disparate data centers (on a per-application basis); presents SQL-like query model; relational data model with transaction support; complex timing/synchronization scheme

Array Store

SciDB

http://www.scidb.org/about/publications.php

Stores arrays of tuples for scientific applications; supports basic and user-defined operations/queries distributed across cluster

Graph Store

The Hadoop Database:

http://hbase.apache.org/

Built on HDFS; implements BigTable

Titan

http://thinkaurelius.github.com/titan/

Graph data model and operators built on top of HBase/Cassandra