The Big Data Platform Landscape

by (10flow.com)
Last update: October 16, 2012

The number of Big Data platforms has exploded. This map organizes Big Data technologies into three main categories. From the bottom up, the Data Storage Layer shows databases and filesystems that provide parallel, distributed storage and access of data sets on the order of hundreds of terabytes to petabytes. Next, the Data Access Layer platforms provide middleware for persisting and querying data in the underlying storage. Finally, the Application Layer outlines major use cases or classes of algorithms that operate on Big Data.

Note that most packages listed here are free, open source software. This is intended to map the technology landscape, not provide a directory of commercial products or start-ups. If you click a node on the map, you'll see a short description and links to publications or websites, if applicable.

Please let me know what I'm missing! Contact me on Twitter, Google+, or leave a comment on my blog.

Tweet to @10flow

Imagemap
Big Data PlatformsApplication LayerScience/HPCAstronomyCFD/SimulationSearch/QueryGraph AnalyticsPageRankSubgraph DetectionBelief PropagationClustering/ClassificationData Access LayerORM/Object PersistenceKunderaGAE DatastoreAd Hoc QueryHivePigSQLDremelSPARQLLinear Algebra/Signal ProcessingGraphLabD4MPregelData Storage LayerRDBMSOraclePostgresMySQLDistributed FilesystemHDFSLustreBigTable/Triple StoreHBaseAccumuloCassandraSpannerArray StoreSciDBGraph StoreTitanNeo4J
hide
Big Data Platforms
hidefull-1
Data Storage Layer
hideforward
Distributed Filesystem
leaf
HDFS
Arrow Link Arrow Link Arrow Link Arrow Link

Hadoop Distributed Filesystem (HDFS)

http://hadoop.apache.org/docs/r0.17.1/hdfs_design.html

Distributes data across commodity cluster nodes; handles replication and failover; implemented in Java

leaf
Lustre

http://wiki.lustre.org/index.php/Main_Page

Parallel distributed filesystem, often used in HPC systems; common on the Top 500 list; used as a central filesystem on dedicated machines (unlike Hadoop clusters in which compute nodes also serve as storage nodes)

leaf
Spanner

Google's new (2012) distributed datastore:

http://research.google.com/archive/spanner.html

Handles replication across physically disparate data centers (on a per-application basis); presents SQL-like query model; relational data model with transaction support; complex timing/synchronization scheme

hideforward
Array Store
leaf
SciDB

http://www.scidb.org/about/publications.php

Stores arrays of tuples for scientific applications; supports basic and user-defined operations/queries distributed across cluster

hideforward
Graph Store

The Hadoop Database:

http://hbase.apache.org/

Built on HDFS; implements BigTable

leaf
Titan

http://thinkaurelius.github.com/titan/

Graph data model and operators built on top of HBase/Cassandra

leafhelp
Neo4J

http://neo4j.org/

Popular graph database; likely to soon support sharding across a cluster