OrpheusDB is a database system with versioning capabilities.

What is OrpheusDB?

With the increasing number of individuals performing data science, in every organization and team, there is a proliferation of dataset versions at various stages of data analysis. More often than not, these dataset versions are stored in an ad-hoc manner in shared file systems, leading to massive redundancy and duplication, and making it impossible to keep track of and find specific versions.
OrpheusDB is a database system that supports versioning capabilities. Since OrpheusDB is built on standard relational databases, it inherits much of the same benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand.

Video of OrpheusDB in Action

Why OrpheusDB?

While git and svn are great at source code version control, they are unfortunately unable to efficiently support large unordered datasets. Moreover, they cannot support the full range of operations supported natively by SQL.
OrpheusDB is built as a wrapper on top of traditional database systems, with no modifications to the underlying database. At the same time, OrpheusDB supports an important subset of git commands enabling checkout, commit, init, create_user, config, whoami, ls, drop, and optimize. Lastly, OrpheusDB supports a rich syntax of SQL statements, against either known or unknown version(s) of a particular dataset.
OrpheusDB is an offshoot of the MIT DataHub project, which is aimed at developing a platform for collaborative data analytics. Head here for more details.

Key Features

OrpheusDB is a hosted system that supports relational dataset version management, with the following design innovations:
  • OrpheusDB is built on top of a traditional relational database, thus it inherits all of the standard benefits of relational database systems "for free"
  • OrpheusDB supports advanced querying and versioning capabilities, via both SQL queries and git-style version control commands.
  • OrpheusDB uses a sophisticated data model, coupled with partition optimization algorithms, to provide efficient version control performance over large-scale datasets. .

Recent Releases

  • Here's a youtube video of our version browser interface in action!
  • Our version 1.0.0 release is out. Check it now!
  • A preprint describing our data representation and partitioning schemes can be found here.
  • Our recent presentation can be found here.

Papers

Contact Us

OrpheusDB is being developed at Illinois by a team of graduate students headed by Prof. Aaron Elmore and Aditya Parameswaran. The list of developers includes (in alphabetical order): Silu Huang, Sili Hui, and Liqi Xu.
Please reach out to the lead PhD students, Silu Huang (shuang86@illinois.edu) or Liqi Xu (liqixu2@illinois.edu) if you'd like to either contribute, or be a beta tester of OrpheusDB!

With thanks to our funding sources: