OrpheusDB is a database system with versioning capabilities.
What is OrpheusDB?
With the increasing number of individuals performing data science,
in every organization and team,
there is a proliferation of dataset versions at various stages of data analysis.
More often than not, these dataset versions are stored in an ad-hoc manner
in shared file systems, leading to massive redundancy and duplication, and
making it impossible to keep track of and find specific versions.
OrpheusDB is a database system that supports versioning capabilities.
Since OrpheusDB is built on standard relational databases, it inherits
much of the same benefits of relational databases, while also compactly storing,
keeping track of, and recreating versions on demand.
Video of OrpheusDB in Action
Why OrpheusDB?
While git and svn are great at source code version control, they are unfortunately unable to efficiently
support large unordered datasets. Moreover, they cannot support
the full range of operations supported natively by SQL.
OrpheusDB is built as a wrapper on top of traditional database systems,
with no modifications to the underlying database. At the same time, OrpheusDB
supports an important subset of git commands enabling checkout, commit, init, create_user, config, whoami, ls, drop, and optimize. Lastly, OrpheusDB supports a rich syntax of SQL statements, against either known or unknown version(s) of a particular dataset.
OrpheusDB is an offshoot of the MIT DataHub project, which is aimed
at developing a platform for collaborative data analytics. Head here for more details.
Key Features
OrpheusDB is a hosted system that supports relational dataset version management, with the following design innovations:
OrpheusDB is built on top of a traditional relational database, thus it inherits all of the standard benefits of relational database systems "for free"
OrpheusDB supports advanced querying and versioning capabilities, via both SQL queries and git-style version control commands.
OrpheusDB uses a sophisticated data model, coupled with partition optimization algorithms, to provide efficient version control performance over large-scale datasets.
.
Recent Releases
Here's a youtube video of our version browser interface in action!
OrpheusDB is being developed at Illinois by a team of graduate students headed by Prof. Aaron Elmore and Aditya Parameswaran. The list of developers includes (in alphabetical order): Silu Huang, Sili Hui, and Liqi Xu.
Please reach out to the lead PhD students, Silu Huang
(shuang86@illinois.edu) or Liqi Xu (liqixu2@illinois.edu)
if you'd like to either contribute, or be
a beta tester of OrpheusDB!