Files
dolt/README.md
2016-06-06 13:28:27 -07:00

4.1 KiB

#Store All the Things

Noms is a content-addressed, immutable, decentralized, strongly-typed database.

In other words, Noms is Git for data.

This repository contains two reference implementations of the database—one in Go, and one in JavaScript. It also includes a number of tools and sample applications.

Setup

  1. Install Go 1.6+
  2. Ensure your $GOPATH is configured
  3. Type type type:
git clone https://github.com/attic-labs/noms $GOPATH/src/github.com/attic-labs/noms
go install github.com/attic-labs/noms/cmd/...

noms log http://demo.noms.io/cli-tour:film-locations

Samples  |  Command-Line Tour  |  JavaScript SDK Tour  |  Intro to Noms

Features

Versioning
Each commit is retained and can be viewed or reverted
Type inference
Each dataset has a precise schema that is automatically inferred
Atomic commits
Immutability enables atomic commits of any size
Diff
Compare structured datasets of any size efficiently
Schema versioning
Narrow or widen schemas instantly, without rewriting data
Sorted indexes
Fast range queries, on a single or a combination of attributes
Fork
Create your own isolated branch of a dataset to work on
Schema validation (soon)
Optionally constrain commit types on a per-dataset basis
Insanely easy import
Noms auto-dedupes snapshots and generates a precise changelog
Sync
Sync disconnected database instances efficiently and correctly
Structural typing
Index, search, and match data by structure shape
Awesome export
Use dataset history to precisely apply sync changes out of Noms

Use Cases

We're just getting started, but here are a few use cases we think Noms is especially well-suited for:

Data Collaboration

Work on data together. Track changes, fork, merge, sync, etc. The entire Git workflow, but on large-scale, structured or unstructured data. Useful for teams doing data analysis, cleansing, enrichment, etc.

ETL

Noms should work really well as a backing store for ETL pipelines. Noms-backed ETL is naturally:

  • Incremental: Noms datasets can be efficiently diffed, so only the changed data needs to be run through the pipeline.
  • Versioned: Any transform can be compared to the previous run and trivially undone or re-applied.
  • Idempotent: If a transform fails in the middle for any reason, it can simply be re-run. A transform's result will always be the same no matter how many times it is run.
  • Auditable: Content-addressing enables precisely tracking inputs to each transform and result.

Data Integration and Enrichment

Noms also should be a natural way to collect, integrate, index, and integrate data from disparate sources.

Due to content-addressing, Noms naturally deduplicates all data, so importers can be trivially simple - just dump coarse-grained snapshots periodically and have only the changes re-processed (see clients/js/fb, client/js/flickr for some early examples of this).

Metadata in such an enviornment can be modeled non-destructively, as assertions from source object to metadata. Such assertions would be naturally versioned and revertable. They would also be owned by the program that made them, meaning they could be manipulated en-masse, leading to easy experimentation.

Decentralized database

Noms should be a natural fit to move data around certain kinds of widely decentralized applications. Rather than moving raw data files, e.g., with rsync, and then rebuilding the database at each node, just move the database itself.

Get Involved

Noms is developed in the open. Come say hi.