Commit Graph

32 Commits

Author SHA1 Message Date
cmasone-attic ccdf08c4f8 NBS: Serialize Commit() calls within a process (#3594)
This patch uses process-wide per-store locking to ensure that only one
NomsBlockStore instance is ever trying to update the upstream NBS
manifest at a time. It also locks out attempts to fetch the manifest
contents during that window.

Conjoining is now much simpler. Since only one instance can ever be in
the critical path of Commit at a time, and conjoining is triggered on
that critical path, we now simply perform the conjoin while excluding
all other in-process NBS instances. Hopefully, locking out instances
who want to fetch the manifest contents during a conjoin won't cripple
performance.

Fixes issue #3583
2017-07-20 14:04:43 -07:00
Rafael Weinstein 0736ca8b6c add store-granularity locking to manifest cache (#3576) 2017-06-28 17:53:09 -07:00
cmasone-attic fa40c6044f NBS: updateManifest() fails fast if Update is DOOOOOOMED (#3575)
If NomsBlockStore can assume that its manifest is a cachingManifest,
it can pre-emptively check to see if someone else in-process has
already moved the manifest forward and, if so, fail early.

Fixes #3574
2017-06-28 13:04:33 -07:00
Rafael Weinstein 3ff92950d8 Revert removal of |last| from Commit() (#3531) 2017-06-09 11:20:45 -07:00
Rafael Weinstein 214054986b Enforce clearer concurrency semantics of ValueStore (#3527) 2017-06-08 11:40:22 -07:00
cmasone-attic e014edfa66 NBS: s3TablePersister caches tables locally on write (#3507) 2017-05-31 12:10:37 -07:00
cmasone-attic 961970f155 NBS: add cache-on-write behavior for manifests (#3503)
Fixes #3494
2017-05-30 13:06:12 -07:00
cmasone-attic 3201a5c9e5 Add Read/WriteManifestLatency stats (#3495)
Fixes #3494
2017-05-22 16:59:32 -07:00
cmasone-attic 5ae0b5063f NBS: Avoid concurrent (in-process) conjoins to a given store (#3484)
Previously, every NomsBlockStore instance decided when to conjoin
tables (and which to conjoin) entirely on its own, which led to A LOT
of concurrent conjoining that would mostly be wasted effort, as one
instance would win the race and then all the rest would drop their
work on the floor, rebase, and continue. This patch introduces a
'conjoiner' that is either process-global, or owned by one of the NBS
factory objects you can create. Now, NBS instances vended by a given
factory call this single conjoiner during Commit(), asking it to
perform a conjoin if necessary. If a conjoin is already underway, the
conjoiner blocks the caller until it's finished and then
returns. Whether the conjoin was triggered at the caller's request, or
the caller got to opportunistically piggyback on a conjoin already in
progress, the caller must rebase after Conjoin() returns.

Fixes #3422
2017-05-18 16:40:28 -07:00
Rafael Weinstein a3cde48690 Instrument NBS with perf metrics (#3449) 2017-05-05 17:48:07 -07:00
cmasone-attic 6e217538a8 Clean up NBS cruft (#3451)
Clean up NBS cruft standing in the way of improvements:

Unmap buffer in newMmapTableReader()
By the time this function exits, we're done with this buffer.
Hanging on to it complicates lifetime management for the file
backing the mmapTableReader, which is something I'm trying to
make simpler. So...ditch it!

remove compactSourcesToBuffer
replace with simpler test-focused version
2017-05-04 23:08:15 -07:00
cmasone-attic c32d4e917f Streaming Compaction (#3434)
The old compaction code loaded all chunks to be compacted into memory, assembled a compacted table, and then persisted it to backing storage. The nice thing about this was that we could de-dup chunks across the compacted tables. The bad thing was that we needed to hold all the chunks in memory at once. That turned out to be a problem, so we've moved to a new strategy that calculates only the merged index for the compacted table in memory, but streams chunk data directly from old tables to the new, big table. This should be a big win on S3 at least, because it turns out that for tables with > 5MB and < 5GB of chunk data, we can actually just tell S3 to reference a range of the existing object when building a compacted table.

Fixes #3411
2017-05-01 08:55:36 -07:00
cmasone-attic ff7cae6d34 Merge chunks.RootTracker interface into chunks.ChunkStore (#3408)
You can't fully specify RootTracker without referring to the
ChunkStore interface, so they should just merge.

Fixes #3402
2017-04-19 21:34:20 -07:00
cmasone-attic cb930dee81 Merge BatchStore into ChunkStore (#3403)
BatchStore is dead, long live ChunkStore! Merging these two required
some modification of the old ChunkStore contract to make it more
BatchStore-like in places, most specifically around Root(), Put() and
PutMany().

The first big change is that Root() now returns a cached value for the
root hash of the Store. This is how NBS worked already, so the more
interesting change here is the addition of Rebase(), which loads the
latest persistent root. Any chunks that appeared in backing storage
since the ChunkStore was opened (or last rebased) also become
visible.

UpdateRoot() has been replaced with Commit(), because UpdateRoot() was
ALREADY doing the work of persisting novel chunks as well as moving
the persisted root hash of the ChunkStore in both NBS and
httpBatchStore. This name, and the new contract (essentially Flush() +
UpdateRoot()), is a more accurate representation of what's going on.

As for Put(), the former contract for claimed to block until the chunk
was durable. That's no longer the case. Indeed, NBS was already not
fulfilling this contract. The new contract reflects this, asserting
that novel chunks aren't persisted until a Flush() or Commit() --
which has replaced UpdateRoot(). Novel chunks are immediately visible
to Get and Has calls, however.

In addition to this larger change, there are also some tweaks to
ValueStore and Database. ValueStore.Flush() no longer takes a hash,
and instead just persists any and all Chunks it has buffered since the
last time anyone called Flush(). Database.Close() used to have some
side effects where it persisted Chunks belonging to any Values the
caller had written -- that is no longer so. Values written to a
Database only become persistent upon a Commit-like operation (Commit,
CommitValue, FastForward, SetHead, or Delete).

/******** New ChunkStore interface ********/

type ChunkStore interface {
     ChunkSource
     RootTracker
}

// RootTracker allows querying and management of the root of an entire tree of
// references. The "root" is the single mutable variable in a ChunkStore. It
// can store any hash, but it is typically used by higher layers (such as
// Database) to store a hash to a value that represents the current state and
// entire history of a database.
type RootTracker interface {
     // Rebase brings this RootTracker into sync with the persistent storage's
     // current root.
     Rebase()

     // Root returns the currently cached root value.
     Root() hash.Hash

     // Commit atomically attempts to persist all novel Chunks and update the
     // persisted root hash from last to current. If last doesn't match the
     // root in persistent storage, returns false.
     // TODO: is last now redundant? Maybe this should just try to update from
     // the cached root to current?
     // TODO: Does having a separate RootTracker make sense anymore? BUG 3402
     Commit(current, last hash.Hash) bool
}

// ChunkSource is a place chunks live.
type ChunkSource interface {
     // Get the Chunk for the value of the hash in the store. If the hash is
     // absent from the store nil is returned.
     Get(h hash.Hash) Chunk

     // GetMany gets the Chunks with |hashes| from the store. On return,
     // |foundChunks| will have been fully sent all chunks which have been
     // found. Any non-present chunks will silently be ignored.
     GetMany(hashes hash.HashSet, foundChunks chan *Chunk)

     // Returns true iff the value at the address |h| is contained in the
     // source
     Has(h hash.Hash) bool

     // Returns a new HashSet containing any members of |hashes| that are
     // present in the source.
     HasMany(hashes hash.HashSet) (present hash.HashSet)

     // Put caches c in the ChunkSink. Upon return, c must be visible to
     // subsequent Get and Has calls, but must not be persistent until a call
     // to Flush(). Put may be called concurrently with other calls to Put(),
     // PutMany(), Get(), GetMany(), Has() and HasMany().
     Put(c Chunk)

     // PutMany caches chunks in the ChunkSink. Upon return, all members of
     // chunks must be visible to subsequent Get and Has calls, but must not be
     // persistent until a call to Flush(). PutMany may be called concurrently
     // with other calls to Put(), PutMany(), Get(), GetMany(), Has() and
     // HasMany().
     PutMany(chunks []Chunk)

     // Returns the NomsVersion with which this ChunkSource is compatible.
     Version() string

     // On return, any previously Put chunks must be durable. It is not safe to
     // call Flush() concurrently with Put() or PutMany().
     Flush()

     io.Closer
}

Fixes #2945
2017-04-19 13:31:58 -07:00
cmasone-attic fe2c476469 Fix NBS optimistic locking (#3353)
Introduce a "lock" hash into NBS manifests to address the bad
interaction between Flush() and optimistic locking. Our original
design didn't include Flush(), which changes the set of tables without
updating the root. Thus... an optimistic locking strategy predicated
on checking the currently-persisted root hash is not robust to
interleaved Flush() calls from multiple clients.

Fixes #3349
2017-04-07 16:55:39 -07:00
Rafael Weinstein 9527907674 Nbs local store factory (#3191)
Add NBS LocalStoreFactory
2017-02-14 20:52:30 -08:00
cmasone-attic 83235c7965 NBS: Calculate maxTableSize precisely (#3165)
Though Raf and I can't figure out how, it's clear that the method we
initially used for calculating the max amount of space for
snappy-compressed chunk data was incorrect. That's the root cause of
of all the chunks to be written and summing the snappy.MaxEncodedLen()
for each.

Fixes #3156
2017-02-09 11:46:06 -08:00
cmasone-attic 8cfc5e6512 Gather more info about Bug 3156 (#3158)
There's some case that causes chunks that compress to more than about
55k (we think these are quite big, chunks that are many hundreds of K
in size) not to wind up correctly inserted into tables. It looks like
the snappy library believes the buffer we've allocated may not be
large enough, so it allocates its own space and this screws us up.

This patch changes two things:
1) The CRC in the NBS format is now the CRC of the _compressed_ data
2) Such chunks will be manually copied into the table, so they won't
   be missing anymore

Also, when the code detects a case where the snappy library decided to
allocate its own storage, it saves the uncompressed data off to the
side, so that it can be pushed to durable storage. Such chunks are
stored on disk or in S3 named like "<chunk-hash>-errata", and logging
is dumped out so we can figure out which tables were supposed to
contain these chunks.

Towards #3156
2017-02-07 15:43:06 -08:00
cmasone-attic 8e40ee4959 First pass at compaction (#3143)
* First pass at compaction

The first cut at compaction blocks UpdateRoot() while it compacts n/2
tables down into a single, large table (where n == number of tables
named in the NBS manifest). It then attempts to update the manifest
with one referencing the compacted table, the novel tables from the
client, and the remaining upstream tables that were not compacted.

If the update fails, probably due to an optimistic lock failure, the
client drops the compacted table it just created, pulls in the tables
from the newly-discovered upstream manifest, and tries again.

Known flaws:
- may explode RAM (#3130)
- doesn't handle novel tables > max tables (#3142)
- may handle optimistic-lock-failures suboptimally (#3141)

Fixes #3132

Also, fixes #2944 because doing so simplifies some code.
2017-02-03 15:58:04 -08:00
Aaron Boodman a09ef6fb44 Revert "Introduce noms version 8. Use it to guard type simplification." (#3043) 2017-01-09 16:30:25 -08:00
Aaron Boodman a4ffa5ba9b Introduce noms version 8. Use it to guard type simplification. (#3035)
Introduce noms version 8. Use it to guard type simplification.
2017-01-06 17:32:32 -08:00
Rafael Weinstein 3242f18c20 [NBS] Implement Streaming GetMany (#3002)
Adds the ability to stream individual chunks requested via GetMany() back to caller.

Removes readAmpThresh and maxReadSize. Lowers the S3ReadBlockSize to 512k.
2017-01-03 12:25:01 -08:00
Rafael Weinstein d8d8c6c7e1 Parallel s3 Slice reads (#2979)
GetMany() calls can now be serviced by <= N goroutines, where N is the number of physical reads the request in broken down into.

This patch also adds a maxReadSize param to the code which decides how to break chunk reads into physical reads, and sets the s3 blockSize to 5MB, which experimentally resulted to lower total latency.

Lastly, some small refactors.
2016-12-22 11:45:33 -08:00
cmasone-attic d129580007 Add frag tool to measure nbs fragmentation (#2963)
Before we can defragment NBS stores, we need to understand how
fragmented they are. This tool provides a measure of fragmentation in
which optimal chunk-graph layout implies that ALL children of a given
parent can be read in one storage-layer operation (e.g. disk read, S3
transaction, etc).
2016-12-20 17:01:18 -08:00
Rafael Weinstein cc8ffacddf Factor tableIndex out of tableReader (#2950)
Factor tableIndex out of tableReader
2016-12-14 12:41:01 -08:00
Rafael Weinstein c159876992 Make read amplification threshold configurable (#2941) 2016-12-13 09:57:41 -08:00
cmasone-attic 7f36fad716 tablePersister.Compact returns a chunkSource (#2939)
It turns out the only caller of Compact() immediately
turns around and calls Open, so why don't I just do
that FOR you?

Fixes #2935
2016-12-13 06:20:33 -08:00
cmasone-attic de6e49c9e0 compactingChunkStore crash fix (#2936)
compactingChunkStore.close() must wait for compactions to finish.
2016-12-12 14:43:46 -08:00
cmasone-attic 7fe3b18a6b Make compaction async (#2934)
Introduce a 'compactingChunkStore', which knows how to compact itself
in the background. It satisfies get/has requests from an in-memory
table until compaction is complete. Once compaction is done, it
destroys the in-memory table and switches over to using solely the
persistent table.

Fixes #2879
2016-12-12 14:15:30 -08:00
cmasone-attic b3eef38fa4 Break NomsBlockStore dependency on disk storage (#2905)
This patch introduces/expands the 'manifest' and 'tableSet'
abstractions, so that NomsBlockStore is no longer explicitly using any
file system operations

Towards issue #2877
2016-12-05 09:05:40 -08:00
Rafael Weinstein a67bb9bf7b Minor rework of hash.Hash API (#2888)
Define the hash.Hash type to be a 20-byte array, rather than embed one. Hash API Changes: `hash.FromSlice` -> `hash.New`, `hash.FromData` -> `hash.Of`
2016-12-02 12:11:00 -08:00
Rafael Weinstein a00a5f5611 Implement experimental block store (#2870)
* Move NBS into Noms

* vendor in deps
2016-12-01 10:04:09 -08:00