This patch uses process-wide per-store locking to ensure that only one
NomsBlockStore instance is ever trying to update the upstream NBS
manifest at a time. It also locks out attempts to fetch the manifest
contents during that window.
Conjoining is now much simpler. Since only one instance can ever be in
the critical path of Commit at a time, and conjoining is triggered on
that critical path, we now simply perform the conjoin while excluding
all other in-process NBS instances. Hopefully, locking out instances
who want to fetch the manifest contents during a conjoin won't cripple
performance.
Fixes issue #3583
If NomsBlockStore can assume that its manifest is a cachingManifest,
it can pre-emptively check to see if someone else in-process has
already moved the manifest forward and, if so, fail early.
Fixes#3574
Previously, every NomsBlockStore instance decided when to conjoin
tables (and which to conjoin) entirely on its own, which led to A LOT
of concurrent conjoining that would mostly be wasted effort, as one
instance would win the race and then all the rest would drop their
work on the floor, rebase, and continue. This patch introduces a
'conjoiner' that is either process-global, or owned by one of the NBS
factory objects you can create. Now, NBS instances vended by a given
factory call this single conjoiner during Commit(), asking it to
perform a conjoin if necessary. If a conjoin is already underway, the
conjoiner blocks the caller until it's finished and then
returns. Whether the conjoin was triggered at the caller's request, or
the caller got to opportunistically piggyback on a conjoin already in
progress, the caller must rebase after Conjoin() returns.
Fixes#3422
Clean up NBS cruft standing in the way of improvements:
Unmap buffer in newMmapTableReader()
By the time this function exits, we're done with this buffer.
Hanging on to it complicates lifetime management for the file
backing the mmapTableReader, which is something I'm trying to
make simpler. So...ditch it!
remove compactSourcesToBuffer
replace with simpler test-focused version
The old compaction code loaded all chunks to be compacted into memory, assembled a compacted table, and then persisted it to backing storage. The nice thing about this was that we could de-dup chunks across the compacted tables. The bad thing was that we needed to hold all the chunks in memory at once. That turned out to be a problem, so we've moved to a new strategy that calculates only the merged index for the compacted table in memory, but streams chunk data directly from old tables to the new, big table. This should be a big win on S3 at least, because it turns out that for tables with > 5MB and < 5GB of chunk data, we can actually just tell S3 to reference a range of the existing object when building a compacted table.
Fixes#3411
BatchStore is dead, long live ChunkStore! Merging these two required
some modification of the old ChunkStore contract to make it more
BatchStore-like in places, most specifically around Root(), Put() and
PutMany().
The first big change is that Root() now returns a cached value for the
root hash of the Store. This is how NBS worked already, so the more
interesting change here is the addition of Rebase(), which loads the
latest persistent root. Any chunks that appeared in backing storage
since the ChunkStore was opened (or last rebased) also become
visible.
UpdateRoot() has been replaced with Commit(), because UpdateRoot() was
ALREADY doing the work of persisting novel chunks as well as moving
the persisted root hash of the ChunkStore in both NBS and
httpBatchStore. This name, and the new contract (essentially Flush() +
UpdateRoot()), is a more accurate representation of what's going on.
As for Put(), the former contract for claimed to block until the chunk
was durable. That's no longer the case. Indeed, NBS was already not
fulfilling this contract. The new contract reflects this, asserting
that novel chunks aren't persisted until a Flush() or Commit() --
which has replaced UpdateRoot(). Novel chunks are immediately visible
to Get and Has calls, however.
In addition to this larger change, there are also some tweaks to
ValueStore and Database. ValueStore.Flush() no longer takes a hash,
and instead just persists any and all Chunks it has buffered since the
last time anyone called Flush(). Database.Close() used to have some
side effects where it persisted Chunks belonging to any Values the
caller had written -- that is no longer so. Values written to a
Database only become persistent upon a Commit-like operation (Commit,
CommitValue, FastForward, SetHead, or Delete).
/******** New ChunkStore interface ********/
type ChunkStore interface {
ChunkSource
RootTracker
}
// RootTracker allows querying and management of the root of an entire tree of
// references. The "root" is the single mutable variable in a ChunkStore. It
// can store any hash, but it is typically used by higher layers (such as
// Database) to store a hash to a value that represents the current state and
// entire history of a database.
type RootTracker interface {
// Rebase brings this RootTracker into sync with the persistent storage's
// current root.
Rebase()
// Root returns the currently cached root value.
Root() hash.Hash
// Commit atomically attempts to persist all novel Chunks and update the
// persisted root hash from last to current. If last doesn't match the
// root in persistent storage, returns false.
// TODO: is last now redundant? Maybe this should just try to update from
// the cached root to current?
// TODO: Does having a separate RootTracker make sense anymore? BUG 3402
Commit(current, last hash.Hash) bool
}
// ChunkSource is a place chunks live.
type ChunkSource interface {
// Get the Chunk for the value of the hash in the store. If the hash is
// absent from the store nil is returned.
Get(h hash.Hash) Chunk
// GetMany gets the Chunks with |hashes| from the store. On return,
// |foundChunks| will have been fully sent all chunks which have been
// found. Any non-present chunks will silently be ignored.
GetMany(hashes hash.HashSet, foundChunks chan *Chunk)
// Returns true iff the value at the address |h| is contained in the
// source
Has(h hash.Hash) bool
// Returns a new HashSet containing any members of |hashes| that are
// present in the source.
HasMany(hashes hash.HashSet) (present hash.HashSet)
// Put caches c in the ChunkSink. Upon return, c must be visible to
// subsequent Get and Has calls, but must not be persistent until a call
// to Flush(). Put may be called concurrently with other calls to Put(),
// PutMany(), Get(), GetMany(), Has() and HasMany().
Put(c Chunk)
// PutMany caches chunks in the ChunkSink. Upon return, all members of
// chunks must be visible to subsequent Get and Has calls, but must not be
// persistent until a call to Flush(). PutMany may be called concurrently
// with other calls to Put(), PutMany(), Get(), GetMany(), Has() and
// HasMany().
PutMany(chunks []Chunk)
// Returns the NomsVersion with which this ChunkSource is compatible.
Version() string
// On return, any previously Put chunks must be durable. It is not safe to
// call Flush() concurrently with Put() or PutMany().
Flush()
io.Closer
}
Fixes#2945
Introduce a "lock" hash into NBS manifests to address the bad
interaction between Flush() and optimistic locking. Our original
design didn't include Flush(), which changes the set of tables without
updating the root. Thus... an optimistic locking strategy predicated
on checking the currently-persisted root hash is not robust to
interleaved Flush() calls from multiple clients.
Fixes#3349
Though Raf and I can't figure out how, it's clear that the method we
initially used for calculating the max amount of space for
snappy-compressed chunk data was incorrect. That's the root cause of
of all the chunks to be written and summing the snappy.MaxEncodedLen()
for each.
Fixes#3156
There's some case that causes chunks that compress to more than about
55k (we think these are quite big, chunks that are many hundreds of K
in size) not to wind up correctly inserted into tables. It looks like
the snappy library believes the buffer we've allocated may not be
large enough, so it allocates its own space and this screws us up.
This patch changes two things:
1) The CRC in the NBS format is now the CRC of the _compressed_ data
2) Such chunks will be manually copied into the table, so they won't
be missing anymore
Also, when the code detects a case where the snappy library decided to
allocate its own storage, it saves the uncompressed data off to the
side, so that it can be pushed to durable storage. Such chunks are
stored on disk or in S3 named like "<chunk-hash>-errata", and logging
is dumped out so we can figure out which tables were supposed to
contain these chunks.
Towards #3156
* First pass at compaction
The first cut at compaction blocks UpdateRoot() while it compacts n/2
tables down into a single, large table (where n == number of tables
named in the NBS manifest). It then attempts to update the manifest
with one referencing the compacted table, the novel tables from the
client, and the remaining upstream tables that were not compacted.
If the update fails, probably due to an optimistic lock failure, the
client drops the compacted table it just created, pulls in the tables
from the newly-discovered upstream manifest, and tries again.
Known flaws:
- may explode RAM (#3130)
- doesn't handle novel tables > max tables (#3142)
- may handle optimistic-lock-failures suboptimally (#3141)
Fixes#3132
Also, fixes#2944 because doing so simplifies some code.
Adds the ability to stream individual chunks requested via GetMany() back to caller.
Removes readAmpThresh and maxReadSize. Lowers the S3ReadBlockSize to 512k.
GetMany() calls can now be serviced by <= N goroutines, where N is the number of physical reads the request in broken down into.
This patch also adds a maxReadSize param to the code which decides how to break chunk reads into physical reads, and sets the s3 blockSize to 5MB, which experimentally resulted to lower total latency.
Lastly, some small refactors.
Before we can defragment NBS stores, we need to understand how
fragmented they are. This tool provides a measure of fragmentation in
which optimal chunk-graph layout implies that ALL children of a given
parent can be read in one storage-layer operation (e.g. disk read, S3
transaction, etc).
Introduce a 'compactingChunkStore', which knows how to compact itself
in the background. It satisfies get/has requests from an in-memory
table until compaction is complete. Once compaction is done, it
destroys the in-memory table and switches over to using solely the
persistent table.
Fixes#2879
This patch introduces/expands the 'manifest' and 'tableSet'
abstractions, so that NomsBlockStore is no longer explicitly using any
file system operations
Towards issue #2877