In httpBatchStore.GetMany(), we check our unwritten
puts to see if any of the requested chunks already
exist locally. If any do, we're _supposed_ to remove
their hashes from the set slated to be requested from
the server. That logic was borked.
Towards https://github.com/attic-labs/attic/issues/503
* Add zero check
Also Fixes#3063
Compaction not only persists the contents of a memTable, it also
filters out duplicate chunks. This means that calling count() on a
compactingChunkStore before and after compaction completes could lead
to different results. In the case where the memTable contains only
duplicate chunks, this is Very Bad becuase it leads to an non-existent
table winding up in the NBS manifest.
Fixes#3044
Chunk deserialization can run into errors sometimes if, e.g. the
client hangs up during a writeValue request. The old error strategy
worked by throwing a "catchable" error and recovering. That's OK if
you've only got one goroutine, but since the writeValue handler starts
so many goroutines, architecting the code to deal with error handling
by panic/recover is dicey.
Instead, make DeserializeToChan return an error in the more common
failure cases and handle it by passing it over a channel and raising
it from a central place.
The more code can use GetMany(), the better performance gets on top of
NBS. To this end, add a call to ValueStore that allows code to read
many values concurrently. This can be used e.g. by read-ahead code
that's navigating prolly trees to increase performance.
Fixes#3019
In the edge case where all source chunks are already present in the
sink AND the sink dataset does not yet exist, the httpBatchStore code
was sending chunks in reverse order. This patch ensures that,
regardless of how few chunks are sent, any operation that sends chunks
to the server also resets chunk write order.
Fixes#3101
Now that we have GetMany, the server can use it directly to let the
chunk-fetching layer figure out the best way to batch up requests. A
small refactor allows ValidatingBatchingSink to directly update the
hint cache instead of relying on logic inside ReadValue to do it. I
made that change because ReadValue now also does a bunch of other
things around caching read values and checking a cache of 'pending
Puts' that will never have anything in it when used from the server.
Toward issue #3019
Prior to this patch, ValueStore only kept a record of where referenced Refs originally reside in from chunks read from the server. This ignored the case of a client doing subsequent commits with the same ValueStore (for example, writing multiple states of a map). This was resulting in the server being forced to load a ton of chunks to validate.
First cut at "type simplification".
This dramatically reduces the sizes of the types generated by type
accretion, in some cases by several orders of magnitude.
Implemented as a drop-in replacement for MakeUnionType(), initially
behind a flag.
Fixes#2995
The old strategy for writing values was to recursively encode them,
putting the resulting chunks into a BatchStore from the bottom up as
they were generated. The BatchStore implementation was responsible for
handling concurrency, so chunks from different Values would be
interleaved if the there were multiple calls to WriteValue happening
at the same time.
The new strategy tries to keep chunks from the same 'level' of a
graph together by caching chunks as they're encoded and only writing
them once they're referenced by some other value. When a collection
is written, the graph representing it is encoded recursively, and
chunks are generated bottom-up. The new strategy should, in practice,
mean that the children of a given parent node in this graph will be
cached until that parent gets written, and then they'll get written
all at once.
Adds the ability to stream individual chunks requested via GetMany() back to caller.
Removes readAmpThresh and maxReadSize. Lowers the S3ReadBlockSize to 512k.
The new spec is a URI, akin to what we use for HTTP It allows the
specification of a DynamoDB table, an S3 bucket, a database ID, and a
dataset ID: aws://table-name:bucket-name/database::dataset
The bucket name is optional and, if not provided, Noms will use a
ChunkStore implementation backed only by DynamoDB.
NBS benefits from related chunks being near one another. Initially,
let's use write-order as a proxy for "related".
This patch contains a pretty heinous hack to allow sync to continue
putting chunks into httpBatchStore top-down without breaking
server-side validation. Work to fix this is tracked in #2982
This patch fixes#2968, at least for now
* Introduces PullWithFlush() to allow noms sync to explicitly
pull chunks over and flush directly after. This allows UpdateRoot
to behave as before.
Also clears out all the legacy batch-put machinery. Now, Flush()
just directly calls sendWriteRequests().
GetMany() calls can now be serviced by <= N goroutines, where N is the number of physical reads the request in broken down into.
This patch also adds a maxReadSize param to the code which decides how to break chunk reads into physical reads, and sets the s3 blockSize to 5MB, which experimentally resulted to lower total latency.
Lastly, some small refactors.
This is a potentially breaking change!
Before this change we required all the fields in a Go struct to be
present in the Noms struct when we unmarshal the Noms struct onto the
Go struct. This is no longer the case, which means that all fields in
the Go struct that are present in the Noms struct will be copied over.
This also means that `omitempty` is useless in Unmarshal and it has been
removed.
This might break your code if expected to get errors when the field
names did not match!
Fixes#2971
This is a breaking change!
We used to create empty Go collections `[]int{}` when unmarshalling an
empty Noms collection onto a Go collection that was `nil`. Now we keep
the Go collection as `nil` which means that you will get `[]int(nil)`
for an empty Noms List.
Fixes#2969
Before we can defragment NBS stores, we need to understand how
fragmented they are. This tool provides a measure of fragmentation in
which optimal chunk-graph layout implies that ALL children of a given
parent can be read in one storage-layer operation (e.g. disk read, S3
transaction, etc).