Previously the buzhash boundary checker used a single value for the
window size, both as the buzhash buffer size when constructing a hash
object, and reported as its window size to the boundary checker
interface. This was wrong because we don't always pass single byte
values to the hasher, for example refs are 20 bytes.
The compound list chunking compensated for this by only passing the
first byte of each list leaf's ref rather than the full ref. This is bad
because there is obviously less entropy in 1 byte vs 20 bytes.
The meta sequence chunking compensated for this by multiplying the
chunking window size by 20, but this also had the effect of
unnecessarily considering 20 times more chunked elements than would fit
in the buzhash buffer.
After a compound blob is created we try to chunk it again in a similar
way to how we chunk Lists. We use the refs of the sub blob and compute
a rolling hash over these. If the hash matches a pattern then we split
the existing compound blob into a new compound blob with sub blobs
which are slices of the original compound blob.
Issue #17
The json serialization now only contains the length of each individual
blob child.
The go representation of this still uses offsets but the offsets are
for the end delimiter.
For "hi" "bye" we get
{"cb", [{"ref": "sha1-hi"}, 2, {"ref": "sha1-bye"}, 3]}
compoundBlob{[2, 5], [sha1-hi, ,sha1-bye]}
Keeping the length in the serialization leads to smaller serializations
Using the end offset leads to simpler binary search and allows us to
use the last entry as the length.
Issue #17
this point the compoundBlob only contains blob leafs but a future
change will create multiple tiers. Both these implement the new Blob
interface.
The splitting is done by using a rolling hash over the last 64 bytes,
when that hash ends with 13 consecutive ones we split the data.
Issue #17
This will enable us to walk the chunk graph without having to go
through weird contortions to figure out which values don't have
chunks in any chunkstore (because they were inlined).
Towards issue #82