This started being a problem since the block service update log
entry does not fit in a UDP packet (it's like 100KB). I think this
approach makes more sense anyway. See comment for `getCache()` for
gotchas.
Just check if we're also unable to count the blocks for the disk,
and if yes, assume it's a single file error.
Of course there will be a time period where we will not have detected
the bad disk when counting the blocks (a few minutes at most), but
that's OK -- the scrubber will scrub blocks for that period, and then
stop.
Once <internal-repo/issues/65#issuecomment-24747>
is done, we should use whatever error detection we use for migration
to also distinguish between these errors.
This is in preparation for #44, but more immediately, to better
stop writing to full block services.
The previous strategy of setting a flag was flawed since once
the flag was set it stayed set -- i.e. we would not remove it once
files would be deleted. This consideration should just be integrated
in distributing the block services.
I had copied the LIFO pattern from ETD codebase, but it's not needed
here given that the loop terminates gracefully and so we can coordinate
explicitly if needed.
See <https://mazzo.li/posts/stopping-linux-threads.html> for tradeoffs
regarding how to terminate threads gracefully.
The goal of this work was for valgrind to work correctly, which in turn
was to investigate #141. It looks like I have succeeded:
==2715080== Warning: unimplemented fcntl command: 1036
==2715080== 20,052 bytes in 5,013 blocks are definitely lost in loss record 133 of 135
==2715080== at 0x483F013: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2715080== by 0x3B708E: allocate (new_allocator.h:121)
==2715080== by 0x3B708E: allocate (allocator.h:173)
==2715080== by 0x3B708E: allocate (alloc_traits.h:460)
==2715080== by 0x3B708E: _M_allocate (stl_vector.h:346)
==2715080== by 0x3B708E: std::vector<Crc, std::allocator<Crc> >::_M_default_append(unsigned long) (vector.tcc:635)
==2715080== by 0x42BF1C: resize (stl_vector.h:940)
==2715080== by 0x42BF1C: ShardDBImpl::_fileSpans(rocksdb::ReadOptions&, FileSpansReq const&, FileSpansResp&) (shard/ShardDB.cpp:921)
==2715080== by 0x420867: ShardDBImpl::read(ShardReqContainer const&, ShardRespContainer&) (shard/ShardDB.cpp:1034)
==2715080== by 0x3CB3EE: ShardServer::_handleRequest(int, sockaddr_in*, char*, unsigned long) (shard/Shard.cpp:347)
==2715080== by 0x3C8A39: ShardServer::step() (shard/Shard.cpp:405)
==2715080== by 0x40B1E8: run (core/Loop.cpp:67)
==2715080== by 0x40B1E8: startLoop(void*) (core/Loop.cpp:37)
==2715080== by 0x4BEA258: start_thread (in /usr/lib/libpthread-2.33.so)
==2715080== by 0x4D005E2: clone (in /usr/lib/libc-2.33.so)
==2715080==
==2715080==
==2715080== Exit program on first error (--exit-on-first-error=yes)