Commit Graph

269 Commits

Author SHA1 Message Date
Francesco Mazzoli e5f133d826 Correct rota for "queue full" alert 2024-02-20 13:55:30 +00:00
Francesco Mazzoli 303421763a Allow to specify rota per alert in C++ 2024-02-20 12:59:42 +00:00
Saulius Grusnys 796e46f466 shuckle to track if blockservices have any files on them (currently t… (#177)
* shuckle to track if blockservices have any files on them (currently there is issue with transient files)
2024-02-20 08:10:51 +00:00
Joshua Leahy 37a205b71e Docker networking seems to not work on new arch snaps, this is fine 2024-02-19 14:38:52 +00:00
Francesco Mazzoli bfe8a449df Some eggsktools additions/improvements 2024-02-12 11:50:18 +00:00
Miroslav Crnic 83d0469c7f SharedRocksdDB: correctly export metrics 2024-02-08 19:39:00 +00:00
Miroslav Crnic 37ba9bc457 shard: support for sharing rocksdb and init LogsDB CFs 2024-02-08 17:44:03 +00:00
Miroslav Crnic 38707535e3 shuckle: support metadata replication 2024-02-07 13:57:00 +00:00
Francesco Mazzoli 9c477ffa40 Make RocksDB patching idempotent 2024-01-30 11:37:52 +00:00
Francesco Mazzoli 25676f1096 Handle concurrent block swapping better 2024-01-30 11:22:45 +00:00
Miroslav Crnic 1d6ac9f648 cmake: add patch -N back 2024-01-29 17:25:07 +00:00
Miroslav Crnic 1dedd7d181 core: SPSC return 0 on timeout in pull 2024-01-29 17:16:05 +00:00
Miroslav Crnic 2ec1304981 core: ppoll, futex dont like negative timeouts 2024-01-29 17:00:14 +00:00
Francesco Mazzoli 9d1a31b482 Fix another signedness mismatch 2024-01-29 16:46:05 +00:00
Miroslav Crnic e543665f8f core: SPSC support timeout in pull 2024-01-29 16:06:31 +00:00
Francesco Mazzoli 2a326f7c5f Fix usual signedness shenanigans 🥱 2024-01-29 16:05:19 +00:00
Francesco Mazzoli 9cf2931bc7 We do want the default patch behavior, not the -N one 2024-01-29 16:02:26 +00:00
Francesco Mazzoli 0a6a0c8f24 Process CDC timeouts in a timely manner 2024-01-29 15:08:06 +00:00
Francesco Mazzoli 1145ea10a3 Put patch in alpine docker build image 2024-01-29 14:43:36 +00:00
Francesco Mazzoli 2a6feb6df5 Patch RocksDB to make it compile with clang 15. 2024-01-29 14:15:29 +00:00
Miroslav Crnic 7ce185c219 cdc: remove uneccessary zeroing in shared 2024-01-24 14:24:06 +00:00
Francesco Mazzoli 8c0c246348 More robust detection of file vs. device errors
Just check if we're also unable to count the blocks for the disk,
and if yes, assume it's a single file error.

Of course there will be a time period where we will not have detected
the bad disk when counting the blocks (a few minutes at most), but
that's OK -- the scrubber will scrub blocks for that period, and then
stop.

Once <internal-repo/issues/65#issuecomment-24747>
is done, we should use whatever error detection we use for migration
to also distinguish between these errors.
2024-01-22 13:18:53 +00:00
Francesco Mazzoli f979a67b04 Always set non-zero transient deadline, fixes #145. 2024-01-18 19:04:36 +00:00
Francesco Mazzoli cd23deaf19 Accept DIRECTORY_NOT_FOUND in SOFT_UNLINK_DIRECTORY
Nothing is preventing a non-existant inode to be sent in that request.
2024-01-18 12:00:43 +00:00
Francesco Mazzoli 2a95b345d2 Many changes to make CI work on new runner
Most notably, we now run the non-kmod integration tests in docker.
The kmod tests are already in their own environment (qemu).
2024-01-18 11:57:17 +00:00
Francesco Mazzoli f8b432eb18 Add metric and alert for CDC update size 2024-01-16 23:22:39 +00:00
Francesco Mazzoli 694e17cbc2 Add alerts for full shard queues 2024-01-16 23:11:41 +00:00
Francesco Mazzoli b6cf2b67a6 Distribute block services from shuckle
This is in preparation for #44, but more immediately, to better
stop writing to full block services.

The previous strategy of setting a flag was flawed since once
the flag was set it stayed set -- i.e. we would not remove it once
files would be deleted.  This consideration should just be integrated
in distributing the block services.
2024-01-16 16:17:27 +00:00
Francesco Mazzoli d569bdb494 Re-introduce thread names (they got lost in a refactor) 2024-01-11 17:32:52 +00:00
Francesco Mazzoli c80c6269d9 Remove spurious MsgsGen.hpp includes 2024-01-11 16:05:34 +00:00
Francesco Mazzoli 8d0b97171e Remove dead code 2024-01-11 13:03:26 +00:00
Francesco Mazzoli c27ba8398a Tear down all threads at once
I had copied the LIFO pattern from ETD codebase, but it's not needed
here given that the loop terminates gracefully and so we can coordinate
explicitly if needed.
2024-01-09 16:53:23 +00:00
Francesco Mazzoli c9bf49d387 Fix silly SPSC bug 2024-01-09 11:14:18 +00:00
Francesco Mazzoli 3097752a30 Minor tweak 2024-01-08 16:03:07 +00:00
Francesco Mazzoli ee9e0ad0af Remove pthread_attr_setsigmask_np, musl does not have it 2024-01-08 15:58:31 +00:00
Francesco Mazzoli 002b2854ec Fix leak in FetchedSpan, and hopefully fix #141. 2024-01-08 15:58:31 +00:00
Francesco Mazzoli 8075e99bb6 Graceful shard teardown
See <https://mazzo.li/posts/stopping-linux-threads.html> for tradeoffs
regarding how to terminate threads gracefully.

The goal of this work was for valgrind to work correctly, which in turn
was to investigate #141. It looks like I have succeeded:

    ==2715080== Warning: unimplemented fcntl command: 1036
    ==2715080== 20,052 bytes in 5,013 blocks are definitely lost in loss record 133 of 135
    ==2715080==    at 0x483F013: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==2715080==    by 0x3B708E: allocate (new_allocator.h:121)
    ==2715080==    by 0x3B708E: allocate (allocator.h:173)
    ==2715080==    by 0x3B708E: allocate (alloc_traits.h:460)
    ==2715080==    by 0x3B708E: _M_allocate (stl_vector.h:346)
    ==2715080==    by 0x3B708E: std::vector<Crc, std::allocator<Crc> >::_M_default_append(unsigned long) (vector.tcc:635)
    ==2715080==    by 0x42BF1C: resize (stl_vector.h:940)
    ==2715080==    by 0x42BF1C: ShardDBImpl::_fileSpans(rocksdb::ReadOptions&, FileSpansReq const&, FileSpansResp&) (shard/ShardDB.cpp:921)
    ==2715080==    by 0x420867: ShardDBImpl::read(ShardReqContainer const&, ShardRespContainer&) (shard/ShardDB.cpp:1034)
    ==2715080==    by 0x3CB3EE: ShardServer::_handleRequest(int, sockaddr_in*, char*, unsigned long) (shard/Shard.cpp:347)
    ==2715080==    by 0x3C8A39: ShardServer::step() (shard/Shard.cpp:405)
    ==2715080==    by 0x40B1E8: run (core/Loop.cpp:67)
    ==2715080==    by 0x40B1E8: startLoop(void*) (core/Loop.cpp:37)
    ==2715080==    by 0x4BEA258: start_thread (in /usr/lib/libpthread-2.33.so)
    ==2715080==    by 0x4D005E2: clone (in /usr/lib/libc-2.33.so)
    ==2715080==
    ==2715080==
    ==2715080== Exit program on first error (--exit-on-first-error=yes)
2024-01-08 15:41:22 +00:00
Francesco Mazzoli 1963714c0f Remove avoidable stat in collect directories 2023-12-15 21:20:05 +00:00
Francesco Mazzoli 01af461477 Factor out function 2023-12-15 18:30:12 +00:00
Francesco Mazzoli 73200f24b6 Use DWARF 4, Ubuntu 20.04 does not understand DWARF 5. 2023-12-11 16:23:55 +00:00
Francesco Mazzoli 898b85ad9c Tweak GC parameters
We're almost in a steady state, no need to overwhelm the shards.
2023-12-11 15:04:41 +00:00
Francesco Mazzoli 8c172fd2e8 Tiny C++ xmon fix 2023-12-10 11:14:19 +00:00
Francesco Mazzoli 27bd28ead0 Remove outdated comment 2023-12-10 08:39:17 +00:00
Francesco Mazzoli 788b5eed57 Fill in current block services before applying the log
It makes a lot more sense to pick outside, given that it involves
randomness. Also, this is in preparation for shuckle picking them
in a smarter way.
2023-12-09 15:20:24 +00:00
Francesco Mazzoli 3394328000 Do not try to close xmon fd if we don't have one
Also, ignore errors if we can't close it. Fixes #134.
2023-12-09 14:50:51 +00:00
Francesco Mazzoli ab1df9137d Fix error logging when inserting stats 2023-12-08 15:57:02 +00:00
Francesco Mazzoli 128078988d Get rid of -parallel in GC
With separate workers it's not really needed anymore.
2023-12-08 11:51:21 +00:00
Francesco Mazzoli 5f4467d0c6 Synchronize access to in-memory block service data
This was alread an issue before, but it never surfaced so far.
Today the quants actually hit it.
2023-12-07 16:43:11 +00:00
Francesco Mazzoli 53049d5779 Shard batch writes, use batch UDP syscalls
The idea is to drain the socket and do a single RocksDB WAL
write/fsync for all the write requests we have found.

The read requests are immediately executed. The reasoning here is
that currently write requests are _a lot_ slower than the read
requests because fsyncing takes ~500us on fsf1. In the future this
might change.

Since we're at it, we also use batch UDP syscalls in the CDC.

Fixes #119.
2023-12-07 14:29:07 +00:00
Francesco Mazzoli 3eae5bbf9b Use an EMA for the in-flight CDC txns as well 2023-12-07 10:27:32 +00:00