Commit Graph

101 Commits

Author SHA1 Message Date
Miroslav Crnic ffb5989692 cdc: increase amount of message received to account for LogsDB msgs 2024-06-02 11:22:20 +00:00
Miroslav Crnic bb47c8f38d cdc: remove adaptive msg receive 2024-06-02 07:59:16 +00:00
Miroslav Crnic 98d0376fc9 cdc: dont forget requests until they are handed over to LogsDB 2024-05-24 10:16:54 +01:00
Miroslav Crnic 1f145c030e shard/cdc: support snapshoting 2024-05-23 10:17:59 +01:00
Miroslav Crnic 1446e4d0d2 cdc: force log cleanup after crash
Transactional db that CDC uses has a slightly
annoying property that it flushes WAL on transaction
start. As a result release point can get moved and
log records persisted even if we crash.
We want to remove them automatically for now.
2024-05-22 16:58:18 +00:00
Miroslav Crnic a377536b40 cdc: correctly rewind assumed logIndex 2024-05-22 15:28:15 +00:00
Miroslav Crnic 4e574374ca shard/cdc: cleanup logsdb options, hostmon name match service name 2024-05-22 10:21:41 +00:00
Miroslav Crnic 25e8264517 cdc: rewind expected LogIdx on append window full 2024-05-22 09:14:16 +00:00
Miroslav Crnic b524748210 cdc: use DEFAULT_UDP_MTU for serialization of entries 2024-05-21 14:23:27 +00:00
Miroslav Crnic 121340f1b2 cdc: log line fixes and handle interupt 2024-05-21 12:56:26 +00:00
Miroslav Crnic ab4c25e5e3 cdc: use normal buffer size in cdc sockets 2024-05-21 12:55:47 +00:00
Miroslav Crnic 8be746de5b cdc: differentiate replicas in xmon and in metrics 2024-05-21 09:51:37 +00:00
Miroslav Crnic 5d453179ad cdc: dont alert on missing replicas if replication is off 2024-05-20 13:10:31 +00:00
Miroslav Crnic 0b3348b458 SharedRocksDB: (ref) paths in constructor 2024-05-17 14:23:25 +00:00
Miroslav Crnic f5e17dace5 cdc: add LogsDB
* cdc: pack req/resp into log entries and apply

* shard: drop support for unused incomming packet drop

* cdc: add logsdb
2024-05-14 12:50:17 +01:00
Miroslav Crnic 8a0ea10cde core: UDPSocketPair and use IpPort AddrsInfo everywhere
* core: UDPSocketPair and use IpPort AddrsInfo everywhere

* Refactor UDPSocketPair a bit

* ci: kmod always delete img before create

* shuckle: fix scripts/json marshal

---------

Co-authored-by: Francesco Mazzoli <francesco.mazzoli@xtxmarkets.com>
2024-05-03 11:32:07 +01:00
Francesco Mazzoli 40ed10b2c6 Bump quiet period for metrics alerts
VictoraMetrics is often acting up
2024-04-29 11:39:54 +00:00
Miroslav Crnic 409b126e4b cdc: use SharedRocksDB 2024-04-05 23:22:39 +01:00
Miroslav Crnic 30ee029f7e shuckle: make requests interruptable and pass timeout to all operations
This means that they'll be interrupted at shutdown, rather than holding everything up when shuckle is overloaded.
We also detect idle connection or slow transmitting data.
2024-04-02 18:15:29 +01:00
Francesco Mazzoli 7a5fc9f8a9 Allow to disable shuckle stat inserting 2024-03-25 16:08:54 +00:00
Francesco Mazzoli 3a6e498664 Make some Loop methods static 2024-03-20 13:00:18 +00:00
Miroslav Crnic b240de53b5 shard: distributed log implementation and shard can use it with a flag set 2024-03-12 11:02:04 +00:00
Miroslav Crnic 712ed8973e core: simplify implementing custom stop for Loop 2024-02-23 13:52:34 +00:00
Francesco Mazzoli beb07dbe6e Silence CDC queue alert 2024-02-21 14:57:00 +00:00
Francesco Mazzoli 303421763a Allow to specify rota per alert in C++ 2024-02-20 12:59:42 +00:00
Francesco Mazzoli 0a6a0c8f24 Process CDC timeouts in a timely manner 2024-01-29 15:08:06 +00:00
Miroslav Crnic 7ce185c219 cdc: remove uneccessary zeroing in shared 2024-01-24 14:24:06 +00:00
Francesco Mazzoli f8b432eb18 Add metric and alert for CDC update size 2024-01-16 23:22:39 +00:00
Francesco Mazzoli c80c6269d9 Remove spurious MsgsGen.hpp includes 2024-01-11 16:05:34 +00:00
Francesco Mazzoli 8075e99bb6 Graceful shard teardown
See <https://mazzo.li/posts/stopping-linux-threads.html> for tradeoffs
regarding how to terminate threads gracefully.

The goal of this work was for valgrind to work correctly, which in turn
was to investigate #141. It looks like I have succeeded:

    ==2715080== Warning: unimplemented fcntl command: 1036
    ==2715080== 20,052 bytes in 5,013 blocks are definitely lost in loss record 133 of 135
    ==2715080==    at 0x483F013: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==2715080==    by 0x3B708E: allocate (new_allocator.h:121)
    ==2715080==    by 0x3B708E: allocate (allocator.h:173)
    ==2715080==    by 0x3B708E: allocate (alloc_traits.h:460)
    ==2715080==    by 0x3B708E: _M_allocate (stl_vector.h:346)
    ==2715080==    by 0x3B708E: std::vector<Crc, std::allocator<Crc> >::_M_default_append(unsigned long) (vector.tcc:635)
    ==2715080==    by 0x42BF1C: resize (stl_vector.h:940)
    ==2715080==    by 0x42BF1C: ShardDBImpl::_fileSpans(rocksdb::ReadOptions&, FileSpansReq const&, FileSpansResp&) (shard/ShardDB.cpp:921)
    ==2715080==    by 0x420867: ShardDBImpl::read(ShardReqContainer const&, ShardRespContainer&) (shard/ShardDB.cpp:1034)
    ==2715080==    by 0x3CB3EE: ShardServer::_handleRequest(int, sockaddr_in*, char*, unsigned long) (shard/Shard.cpp:347)
    ==2715080==    by 0x3C8A39: ShardServer::step() (shard/Shard.cpp:405)
    ==2715080==    by 0x40B1E8: run (core/Loop.cpp:67)
    ==2715080==    by 0x40B1E8: startLoop(void*) (core/Loop.cpp:37)
    ==2715080==    by 0x4BEA258: start_thread (in /usr/lib/libpthread-2.33.so)
    ==2715080==    by 0x4D005E2: clone (in /usr/lib/libc-2.33.so)
    ==2715080==
    ==2715080==
    ==2715080== Exit program on first error (--exit-on-first-error=yes)
2024-01-08 15:41:22 +00:00
Francesco Mazzoli 53049d5779 Shard batch writes, use batch UDP syscalls
The idea is to drain the socket and do a single RocksDB WAL
write/fsync for all the write requests we have found.

The read requests are immediately executed. The reasoning here is
that currently write requests are _a lot_ slower than the read
requests because fsyncing takes ~500us on fsf1. In the future this
might change.

Since we're at it, we also use batch UDP syscalls in the CDC.

Fixes #119.
2023-12-07 14:29:07 +00:00
Francesco Mazzoli 3eae5bbf9b Use an EMA for the in-flight CDC txns as well 2023-12-07 10:27:32 +00:00
Francesco Mazzoli 38f3d54ecd Wait forever, rather than having timeouts
The goal here is to not have constant wakeups due to timeout. Do
not attempt to clean things up nicely before termination -- just
terminate instead. We can setup a proper termination system in
the future, I first want to see if this makes a difference.

Also, change xmon to use pipes for communication, so that it can
wait without timers as well.

Also, `write` directly for logging, so that we know the logs will
make it to the file after the logging call returns (since we now
do not have the chance to flush them afterwards).
2023-12-07 10:11:19 +00:00
Francesco Mazzoli af46ab2173 Bump CDC shard response timeout 2023-11-29 15:00:08 +00:00
Francesco Mazzoli a52efe217b Tune CDC logging more 2023-11-29 14:40:33 +00:00
Francesco Mazzoli e4c01e8728 Metrics + logging 2023-11-29 14:32:37 +00:00
Francesco Mazzoli bd278ff6f6 Better metrics for shard responses in CDC 2023-11-29 13:52:44 +00:00
Francesco Mazzoli 4453083aa7 Correctly record request id when picking up transactions after restart 2023-11-29 11:08:07 +00:00
Francesco Mazzoli 7537bbc6cf Remove useless line 2023-11-29 11:08:07 +00:00
Francesco Mazzoli 2eab012d76 Fix bug in poll check code 2023-11-29 11:08:07 +00:00
Francesco Mazzoli c94ece50cf Integer sanitizer stuff 2023-11-29 11:08:07 +00:00
Francesco Mazzoli 59abb24a8e Add ceiling on max update size
We don't want it to grow without bound, but we want to maximize
throughput (we'd like for fsync to not be a factor).
2023-11-29 11:08:07 +00:00
Francesco Mazzoli 476009381a Remove maximum enqueued requests limit
We already drop in-flight requests that we're already processing,
so I don't think this matters very much currently.
2023-11-29 11:08:07 +00:00
Francesco Mazzoli c5562c7ca3 Parallelize CDC by directory
Fixes #66.
2023-11-29 11:08:07 +00:00
Francesco Mazzoli 057be91613 rocksDBStats -> rocksDBMetrics 2023-11-09 13:38:32 +00:00
Francesco Mazzoli c5979a9d90 Expose some RocksDB stats 2023-11-09 13:23:49 +00:00
Francesco Mazzoli 03e9510255 Align xmon's app instances and systemd services 2023-11-08 14:36:58 +00:00
Francesco Mazzoli 71556ce933 Switch to restech EggsFS rota 2023-11-03 14:23:44 +00:00
Francesco Mazzoli 64d400fcfe Insert shard/cdc metrics at more regular intervals 2023-11-03 13:49:38 +00:00
Francesco Mazzoli 654c0d4db4 Report CDC queue size in grafana 2023-11-03 13:49:32 +00:00