ternfs-XTXMarkets

mirror of https://github.com/XTXMarkets/ternfs.git synced 2026-02-09 03:48:55 -06:00

Author	SHA1	Message	Date
Miroslav Crnic	409b126e4b	cdc: use SharedRocksDB	2024-04-05 23:22:39 +01:00
Miroslav Crnic	30ee029f7e	shuckle: make requests interruptable and pass timeout to all operations This means that they'll be interrupted at shutdown, rather than holding everything up when shuckle is overloaded. We also detect idle connection or slow transmitting data.	2024-04-02 18:15:29 +01:00
Miroslav Crnic	65e5b6e7ac	cdc: shuckle-stats	2024-03-26 09:40:44 +00:00
Francesco Mazzoli	7a5fc9f8a9	Allow to disable shuckle stat inserting	2024-03-25 16:08:54 +00:00
Francesco Mazzoli	3a6e498664	Make some `Loop` methods static	2024-03-20 13:00:18 +00:00
Miroslav Crnic	b240de53b5	shard: distributed log implementation and shard can use it with a flag set	2024-03-12 11:02:04 +00:00
Miroslav Crnic	712ed8973e	core: simplify implementing custom stop for Loop	2024-02-23 13:52:34 +00:00
Francesco Mazzoli	beb07dbe6e	Silence CDC queue alert	2024-02-21 14:57:00 +00:00
Francesco Mazzoli	303421763a	Allow to specify rota per alert in C++	2024-02-20 12:59:42 +00:00
Francesco Mazzoli	0a6a0c8f24	Process CDC timeouts in a timely manner	2024-01-29 15:08:06 +00:00
Miroslav Crnic	7ce185c219	cdc: remove uneccessary zeroing in shared	2024-01-24 14:24:06 +00:00
Francesco Mazzoli	cd23deaf19	Accept `DIRECTORY_NOT_FOUND` in `SOFT_UNLINK_DIRECTORY` Nothing is preventing a non-existant inode to be sent in that request.	2024-01-18 12:00:43 +00:00
Francesco Mazzoli	f8b432eb18	Add metric and alert for CDC update size	2024-01-16 23:22:39 +00:00
Francesco Mazzoli	c80c6269d9	Remove spurious `MsgsGen.hpp` includes	2024-01-11 16:05:34 +00:00
Francesco Mazzoli	8075e99bb6	Graceful shard teardown See <https://mazzo.li/posts/stopping-linux-threads.html> for tradeoffs regarding how to terminate threads gracefully. The goal of this work was for valgrind to work correctly, which in turn was to investigate #141. It looks like I have succeeded: ==2715080== Warning: unimplemented fcntl command: 1036 ==2715080== 20,052 bytes in 5,013 blocks are definitely lost in loss record 133 of 135 ==2715080== at 0x483F013: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2715080== by 0x3B708E: allocate (new_allocator.h:121) ==2715080== by 0x3B708E: allocate (allocator.h:173) ==2715080== by 0x3B708E: allocate (alloc_traits.h:460) ==2715080== by 0x3B708E: _M_allocate (stl_vector.h:346) ==2715080== by 0x3B708E: std::vector<Crc, std::allocator<Crc> >::_M_default_append(unsigned long) (vector.tcc:635) ==2715080== by 0x42BF1C: resize (stl_vector.h:940) ==2715080== by 0x42BF1C: ShardDBImpl::_fileSpans(rocksdb::ReadOptions&, FileSpansReq const&, FileSpansResp&) (shard/ShardDB.cpp:921) ==2715080== by 0x420867: ShardDBImpl::read(ShardReqContainer const&, ShardRespContainer&) (shard/ShardDB.cpp:1034) ==2715080== by 0x3CB3EE: ShardServer::_handleRequest(int, sockaddr_in, char, unsigned long) (shard/Shard.cpp:347) ==2715080== by 0x3C8A39: ShardServer::step() (shard/Shard.cpp:405) ==2715080== by 0x40B1E8: run (core/Loop.cpp:67) ==2715080== by 0x40B1E8: startLoop(void*) (core/Loop.cpp:37) ==2715080== by 0x4BEA258: start_thread (in /usr/lib/libpthread-2.33.so) ==2715080== by 0x4D005E2: clone (in /usr/lib/libc-2.33.so) ==2715080== ==2715080== ==2715080== Exit program on first error (--exit-on-first-error=yes)	2024-01-08 15:41:22 +00:00
Francesco Mazzoli	53049d5779	Shard batch writes, use batch UDP syscalls The idea is to drain the socket and do a single RocksDB WAL write/fsync for all the write requests we have found. The read requests are immediately executed. The reasoning here is that currently write requests are _a lot_ slower than the read requests because fsyncing takes ~500us on fsf1. In the future this might change. Since we're at it, we also use batch UDP syscalls in the CDC. Fixes #119.	2023-12-07 14:29:07 +00:00
Francesco Mazzoli	3eae5bbf9b	Use an EMA for the in-flight CDC txns as well	2023-12-07 10:27:32 +00:00
Francesco Mazzoli	38f3d54ecd	Wait forever, rather than having timeouts The goal here is to not have constant wakeups due to timeout. Do not attempt to clean things up nicely before termination -- just terminate instead. We can setup a proper termination system in the future, I first want to see if this makes a difference. Also, change xmon to use pipes for communication, so that it can wait without timers as well. Also, `write` directly for logging, so that we know the logs will make it to the file after the logging call returns (since we now do not have the chance to flush them afterwards).	2023-12-07 10:11:19 +00:00
Francesco Mazzoli	af46ab2173	Bump CDC shard response timeout	2023-11-29 15:00:08 +00:00
Francesco Mazzoli	a52efe217b	Tune CDC logging more	2023-11-29 14:40:33 +00:00
Francesco Mazzoli	e4c01e8728	Metrics + logging	2023-11-29 14:32:37 +00:00
Francesco Mazzoli	bd278ff6f6	Better metrics for shard responses in CDC	2023-11-29 13:52:44 +00:00
Francesco Mazzoli	4453083aa7	Correctly record request id when picking up transactions after restart	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	a367858684	Drop entire CF at once, rather than one-by-one A dry run of the production upgrade using a backup revealed that dropping them one-by-one would take ages, since before we kept every single CDC request.	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	7537bbc6cf	Remove useless line	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	fac014a864	Self-PR review, part 2	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	ba9424e224	Remove `unordered_set` Almost certainly irrelevant, but it was bugging me	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	2eab012d76	Fix bug in poll check code	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	c94ece50cf	Integer sanitizer stuff	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	59abb24a8e	Add ceiling on max update size We don't want it to grow without bound, but we want to maximize throughput (we'd like for fsync to not be a factor).	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	476009381a	Remove maximum enqueued requests limit We already drop in-flight requests that we're already processing, so I don't think this matters very much currently.	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	c5562c7ca3	Parallelize CDC by directory Fixes #66.	2023-11-29 11:08:07 +00:00
Francesco Mazzoli	340e7f2f37	Harmonize addr-passing, add shuckle beacon and test it in kmod	2023-11-14 13:49:36 +00:00
Francesco Mazzoli	2ad278adaa	Add `ubuntu` image to build, use jemalloc in release build I want to use the introspection capabilities of jemalloc, and it should also be much faster. Preserve alpine build for go build, it's also really useful to test inside the kmod.	2023-11-13 15:44:55 +00:00
Francesco Mazzoli	ad3c969772	Push full RocksDB stats to grafana	2023-11-09 16:48:51 +00:00
Francesco Mazzoli	f70c484883	Dump RocksDB full statistics to file	2023-11-09 14:12:54 +00:00
Francesco Mazzoli	057be91613	`rocksDBStats` -> `rocksDBMetrics`	2023-11-09 13:38:32 +00:00
Francesco Mazzoli	c5979a9d90	Expose some RocksDB stats	2023-11-09 13:23:49 +00:00
Francesco Mazzoli	03e9510255	Align xmon's app instances and systemd services	2023-11-08 14:36:58 +00:00
Francesco Mazzoli	afc4e78a62	Reduce default CDC queue size	2023-11-05 22:38:57 +00:00
Francesco Mazzoli	71556ce933	Switch to restech EggsFS rota	2023-11-03 14:23:44 +00:00
Francesco Mazzoli	64d400fcfe	Insert shard/cdc metrics at more regular intervals	2023-11-03 13:49:38 +00:00
Francesco Mazzoli	654c0d4db4	Report CDC queue size in grafana	2023-11-03 13:49:32 +00:00
Francesco Mazzoli	9e21969637	Slightly tighter error checks	2023-10-11 13:40:46 +01:00
Francesco Mazzoli	6726fff0fe	Better "innocuous error" handling in CDC	2023-10-04 18:12:15 +01:00
Francesco Mazzoli	440a78510e	Add concrete quiet windows to C++ alerts This together with the previous commits fixes #72.	2023-10-02 23:06:40 +00:00
Francesco Mazzoli	59237ed673	Limit number of open RocksDB files We got to the point where we had ~4k open SST files per shard, which meant that we eat up all the available FDs.	2023-09-30 11:08:35 +00:00
Francesco Mazzoli	2679ee7c80	Retry RocksDB transactions if appropriate	2023-09-30 10:44:40 +00:00
Francesco Mazzoli	1d4c4abafd	Correctly check that RocksDB txn succeeded This was caught anyway by the fact that we check that the log index is what we expect. Would have been very nasty otherwise. The right thing to do is to check for `Status::TryAgain()` and retry. `Status::Busy()` should never happen because we never run transactions concurrently so far.	2023-09-30 09:51:26 +00:00
Francesco Mazzoli	02838e228f	Correct xmon app types	2023-09-28 11:53:12 +00:00

1 2 3

104 Commits