Commit Graph

169 Commits

Author SHA1 Message Date
Francesco Mazzoli
59237ed673 Limit number of open RocksDB files
We got to the point where we had ~4k open SST files per shard, which
meant that we eat up all the available FDs.
2023-09-30 11:08:35 +00:00
Francesco Mazzoli
2679ee7c80 Retry RocksDB transactions if appropriate 2023-09-30 10:44:40 +00:00
Francesco Mazzoli
1d4c4abafd Correctly check that RocksDB txn succeeded
This was caught anyway by the fact that we check that the log index
is what we expect. Would have been very nasty otherwise.

The right thing to do is to check for `Status::TryAgain()` and
retry. `Status::Busy()` should never happen because we never
run transactions concurrently so far.
2023-09-30 09:51:26 +00:00
Francesco Mazzoli
02838e228f Correct xmon app types 2023-09-28 11:53:12 +00:00
Francesco Mazzoli
762f047772 Add fsr17 and fsr18 to deployment 2023-09-19 12:56:34 +00:00
Francesco Mazzoli
77ac15af8d Allow to choose xmon env in C++ apps 2023-09-18 11:56:44 +00:00
Francesco Mazzoli
b87a43a297 Continue running GC if servers are down
This was triggered by a server failing hard (fsr13), without any
short term resolution (we've already replaced the mobo, we'll probably
replace the HBA). In this case GC should still run rather than
get stuck.
2023-08-29 12:47:24 +00:00
Francesco Mazzoli
1a8cda8747 Retry if we fail to get page in spans
See comment for explanation, this is in preparation for #50, see
<internal-repo/issues/50#issuecomment-23278>
in particular.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
1cab680110 Support arbitrary span/block/... policies in kmod...
...and also update them quickly, by indexing them by (inode, tag).

Currently they only get updated on local renames though, we should
also update them when things are moved around remotely.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
6fa520c582 Always update directory modification
This fixes a bona-fide bug -- we didn't update the mtime when an
edge was unlocked + moved. However we might as well blindly always
update the mtime, even if there is no POSIX-visible change, to
be on the safe side.
2023-08-21 13:33:30 +00:00
Francesco Mazzoli
b25f893403 Update estimates in ShardDB.cpp 2023-08-16 08:41:13 +00:00
Francesco Mazzoli
40f229b6f5 Add endpoint to specify which file to get the "reference" block services from
See comments for more details.
2023-08-16 08:40:47 +01:00
Francesco Mazzoli
9405b64a76 Remove ExpireTransientFile, make future cutoff tunable
Fixes #48. Also, reorganize error handling in `eggsblocks` requests,
especially around write requests, which might help with #45.
2023-08-15 12:43:49 +01:00
Ivan Korostelev
7ec477ca9f CDC.cpp: minor bugfix with using optional after reset()
harmless in release builds, since optional in questino is POD and destructor is a noop
2023-08-09 10:45:43 +00:00
Francesco Mazzoli
a5dbe189e3 Add some block services metrics 2023-08-08 11:48:35 +00:00
Francesco Mazzoli
32e2a011ee More grafana fixes 2023-08-08 09:28:07 +00:00
Francesco Mazzoli
467fcffefb A few metrics fixes 2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e2246afc53 More tweaks to event loops 2023-08-08 09:21:35 +01:00
Francesco Mazzoli
b2f28955a5 Log timings 2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e686222040 A bit more logging 2023-08-08 09:21:35 +01:00
Francesco Mazzoli
5117ddd16e Add shard/CDC metrics 2023-08-08 09:21:35 +01:00
Francesco Mazzoli
1922cf3c30 Factor out common looping patterns 2023-08-08 09:21:35 +01:00
Francesco Mazzoli
93b212c665 Alert while initializing shard DB 2023-08-07 10:16:00 +00:00
Francesco Mazzoli
63ed6a90fa Reconnect to xmon on expired heartbeat 2023-08-07 10:06:40 +00:00
Francesco Mazzoli
b370118e90 Rate limit binnable xmon requests
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli
02a2ca2a6f Wait for block services to come up before restarting the next one
This should already make #43 better.
2023-08-04 13:40:10 +00:00
Francesco Mazzoli
18b2397842 Some Timings.hpp functions 2023-08-03 23:41:11 +00:00
Francesco Mazzoli
698794ac44 Fix bad indexing in Timings.hpp 2023-08-03 21:06:49 +00:00
Francesco Mazzoli
ca987ed205 Fix UB in bincode
I `memcpy` a zero-sized string into `NULL`. UBsan rightfully
complains.
2023-08-03 09:58:53 +00:00
Francesco Mazzoli
0055575f71 alpine-debug -> alpinedebug, new cmake doesn't like dashes 2023-08-03 07:09:41 +00:00
Francesco Mazzoli
9ef3162882 Add error count to inspect how things failed 2023-08-03 06:53:35 +00:00
Francesco Mazzoli
acf4f129f6 Fix CDC txn status tracking
This might be a resolution to #38, although I'm not sure yet.
2023-08-02 13:49:38 +00:00
Francesco Mazzoli
5a6a13de5f Alpine 3.17 -> 3.18
To get <https://www.openwall.com/lists/musl/2023/05/02/1>
2023-08-02 13:05:49 +00:00
Francesco Mazzoli
2a1d8a497e Update some size estimates 2023-08-02 12:22:16 +00:00
Francesco Mazzoli
63e2db0889 Cap maximum number of CDC requests
No point letting huge queues build -- especially now that we
deduplicate client requests.
2023-08-01 21:17:23 +01:00
Francesco Mazzoli
fe2ce7aa17 See comments 2023-08-01 21:17:23 +01:00
Francesco Mazzoli
a5eb12a262 Do not alert/log on innocuous shard error in CDC 2023-08-01 13:41:18 +00:00
Francesco Mazzoli
5146a80c2d Use homegrown Xmon
I got annoyed at the old lib dropping requests when queue gets
full, I could probably fix but this is almost certainly quicker.
2023-07-30 11:16:35 +00:00
Francesco Mazzoli
e851457c52 Do not re-insert requests in C++ xmon code
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli
8e9f4f3d8b Never die because of bad Xmon
It will alert if we're disconnected anyway, and when restarting
everything this causes crashes.
2023-07-28 08:08:03 +00:00
Francesco Mazzoli
7dceb5fda5 More alerts shenanigans 2023-07-27 15:51:15 +00:00
Francesco Mazzoli
a01b1f036d More alert-related fixes 2023-07-27 13:54:51 +00:00
Francesco Mazzoli
6a52a961eb Split CDC timings to distinguish queue time from exec time 2023-07-27 13:14:12 +00:00
Francesco Mazzoli
889c04766f Do not bump req ids when retrying requests in the CDC
Fixes #29.

The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli
f797663d8c Transient alerts for EPERM errors on sendto 2023-07-27 07:31:34 +00:00
Francesco Mazzoli
bf447408a6 Actually wait for things to finish terminating before reaping next one
Fixes #27. This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli
15e59b8e67 More logging when closing (see #27)
It seems that we get the SIGSEGV while closing the DB.
2023-07-26 21:09:29 +00:00
Francesco Mazzoli
dd39466daa Insert CDC stats on shutdown 2023-07-26 20:41:35 +00:00
Francesco Mazzoli
999d2df52b Do not alert for missing CDC request
This is totally normal if the CDC is restarted with queued
transactions.
2023-07-26 19:38:17 +00:00
Francesco Mazzoli
45b2618296 Temporarily put a stop to alert spam 2023-07-26 19:33:00 +00:00