Commit Graph

189 Commits

Author SHA1 Message Date
Francesco Mazzoli c5979a9d90 Expose some RocksDB stats 2023-11-09 13:23:49 +00:00
Francesco Mazzoli 03e9510255 Align xmon's app instances and systemd services 2023-11-08 14:36:58 +00:00
Francesco Mazzoli ef1885a4b2 Print out more info when failing because of bad proofs 2023-11-08 11:57:32 +00:00
Francesco Mazzoli 4cc917a1c7 Expose shard socket buf size to grafana
As a proxy to how behind shards are.
2023-11-07 14:12:55 +00:00
Francesco Mazzoli d0126d0656 Distinguish IO errors in eggsblocks
See #115 for background.
2023-11-06 19:35:05 +00:00
Francesco Mazzoli afc4e78a62 Reduce default CDC queue size 2023-11-05 22:38:57 +00:00
Francesco Mazzoli 1ec63f9710 Implement scrubbing functionality
Fixes #32. This also involves some reworking of the block request machinery
to make it more robust and faster. The scrubbing is done assuming that
the overwhelming majority of block checking will go through.
2023-11-05 18:33:00 +00:00
Francesco Mazzoli 71556ce933 Switch to restech EggsFS rota 2023-11-03 14:23:44 +00:00
Francesco Mazzoli 64d400fcfe Insert shard/cdc metrics at more regular intervals 2023-11-03 13:49:38 +00:00
Francesco Mazzoli 654c0d4db4 Report CDC queue size in grafana 2023-11-03 13:49:32 +00:00
Francesco Mazzoli 674c9f22a8 Do not crash shards when swapping blocks fails
Fixes #101
2023-10-31 08:39:32 +00:00
Francesco Mazzoli dd052b1919 Add excel spreadsheet to quickly adjust RocksDB size estimates 2023-10-26 14:32:35 +00:00
Francesco Mazzoli c529d96c88 Garbage collect zero block service files mappings.
See #91.
2023-10-21 11:41:33 +00:00
Francesco Mazzoli 83f38080de Do not return FILE_NOT_FOUND when getting spans of empty transient file 2023-10-13 21:10:44 +00:00
Francesco Mazzoli 9e21969637 Slightly tighter error checks 2023-10-11 13:40:46 +01:00
Francesco Mazzoli 03ed4f951f Alert when block proof is bad (see #89) 2023-10-10 21:37:39 +00:00
Francesco Mazzoli c461872ace Implement dir seeking. Fixes #83. 2023-10-09 22:32:38 +01:00
Francesco Mazzoli 6726fff0fe Better "innocuous error" handling in CDC 2023-10-04 18:12:15 +01:00
Francesco Mazzoli 440a78510e Add concrete quiet windows to C++ alerts
This together with the previous commits fixes #72.
2023-10-02 23:06:40 +00:00
Francesco Mazzoli 24d1588b21 Add quiet window for C++ alerts, too 2023-10-02 23:02:45 +00:00
Francesco Mazzoli 59237ed673 Limit number of open RocksDB files
We got to the point where we had ~4k open SST files per shard, which
meant that we eat up all the available FDs.
2023-09-30 11:08:35 +00:00
Francesco Mazzoli 2679ee7c80 Retry RocksDB transactions if appropriate 2023-09-30 10:44:40 +00:00
Francesco Mazzoli 1d4c4abafd Correctly check that RocksDB txn succeeded
This was caught anyway by the fact that we check that the log index
is what we expect. Would have been very nasty otherwise.

The right thing to do is to check for `Status::TryAgain()` and
retry. `Status::Busy()` should never happen because we never
run transactions concurrently so far.
2023-09-30 09:51:26 +00:00
Francesco Mazzoli 02838e228f Correct xmon app types 2023-09-28 11:53:12 +00:00
Francesco Mazzoli 762f047772 Add fsr17 and fsr18 to deployment 2023-09-19 12:56:34 +00:00
Francesco Mazzoli 77ac15af8d Allow to choose xmon env in C++ apps 2023-09-18 11:56:44 +00:00
Francesco Mazzoli b87a43a297 Continue running GC if servers are down
This was triggered by a server failing hard (fsr13), without any
short term resolution (we've already replaced the mobo, we'll probably
replace the HBA). In this case GC should still run rather than
get stuck.
2023-08-29 12:47:24 +00:00
Francesco Mazzoli 1a8cda8747 Retry if we fail to get page in spans
See comment for explanation, this is in preparation for #50, see
<internal-repo/issues/50#issuecomment-23278>
in particular.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli 1cab680110 Support arbitrary span/block/... policies in kmod...
...and also update them quickly, by indexing them by (inode, tag).

Currently they only get updated on local renames though, we should
also update them when things are moved around remotely.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli 6fa520c582 Always update directory modification
This fixes a bona-fide bug -- we didn't update the mtime when an
edge was unlocked + moved. However we might as well blindly always
update the mtime, even if there is no POSIX-visible change, to
be on the safe side.
2023-08-21 13:33:30 +00:00
Francesco Mazzoli b25f893403 Update estimates in ShardDB.cpp 2023-08-16 08:41:13 +00:00
Francesco Mazzoli 40f229b6f5 Add endpoint to specify which file to get the "reference" block services from
See comments for more details.
2023-08-16 08:40:47 +01:00
Francesco Mazzoli 9405b64a76 Remove ExpireTransientFile, make future cutoff tunable
Fixes #48. Also, reorganize error handling in `eggsblocks` requests,
especially around write requests, which might help with #45.
2023-08-15 12:43:49 +01:00
Ivan Korostelev 7ec477ca9f CDC.cpp: minor bugfix with using optional after reset()
harmless in release builds, since optional in questino is POD and destructor is a noop
2023-08-09 10:45:43 +00:00
Francesco Mazzoli a5dbe189e3 Add some block services metrics 2023-08-08 11:48:35 +00:00
Francesco Mazzoli 32e2a011ee More grafana fixes 2023-08-08 09:28:07 +00:00
Francesco Mazzoli 467fcffefb A few metrics fixes 2023-08-08 09:21:35 +01:00
Francesco Mazzoli e2246afc53 More tweaks to event loops 2023-08-08 09:21:35 +01:00
Francesco Mazzoli b2f28955a5 Log timings 2023-08-08 09:21:35 +01:00
Francesco Mazzoli e686222040 A bit more logging 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 5117ddd16e Add shard/CDC metrics 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 1922cf3c30 Factor out common looping patterns 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 93b212c665 Alert while initializing shard DB 2023-08-07 10:16:00 +00:00
Francesco Mazzoli 63ed6a90fa Reconnect to xmon on expired heartbeat 2023-08-07 10:06:40 +00:00
Francesco Mazzoli b370118e90 Rate limit binnable xmon requests
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli 02a2ca2a6f Wait for block services to come up before restarting the next one
This should already make #43 better.
2023-08-04 13:40:10 +00:00
Francesco Mazzoli 18b2397842 Some Timings.hpp functions 2023-08-03 23:41:11 +00:00
Francesco Mazzoli 698794ac44 Fix bad indexing in Timings.hpp 2023-08-03 21:06:49 +00:00
Francesco Mazzoli ca987ed205 Fix UB in bincode
I `memcpy` a zero-sized string into `NULL`. UBsan rightfully
complains.
2023-08-03 09:58:53 +00:00
Francesco Mazzoli 0055575f71 alpine-debug -> alpinedebug, new cmake doesn't like dashes 2023-08-03 07:09:41 +00:00