Francesco Mazzoli
c5979a9d90
Expose some RocksDB stats
2023-11-09 13:23:49 +00:00
Francesco Mazzoli
03e9510255
Align xmon's app instances and systemd services
2023-11-08 14:36:58 +00:00
Francesco Mazzoli
ef1885a4b2
Print out more info when failing because of bad proofs
2023-11-08 11:57:32 +00:00
Francesco Mazzoli
4cc917a1c7
Expose shard socket buf size to grafana
...
As a proxy to how behind shards are.
2023-11-07 14:12:55 +00:00
Francesco Mazzoli
d0126d0656
Distinguish IO errors in eggsblocks
...
See #115 for background.
2023-11-06 19:35:05 +00:00
Francesco Mazzoli
afc4e78a62
Reduce default CDC queue size
2023-11-05 22:38:57 +00:00
Francesco Mazzoli
1ec63f9710
Implement scrubbing functionality
...
Fixes #32 . This also involves some reworking of the block request machinery
to make it more robust and faster. The scrubbing is done assuming that
the overwhelming majority of block checking will go through.
2023-11-05 18:33:00 +00:00
Francesco Mazzoli
71556ce933
Switch to restech EggsFS rota
2023-11-03 14:23:44 +00:00
Francesco Mazzoli
64d400fcfe
Insert shard/cdc metrics at more regular intervals
2023-11-03 13:49:38 +00:00
Francesco Mazzoli
654c0d4db4
Report CDC queue size in grafana
2023-11-03 13:49:32 +00:00
Francesco Mazzoli
674c9f22a8
Do not crash shards when swapping blocks fails
...
Fixes #101
2023-10-31 08:39:32 +00:00
Francesco Mazzoli
dd052b1919
Add excel spreadsheet to quickly adjust RocksDB size estimates
2023-10-26 14:32:35 +00:00
Francesco Mazzoli
c529d96c88
Garbage collect zero block service files mappings.
...
See #91 .
2023-10-21 11:41:33 +00:00
Francesco Mazzoli
83f38080de
Do not return FILE_NOT_FOUND when getting spans of empty transient file
2023-10-13 21:10:44 +00:00
Francesco Mazzoli
9e21969637
Slightly tighter error checks
2023-10-11 13:40:46 +01:00
Francesco Mazzoli
03ed4f951f
Alert when block proof is bad (see #89 )
2023-10-10 21:37:39 +00:00
Francesco Mazzoli
c461872ace
Implement dir seeking. Fixes #83 .
2023-10-09 22:32:38 +01:00
Francesco Mazzoli
6726fff0fe
Better "innocuous error" handling in CDC
2023-10-04 18:12:15 +01:00
Francesco Mazzoli
440a78510e
Add concrete quiet windows to C++ alerts
...
This together with the previous commits fixes #72 .
2023-10-02 23:06:40 +00:00
Francesco Mazzoli
24d1588b21
Add quiet window for C++ alerts, too
2023-10-02 23:02:45 +00:00
Francesco Mazzoli
59237ed673
Limit number of open RocksDB files
...
We got to the point where we had ~4k open SST files per shard, which
meant that we eat up all the available FDs.
2023-09-30 11:08:35 +00:00
Francesco Mazzoli
2679ee7c80
Retry RocksDB transactions if appropriate
2023-09-30 10:44:40 +00:00
Francesco Mazzoli
1d4c4abafd
Correctly check that RocksDB txn succeeded
...
This was caught anyway by the fact that we check that the log index
is what we expect. Would have been very nasty otherwise.
The right thing to do is to check for `Status::TryAgain()` and
retry. `Status::Busy()` should never happen because we never
run transactions concurrently so far.
2023-09-30 09:51:26 +00:00
Francesco Mazzoli
02838e228f
Correct xmon app types
2023-09-28 11:53:12 +00:00
Francesco Mazzoli
762f047772
Add fsr17 and fsr18 to deployment
2023-09-19 12:56:34 +00:00
Francesco Mazzoli
77ac15af8d
Allow to choose xmon env in C++ apps
2023-09-18 11:56:44 +00:00
Francesco Mazzoli
b87a43a297
Continue running GC if servers are down
...
This was triggered by a server failing hard (fsr13), without any
short term resolution (we've already replaced the mobo, we'll probably
replace the HBA). In this case GC should still run rather than
get stuck.
2023-08-29 12:47:24 +00:00
Francesco Mazzoli
1a8cda8747
Retry if we fail to get page in spans
...
See comment for explanation, this is in preparation for #50 , see
<internal-repo/issues/50#issuecomment-23278>
in particular.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
1cab680110
Support arbitrary span/block/... policies in kmod...
...
...and also update them quickly, by indexing them by (inode, tag).
Currently they only get updated on local renames though, we should
also update them when things are moved around remotely.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
6fa520c582
Always update directory modification
...
This fixes a bona-fide bug -- we didn't update the mtime when an
edge was unlocked + moved. However we might as well blindly always
update the mtime, even if there is no POSIX-visible change, to
be on the safe side.
2023-08-21 13:33:30 +00:00
Francesco Mazzoli
b25f893403
Update estimates in ShardDB.cpp
2023-08-16 08:41:13 +00:00
Francesco Mazzoli
40f229b6f5
Add endpoint to specify which file to get the "reference" block services from
...
See comments for more details.
2023-08-16 08:40:47 +01:00
Francesco Mazzoli
9405b64a76
Remove ExpireTransientFile, make future cutoff tunable
...
Fixes #48 . Also, reorganize error handling in `eggsblocks` requests,
especially around write requests, which might help with #45 .
2023-08-15 12:43:49 +01:00
Ivan Korostelev
7ec477ca9f
CDC.cpp: minor bugfix with using optional after reset()
...
harmless in release builds, since optional in questino is POD and destructor is a noop
2023-08-09 10:45:43 +00:00
Francesco Mazzoli
a5dbe189e3
Add some block services metrics
2023-08-08 11:48:35 +00:00
Francesco Mazzoli
32e2a011ee
More grafana fixes
2023-08-08 09:28:07 +00:00
Francesco Mazzoli
467fcffefb
A few metrics fixes
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e2246afc53
More tweaks to event loops
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
b2f28955a5
Log timings
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e686222040
A bit more logging
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
5117ddd16e
Add shard/CDC metrics
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
1922cf3c30
Factor out common looping patterns
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
93b212c665
Alert while initializing shard DB
2023-08-07 10:16:00 +00:00
Francesco Mazzoli
63ed6a90fa
Reconnect to xmon on expired heartbeat
2023-08-07 10:06:40 +00:00
Francesco Mazzoli
b370118e90
Rate limit binnable xmon requests
...
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli
02a2ca2a6f
Wait for block services to come up before restarting the next one
...
This should already make #43 better.
2023-08-04 13:40:10 +00:00
Francesco Mazzoli
18b2397842
Some Timings.hpp functions
2023-08-03 23:41:11 +00:00
Francesco Mazzoli
698794ac44
Fix bad indexing in Timings.hpp
2023-08-03 21:06:49 +00:00
Francesco Mazzoli
ca987ed205
Fix UB in bincode
...
I `memcpy` a zero-sized string into `NULL`. UBsan rightfully
complains.
2023-08-03 09:58:53 +00:00
Francesco Mazzoli
0055575f71
alpine-debug -> alpinedebug, new cmake doesn't like dashes
2023-08-03 07:09:41 +00:00