Francesco Mazzoli
b87a43a297
Continue running GC if servers are down
...
This was triggered by a server failing hard (fsr13), without any
short term resolution (we've already replaced the mobo, we'll probably
replace the HBA). In this case GC should still run rather than
get stuck.
2023-08-29 12:47:24 +00:00
Francesco Mazzoli
1a8cda8747
Retry if we fail to get page in spans
...
See comment for explanation, this is in preparation for #50 , see
<internal-repo/issues/50#issuecomment-23278>
in particular.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
1cab680110
Support arbitrary span/block/... policies in kmod...
...
...and also update them quickly, by indexing them by (inode, tag).
Currently they only get updated on local renames though, we should
also update them when things are moved around remotely.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
6fa520c582
Always update directory modification
...
This fixes a bona-fide bug -- we didn't update the mtime when an
edge was unlocked + moved. However we might as well blindly always
update the mtime, even if there is no POSIX-visible change, to
be on the safe side.
2023-08-21 13:33:30 +00:00
Francesco Mazzoli
b25f893403
Update estimates in ShardDB.cpp
2023-08-16 08:41:13 +00:00
Francesco Mazzoli
40f229b6f5
Add endpoint to specify which file to get the "reference" block services from
...
See comments for more details.
2023-08-16 08:40:47 +01:00
Francesco Mazzoli
9405b64a76
Remove ExpireTransientFile, make future cutoff tunable
...
Fixes #48 . Also, reorganize error handling in `eggsblocks` requests,
especially around write requests, which might help with #45 .
2023-08-15 12:43:49 +01:00
Ivan Korostelev
7ec477ca9f
CDC.cpp: minor bugfix with using optional after reset()
...
harmless in release builds, since optional in questino is POD and destructor is a noop
2023-08-09 10:45:43 +00:00
Francesco Mazzoli
a5dbe189e3
Add some block services metrics
2023-08-08 11:48:35 +00:00
Francesco Mazzoli
32e2a011ee
More grafana fixes
2023-08-08 09:28:07 +00:00
Francesco Mazzoli
467fcffefb
A few metrics fixes
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e2246afc53
More tweaks to event loops
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
b2f28955a5
Log timings
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e686222040
A bit more logging
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
5117ddd16e
Add shard/CDC metrics
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
1922cf3c30
Factor out common looping patterns
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
93b212c665
Alert while initializing shard DB
2023-08-07 10:16:00 +00:00
Francesco Mazzoli
63ed6a90fa
Reconnect to xmon on expired heartbeat
2023-08-07 10:06:40 +00:00
Francesco Mazzoli
b370118e90
Rate limit binnable xmon requests
...
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli
02a2ca2a6f
Wait for block services to come up before restarting the next one
...
This should already make #43 better.
2023-08-04 13:40:10 +00:00
Francesco Mazzoli
18b2397842
Some Timings.hpp functions
2023-08-03 23:41:11 +00:00
Francesco Mazzoli
698794ac44
Fix bad indexing in Timings.hpp
2023-08-03 21:06:49 +00:00
Francesco Mazzoli
ca987ed205
Fix UB in bincode
...
I `memcpy` a zero-sized string into `NULL`. UBsan rightfully
complains.
2023-08-03 09:58:53 +00:00
Francesco Mazzoli
0055575f71
alpine-debug -> alpinedebug, new cmake doesn't like dashes
2023-08-03 07:09:41 +00:00
Francesco Mazzoli
9ef3162882
Add error count to inspect how things failed
2023-08-03 06:53:35 +00:00
Francesco Mazzoli
acf4f129f6
Fix CDC txn status tracking
...
This might be a resolution to #38 , although I'm not sure yet.
2023-08-02 13:49:38 +00:00
Francesco Mazzoli
5a6a13de5f
Alpine 3.17 -> 3.18
...
To get <https://www.openwall.com/lists/musl/2023/05/02/1 >
2023-08-02 13:05:49 +00:00
Francesco Mazzoli
2a1d8a497e
Update some size estimates
2023-08-02 12:22:16 +00:00
Francesco Mazzoli
63e2db0889
Cap maximum number of CDC requests
...
No point letting huge queues build -- especially now that we
deduplicate client requests.
2023-08-01 21:17:23 +01:00
Francesco Mazzoli
fe2ce7aa17
See comments
2023-08-01 21:17:23 +01:00
Francesco Mazzoli
a5eb12a262
Do not alert/log on innocuous shard error in CDC
2023-08-01 13:41:18 +00:00
Francesco Mazzoli
5146a80c2d
Use homegrown Xmon
...
I got annoyed at the old lib dropping requests when queue gets
full, I could probably fix but this is almost certainly quicker.
2023-07-30 11:16:35 +00:00
Francesco Mazzoli
e851457c52
Do not re-insert requests in C++ xmon code
...
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli
8e9f4f3d8b
Never die because of bad Xmon
...
It will alert if we're disconnected anyway, and when restarting
everything this causes crashes.
2023-07-28 08:08:03 +00:00
Francesco Mazzoli
7dceb5fda5
More alerts shenanigans
2023-07-27 15:51:15 +00:00
Francesco Mazzoli
a01b1f036d
More alert-related fixes
2023-07-27 13:54:51 +00:00
Francesco Mazzoli
6a52a961eb
Split CDC timings to distinguish queue time from exec time
2023-07-27 13:14:12 +00:00
Francesco Mazzoli
889c04766f
Do not bump req ids when retrying requests in the CDC
...
Fixes #29 .
The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli
f797663d8c
Transient alerts for EPERM errors on sendto
2023-07-27 07:31:34 +00:00
Francesco Mazzoli
bf447408a6
Actually wait for things to finish terminating before reaping next one
...
Fixes #27 . This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli
15e59b8e67
More logging when closing (see #27 )
...
It seems that we get the SIGSEGV while closing the DB.
2023-07-26 21:09:29 +00:00
Francesco Mazzoli
dd39466daa
Insert CDC stats on shutdown
2023-07-26 20:41:35 +00:00
Francesco Mazzoli
999d2df52b
Do not alert for missing CDC request
...
This is totally normal if the CDC is restarted with queued
transactions.
2023-07-26 19:38:17 +00:00
Francesco Mazzoli
45b2618296
Temporarily put a stop to alert spam
2023-07-26 19:33:00 +00:00
Francesco Mazzoli
b0ff28dc44
Do not alert on error which can happen naturally in GC
2023-07-26 19:28:44 +00:00
Francesco Mazzoli
0fc80dfe0f
Remove additional CDC status fields
...
`status()` was racy anyway (the txn might have been gone between
first and second lookup) and these are better solved by the stats
db/graphana anyway.
2023-07-26 19:21:24 +00:00
Francesco Mazzoli
d918df0fcc
Correctly return errors when failing to connect
...
Triggered by investigating
xmon: could not read message type: unexpected EOF, will reconnect
xmon: connected to xmon REDACTED
Undertaker: hard abort - running abort handlers
Uncaught exception thrown: SyscallException(Xmon.cpp@186, 9/EBADF=Bad file descriptor in void Xmon::run()): setsockopt
which caused crashes in shards/CDC.
2023-07-26 12:59:40 +00:00
Francesco Mazzoli
60554ec58d
Have bigger histograms, remove other metrics entirely
...
The `uint16_t` -> `size_t` in `packedSize` is because now
insert stats requests are bigger than `uint16_t`.
2023-07-26 10:01:27 +00:00
Francesco Mazzoli
c2bd882cdc
Allow erasing blocks for decommissioned block services
...
Otherwise GC cannot run after disposing of a broken disk. This
commit also adds various safety checks regarding decommissioned
block services.
2023-07-24 19:03:16 +01:00
Francesco Mazzoli
5776bb6d34
Include duration in mean/stddev stat
2023-07-24 19:03:16 +01:00