Commit Graph

134 Commits

Author SHA1 Message Date
Francesco Mazzoli
fe2ce7aa17 See comments 2023-08-01 21:17:23 +01:00
Francesco Mazzoli
a5eb12a262 Do not alert/log on innocuous shard error in CDC 2023-08-01 13:41:18 +00:00
Francesco Mazzoli
5146a80c2d Use homegrown Xmon
I got annoyed at the old lib dropping requests when queue gets
full, I could probably fix but this is almost certainly quicker.
2023-07-30 11:16:35 +00:00
Francesco Mazzoli
e851457c52 Do not re-insert requests in C++ xmon code
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli
8e9f4f3d8b Never die because of bad Xmon
It will alert if we're disconnected anyway, and when restarting
everything this causes crashes.
2023-07-28 08:08:03 +00:00
Francesco Mazzoli
7dceb5fda5 More alerts shenanigans 2023-07-27 15:51:15 +00:00
Francesco Mazzoli
a01b1f036d More alert-related fixes 2023-07-27 13:54:51 +00:00
Francesco Mazzoli
6a52a961eb Split CDC timings to distinguish queue time from exec time 2023-07-27 13:14:12 +00:00
Francesco Mazzoli
889c04766f Do not bump req ids when retrying requests in the CDC
Fixes #29.

The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli
f797663d8c Transient alerts for EPERM errors on sendto 2023-07-27 07:31:34 +00:00
Francesco Mazzoli
bf447408a6 Actually wait for things to finish terminating before reaping next one
Fixes #27. This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli
15e59b8e67 More logging when closing (see #27)
It seems that we get the SIGSEGV while closing the DB.
2023-07-26 21:09:29 +00:00
Francesco Mazzoli
dd39466daa Insert CDC stats on shutdown 2023-07-26 20:41:35 +00:00
Francesco Mazzoli
999d2df52b Do not alert for missing CDC request
This is totally normal if the CDC is restarted with queued
transactions.
2023-07-26 19:38:17 +00:00
Francesco Mazzoli
45b2618296 Temporarily put a stop to alert spam 2023-07-26 19:33:00 +00:00
Francesco Mazzoli
b0ff28dc44 Do not alert on error which can happen naturally in GC 2023-07-26 19:28:44 +00:00
Francesco Mazzoli
0fc80dfe0f Remove additional CDC status fields
`status()` was racy anyway (the txn might have been gone between
first and second lookup) and these are better solved by the stats
db/graphana anyway.
2023-07-26 19:21:24 +00:00
Francesco Mazzoli
d918df0fcc Correctly return errors when failing to connect
Triggered by investigating

    xmon: could not read message type: unexpected EOF, will reconnect
    xmon: connected to xmon REDACTED
    Undertaker: hard abort - running abort handlers
    Uncaught exception thrown: SyscallException(Xmon.cpp@186, 9/EBADF=Bad file descriptor in void Xmon::run()): setsockopt

which caused crashes in shards/CDC.
2023-07-26 12:59:40 +00:00
Francesco Mazzoli
60554ec58d Have bigger histograms, remove other metrics entirely
The `uint16_t` -> `size_t` in `packedSize` is because now
insert stats requests are bigger than `uint16_t`.
2023-07-26 10:01:27 +00:00
Francesco Mazzoli
c2bd882cdc Allow erasing blocks for decommissioned block services
Otherwise GC cannot run after disposing of a broken disk. This
commit also adds various safety checks regarding decommissioned
block services.
2023-07-24 19:03:16 +01:00
Francesco Mazzoli
5776bb6d34 Include duration in mean/stddev stat 2023-07-24 19:03:16 +01:00
Francesco Mazzoli
4dbb6c79ba Fix bug in Xmon parsing (alert id is 8 bytes, not 4) 2023-07-24 07:40:49 +00:00
Francesco Mazzoli
18b01438d4 Have -short tests actually be short, split out longer tests 2023-07-22 20:17:53 +01:00
Francesco Mazzoli
fe14ec5c22 Aggregate mean/stddev stat into one, together with count
This makes more sense so that we can combine multiple ones together
2023-07-22 20:17:53 +01:00
Francesco Mazzoli
37ce3be74c Implement utime-like functions
Also, update atime when opening a file.
2023-07-21 06:28:48 +00:00
Francesco Mazzoli
441eebb514 Do not crash on bad shuckle response 2023-07-20 12:46:38 +00:00
Francesco Mazzoli
ce21016ad9 Fix mean/stddev calculation 2023-07-19 21:44:17 +00:00
Francesco Mazzoli
6aa670b481 Remove mean/stddev computation in C++
It's broken (also in Go), will fix in the following days.
2023-07-19 11:48:38 +00:00
Francesco Mazzoli
dce2961d7f Re-insert xmon requests if we fail to write them 2023-07-18 16:11:01 +00:00
Francesco Mazzoli
6973ed9ff7 Reset xmon buffer before packing stuff in 2023-07-18 16:10:28 +00:00
Francesco Mazzoli
b4613bd47e Fix other little stddev things 2023-07-18 14:43:29 +00:00
Francesco Mazzoli
5c849c0d96 Fix timings stddev overflow
This adds a couple of locks which could be avoided by being a bit
more clever, but almost certainly doesn't matter for now.
2023-07-18 14:37:37 +00:00
Francesco Mazzoli
283f3508b9 Add binary /api endpoint, use it to draw histograms
This makes /stats _a lot_ faster.
2023-07-18 12:34:57 +00:00
Francesco Mazzoli
2b1b1a1c15 Insert stats when shutting down 2023-07-17 12:27:07 +00:00
Francesco Mazzoli
dcb76a86c2 Fix _hours operator 2023-07-17 12:26:49 +00:00
Francesco Mazzoli
3cc7310a6e Add histograms for all components in /stats 2023-07-17 08:56:09 +00:00
Francesco Mazzoli
2f7be11e29 Add query for single block service in shuckle
I thought I might need it for some upcoming migration improvements,
I probably don't, but still kinda nice to have.
2023-07-13 09:46:37 +00:00
Francesco Mazzoli
2f1385445b Tighten up the mtime story for transient files 2023-07-12 12:52:50 +00:00
Francesco Mazzoli
d93df7ef42 Make tests pass for now 2023-07-12 12:22:40 +01:00
Francesco Mazzoli
53598c2fe9 Allow to re-open files as writing if we're already writing them
This makes `cp` work
2023-07-12 12:22:40 +01:00
Francesco Mazzoli
65174341a0 Drop MM after flushing out a transient file 2023-07-12 12:22:40 +01:00
Francesco Mazzoli
fe88efb1ce Remove UB in xmon code 2023-07-11 14:15:33 +00:00
Francesco Mazzoli
ff9306f6e3 Add Xmon support to C++ code 2023-07-11 12:13:22 +00:00
Francesco Mazzoli
d5fea6c08c Retry when block services are unavailable in kmod 2023-07-06 19:39:12 +01:00
Saulius Grusnys
0360ec85cf Switch cutoff time to blockservice to 1h and set the deadline in shard to 2 2023-07-06 13:28:12 +01:00
Francesco Mazzoli
1a4301a499 Simplify go span read/write code, make it work with broken block services
And some other assorted changes.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli
4e0e6fe8a8 Configurable CDC shard timeout
Running in valgrind seems to just not be able to process a small
FullReadDirReq in 100ms, which is a bit concerning, but I'll let
it slide for now.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli
87d0e69f85 Port kmod to new FullReadDir request 2023-07-04 08:05:42 +00:00
Francesco Mazzoli
f0add4d926 Remove C++ varint code, we don't use varints anymore 2023-07-04 08:05:42 +00:00
Francesco Mazzoli
e2dcd43fea Fix bug in CreateLockedCurrentEdge logic
See comment in `msgs.go`. This would normally have required
entirely new transactions, but since we're not in production yet
I'm going to just change the schema and wipe the current FS.

This also adds in an unrelated change regarding more flexible
blacklisting, which will be required for some additional testing
I'm preparing.
2023-07-04 08:05:42 +00:00