Commit Graph

88 Commits

Author SHA1 Message Date
Francesco Mazzoli 64d400fcfe Insert shard/cdc metrics at more regular intervals 2023-11-03 13:49:38 +00:00
Francesco Mazzoli 654c0d4db4 Report CDC queue size in grafana 2023-11-03 13:49:32 +00:00
Francesco Mazzoli c529d96c88 Garbage collect zero block service files mappings.
See #91.
2023-10-21 11:41:33 +00:00
Francesco Mazzoli 24d1588b21 Add quiet window for C++ alerts, too 2023-10-02 23:02:45 +00:00
Francesco Mazzoli 2679ee7c80 Retry RocksDB transactions if appropriate 2023-09-30 10:44:40 +00:00
Francesco Mazzoli 02838e228f Correct xmon app types 2023-09-28 11:53:12 +00:00
Francesco Mazzoli b87a43a297 Continue running GC if servers are down
This was triggered by a server failing hard (fsr13), without any
short term resolution (we've already replaced the mobo, we'll probably
replace the HBA). In this case GC should still run rather than
get stuck.
2023-08-29 12:47:24 +00:00
Francesco Mazzoli 40f229b6f5 Add endpoint to specify which file to get the "reference" block services from
See comments for more details.
2023-08-16 08:40:47 +01:00
Francesco Mazzoli 9405b64a76 Remove ExpireTransientFile, make future cutoff tunable
Fixes #48. Also, reorganize error handling in `eggsblocks` requests,
especially around write requests, which might help with #45.
2023-08-15 12:43:49 +01:00
Francesco Mazzoli a5dbe189e3 Add some block services metrics 2023-08-08 11:48:35 +00:00
Francesco Mazzoli 467fcffefb A few metrics fixes 2023-08-08 09:21:35 +01:00
Francesco Mazzoli e2246afc53 More tweaks to event loops 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 5117ddd16e Add shard/CDC metrics 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 1922cf3c30 Factor out common looping patterns 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 63ed6a90fa Reconnect to xmon on expired heartbeat 2023-08-07 10:06:40 +00:00
Francesco Mazzoli b370118e90 Rate limit binnable xmon requests
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli 02a2ca2a6f Wait for block services to come up before restarting the next one
This should already make #43 better.
2023-08-04 13:40:10 +00:00
Francesco Mazzoli 18b2397842 Some Timings.hpp functions 2023-08-03 23:41:11 +00:00
Francesco Mazzoli 698794ac44 Fix bad indexing in Timings.hpp 2023-08-03 21:06:49 +00:00
Francesco Mazzoli ca987ed205 Fix UB in bincode
I `memcpy` a zero-sized string into `NULL`. UBsan rightfully
complains.
2023-08-03 09:58:53 +00:00
Francesco Mazzoli 9ef3162882 Add error count to inspect how things failed 2023-08-03 06:53:35 +00:00
Francesco Mazzoli 5146a80c2d Use homegrown Xmon
I got annoyed at the old lib dropping requests when queue gets
full, I could probably fix but this is almost certainly quicker.
2023-07-30 11:16:35 +00:00
Francesco Mazzoli e851457c52 Do not re-insert requests in C++ xmon code
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli 8e9f4f3d8b Never die because of bad Xmon
It will alert if we're disconnected anyway, and when restarting
everything this causes crashes.
2023-07-28 08:08:03 +00:00
Francesco Mazzoli 7dceb5fda5 More alerts shenanigans 2023-07-27 15:51:15 +00:00
Francesco Mazzoli 889c04766f Do not bump req ids when retrying requests in the CDC
Fixes #29.

The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli f797663d8c Transient alerts for EPERM errors on sendto 2023-07-27 07:31:34 +00:00
Francesco Mazzoli bf447408a6 Actually wait for things to finish terminating before reaping next one
Fixes #27. This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli 0fc80dfe0f Remove additional CDC status fields
`status()` was racy anyway (the txn might have been gone between
first and second lookup) and these are better solved by the stats
db/graphana anyway.
2023-07-26 19:21:24 +00:00
Francesco Mazzoli d918df0fcc Correctly return errors when failing to connect
Triggered by investigating

    xmon: could not read message type: unexpected EOF, will reconnect
    xmon: connected to xmon REDACTED
    Undertaker: hard abort - running abort handlers
    Uncaught exception thrown: SyscallException(Xmon.cpp@186, 9/EBADF=Bad file descriptor in void Xmon::run()): setsockopt

which caused crashes in shards/CDC.
2023-07-26 12:59:40 +00:00
Francesco Mazzoli 60554ec58d Have bigger histograms, remove other metrics entirely
The `uint16_t` -> `size_t` in `packedSize` is because now
insert stats requests are bigger than `uint16_t`.
2023-07-26 10:01:27 +00:00
Francesco Mazzoli c2bd882cdc Allow erasing blocks for decommissioned block services
Otherwise GC cannot run after disposing of a broken disk. This
commit also adds various safety checks regarding decommissioned
block services.
2023-07-24 19:03:16 +01:00
Francesco Mazzoli 5776bb6d34 Include duration in mean/stddev stat 2023-07-24 19:03:16 +01:00
Francesco Mazzoli 4dbb6c79ba Fix bug in Xmon parsing (alert id is 8 bytes, not 4) 2023-07-24 07:40:49 +00:00
Francesco Mazzoli fe14ec5c22 Aggregate mean/stddev stat into one, together with count
This makes more sense so that we can combine multiple ones together
2023-07-22 20:17:53 +01:00
Francesco Mazzoli 37ce3be74c Implement utime-like functions
Also, update atime when opening a file.
2023-07-21 06:28:48 +00:00
Francesco Mazzoli 441eebb514 Do not crash on bad shuckle response 2023-07-20 12:46:38 +00:00
Francesco Mazzoli ce21016ad9 Fix mean/stddev calculation 2023-07-19 21:44:17 +00:00
Francesco Mazzoli 6aa670b481 Remove mean/stddev computation in C++
It's broken (also in Go), will fix in the following days.
2023-07-19 11:48:38 +00:00
Francesco Mazzoli dce2961d7f Re-insert xmon requests if we fail to write them 2023-07-18 16:11:01 +00:00
Francesco Mazzoli 6973ed9ff7 Reset xmon buffer before packing stuff in 2023-07-18 16:10:28 +00:00
Francesco Mazzoli b4613bd47e Fix other little stddev things 2023-07-18 14:43:29 +00:00
Francesco Mazzoli 5c849c0d96 Fix timings stddev overflow
This adds a couple of locks which could be avoided by being a bit
more clever, but almost certainly doesn't matter for now.
2023-07-18 14:37:37 +00:00
Francesco Mazzoli 283f3508b9 Add binary /api endpoint, use it to draw histograms
This makes /stats _a lot_ faster.
2023-07-18 12:34:57 +00:00
Francesco Mazzoli dcb76a86c2 Fix _hours operator 2023-07-17 12:26:49 +00:00
Francesco Mazzoli 3cc7310a6e Add histograms for all components in /stats 2023-07-17 08:56:09 +00:00
Francesco Mazzoli 2f7be11e29 Add query for single block service in shuckle
I thought I might need it for some upcoming migration improvements,
I probably don't, but still kinda nice to have.
2023-07-13 09:46:37 +00:00
Francesco Mazzoli 53598c2fe9 Allow to re-open files as writing if we're already writing them
This makes `cp` work
2023-07-12 12:22:40 +01:00
Francesco Mazzoli 65174341a0 Drop MM after flushing out a transient file 2023-07-12 12:22:40 +01:00
Francesco Mazzoli fe88efb1ce Remove UB in xmon code 2023-07-11 14:15:33 +00:00