Francesco Mazzoli
91db9566e1
Remove option to not write out atime which is too recent
...
This was pretty nasty to begin with, we now do it in the client.
2023-11-23 13:28:23 +00:00
Francesco Mazzoli
bcf75d5308
Shut up sanitizer
2023-11-21 17:03:05 +00:00
Francesco Mazzoli
1fca8b84cd
Fix type signature
2023-11-17 22:48:31 +00:00
Francesco Mazzoli
b964d0632a
Add option to not write out atime which is too recent
...
This is to save on a ton of writes as jobs stat tons of files.
It would maybe be a bit cleaner to do it in the kmod, but this is
much quicker.
Thanks to @sgrusny for the good idea.
2023-11-16 14:45:58 +00:00
Saulius Grusnys
2ce5586eb9
Periodically refresh metadata info in kmod, use two IPs for shuckle
...
Fixes #112 .
Co-authored-by: Francesco Mazzoli <francesco.mazzoli@xtxmarkets.com >
2023-11-14 13:49:36 +00:00
Francesco Mazzoli
3bc17301d6
Switch from tuple to variant for req/resp containers
...
The `tuple` was for when I thought it'd be useful to leave slots
for each request, but we don't need this anymore, and now leading
up to #66 I want to be able to keep vectors of reqs/resps.
2023-11-09 19:03:37 +00:00
Francesco Mazzoli
ad3c969772
Push full RocksDB stats to grafana
2023-11-09 16:48:51 +00:00
Francesco Mazzoli
057be91613
rocksDBStats -> rocksDBMetrics
2023-11-09 13:38:32 +00:00
Francesco Mazzoli
c5979a9d90
Expose some RocksDB stats
2023-11-09 13:23:49 +00:00
Francesco Mazzoli
d0126d0656
Distinguish IO errors in eggsblocks
...
See #115 for background.
2023-11-06 19:35:05 +00:00
Francesco Mazzoli
1ec63f9710
Implement scrubbing functionality
...
Fixes #32 . This also involves some reworking of the block request machinery
to make it more robust and faster. The scrubbing is done assuming that
the overwhelming majority of block checking will go through.
2023-11-05 18:33:00 +00:00
Francesco Mazzoli
71556ce933
Switch to restech EggsFS rota
2023-11-03 14:23:44 +00:00
Francesco Mazzoli
64d400fcfe
Insert shard/cdc metrics at more regular intervals
2023-11-03 13:49:38 +00:00
Francesco Mazzoli
654c0d4db4
Report CDC queue size in grafana
2023-11-03 13:49:32 +00:00
Francesco Mazzoli
c529d96c88
Garbage collect zero block service files mappings.
...
See #91 .
2023-10-21 11:41:33 +00:00
Francesco Mazzoli
24d1588b21
Add quiet window for C++ alerts, too
2023-10-02 23:02:45 +00:00
Francesco Mazzoli
2679ee7c80
Retry RocksDB transactions if appropriate
2023-09-30 10:44:40 +00:00
Francesco Mazzoli
02838e228f
Correct xmon app types
2023-09-28 11:53:12 +00:00
Francesco Mazzoli
b87a43a297
Continue running GC if servers are down
...
This was triggered by a server failing hard (fsr13), without any
short term resolution (we've already replaced the mobo, we'll probably
replace the HBA). In this case GC should still run rather than
get stuck.
2023-08-29 12:47:24 +00:00
Francesco Mazzoli
40f229b6f5
Add endpoint to specify which file to get the "reference" block services from
...
See comments for more details.
2023-08-16 08:40:47 +01:00
Francesco Mazzoli
9405b64a76
Remove ExpireTransientFile, make future cutoff tunable
...
Fixes #48 . Also, reorganize error handling in `eggsblocks` requests,
especially around write requests, which might help with #45 .
2023-08-15 12:43:49 +01:00
Francesco Mazzoli
a5dbe189e3
Add some block services metrics
2023-08-08 11:48:35 +00:00
Francesco Mazzoli
467fcffefb
A few metrics fixes
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e2246afc53
More tweaks to event loops
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
5117ddd16e
Add shard/CDC metrics
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
1922cf3c30
Factor out common looping patterns
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
63ed6a90fa
Reconnect to xmon on expired heartbeat
2023-08-07 10:06:40 +00:00
Francesco Mazzoli
b370118e90
Rate limit binnable xmon requests
...
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli
02a2ca2a6f
Wait for block services to come up before restarting the next one
...
This should already make #43 better.
2023-08-04 13:40:10 +00:00
Francesco Mazzoli
18b2397842
Some Timings.hpp functions
2023-08-03 23:41:11 +00:00
Francesco Mazzoli
698794ac44
Fix bad indexing in Timings.hpp
2023-08-03 21:06:49 +00:00
Francesco Mazzoli
ca987ed205
Fix UB in bincode
...
I `memcpy` a zero-sized string into `NULL`. UBsan rightfully
complains.
2023-08-03 09:58:53 +00:00
Francesco Mazzoli
9ef3162882
Add error count to inspect how things failed
2023-08-03 06:53:35 +00:00
Francesco Mazzoli
5146a80c2d
Use homegrown Xmon
...
I got annoyed at the old lib dropping requests when queue gets
full, I could probably fix but this is almost certainly quicker.
2023-07-30 11:16:35 +00:00
Francesco Mazzoli
e851457c52
Do not re-insert requests in C++ xmon code
...
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli
8e9f4f3d8b
Never die because of bad Xmon
...
It will alert if we're disconnected anyway, and when restarting
everything this causes crashes.
2023-07-28 08:08:03 +00:00
Francesco Mazzoli
7dceb5fda5
More alerts shenanigans
2023-07-27 15:51:15 +00:00
Francesco Mazzoli
889c04766f
Do not bump req ids when retrying requests in the CDC
...
Fixes #29 .
The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli
f797663d8c
Transient alerts for EPERM errors on sendto
2023-07-27 07:31:34 +00:00
Francesco Mazzoli
bf447408a6
Actually wait for things to finish terminating before reaping next one
...
Fixes #27 . This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli
0fc80dfe0f
Remove additional CDC status fields
...
`status()` was racy anyway (the txn might have been gone between
first and second lookup) and these are better solved by the stats
db/graphana anyway.
2023-07-26 19:21:24 +00:00
Francesco Mazzoli
d918df0fcc
Correctly return errors when failing to connect
...
Triggered by investigating
xmon: could not read message type: unexpected EOF, will reconnect
xmon: connected to xmon REDACTED
Undertaker: hard abort - running abort handlers
Uncaught exception thrown: SyscallException(Xmon.cpp@186, 9/EBADF=Bad file descriptor in void Xmon::run()): setsockopt
which caused crashes in shards/CDC.
2023-07-26 12:59:40 +00:00
Francesco Mazzoli
60554ec58d
Have bigger histograms, remove other metrics entirely
...
The `uint16_t` -> `size_t` in `packedSize` is because now
insert stats requests are bigger than `uint16_t`.
2023-07-26 10:01:27 +00:00
Francesco Mazzoli
c2bd882cdc
Allow erasing blocks for decommissioned block services
...
Otherwise GC cannot run after disposing of a broken disk. This
commit also adds various safety checks regarding decommissioned
block services.
2023-07-24 19:03:16 +01:00
Francesco Mazzoli
5776bb6d34
Include duration in mean/stddev stat
2023-07-24 19:03:16 +01:00
Francesco Mazzoli
4dbb6c79ba
Fix bug in Xmon parsing (alert id is 8 bytes, not 4)
2023-07-24 07:40:49 +00:00
Francesco Mazzoli
fe14ec5c22
Aggregate mean/stddev stat into one, together with count
...
This makes more sense so that we can combine multiple ones together
2023-07-22 20:17:53 +01:00
Francesco Mazzoli
37ce3be74c
Implement utime-like functions
...
Also, update atime when opening a file.
2023-07-21 06:28:48 +00:00
Francesco Mazzoli
441eebb514
Do not crash on bad shuckle response
2023-07-20 12:46:38 +00:00
Francesco Mazzoli
ce21016ad9
Fix mean/stddev calculation
2023-07-19 21:44:17 +00:00