Commit Graph

128 Commits

Author SHA1 Message Date
Francesco Mazzoli a01b1f036d More alert-related fixes 2023-07-27 13:54:51 +00:00
Francesco Mazzoli 6a52a961eb Split CDC timings to distinguish queue time from exec time 2023-07-27 13:14:12 +00:00
Francesco Mazzoli 889c04766f Do not bump req ids when retrying requests in the CDC
Fixes #29.

The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli f797663d8c Transient alerts for EPERM errors on sendto 2023-07-27 07:31:34 +00:00
Francesco Mazzoli bf447408a6 Actually wait for things to finish terminating before reaping next one
Fixes #27. This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli 15e59b8e67 More logging when closing (see #27)
It seems that we get the SIGSEGV while closing the DB.
2023-07-26 21:09:29 +00:00
Francesco Mazzoli dd39466daa Insert CDC stats on shutdown 2023-07-26 20:41:35 +00:00
Francesco Mazzoli 999d2df52b Do not alert for missing CDC request
This is totally normal if the CDC is restarted with queued
transactions.
2023-07-26 19:38:17 +00:00
Francesco Mazzoli 45b2618296 Temporarily put a stop to alert spam 2023-07-26 19:33:00 +00:00
Francesco Mazzoli b0ff28dc44 Do not alert on error which can happen naturally in GC 2023-07-26 19:28:44 +00:00
Francesco Mazzoli 0fc80dfe0f Remove additional CDC status fields
`status()` was racy anyway (the txn might have been gone between
first and second lookup) and these are better solved by the stats
db/graphana anyway.
2023-07-26 19:21:24 +00:00
Francesco Mazzoli d918df0fcc Correctly return errors when failing to connect
Triggered by investigating

    xmon: could not read message type: unexpected EOF, will reconnect
    xmon: connected to xmon REDACTED
    Undertaker: hard abort - running abort handlers
    Uncaught exception thrown: SyscallException(Xmon.cpp@186, 9/EBADF=Bad file descriptor in void Xmon::run()): setsockopt

which caused crashes in shards/CDC.
2023-07-26 12:59:40 +00:00
Francesco Mazzoli 60554ec58d Have bigger histograms, remove other metrics entirely
The `uint16_t` -> `size_t` in `packedSize` is because now
insert stats requests are bigger than `uint16_t`.
2023-07-26 10:01:27 +00:00
Francesco Mazzoli c2bd882cdc Allow erasing blocks for decommissioned block services
Otherwise GC cannot run after disposing of a broken disk. This
commit also adds various safety checks regarding decommissioned
block services.
2023-07-24 19:03:16 +01:00
Francesco Mazzoli 5776bb6d34 Include duration in mean/stddev stat 2023-07-24 19:03:16 +01:00
Francesco Mazzoli 4dbb6c79ba Fix bug in Xmon parsing (alert id is 8 bytes, not 4) 2023-07-24 07:40:49 +00:00
Francesco Mazzoli 18b01438d4 Have -short tests actually be short, split out longer tests 2023-07-22 20:17:53 +01:00
Francesco Mazzoli fe14ec5c22 Aggregate mean/stddev stat into one, together with count
This makes more sense so that we can combine multiple ones together
2023-07-22 20:17:53 +01:00
Francesco Mazzoli 37ce3be74c Implement utime-like functions
Also, update atime when opening a file.
2023-07-21 06:28:48 +00:00
Francesco Mazzoli 441eebb514 Do not crash on bad shuckle response 2023-07-20 12:46:38 +00:00
Francesco Mazzoli ce21016ad9 Fix mean/stddev calculation 2023-07-19 21:44:17 +00:00
Francesco Mazzoli 6aa670b481 Remove mean/stddev computation in C++
It's broken (also in Go), will fix in the following days.
2023-07-19 11:48:38 +00:00
Francesco Mazzoli dce2961d7f Re-insert xmon requests if we fail to write them 2023-07-18 16:11:01 +00:00
Francesco Mazzoli 6973ed9ff7 Reset xmon buffer before packing stuff in 2023-07-18 16:10:28 +00:00
Francesco Mazzoli b4613bd47e Fix other little stddev things 2023-07-18 14:43:29 +00:00
Francesco Mazzoli 5c849c0d96 Fix timings stddev overflow
This adds a couple of locks which could be avoided by being a bit
more clever, but almost certainly doesn't matter for now.
2023-07-18 14:37:37 +00:00
Francesco Mazzoli 283f3508b9 Add binary /api endpoint, use it to draw histograms
This makes /stats _a lot_ faster.
2023-07-18 12:34:57 +00:00
Francesco Mazzoli 2b1b1a1c15 Insert stats when shutting down 2023-07-17 12:27:07 +00:00
Francesco Mazzoli dcb76a86c2 Fix _hours operator 2023-07-17 12:26:49 +00:00
Francesco Mazzoli 3cc7310a6e Add histograms for all components in /stats 2023-07-17 08:56:09 +00:00
Francesco Mazzoli 2f7be11e29 Add query for single block service in shuckle
I thought I might need it for some upcoming migration improvements,
I probably don't, but still kinda nice to have.
2023-07-13 09:46:37 +00:00
Francesco Mazzoli 2f1385445b Tighten up the mtime story for transient files 2023-07-12 12:52:50 +00:00
Francesco Mazzoli d93df7ef42 Make tests pass for now 2023-07-12 12:22:40 +01:00
Francesco Mazzoli 53598c2fe9 Allow to re-open files as writing if we're already writing them
This makes `cp` work
2023-07-12 12:22:40 +01:00
Francesco Mazzoli 65174341a0 Drop MM after flushing out a transient file 2023-07-12 12:22:40 +01:00
Francesco Mazzoli fe88efb1ce Remove UB in xmon code 2023-07-11 14:15:33 +00:00
Francesco Mazzoli ff9306f6e3 Add Xmon support to C++ code 2023-07-11 12:13:22 +00:00
Francesco Mazzoli d5fea6c08c Retry when block services are unavailable in kmod 2023-07-06 19:39:12 +01:00
Saulius Grusnys 0360ec85cf Switch cutoff time to blockservice to 1h and set the deadline in shard to 2 2023-07-06 13:28:12 +01:00
Francesco Mazzoli 1a4301a499 Simplify go span read/write code, make it work with broken block services
And some other assorted changes.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli 4e0e6fe8a8 Configurable CDC shard timeout
Running in valgrind seems to just not be able to process a small
FullReadDirReq in 100ms, which is a bit concerning, but I'll let
it slide for now.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli 87d0e69f85 Port kmod to new FullReadDir request 2023-07-04 08:05:42 +00:00
Francesco Mazzoli f0add4d926 Remove C++ varint code, we don't use varints anymore 2023-07-04 08:05:42 +00:00
Francesco Mazzoli e2dcd43fea Fix bug in CreateLockedCurrentEdge logic
See comment in `msgs.go`. This would normally have required
entirely new transactions, but since we're not in production yet
I'm going to just change the schema and wipe the current FS.

This also adds in an unrelated change regarding more flexible
blacklisting, which will be required for some additional testing
I'm preparing.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli 0f114623f3 Just use unix nanos for eggs times
This was bugging me for a while, but the final straw was that if
one wants to use the max time (for example to look backwards when
traversing edges), you cannot trivially convert from one to the
other, since you'd overflow. So you can't (for instance) trivially
convert from eggs time to `time.Time` in go.

The main disadvantage is that we lose ~50 of the ~600 years
representable with nanoseconds. But I think that's fine.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli dd78912c0c More stuff as debug 2023-06-18 12:50:05 +00:00
Francesco Mazzoli c328cca75b Fix shard bug when returning from idempotent locked edge creation 2023-06-16 15:20:40 +00:00
Francesco Mazzoli 016c4bf162 First GH workflow attempt 2023-06-15 15:56:34 +00:00
Francesco Mazzoli 444ffba63f Propagate BS flags 2023-06-15 13:53:40 +00:00
Francesco Mazzoli e26eeaede1 Add "mtu" field to requests that benefit from it
Not used right now, but this way we can easily start stuffing more
data in responses.

I also split off some arguments in `NewClient`, unrelated change
(I wanted to pair the MTU with a single client, but I then realized
that it's enough to have it as some global property for now).
2023-06-15 11:57:05 +00:00