Commit Graph

57 Commits

Author SHA1 Message Date
Francesco Mazzoli 2679ee7c80 Retry RocksDB transactions if appropriate 2023-09-30 10:44:40 +00:00
Francesco Mazzoli 1d4c4abafd Correctly check that RocksDB txn succeeded
This was caught anyway by the fact that we check that the log index
is what we expect. Would have been very nasty otherwise.

The right thing to do is to check for `Status::TryAgain()` and
retry. `Status::Busy()` should never happen because we never
run transactions concurrently so far.
2023-09-30 09:51:26 +00:00
Francesco Mazzoli 02838e228f Correct xmon app types 2023-09-28 11:53:12 +00:00
Francesco Mazzoli 762f047772 Add fsr17 and fsr18 to deployment 2023-09-19 12:56:34 +00:00
Francesco Mazzoli 77ac15af8d Allow to choose xmon env in C++ apps 2023-09-18 11:56:44 +00:00
Ivan Korostelev 7ec477ca9f CDC.cpp: minor bugfix with using optional after reset()
harmless in release builds, since optional in questino is POD and destructor is a noop
2023-08-09 10:45:43 +00:00
Francesco Mazzoli 32e2a011ee More grafana fixes 2023-08-08 09:28:07 +00:00
Francesco Mazzoli 467fcffefb A few metrics fixes 2023-08-08 09:21:35 +01:00
Francesco Mazzoli e2246afc53 More tweaks to event loops 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 5117ddd16e Add shard/CDC metrics 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 1922cf3c30 Factor out common looping patterns 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 93b212c665 Alert while initializing shard DB 2023-08-07 10:16:00 +00:00
Francesco Mazzoli b370118e90 Rate limit binnable xmon requests
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli 9ef3162882 Add error count to inspect how things failed 2023-08-03 06:53:35 +00:00
Francesco Mazzoli acf4f129f6 Fix CDC txn status tracking
This might be a resolution to #38, although I'm not sure yet.
2023-08-02 13:49:38 +00:00
Francesco Mazzoli 63e2db0889 Cap maximum number of CDC requests
No point letting huge queues build -- especially now that we
deduplicate client requests.
2023-08-01 21:17:23 +01:00
Francesco Mazzoli fe2ce7aa17 See comments 2023-08-01 21:17:23 +01:00
Francesco Mazzoli a5eb12a262 Do not alert/log on innocuous shard error in CDC 2023-08-01 13:41:18 +00:00
Francesco Mazzoli e851457c52 Do not re-insert requests in C++ xmon code
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli 7dceb5fda5 More alerts shenanigans 2023-07-27 15:51:15 +00:00
Francesco Mazzoli a01b1f036d More alert-related fixes 2023-07-27 13:54:51 +00:00
Francesco Mazzoli 6a52a961eb Split CDC timings to distinguish queue time from exec time 2023-07-27 13:14:12 +00:00
Francesco Mazzoli 889c04766f Do not bump req ids when retrying requests in the CDC
Fixes #29.

The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli f797663d8c Transient alerts for EPERM errors on sendto 2023-07-27 07:31:34 +00:00
Francesco Mazzoli bf447408a6 Actually wait for things to finish terminating before reaping next one
Fixes #27. This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli dd39466daa Insert CDC stats on shutdown 2023-07-26 20:41:35 +00:00
Francesco Mazzoli 999d2df52b Do not alert for missing CDC request
This is totally normal if the CDC is restarted with queued
transactions.
2023-07-26 19:38:17 +00:00
Francesco Mazzoli 45b2618296 Temporarily put a stop to alert spam 2023-07-26 19:33:00 +00:00
Francesco Mazzoli b0ff28dc44 Do not alert on error which can happen naturally in GC 2023-07-26 19:28:44 +00:00
Francesco Mazzoli 0fc80dfe0f Remove additional CDC status fields
`status()` was racy anyway (the txn might have been gone between
first and second lookup) and these are better solved by the stats
db/graphana anyway.
2023-07-26 19:21:24 +00:00
Francesco Mazzoli 60554ec58d Have bigger histograms, remove other metrics entirely
The `uint16_t` -> `size_t` in `packedSize` is because now
insert stats requests are bigger than `uint16_t`.
2023-07-26 10:01:27 +00:00
Francesco Mazzoli 2b1b1a1c15 Insert stats when shutting down 2023-07-17 12:27:07 +00:00
Francesco Mazzoli 3cc7310a6e Add histograms for all components in /stats 2023-07-17 08:56:09 +00:00
Francesco Mazzoli ff9306f6e3 Add Xmon support to C++ code 2023-07-11 12:13:22 +00:00
Francesco Mazzoli 4e0e6fe8a8 Configurable CDC shard timeout
Running in valgrind seems to just not be able to process a small
FullReadDirReq in 100ms, which is a bit concerning, but I'll let
it slide for now.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli e2dcd43fea Fix bug in CreateLockedCurrentEdge logic
See comment in `msgs.go`. This would normally have required
entirely new transactions, but since we're not in production yet
I'm going to just change the schema and wipe the current FS.

This also adds in an unrelated change regarding more flexible
blacklisting, which will be required for some additional testing
I'm preparing.
2023-07-04 08:05:42 +00:00
Francesco Mazzoli dd78912c0c More stuff as debug 2023-06-18 12:50:05 +00:00
Francesco Mazzoli e26eeaede1 Add "mtu" field to requests that benefit from it
Not used right now, but this way we can easily start stuffing more
data in responses.

I also split off some arguments in `NewClient`, unrelated change
(I wanted to pair the MTU with a single client, but I then realized
that it's enough to have it as some global property for now).
2023-06-15 11:57:05 +00:00
Francesco Mazzoli d1e02e261b Various QOL improvements
Also, try to avoid thundering herds on shuckle from CDC/shards too.
2023-06-08 11:59:09 +00:00
Francesco Mazzoli d076941ce8 Simplify block write/fetch
And hopefully reduce the likelihood of bugs. On the write end, given
that we do things less asynchronously, things might be a bit slower,
but I think the simplification is worth it for now.

Also, fix/improve a bunch of other stuff.
2023-06-08 11:59:09 +00:00
Francesco Mazzoli b041d14860 Add second ip/addr for CDC/shards too
This is one of the two data model/protocol changes I want to perform
before going into production, the other being file atime.

Right now the kernel module does not take advantage of this, but
it's OK since I tested the rest of the code reasonably and the goal
here is to perform the protocol/data changes.
2023-06-05 12:14:14 +00:00
Francesco Mazzoli a12a938c40 syslogify logs 2023-05-29 09:52:01 +00:00
Francesco Mazzoli 1458759534 Allow to enable shard/cdc debugging at runtime using USR2 2023-05-26 10:03:59 +00:00
Francesco Mazzoli 1eab8ee6cf Add versions to some RocksDB values
Only the ones where it is needed -- in some cases we can just
modify the keys (e.g. metadata stuff).

Also, come up with a sort of horrifying but more robust way
to specify the RocksDB values with the C preprocessor.
2023-05-22 08:03:01 +00:00
Francesco Mazzoli 6addbdee6a First version of kernel module
Initial version really by Pawel, but many changes in between.

Big outstanding issues:

* span cache reclamation (unbounded memory otherwise...)
* bad block service detection and workarounds
* corrupted blocks detection and workaround

Co-authored-by: Paweł Dziepak <pawel.dziepak@xtxmarkets.com>
2023-05-18 15:29:41 +00:00
Francesco Mazzoli 5bff9b8fae Many, many changes -- tests pass, but FUSE is currently not present
The main thing that's added is full RS support, but a lot of things
were rejigged along the way. The tests are still a bit lacking,
and will be augmented in future commits.
2023-03-03 16:42:22 +00:00
Francesco Mazzoli e1b8de02dc More assorted improvements 2023-02-15 14:03:53 +00:00
Francesco Mazzoli 51860fac3a Various improveents, nothing substantial 2023-02-14 22:39:38 +00:00
Francesco Mazzoli 4288189766 Reorganize logs, add req/resp to CLI, add last seen to UI 2023-02-14 12:21:48 +00:00
Francesco Mazzoli e580cd5fe9 Select right source address in CDC/Shard 2023-02-14 12:20:21 +00:00