Commit Graph

83 Commits

Author SHA1 Message Date
Francesco Mazzoli bd278ff6f6 Better metrics for shard responses in CDC 2023-11-29 13:52:44 +00:00
Francesco Mazzoli 4453083aa7 Correctly record request id when picking up transactions after restart 2023-11-29 11:08:07 +00:00
Francesco Mazzoli a367858684 Drop entire CF at once, rather than one-by-one
A dry run of the production upgrade using a backup revealed that
dropping them one-by-one would take ages, since before we kept every
single CDC request.
2023-11-29 11:08:07 +00:00
Francesco Mazzoli 7537bbc6cf Remove useless line 2023-11-29 11:08:07 +00:00
Francesco Mazzoli fac014a864 Self-PR review, part 2 2023-11-29 11:08:07 +00:00
Francesco Mazzoli ba9424e224 Remove unordered_set
Almost certainly irrelevant, but it was bugging me
2023-11-29 11:08:07 +00:00
Francesco Mazzoli 2eab012d76 Fix bug in poll check code 2023-11-29 11:08:07 +00:00
Francesco Mazzoli c94ece50cf Integer sanitizer stuff 2023-11-29 11:08:07 +00:00
Francesco Mazzoli 59abb24a8e Add ceiling on max update size
We don't want it to grow without bound, but we want to maximize
throughput (we'd like for fsync to not be a factor).
2023-11-29 11:08:07 +00:00
Francesco Mazzoli 476009381a Remove maximum enqueued requests limit
We already drop in-flight requests that we're already processing,
so I don't think this matters very much currently.
2023-11-29 11:08:07 +00:00
Francesco Mazzoli c5562c7ca3 Parallelize CDC by directory
Fixes #66.
2023-11-29 11:08:07 +00:00
Francesco Mazzoli 340e7f2f37 Harmonize addr-passing, add shuckle beacon and test it in kmod 2023-11-14 13:49:36 +00:00
Francesco Mazzoli 2ad278adaa Add ubuntu image to build, use jemalloc in release build
I want to use the introspection capabilities of jemalloc, and it
should also be much faster. Preserve alpine build for go build,
it's also really useful to test inside the kmod.
2023-11-13 15:44:55 +00:00
Francesco Mazzoli ad3c969772 Push full RocksDB stats to grafana 2023-11-09 16:48:51 +00:00
Francesco Mazzoli f70c484883 Dump RocksDB full statistics to file 2023-11-09 14:12:54 +00:00
Francesco Mazzoli 057be91613 rocksDBStats -> rocksDBMetrics 2023-11-09 13:38:32 +00:00
Francesco Mazzoli c5979a9d90 Expose some RocksDB stats 2023-11-09 13:23:49 +00:00
Francesco Mazzoli 03e9510255 Align xmon's app instances and systemd services 2023-11-08 14:36:58 +00:00
Francesco Mazzoli afc4e78a62 Reduce default CDC queue size 2023-11-05 22:38:57 +00:00
Francesco Mazzoli 71556ce933 Switch to restech EggsFS rota 2023-11-03 14:23:44 +00:00
Francesco Mazzoli 64d400fcfe Insert shard/cdc metrics at more regular intervals 2023-11-03 13:49:38 +00:00
Francesco Mazzoli 654c0d4db4 Report CDC queue size in grafana 2023-11-03 13:49:32 +00:00
Francesco Mazzoli 9e21969637 Slightly tighter error checks 2023-10-11 13:40:46 +01:00
Francesco Mazzoli 6726fff0fe Better "innocuous error" handling in CDC 2023-10-04 18:12:15 +01:00
Francesco Mazzoli 440a78510e Add concrete quiet windows to C++ alerts
This together with the previous commits fixes #72.
2023-10-02 23:06:40 +00:00
Francesco Mazzoli 59237ed673 Limit number of open RocksDB files
We got to the point where we had ~4k open SST files per shard, which
meant that we eat up all the available FDs.
2023-09-30 11:08:35 +00:00
Francesco Mazzoli 2679ee7c80 Retry RocksDB transactions if appropriate 2023-09-30 10:44:40 +00:00
Francesco Mazzoli 1d4c4abafd Correctly check that RocksDB txn succeeded
This was caught anyway by the fact that we check that the log index
is what we expect. Would have been very nasty otherwise.

The right thing to do is to check for `Status::TryAgain()` and
retry. `Status::Busy()` should never happen because we never
run transactions concurrently so far.
2023-09-30 09:51:26 +00:00
Francesco Mazzoli 02838e228f Correct xmon app types 2023-09-28 11:53:12 +00:00
Francesco Mazzoli 762f047772 Add fsr17 and fsr18 to deployment 2023-09-19 12:56:34 +00:00
Francesco Mazzoli 77ac15af8d Allow to choose xmon env in C++ apps 2023-09-18 11:56:44 +00:00
Ivan Korostelev 7ec477ca9f CDC.cpp: minor bugfix with using optional after reset()
harmless in release builds, since optional in questino is POD and destructor is a noop
2023-08-09 10:45:43 +00:00
Francesco Mazzoli 32e2a011ee More grafana fixes 2023-08-08 09:28:07 +00:00
Francesco Mazzoli 467fcffefb A few metrics fixes 2023-08-08 09:21:35 +01:00
Francesco Mazzoli e2246afc53 More tweaks to event loops 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 5117ddd16e Add shard/CDC metrics 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 1922cf3c30 Factor out common looping patterns 2023-08-08 09:21:35 +01:00
Francesco Mazzoli 93b212c665 Alert while initializing shard DB 2023-08-07 10:16:00 +00:00
Francesco Mazzoli b370118e90 Rate limit binnable xmon requests
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli 9ef3162882 Add error count to inspect how things failed 2023-08-03 06:53:35 +00:00
Francesco Mazzoli acf4f129f6 Fix CDC txn status tracking
This might be a resolution to #38, although I'm not sure yet.
2023-08-02 13:49:38 +00:00
Francesco Mazzoli 63e2db0889 Cap maximum number of CDC requests
No point letting huge queues build -- especially now that we
deduplicate client requests.
2023-08-01 21:17:23 +01:00
Francesco Mazzoli fe2ce7aa17 See comments 2023-08-01 21:17:23 +01:00
Francesco Mazzoli a5eb12a262 Do not alert/log on innocuous shard error in CDC 2023-08-01 13:41:18 +00:00
Francesco Mazzoli e851457c52 Do not re-insert requests in C++ xmon code
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli 7dceb5fda5 More alerts shenanigans 2023-07-27 15:51:15 +00:00
Francesco Mazzoli a01b1f036d More alert-related fixes 2023-07-27 13:54:51 +00:00
Francesco Mazzoli 6a52a961eb Split CDC timings to distinguish queue time from exec time 2023-07-27 13:14:12 +00:00
Francesco Mazzoli 889c04766f Do not bump req ids when retrying requests in the CDC
Fixes #29.

The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli f797663d8c Transient alerts for EPERM errors on sendto 2023-07-27 07:31:34 +00:00