Commit Graph

480 Commits

Author SHA1 Message Date
Francesco Mazzoli fe2ce7aa17 See comments 2023-08-01 21:17:23 +01:00
Francesco Mazzoli a5eb12a262 Do not alert/log on innocuous shard error in CDC 2023-08-01 13:41:18 +00:00
Francesco Mazzoli 1ac282dc08 Update README.md 2023-08-01 13:51:55 +01:00
Francesco Mazzoli 0d1d579449 Update README.md 2023-08-01 13:40:22 +01:00
Francesco Mazzoli 023885a6f6 Smooth out histos slightly 2023-08-01 12:05:02 +00:00
Francesco Mazzoli a7d119387c Add buttons in /stat to quickly choose range 2023-08-01 11:47:52 +00:00
Francesco Mazzoli baf1fd7278 Re-introduce longer timeouts in QEMU
Should fix #31
2023-07-31 13:00:04 +00:00
Francesco Mazzoli 9b38509362 Fix CDC request tracing 2023-07-31 12:59:24 +00:00
Francesco Mazzoli 1359a5d752 The timeouts seem to be some genuine errors, but I need more info to debug 2023-07-31 10:42:03 +00:00
Francesco Mazzoli 791d847f2b Increase timeouts in QEMU 2023-07-31 08:46:45 +00:00
Francesco Mazzoli 03684fee13 Timings -> Histogram, I want to add other kinds of timings soon 2023-07-30 13:18:09 +00:00
Francesco Mazzoli 7ab9c78770 Explain some /stats details 2023-07-30 13:14:44 +00:00
Francesco Mazzoli 5146a80c2d Use homegrown Xmon
I got annoyed at the old lib dropping requests when queue gets
full, I could probably fix but this is almost certainly quicker.
2023-07-30 11:16:35 +00:00
Francesco Mazzoli f3b0a9be7c Correct accounting of header size when fetching blocks
Fixes #28
2023-07-30 10:58:58 +00:00
Francesco Mazzoli e851457c52 Do not re-insert requests in C++ xmon code
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli 5fe63035a8 Fix kernel ci rsync 2023-07-29 16:18:27 +00:00
Francesco Mazzoli 76f581f194 Rationalize timeout handling 2023-07-28 15:01:55 +00:00
Francesco Mazzoli 41cf9b16f9 Much better /stats 2023-07-28 08:12:04 +00:00
Francesco Mazzoli 0ac6252832 Decouple shuckle registration and block counting
Block counting was getting slow now, and it could cause mistakenly
marking block services as stale.
2023-07-28 08:09:55 +00:00
Francesco Mazzoli 8e9f4f3d8b Never die because of bad Xmon
It will alert if we're disconnected anyway, and when restarting
everything this causes crashes.
2023-07-28 08:08:03 +00:00
Francesco Mazzoli 7dceb5fda5 More alerts shenanigans 2023-07-27 15:51:15 +00:00
Francesco Mazzoli a01b1f036d More alert-related fixes 2023-07-27 13:54:51 +00:00
Francesco Mazzoli 2c3c09180b Ignore certain errors in shuckle connections
See <internal-repo/issues/22#issuecomment-22869>
2023-07-27 13:29:48 +00:00
Francesco Mazzoli 6a52a961eb Split CDC timings to distinguish queue time from exec time 2023-07-27 13:14:12 +00:00
Francesco Mazzoli fd132ca4b2 More CI tweaks 🥱 2023-07-27 12:44:44 +00:00
Francesco Mazzoli 85fd84af82 Fix logger confusion 2023-07-27 12:23:34 +00:00
Francesco Mazzoli 39ba351886 Yet improved CI log syncing 2023-07-27 12:15:24 +00:00
Francesco Mazzoli 889c04766f Do not bump req ids when retrying requests in the CDC
Fixes #29.

The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli 111bde371b Finally remove duplication between shardreq.go and cdcreq.go 2023-07-27 09:54:49 +00:00
Francesco Mazzoli f797663d8c Transient alerts for EPERM errors on sendto 2023-07-27 07:31:34 +00:00
Francesco Mazzoli bf447408a6 Actually wait for things to finish terminating before reaping next one
Fixes #27. This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli 15e59b8e67 More logging when closing (see #27)
It seems that we get the SIGSEGV while closing the DB.
2023-07-26 21:09:29 +00:00
Francesco Mazzoli 401898ac98 Better /stats commentary. 2023-07-26 20:44:02 +00:00
Francesco Mazzoli dd39466daa Insert CDC stats on shutdown 2023-07-26 20:41:35 +00:00
Francesco Mazzoli 9979f54240 Ignore blockservices once they are decommissioned
This is so that we can restart the blockservice before removing the
mount, and have GC and generally block erasure to work fine. It should
be fairly safe since we never allow the removal of the decommissioned
flag.
2023-07-26 20:41:27 +00:00
Francesco Mazzoli 999d2df52b Do not alert for missing CDC request
This is totally normal if the CDC is restarted with queued
transactions.
2023-07-26 19:38:17 +00:00
Francesco Mazzoli 45b2618296 Temporarily put a stop to alert spam 2023-07-26 19:33:00 +00:00
Francesco Mazzoli b0ff28dc44 Do not alert on error which can happen naturally in GC 2023-07-26 19:28:44 +00:00
Francesco Mazzoli 0fc80dfe0f Remove additional CDC status fields
`status()` was racy anyway (the txn might have been gone between
first and second lookup) and these are better solved by the stats
db/graphana anyway.
2023-07-26 19:21:24 +00:00
Francesco Mazzoli 6d07e5d97d Fix stats with new histos 2023-07-26 13:18:43 +00:00
Francesco Mazzoli d918df0fcc Correctly return errors when failing to connect
Triggered by investigating

    xmon: could not read message type: unexpected EOF, will reconnect
    xmon: connected to xmon REDACTED
    Undertaker: hard abort - running abort handlers
    Uncaught exception thrown: SyscallException(Xmon.cpp@186, 9/EBADF=Bad file descriptor in void Xmon::run()): setsockopt

which caused crashes in shards/CDC.
2023-07-26 12:59:40 +00:00
Francesco Mazzoli 3206d7a564 Fix syncing of kmod test logs 2023-07-26 11:46:24 +00:00
Francesco Mazzoli e631096df7 Write shuckle stats before quitting 2023-07-26 10:01:27 +00:00
Francesco Mazzoli 60554ec58d Have bigger histograms, remove other metrics entirely
The `uint16_t` -> `size_t` in `packedSize` is because now
insert stats requests are bigger than `uint16_t`.
2023-07-26 10:01:27 +00:00
Francesco Mazzoli a35ea2dacf Minor shuckle tweaks 2023-07-26 07:58:17 +00:00
Francesco Mazzoli f381bb293f Wait for block services first in tests
This ensures that when the shards ask for block services we always
get the full set, which means that we'll be able to write from the
get-go.
2023-07-26 07:58:17 +00:00
Francesco Mazzoli 0b7d432051 Use -verbose in kernel tests, and sync logs 2023-07-25 10:28:30 +00:00
Francesco Mazzoli 4b0554d9a9 Start block services correctly in eggsrun 2023-07-24 21:04:20 +00:00
Francesco Mazzoli 9ce88cd8ed Don't do that, do the opposite of that 2023-07-24 18:18:48 +00:00
Francesco Mazzoli 704062ed5f Store dead block services ciphers, and use them 2023-07-24 18:10:25 +00:00