Francesco Mazzoli
59237ed673
Limit number of open RocksDB files
...
We got to the point where we had ~4k open SST files per shard, which
meant that we eat up all the available FDs.
2023-09-30 11:08:35 +00:00
Francesco Mazzoli
2679ee7c80
Retry RocksDB transactions if appropriate
2023-09-30 10:44:40 +00:00
Francesco Mazzoli
1d4c4abafd
Correctly check that RocksDB txn succeeded
...
This was caught anyway by the fact that we check that the log index
is what we expect. Would have been very nasty otherwise.
The right thing to do is to check for `Status::TryAgain()` and
retry. `Status::Busy()` should never happen because we never
run transactions concurrently so far.
2023-09-30 09:51:26 +00:00
Francesco Mazzoli
02838e228f
Correct xmon app types
2023-09-28 11:53:12 +00:00
Francesco Mazzoli
762f047772
Add fsr17 and fsr18 to deployment
2023-09-19 12:56:34 +00:00
Francesco Mazzoli
77ac15af8d
Allow to choose xmon env in C++ apps
2023-09-18 11:56:44 +00:00
Francesco Mazzoli
b87a43a297
Continue running GC if servers are down
...
This was triggered by a server failing hard (fsr13), without any
short term resolution (we've already replaced the mobo, we'll probably
replace the HBA). In this case GC should still run rather than
get stuck.
2023-08-29 12:47:24 +00:00
Francesco Mazzoli
1a8cda8747
Retry if we fail to get page in spans
...
See comment for explanation, this is in preparation for #50 , see
<internal-repo/issues/50#issuecomment-23278>
in particular.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
1cab680110
Support arbitrary span/block/... policies in kmod...
...
...and also update them quickly, by indexing them by (inode, tag).
Currently they only get updated on local renames though, we should
also update them when things are moved around remotely.
2023-08-22 15:01:33 +01:00
Francesco Mazzoli
6fa520c582
Always update directory modification
...
This fixes a bona-fide bug -- we didn't update the mtime when an
edge was unlocked + moved. However we might as well blindly always
update the mtime, even if there is no POSIX-visible change, to
be on the safe side.
2023-08-21 13:33:30 +00:00
Francesco Mazzoli
b25f893403
Update estimates in ShardDB.cpp
2023-08-16 08:41:13 +00:00
Francesco Mazzoli
40f229b6f5
Add endpoint to specify which file to get the "reference" block services from
...
See comments for more details.
2023-08-16 08:40:47 +01:00
Francesco Mazzoli
9405b64a76
Remove ExpireTransientFile, make future cutoff tunable
...
Fixes #48 . Also, reorganize error handling in `eggsblocks` requests,
especially around write requests, which might help with #45 .
2023-08-15 12:43:49 +01:00
Ivan Korostelev
7ec477ca9f
CDC.cpp: minor bugfix with using optional after reset()
...
harmless in release builds, since optional in questino is POD and destructor is a noop
2023-08-09 10:45:43 +00:00
Francesco Mazzoli
a5dbe189e3
Add some block services metrics
2023-08-08 11:48:35 +00:00
Francesco Mazzoli
32e2a011ee
More grafana fixes
2023-08-08 09:28:07 +00:00
Francesco Mazzoli
467fcffefb
A few metrics fixes
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e2246afc53
More tweaks to event loops
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
b2f28955a5
Log timings
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
e686222040
A bit more logging
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
5117ddd16e
Add shard/CDC metrics
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
1922cf3c30
Factor out common looping patterns
2023-08-08 09:21:35 +01:00
Francesco Mazzoli
93b212c665
Alert while initializing shard DB
2023-08-07 10:16:00 +00:00
Francesco Mazzoli
63ed6a90fa
Reconnect to xmon on expired heartbeat
2023-08-07 10:06:40 +00:00
Francesco Mazzoli
b370118e90
Rate limit binnable xmon requests
...
This involved clearly separating non-clearable and clearable alerts,
which simplifies the design and I think satisfies all our needs.
2023-08-05 23:41:10 +01:00
Francesco Mazzoli
02a2ca2a6f
Wait for block services to come up before restarting the next one
...
This should already make #43 better.
2023-08-04 13:40:10 +00:00
Francesco Mazzoli
18b2397842
Some Timings.hpp functions
2023-08-03 23:41:11 +00:00
Francesco Mazzoli
698794ac44
Fix bad indexing in Timings.hpp
2023-08-03 21:06:49 +00:00
Francesco Mazzoli
ca987ed205
Fix UB in bincode
...
I `memcpy` a zero-sized string into `NULL`. UBsan rightfully
complains.
2023-08-03 09:58:53 +00:00
Francesco Mazzoli
0055575f71
alpine-debug -> alpinedebug, new cmake doesn't like dashes
2023-08-03 07:09:41 +00:00
Francesco Mazzoli
9ef3162882
Add error count to inspect how things failed
2023-08-03 06:53:35 +00:00
Francesco Mazzoli
acf4f129f6
Fix CDC txn status tracking
...
This might be a resolution to #38 , although I'm not sure yet.
2023-08-02 13:49:38 +00:00
Francesco Mazzoli
5a6a13de5f
Alpine 3.17 -> 3.18
...
To get <https://www.openwall.com/lists/musl/2023/05/02/1 >
2023-08-02 13:05:49 +00:00
Francesco Mazzoli
2a1d8a497e
Update some size estimates
2023-08-02 12:22:16 +00:00
Francesco Mazzoli
63e2db0889
Cap maximum number of CDC requests
...
No point letting huge queues build -- especially now that we
deduplicate client requests.
2023-08-01 21:17:23 +01:00
Francesco Mazzoli
fe2ce7aa17
See comments
2023-08-01 21:17:23 +01:00
Francesco Mazzoli
a5eb12a262
Do not alert/log on innocuous shard error in CDC
2023-08-01 13:41:18 +00:00
Francesco Mazzoli
5146a80c2d
Use homegrown Xmon
...
I got annoyed at the old lib dropping requests when queue gets
full, I could probably fix but this is almost certainly quicker.
2023-07-30 11:16:35 +00:00
Francesco Mazzoli
e851457c52
Do not re-insert requests in C++ xmon code
...
It could mess up the ordering.
2023-07-30 10:58:58 +00:00
Francesco Mazzoli
8e9f4f3d8b
Never die because of bad Xmon
...
It will alert if we're disconnected anyway, and when restarting
everything this causes crashes.
2023-07-28 08:08:03 +00:00
Francesco Mazzoli
7dceb5fda5
More alerts shenanigans
2023-07-27 15:51:15 +00:00
Francesco Mazzoli
a01b1f036d
More alert-related fixes
2023-07-27 13:54:51 +00:00
Francesco Mazzoli
6a52a961eb
Split CDC timings to distinguish queue time from exec time
2023-07-27 13:14:12 +00:00
Francesco Mazzoli
889c04766f
Do not bump req ids when retrying requests in the CDC
...
Fixes #29 .
The additions to codegen are unrelated -- I was exploring a different
approach based on request equality and I decided to keep those
changes in since they might be useful anyhow.
2023-07-27 11:55:33 +00:00
Francesco Mazzoli
f797663d8c
Transient alerts for EPERM errors on sendto
2023-07-27 07:31:34 +00:00
Francesco Mazzoli
bf447408a6
Actually wait for things to finish terminating before reaping next one
...
Fixes #27 . This is all kind of clunky right now, it would be much
better to just standardize the `run()` function pattern.
2023-07-26 22:31:42 +00:00
Francesco Mazzoli
15e59b8e67
More logging when closing (see #27 )
...
It seems that we get the SIGSEGV while closing the DB.
2023-07-26 21:09:29 +00:00
Francesco Mazzoli
dd39466daa
Insert CDC stats on shutdown
2023-07-26 20:41:35 +00:00
Francesco Mazzoli
999d2df52b
Do not alert for missing CDC request
...
This is totally normal if the CDC is restarted with queued
transactions.
2023-07-26 19:38:17 +00:00
Francesco Mazzoli
45b2618296
Temporarily put a stop to alert spam
2023-07-26 19:33:00 +00:00