With reduced span cache time the block service cache
is no longer needed. We also don't need to fetch
changed block services from registry as we'll get
it as part of span fetches.
Things not done because probably disruptive:
* kmod filesystem string
* sysctl/debugfs/trace
* metrics names
* xmon instance names
Some of these might be renamed too, but starting with a relatively
safe set.
Also, split the timeouts for dentries and for stats. We generally
don't care if stats are out of dates, but dentries should be up
to date.
The code leaves various aspects to be desired:
* No attempt is made to only send stats when needed -- it is always
done. It might be a good idea to instead wait for the first two
stats to come back.
* Theres quite a bit of code duplication.
* It's pretty wasteful to have so many different packets for the
stats. It'd be much better to pack multiple requests and multiple
responses in single packets.
This could be done simply by allowing many requests to come
in the same packet (just one after the other would be fine),
and same for the responses. We can still use the protocol and
request id to keep track of things anyway.
As noted by @achen, from `recv(2)`:
When a stream socket peer has performed an orderly shutdown, the
return value will be 0 (the traditional "end‐of‐file" return).
This change was triggered when a `open` + `lseek` sequence didn't
work, with `lseek` having `SEEK_END`, since `i_size` wasn't filled
in yet, and therefore the absolute file offset being negative.
Thanks to @sgrusny for pinpointing the issue.
By default it is
s->s_maxbytes = MAX_NON_LFS;
that is to say `((1UL<<31) - 1)`. This tripped us in `sendfile`,
when the upper bound is set to `s_maxbytes`:
if (!max)
max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
See <https://elixir.bootlin.com/linux/v5.4.249/source/fs/read_write.c#L1443>
...and also update them quickly, by indexing them by (inode, tag).
Currently they only get updated on local renames though, we should
also update them when things are moved around remotely.
And hopefully reduce the likelihood of bugs. On the write end, given
that we do things less asynchronously, things might be a bit slower,
but I think the simplification is worth it for now.
Also, fix/improve a bunch of other stuff.
This is one of the two data model/protocol changes I want to perform
before going into production, the other being file atime.
Right now the kernel module does not take advantage of this, but
it's OK since I tested the rest of the code reasonably and the goal
here is to perform the protocol/data changes.
Initial version really by Pawel, but many changes in between.
Big outstanding issues:
* span cache reclamation (unbounded memory otherwise...)
* bad block service detection and workarounds
* corrupted blocks detection and workaround
Co-authored-by: Paweł Dziepak <pawel.dziepak@xtxmarkets.com>