Commit Graph

1014 Commits

Author SHA1 Message Date
Francesco Mazzoli 592927acc9 Fix a couple of issues around file download
* Client was closed, bricking the web UI
* File name wasn't properly threaded through
* Delete edges contained links to nowhere
2024-01-23 13:56:48 +00:00
Saulius Grusnys a38f2e0fab [kmod] add support for ftruncate() (#162)
* [kmod] add support for ftruncate()

* address review comments

* comment explaining time handling peculiarities

* comment explaining when ATTR_SIZE is used
2024-01-23 13:44:54 +00:00
Francesco Mazzoli affde4cde4 Reduce span lock critical section (see comment for motivation) 2024-01-23 13:07:29 +00:00
Francesco Mazzoli 8c0c246348 More robust detection of file vs. device errors
Just check if we're also unable to count the blocks for the disk,
and if yes, assume it's a single file error.

Of course there will be a time period where we will not have detected
the bad disk when counting the blocks (a few minutes at most), but
that's OK -- the scrubber will scrub blocks for that period, and then
stop.

Once <internal-repo/issues/65#issuecomment-24747>
is done, we should use whatever error detection we use for migration
to also distinguish between these errors.
2024-01-22 13:18:53 +00:00
Francesco Mazzoli 9b4f0cc809 Add small utility to see where in files we get IO errors 2024-01-22 11:36:53 +00:00
Francesco Mazzoli 1d5b6a10bc Check bad block errors consistently 2024-01-22 10:44:14 +00:00
Francesco Mazzoli 5a6bec1fdd Scrub file with BLOCK_IO_ERROR also
We should still not fail to scrub, and the BLOCK_PARTIAL_IO_ERROR
is a heuristic anyway, so it won't catch all cases.
2024-01-20 09:16:21 +00:00
Francesco Mazzoli 28390a6f51 Record GC cycles in GC metrics
This should let us know with more confidence if GC is keeping up.
2024-01-20 08:57:15 +00:00
Francesco Mazzoli 8ed4191ea0 Check for BUG/WARNING in dmesg without crashing the script 2024-01-18 19:04:36 +00:00
Francesco Mazzoli fa9ab31b51 Fix error handling in Go metadata code 2024-01-18 19:04:36 +00:00
Francesco Mazzoli f979a67b04 Always set non-zero transient deadline, fixes #145. 2024-01-18 19:04:36 +00:00
Saulius Grusnys b41f2971bc [kmod] check group leader and grab mm early in the call with preempti… (#158)
* [kmod] check group leader and grab mm early in the call with preemption disabled

* [go/eggstests] set higher cdc and shard timeouts during tests to prevent sporadic failures
2024-01-18 16:14:29 +00:00
Francesco Mazzoli 22136ead35 Some more output in kmod CI 2024-01-18 13:45:47 +00:00
Francesco Mazzoli 65c0fb08de Scrub files forever 2024-01-18 12:48:03 +00:00
Francesco Mazzoli cd23deaf19 Accept DIRECTORY_NOT_FOUND in SOFT_UNLINK_DIRECTORY
Nothing is preventing a non-existant inode to be sent in that request.
2024-01-18 12:00:43 +00:00
Francesco Mazzoli 2a95b345d2 Many changes to make CI work on new runner
Most notably, we now run the non-kmod integration tests in docker.
The kmod tests are already in their own environment (qemu).
2024-01-18 11:57:17 +00:00
Francesco Mazzoli f8b432eb18 Add metric and alert for CDC update size 2024-01-16 23:22:39 +00:00
Francesco Mazzoli 694e17cbc2 Add alerts for full shard queues 2024-01-16 23:11:41 +00:00
Francesco Mazzoli aa566069a7 Remove QuietPeriod options, we don't use them anymore 2024-01-16 22:50:52 +00:00
Francesco Mazzoli e859415b42 Re-introduce quiet period when destructing files
The busy loop was causing trouble in shards where we were done
collecting files.
2024-01-16 20:43:02 +00:00
Francesco Mazzoli 394e43cc53 Do not wait for GC to be over in all shards before rolling over 2024-01-16 17:00:54 +00:00
Francesco Mazzoli e1eff3a073 Do not page people from GC
Currently metadata often times out, which is OK, but is too noisy
on xmon.
2024-01-16 16:39:03 +00:00
Francesco Mazzoli f5ed515776 Add kmod generated files to list of generated files 2024-01-16 16:21:39 +00:00
Francesco Mazzoli b6cf2b67a6 Distribute block services from shuckle
This is in preparation for #44, but more immediately, to better
stop writing to full block services.

The previous strategy of setting a flag was flawed since once
the flag was set it stayed set -- i.e. we would not remove it once
files would be deleted.  This consideration should just be integrated
in distributing the block services.
2024-01-16 16:17:27 +00:00
Francesco Mazzoli c4805a56fe Bump directories GC limit 2024-01-16 16:03:37 +00:00
Saulius Grusnys 008b57e418 check dmesg for WARNING and BUG during ci runs 2024-01-16 13:16:41 +00:00
Ross Ilott d030907f61 add kmod_package makefile target
When building a module via dkms, the git metadata isn't available, so
revision.c isn't generated properly. Adding a new makefile target that
doesn't depend on revision.c allows package managers to generate it
manually beforhand without dkms overwriting it later.
2024-01-12 13:15:32 +00:00
Andrew Chen edd9dcf1ca kmod: fix sendfile() on linux 5.10
See:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/hirsute/commit/fs/shiftfs.c?h=master-next&id=f133ae22f2a891aae6fba20114bbc2468637231b
linux commit 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
2024-01-11 17:56:57 +00:00
Francesco Mazzoli d569bdb494 Re-introduce thread names (they got lost in a refactor) 2024-01-11 17:32:52 +00:00
Francesco Mazzoli c80c6269d9 Remove spurious MsgsGen.hpp includes 2024-01-11 16:05:34 +00:00
Francesco Mazzoli 8d0b97171e Remove dead code 2024-01-11 13:03:26 +00:00
Francesco Mazzoli ab75efbe81 See comment for countBlocks 2024-01-10 15:02:49 +00:00
Francesco Mazzoli c27ba8398a Tear down all threads at once
I had copied the LIFO pattern from ETD codebase, but it's not needed
here given that the loop terminates gracefully and so we can coordinate
explicitly if needed.
2024-01-09 16:53:23 +00:00
Miroslav Crnic 2f5dc6e5b5 github: ignore generated files 2024-01-09 11:41:22 +00:00
Francesco Mazzoli c9bf49d387 Fix silly SPSC bug 2024-01-09 11:14:18 +00:00
Francesco Mazzoli d69eeaffc9 Simple find command in cli 2024-01-09 00:05:42 +00:00
Andrew Chen d06ce9584e changes to support linux 5.10 2024-01-08 16:14:18 +00:00
Francesco Mazzoli 3097752a30 Minor tweak 2024-01-08 16:03:07 +00:00
Francesco Mazzoli ee9e0ad0af Remove pthread_attr_setsigmask_np, musl does not have it 2024-01-08 15:58:31 +00:00
Francesco Mazzoli 002b2854ec Fix leak in FetchedSpan, and hopefully fix #141. 2024-01-08 15:58:31 +00:00
Francesco Mazzoli 8075e99bb6 Graceful shard teardown
See <https://mazzo.li/posts/stopping-linux-threads.html> for tradeoffs
regarding how to terminate threads gracefully.

The goal of this work was for valgrind to work correctly, which in turn
was to investigate #141. It looks like I have succeeded:

    ==2715080== Warning: unimplemented fcntl command: 1036
    ==2715080== 20,052 bytes in 5,013 blocks are definitely lost in loss record 133 of 135
    ==2715080==    at 0x483F013: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==2715080==    by 0x3B708E: allocate (new_allocator.h:121)
    ==2715080==    by 0x3B708E: allocate (allocator.h:173)
    ==2715080==    by 0x3B708E: allocate (alloc_traits.h:460)
    ==2715080==    by 0x3B708E: _M_allocate (stl_vector.h:346)
    ==2715080==    by 0x3B708E: std::vector<Crc, std::allocator<Crc> >::_M_default_append(unsigned long) (vector.tcc:635)
    ==2715080==    by 0x42BF1C: resize (stl_vector.h:940)
    ==2715080==    by 0x42BF1C: ShardDBImpl::_fileSpans(rocksdb::ReadOptions&, FileSpansReq const&, FileSpansResp&) (shard/ShardDB.cpp:921)
    ==2715080==    by 0x420867: ShardDBImpl::read(ShardReqContainer const&, ShardRespContainer&) (shard/ShardDB.cpp:1034)
    ==2715080==    by 0x3CB3EE: ShardServer::_handleRequest(int, sockaddr_in*, char*, unsigned long) (shard/Shard.cpp:347)
    ==2715080==    by 0x3C8A39: ShardServer::step() (shard/Shard.cpp:405)
    ==2715080==    by 0x40B1E8: run (core/Loop.cpp:67)
    ==2715080==    by 0x40B1E8: startLoop(void*) (core/Loop.cpp:37)
    ==2715080==    by 0x4BEA258: start_thread (in /usr/lib/libpthread-2.33.so)
    ==2715080==    by 0x4D005E2: clone (in /usr/lib/libc-2.33.so)
    ==2715080==
    ==2715080==
    ==2715080== Exit program on first error (--exit-on-first-error=yes)
2024-01-08 15:41:22 +00:00
Andrew Chen 0834c83e5c kmod: add mount options to set permissions: uid, gid, dmask, fmask 2023-12-22 10:23:09 +00:00
Francesco Mazzoli e7a1a185cc Simplify scrub code even further by retrying using the built-in retry 2023-12-21 22:40:10 +00:00
Francesco Mazzoli 2e5fe53e9b Simplify scrub.go
The case where we need to scrub is very rare: let's not complicate
things to speed things up when that happens. Also the previous code
had a race in checker termination.
2023-12-21 22:02:15 +00:00
Francesco Mazzoli be9991b25f Fix dentry leak, fixes #138 2023-12-21 17:27:16 +00:00
Andrew Chen 7d136c1f5a kmod fixes for building with linux 5.5 and GCC 9
This is due to wanting to run on job13, which is Fedora with that kernel
and GCC.
2023-12-21 13:34:04 +00:00
Francesco Mazzoli a99373a08d Fix possible out-of-bound access in put_transient_span 2023-12-20 15:32:16 +00:00
Francesco Mazzoli 491d7fdf5c Correctly retry forever when we should 2023-12-20 13:51:14 +00:00
Francesco Mazzoli b554da7452 Harmonize scrubbing with everything else, add rate limiting 2023-12-20 11:27:21 +00:00
Francesco Mazzoli 9199fc6cc3 Fix bad sharing of struct eggsfs_fs_info between mounts 2023-12-19 16:55:13 +00:00