ternfs-XTXMarkets/docs/kmod-file-tracking.md at cd24103c2183d85cfa1cae52cd0d23088f8fd2f2

mirror of https://github.com/XTXMarkets/ternfs.git synced 2026-01-06 02:49:45 -06:00

Files

Joshua Leahy 7a4e466ac6 Make TernFS open source

2025-09-17 18:20:23 +01:00

3.6 KiB

Raw Blame History

TernFS files are immutable: they are created once and never modified. This presents a challenge when writing a kernel module for it, since the VFS API very much assumes that files can be modified.

So our high level strategy is: allow users to open a file for writing, and keep that file transient (i.e. not visible in the directory tree) until we declare it "done", and do not allow modifications after that.

The main problem in implementing the strategy above is when to declare the file "done". An attractive answer is "when the file is closed". However one problem with that answer is that it's not clear when files are "consciously" closed through through close(), and when they are closed because the process is winding down and all its open FDs are being closed.

The relevant VFS interface is flush in struct file_operations: exactly the same function gets called in the two situations above.

If we just blindly declare the file done when flush is called, we're going to get a ton of false positives. Consider the classic fork + execve pattern:

An TernFS file is opened for writing by process A;
A unrelatedly (perhaps in another thread) forks (maybe to run another process through execve) to process B;
B inherits all FDs of A, including the open TernFS file;
B terminates before A has finished writing the TernFS file it opened;
The file is prematurely declared "done" and A can't finish writing it to completion.

So we need a better way to recognize when a flush is intentional, so to speak. We achieve this as follows:

When a file gets created, we attach to it a reference to the process that originated the syscall. Note that we use current->group_leader rather than current to get the process, not the thread, that the syscall originates from.
We also attach the struct mm_struct of the process to the file.
When flushing:
- If the flush does not originate from the same process that originated the file creation, we do not declare the file done;
- If the flush is in the same process, but the struct mm_struct of the process has already been torn down, then interrupt the creation of the file altogether. We do this since if the struct mm_struct of the process is done it means that we're in the middle of tearing the process down, which in turn means that no user requested for the file to be closed explicitly (i.e. the process has crashed before the file got closed). For evidence of this, note how exit_mm() is called before exit_files.

The above is quite dirty, but seems to be pretty solid¹. However trouble arises if files are created from inside the kernel, which is exactly what happens with NFS, which is what prompted me to write down this explanation. We'll have to do something else for NFS to work.

Also note that we want to keep struct mm_struct around anyway to increase MM_FILEPAGES when we allocate new pages to write files. But that is more of a nice to have than a strict requirement.

Note that the FUSE implementation is not as solid, given that we don't have access to the internals that we have access to in the kernel module. ↩︎

3.6 KiB Raw Blame History

3.6 KiB

Raw Blame History