'fix' in scare-quotes, because the previous implementation of first_chunk was
wrong, but never led to actually wrong outcomes, only to one-too-many recursive
call (for seconds, minutes)
I _think_ I meant that I had never actually seen those code-paths in action
(i.e. the note was not about automated tests but rather any kind of visual
confirmation that it worked) but I have seen that now
Using a pid-file that's implied by the ingestion directory.
We do this in `get_pc_registry`, i.e. on the first request. This means failure is
in the first request on the 2nd process.
Why not on startup? Because we don't have a configtest or generic on-startup location
(yet). Making _that_ could be another source of fragility, and getting e.g. the nr
of processes might be non-trivial / config-dependent.
Replacing it with passing the thresholds on each call to `inc`.
The event-based approach was broken in a multi-process setup (such as having a separate
gunicorn and snappea), because the unmute events would be registered GUI-side
(gunicorn), and the single process where the counting happened had a different PC
instance.
The solution is to get rid of the event-listener approach, and just make an inventory of
the threshold-checks that need to be done right before each call to `inc`. Because the
calls to `inc` happen in a single process (we [will] enforce this elsewhere) this fixes
the problem.
During refactoring it became clear that this is probably a good idea anyway: many
comments about corner-cases could be removed.
Other things I found:
* The now-removed `_digest_event_python_postprocessing` did more than Python alone (it
also touched the DB for unmutes) so that was probably a separate bug (now fixed).
* In the event-listener-based code, I foresaw the need for `on_become_false` (but did
not use it yet). The idea was probably that this could be useful in the quota setting
(a quota can become unmet after a while) but in fact it isn't useful, because when a
quota becomes unmet you'd still need to check all quota and OR them.
Tests have not been truly refactored (the new architecture probably points to a new
desired set of tests) but rather have been made to run in the simplest way possible.
exposed when playing around with arbitrary Tasks in a shell; this created
workers I could not run, which would put the foreman in a 'waiting for available threads'
mode.
I briefly looked at the rest of that loop to see whether more exception handling
is necessary, but TBH I don't think we can reasonably recover from e.g. task.delete()
failing (or at least I don't want to think about it now)
Unscientifically (n=1, changing circumstances), this improved times like so when the max was 10k:
* 573.56ms EVICT; down to 8813, max irr. from 15 to 13 in 171ms+402ms and 5+4 queries (pre-index)
* 229.34ms EVICT; down to 7643, max irr. from 15 to 12 in 7ms+222ms and 5+7 queries (post-index)
The order of the index was chosen because we have 3 types of queries in our algo:
* on Project -> irrelevance <= amount of work
* on Project, timestamp -> irrelevance <= observed irrelevances
* on Project, timestamp, irrelevance -> deletion