Now that we have eviction, events may disappear. Deal with it:
* event-specific 404 that still allows for navigation
* first/last buttons
* navigation to prev/next when prev/next is not just 1 step away
* don't use HttpRedirect for "lookup based" views
in principle, the thing you were looking for might go missing in-between
drawback: these URLs are not "stable"
in the belief that it was only (mostly?) useful in the context of
the pc-registry performance tests. (We do our other tests end-to-end
using the stress tests)
In the previous commit I put the code for a small performance-experiment.
The results are (very) obvious: don't do this. Response times go through
the roof, and more importantly, the server becomes unreliable. Reason:
time-outs caused by waiting for the write-lock.
* issue.event_count to digested_event_count
* event.ingest_order to event.digest_order
* issue.ingest_order to digest_order
This is generally more correct/explicit, and is also in preparation
of doing work on-digest (which may or may not happen)
The direct cause for this was the following observation: there was no mechanism
in place to safeguard counted events across evictions, i.e. the following order
of events was not accounted for:
* ingest/digest a bunch of events (PCs correctly updated)
* eviction (PC still correct)
* server/snappea restart (PC reloaded, but based on new events. not correct).
I though about various approaches to fix this (e.g. snapshotting) but in the end
such approaches added even more complexity to the PC mechanism. I decided to first
check how non-performant the SQL route would be, and this PoC seems to say: just
go SQL.
There's also a small semantic change (probably in the direction of what you'd
expect), namely: the periods are no longer 'calendar' periods.
'fix' in scare-quotes, because the previous implementation of first_chunk was
wrong, but never led to actually wrong outcomes, only to one-too-many recursive
call (for seconds, minutes)
I _think_ I meant that I had never actually seen those code-paths in action
(i.e. the note was not about automated tests but rather any kind of visual
confirmation that it worked) but I have seen that now
Using a pid-file that's implied by the ingestion directory.
We do this in `get_pc_registry`, i.e. on the first request. This means failure is
in the first request on the 2nd process.
Why not on startup? Because we don't have a configtest or generic on-startup location
(yet). Making _that_ could be another source of fragility, and getting e.g. the nr
of processes might be non-trivial / config-dependent.
Replacing it with passing the thresholds on each call to `inc`.
The event-based approach was broken in a multi-process setup (such as having a separate
gunicorn and snappea), because the unmute events would be registered GUI-side
(gunicorn), and the single process where the counting happened had a different PC
instance.
The solution is to get rid of the event-listener approach, and just make an inventory of
the threshold-checks that need to be done right before each call to `inc`. Because the
calls to `inc` happen in a single process (we [will] enforce this elsewhere) this fixes
the problem.
During refactoring it became clear that this is probably a good idea anyway: many
comments about corner-cases could be removed.
Other things I found:
* The now-removed `_digest_event_python_postprocessing` did more than Python alone (it
also touched the DB for unmutes) so that was probably a separate bug (now fixed).
* In the event-listener-based code, I foresaw the need for `on_become_false` (but did
not use it yet). The idea was probably that this could be useful in the quota setting
(a quota can become unmet after a while) but in fact it isn't useful, because when a
quota becomes unmet you'd still need to check all quota and OR them.
Tests have not been truly refactored (the new architecture probably points to a new
desired set of tests) but rather have been made to run in the simplest way possible.