As implied by this comment:
> this implementation is not supposed to be bullet-proof for race conditions (nor is it cross-platform)... it's
> just a small check to prevent the regularly occurring cases:
> * starting a second runsnappea in development
> * running 2 separate instances of bugsink on a single machine without properly distinguishing them
but this "small check" gets in the way sometimes, so it's better to be able to turn it off.
See #99
This commit fixes 3 related issues with the way runtime_limit was administered;
which could lead to race conditions (and hence: the wrong runtime_limit
applying at some point in time). Post-fix, the folllowing holds:
1. We use thread_locals to store this info, since there are at least 2 sources of
threaded code that touch this (snappea's workers and the django debugserver)
2. We distinguish between the "from connection settings" timeout and the
"temporarily overridden" ones, since we cannot assume
connection-initialization happens first (as per the comment in base.py)
3. We store runtime-limits per alias ('using'). Needed for [2] (each connection
may have a different moment-of-initialization, clobbering CM-set values from
the other connection) and also needed once you realize there may be
different defaults for the timeouts.
General context: I've recently started introducing the 'different runtime'
helper quite a bit more; and across connections (snappea!), which created more
and more doubts as to it actually working as advertised.
Thoughts on "using" being required. I used to think "you can reason about a
global timeout value, and the current transaction makes clear what you're
actually doing", but as per the notes above that doesn't really work.
Thoughts on reproducing:
A few thoughts/notes on reproducing problems with race conditions. Basic note:
that's always hairy. So in the end I settled on a solution that's hopefully
easy to reason about, even if it's verbose.
When I started work on this commit, I focussed on thread-safety; "proving the
problem" consisted of F5/^R on a web page with 2 context managers with different
timeouts, hoping to show that the stack unrolling didn't work properly.
However, during those "tests" I noticed quite a few resets-to-5s (from the
connection defaults), which prompted fix [2] from above.
This will hopefully help when getting issue-reports for those that
have not set up dogfooding.
See [Dogfooding Bugsink](https://www.bugsink.com/docs/dogfooding/)
Triggered by issue_event_list being more than 5s on "emu" (my 1,500,000 event
test-machine). Reason: sorting those events on non-indexed field. Switching
to a field-with-index solved it.
I then analysed (grepped) for "ordering" and "order_by" and set indexes
accordingly and more or less indiscriminately (i.e. even on tables that are
assumed to have relatively few rows, such as Project & Team).
As discussed in #11, there are scenarios (e.g. misconfiguration) where snappea
does not pick up the tasks. Events not showing up in Bugsink, w/o further
indication why that may be, leaves people confused. Better to warn explicitly
in that case.
Using a pid-file that's implied by the ingestion directory.
We do this in `get_pc_registry`, i.e. on the first request. This means failure is
in the first request on the 2nd process.
Why not on startup? Because we don't have a configtest or generic on-startup location
(yet). Making _that_ could be another source of fragility, and getting e.g. the nr
of processes might be non-trivial / config-dependent.
exposed when playing around with arbitrary Tasks in a shell; this created
workers I could not run, which would put the foreman in a 'waiting for available threads'
mode.
I briefly looked at the rest of that loop to see whether more exception handling
is necessary, but TBH I don't think we can reasonably recover from e.g. task.delete()
failing (or at least I don't want to think about it now)
* recommend to just run in the home dir
* don't use private tmp
The troubles were: when set up using private tmp files, the 2 processes
cannot communicate with each other