Commit Graph

75 Commits

Author SHA1 Message Date
Klaas van Schelven
4a73880ea7 PID_FILE check: make optional
As implied by this comment:

> this implementation is not supposed to be bullet-proof for race conditions (nor is it cross-platform)... it's
> just a small check to prevent the regularly occurring cases:
> * starting a second runsnappea in development
> * running 2 separate instances of bugsink on a single machine without properly distinguishing them

but this "small check" gets in the way sometimes, so it's better to be able to turn it off.

See #99
2025-07-28 20:46:45 +02:00
Klaas van Schelven
aa255978b7 Snappea: refuse to start in TASK_ALWAYS_EAGER mode 2025-07-08 20:57:26 +02:00
Klaas van Schelven
53d4be8183 Fix 'different_runtime_limit' race conditions
This commit fixes 3 related issues with the way runtime_limit was administered;
which could lead to race conditions (and hence: the wrong runtime_limit
applying at some point in time). Post-fix, the folllowing holds:

1. We use thread_locals to store this info, since there are at least 2 sources of
    threaded code that touch this (snappea's workers and the django debugserver)

2. We distinguish between the "from connection settings" timeout and the
    "temporarily overridden" ones, since we cannot assume
    connection-initialization happens first (as per the comment in base.py)

3. We store runtime-limits per alias ('using'). Needed for [2] (each connection
    may have a different moment-of-initialization, clobbering CM-set values from
    the other connection) and also needed once you realize there may be
    different defaults for the timeouts.

General context: I've recently started introducing the 'different runtime'
helper quite a bit more; and across connections (snappea!), which created more
and more doubts as to it actually working as advertised.

Thoughts on "using" being required. I used to think "you can reason about a
global timeout value, and the current transaction makes clear what you're
actually doing", but as per the notes above that doesn't really work.

Thoughts on reproducing:
A few thoughts/notes on reproducing problems with race conditions. Basic note:
that's always hairy. So in the end I settled on a solution that's hopefully
easy to reason about, even if it's verbose.

When I started work on this commit, I focussed on thread-safety; "proving the
problem" consisted of F5/^R on a web page with 2 context managers with different
timeouts, hoping to show that the stack unrolling didn't work properly.
However, during those "tests" I noticed quite a few resets-to-5s (from the
connection defaults), which prompted fix [2] from above.
2025-04-22 22:08:53 +02:00
Klaas van Schelven
9b6fbe523f Snappea foreman: on catastrophic errors, wait for workers 2025-04-18 14:57:08 +02:00
Klaas van Schelven
366d22f295 Snappea stats: fix for when no tasks remain 2025-04-18 14:37:44 +02:00
Klaas van Schelven
7616b0ea77 Document timing of task.create/delete in code 2025-04-17 10:16:43 +02:00
Klaas van Schelven
89927c7ab4 Snappea stats: never bring down snappea 2025-04-17 10:13:19 +02:00
Klaas van Schelven
6500548168 Snappea Stats: document the need for separate table 2025-04-17 09:41:03 +02:00
Klaas van Schelven
abd05b7269 Snappea stats: silently ignore backwards clock drift 2025-04-17 09:38:42 +02:00
Klaas van Schelven
4cedffc1b7 Snappea stats: configurable retention 2025-04-16 17:10:15 +02:00
Klaas van Schelven
e27439ab7b snappea stats: log cost of stats themselves 2025-04-16 16:57:53 +02:00
Klaas van Schelven
94338051ef Snappea Stats: first version 2025-04-16 16:40:28 +02:00
Klaas van Schelven
1084796763 When not dogfooding, just print a regular stacktrace in the logs
This will hopefully help when getting issue-reports for those that
have not set up dogfooding.

See [Dogfooding Bugsink](https://www.bugsink.com/docs/dogfooding/)
2025-04-11 13:22:37 +02:00
Klaas van Schelven
eb780c0008 Snappea Foreman: don't crash on "non-bullet-broof" pid-check 2025-03-19 08:53:21 +01:00
Klaas van Schelven
14d34807ca Snappea 'worker done': display task name
for the important case of 'quickly eye-balling what-took-you-so-long'
this saves those eye-balls a lookup
2025-02-20 21:38:14 +01:00
Klaas van Schelven
918b1ef54c Add ids to 2 system-checks 2025-02-18 12:10:32 +01:00
Klaas van Schelven
19aa439339 reword comment slightly 2025-02-07 10:21:20 +01:00
Klaas van Schelven
86e8c4318b Add indexes on fields on which we order and vice versa
Triggered by issue_event_list being more than 5s on "emu" (my 1,500,000 event
test-machine). Reason: sorting those events on non-indexed field. Switching
to a field-with-index solved it.

I then analysed (grepped) for "ordering" and "order_by" and set indexes
accordingly and more or less indiscriminately (i.e. even on tables that are
assumed to have relatively few rows, such as Project & Team).
2025-02-04 21:19:24 +01:00
Klaas van Schelven
59372aba33 First version of multi-tenant setup (EE) 2025-01-29 09:04:19 +01:00
Klaas van Schelven
705cf43fc2 Remove doc-TODO 2025-01-24 11:43:28 +01:00
Klaas van Schelven
cf23ba707e Warn about top-level settings 2025-01-24 11:40:13 +01:00
Klaas van Schelven
0ad878d1bc AttrLike dict: better exceptions 2024-11-20 16:30:48 +01:00
Klaas van Schelven
71d6e89c93 Show warning message when there are many/stale snappea tasks
As discussed in #11, there are scenarios (e.g. misconfiguration) where snappea
does not pick up the tasks. Events not showing up in Bugsink, w/o further
indication why that may be, leaves people confused. Better to warn explicitly
in that case.
2024-11-15 14:51:41 +01:00
Klaas van Schelven
67cfbb58d7 Use 'monofy' package now that it is extracted 2024-09-04 22:54:27 +02:00
Klaas van Schelven
68c556cdca Comment update about WAL & Snappea 2024-08-29 10:23:10 +02:00
Klaas van Schelven
5d6983042a Snappea: workaholic mode 2024-08-28 08:58:35 +02:00
Klaas van Schelven
66bece58c1 snappea: remove a comment that's mostly of historic interest 2024-08-28 08:58:00 +02:00
Klaas van Schelven
46046f894c snappea foreman comment clarifications 2024-08-27 22:17:28 +02:00
Klaas van Schelven
129a8db421 Fix various flake8 errors 2024-08-21 09:31:05 +02:00
Klaas van Schelven
63417d555f Explain why we deal with SIGTERM as we do
(from memory, in response to the glib remarks in b09cfb21c3, ('as demanded by systemd')
2024-07-26 16:22:30 +02:00
Klaas van Schelven
d56a8663a7 Remove the periodCounter and the PC registry
direct consequence of switching to SQL-based counting
2024-07-16 15:08:05 +02:00
Klaas van Schelven
6767ea593a Fix port number in example nginx conf 2024-07-12 08:39:50 +02:00
Klaas van Schelven
eb23d44962 Enforce a single pc_registry for a single ingesting process
Using a pid-file that's implied by the ingestion directory.

We do this in `get_pc_registry`, i.e. on the first request. This means failure is
in the first request on the 2nd process.

Why not on startup? Because we don't have a configtest or generic on-startup location
(yet). Making _that_ could be another source of fragility, and getting e.g. the nr
of processes might be non-trivial / config-dependent.
2024-07-09 13:14:27 +02:00
Klaas van Schelven
c2ec150f52 Cost of connection.close and subsequent reopen documented 2024-07-08 09:53:15 +02:00
Klaas van Schelven
b6cc268333 Remove connection_close
this was never supposed to have been committed, mistake in c453ca00e5
2024-07-08 09:41:57 +02:00
Klaas van Schelven
c453ca00e5 Snappea connection_close 2024-07-05 16:28:23 +02:00
Klaas van Schelven
1eb65a7790 Release worker_semaphore when failing to create worker
exposed when playing around with arbitrary Tasks in a shell; this created
workers I could not run, which would put the foreman in a 'waiting for available threads'
mode.

I briefly looked at the rest of that loop to see whether more exception handling
is necessary, but TBH I don't think we can reasonably recover from e.g. task.delete()
failing (or at least I don't want to think about it now)
2024-07-05 15:58:15 +02:00
Klaas van Schelven
259069f6e2 Explicit error message for malformed task name 2024-07-05 15:54:55 +02:00
Klaas van Schelven
253380bf2f Foreman: document current understanding of connection.close() 2024-07-05 13:00:08 +02:00
Klaas van Schelven
bf5d221a03 Snappea: fixes on 'atomic' call for Task-getting
prompted by the work in the previous commit; but somewhat separate from it
2024-07-04 14:05:04 +02:00
Klaas van Schelven
4daa6c9e09 Close database connections in snappea 2024-07-04 14:04:03 +02:00
Klaas van Schelven
75b620941a Don't (accidentally) load all events into memory on-init 2024-05-23 10:43:28 +02:00
Klaas van Schelven
82d40e3741 Push get_pc_registry into snappea.foreman init 2024-05-23 10:12:24 +02:00
Klaas van Schelven
5af1d2384e Performance logging of Snappea task create/delete 2024-05-22 17:45:28 +02:00
Klaas van Schelven
151af98559 Performance logging: push into development.py (i.e. remove from non-development servers) 2024-05-22 17:06:21 +02:00
Klaas van Schelven
b09cfb21c3 Configure runsnappea to be rebooted every day
a way to limit memory leaks

also: deal with SIGTERM 'correctly' (i.e. as demanded by systemd)
2024-05-22 12:44:58 +02:00
Klaas van Schelven
f150c839dc Recommended setup: fix tmpfile troubles in 2 ways
* recommend to just run in the home dir
* don't use private tmp

The troubles were: when set up using private tmp files, the 2 processes
cannot communicate with each other
2024-05-22 08:37:52 +02:00
Klaas van Schelven
89dba6e6e5 Fix typo 2024-05-17 15:52:43 +02:00
Klaas van Schelven
5ff2623112 Add checksnappea command 2024-05-17 12:03:40 +02:00
Klaas van Schelven
46220f97ea Snappea: 'ensure' it is running as a singleton 2024-04-27 21:59:20 +02:00