Commit Graph

19 Commits

Author SHA1 Message Date
Mohammed Nafees
54701e87d0 Retry RMQ messages indefinitely with aggressive logging after 5 retries (#2448)
* aggressively log errors when rmq retry more than 5 times

* revisit comments

* new vars and fix integration test

* fix test

* log only after 5 retries
2025-10-28 16:51:39 +01:00
Mohammed Nafees
e2b1f1353e Fix OTel span attribute naming convention (#2409)
* rename spans according to convention

* low cardinality
2025-10-16 18:43:40 +02:00
Sean Reilly
190f3f984a clean up rabbit mq session stuff, add a quick ack and error processin… (#1197)
* clean up rabbit mq session stuff, add a quick ack and error processing for AddMessage

* bit more paranoid about getting stuck in chans

* first pass at locking the message to deal with the failed states better

* clean up the access to ready for the mq

* make sure we don't block sending this ack
2025-01-23 16:06:02 -08:00
abelanger5
95558138a4 chore: improve throughput, remove deadlocks (#949)
* add otel to pub

* temporarily remove tenant id exchange

* fix: increase internal queue throughput

* fix: remove potential deadlocking

* rollback hash factor multiplier

* fix: batch update issues

* fix: rm unneeded locks

* move disable tenant pubsub to an env var

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-10-10 08:54:34 -04:00
abelanger5
8939c94f63 fix: send fewer messages to job queue when it's not necessary (#932)
* handle started at differently

* fix: start job runs in workflows controller

* fix: keep job runs around for backwards compat
2024-10-03 07:39:06 -04:00
abelanger5
c3fa2c57f3 fix: don't need acks on queue checks (#926) 2024-10-02 00:52:02 +00:00
abelanger5
9d69e4d192 fix: use read-only message queue (#897)
* fix: use read-only message queue

* set very high qos for read-heavy queue
2024-09-24 18:30:43 -04:00
abelanger5
263eaf069b feat: pass otel through msgqueue (#802)
* feat: pass otel through msgqueue

* feat: more spans on scheduling

* otel increase batch size
2024-08-28 14:45:02 +00:00
Gabe Ruttner
4ea4712d4d refactor: performance and throughput (#756)
Refactors the queueing logic to be fairly balanced between actions, with each action backed as a separate FIFO queue. Also adds support for priority queueing and custom queues, though those aren't exposed on the API layer yet. Improves throughput to be > 5000 tasks/second on a single queue. 

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-08-12 14:38:47 +00:00
Viktor Szépe
0948598749 Fix typos (#775) 2024-08-10 10:58:33 +00:00
Gabe Ruttner
b4670af138 Fix qos otel config (#754)
* feat: otel trace id ratio

* feat: rabbitmq qos

* feat: requeue limit

* fix: tests
2024-07-30 18:11:10 -04:00
abelanger5
5538196169 fix: correct lengths on random.Generate (#638) 2024-06-25 15:12:59 -04:00
Luca Steeb
b6dcb4e7e9 refactor(random): refactor random string generation (#633) 2024-06-24 23:44:03 +01:00
abelanger5
7c3ddfca32 feat: api server extensions (#614)
* feat: allow extending the api server

* chore: remove internal packages to pkg

* chore: update db_gen.go

* fix: expose auth

* fix: move logger to pkg

* fix: don't generate gitignore for prisma client

* fix: allow extensions to register their own api spec

* feat: expose pool on server config

* fix: nil pointer exception on empty opts

* fix: run.go file
2024-06-19 09:36:13 -04:00
abelanger5
ff90533458 fix: only close rabbitmq channels if they are open (#402) 2024-04-22 05:35:30 -04:00
abelanger5
347bc5dd53 feat: rabbitmq connection pooling (#387)
* feat: add rabbitmq connection pool and remove non-fatal worker errors

* chore: go mod tidy

* fix: release pool after opening channel

* fix: make sure channel is closed after all tasks return on subscribe

* fix: don't loop endlessly
2024-04-16 16:45:03 -04:00
abelanger5
08f0864046 fix: retry rabbitmq connections properly and retry published messages (#369) 2024-04-10 15:48:06 -04:00
abelanger5
7b7fbe3668 fix: update Requeue and Reassign logic to fix performance degradation when many events are queued (#310)
Logic for requeueing and reassigning did not limit the number of step runs to requeue, so when events accumulate with no worker present it causes memory to spike along with a very high query latency on the database. This commit limits the number of step runs returned in the requeue and reassign queries, and also properly locks step run rows for these queries so only a step run in a PENDING or PENDING_ASSIGNMENT state can be requeued.

It also improves performance of the `AssignStepRunToWorker` query and ensures that `maxRuns` on workers are always respected through the introduction of a `WorkerSemaphore` model. The value gets decremented when a step run is assigned and incremented when a step run is in a final state. 

Co-authored-by: Luca Steeb <contact@luca-steeb.com>

* Update controller.go

---------

Co-authored-by: steebchen <contact@luca-steeb.com>
2024-04-01 12:33:18 -04:00
abelanger5
c66f97c856 fix: deadlocks on workers and tickers (#241)
* chore: add sentry support to engine

* fix: deadlocks on workers and tickers

* refactor: reduce prisma calls in engine

* trigger

* fix: remove some tenant lookups

* feat: dlx and renamed taskqueue -> msgqueue

* refactor: get group key run logic

* fix: retry counts on messages and concurrency edge cases

* fix: rabbitmq integration tests

* feat: add consumer timeouts

---------

Co-authored-by: Luca Steeb <contact@luca-steeb.com>
2024-03-12 00:45:18 -04:00