Commit Graph

45 Commits

Author SHA1 Message Date
abelanger5
9dabe7d902 feat: dlq for dispatcher queues (#2600)
* feat: dlq for dispatcher queues

* reduce dispatcher message ttl to 20 seconds

* rename dispatcher queue for clarity

* add error logs when dead lettering

* address comment
2025-12-04 14:19:01 -05:00
abelanger5
3f5c243325 fix: move check for large payloads to after json.Marshal (#2594) 2025-12-02 11:45:37 -05:00
abelanger5
d906a441d4 fix: ensure that slow worker doesn't interrupt dispatcher, guard large RabbitMQ pubs (#2591)
* ensure that slow worker doesn't interrupt dispatcher

* fix: large payload pub issues

* add comments

* fix: review comments
2025-12-02 09:54:54 -05:00
Mohammed Nafees
54701e87d0 Retry RMQ messages indefinitely with aggressive logging after 5 retries (#2448)
* aggressively log errors when rmq retry more than 5 times

* revisit comments

* new vars and fix integration test

* fix test

* log only after 5 retries
2025-10-28 16:51:39 +01:00
Mohammed Nafees
e2b1f1353e Fix OTel span attribute naming convention (#2409)
* rename spans according to convention

* low cardinality
2025-10-16 18:43:40 +02:00
matt
d677cb2b08 feat: gzip compression for large payloads, persistent OLAP writes (#2368)
* debug: remove event pub

* add additional spans to publish message

* debug: don't publish payloads

* fix: persistent messages on olap

* add back other payloads

* remove pub buffers temporarily

* fix: correct queue

* hacky partitioning

* add back pub buffers to scheduler

* don't send no worker events

* add attributes for queue name and message id to publish

* add back pub buffers to grpc api

* remove pubs again, no worker writes though

* task processing queue hashes

* remove payloads again

* gzip compression over 5kb

* add back task controller payloads

* add back no worker requeueing event, with expirable lru cache

* add back pub buffers

* remove hash partitioned queues

* small fixes

* ignore lru cache top fn

* config vars for compression, disable by default

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2025-10-08 11:44:04 -04:00
Mohammed Nafees
ed40a82dbb Include tenant_id in OTel spans wherever possible (#2382) 2025-10-03 18:16:16 +02:00
abelanger5
2edeeb10ea feat: max channels for rabbitmq (#2365)
* feat: max conns for rabbitmq

* rename conns -> chans
2025-09-30 08:49:45 -04:00
abelanger5
733feedbff fix: use separate connections for pub and sub (#2358)
* use separate connections for pub and sub

* Update internal/msgqueue/v1/rabbitmq/rabbitmq.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-29 14:29:45 -04:00
matt
025f42af74 Debug: Error log if we send >10mb message over the internal queue (#2345)
* fix: send error log if we try to send message > 10mb

* feat: add some span attributes

* fix: span attribute names

* fix: cleanup

* fix: add message id
2025-09-25 18:15:35 -04:00
matt
92843bb277 Feat: Payload Store Repository (#2047)
* feat: add table for storing payloads

* feat: add payload type enum

* feat: gen sqlc

* feat: initial sql impl

* feat: add payload store repo to shared

* feat: add overwrite

* fix: impl

* feat: bulk op

* feat: initial wiring of inputs for task triggers

* feat: wire up dag matches

* feat: create V1TaskWithPayload and use it everywhere

* fix: couple bugs

* fix: clean up types

* fix: overwrite

* fix: rm input from replay

* fix: move payload store to shared repo

* fix: schema

* refactor: repo setup

* refactor: repos

* fix: gen

* chore: lint

* fix: rename

* feat: naming, write dag inputs

* fix: more naming, trigger bug

* fix: dual writes for now

* fix: pass in tx

* feat: initial work on offloader

* feat: improve external offloader

* fix: some refs

* add withExternalHandler

* fix: improve impl of external store

* feat: implement offloading, fix other impls

* feat: add query to update JSON

* fix: implement offloading + updating records in payloads table

* feat: add WAL table

* feat: add queries for polling WAL and evicting

* feat: wire up writes into WAL

* fix: get job working

* refactor: improve types

* fix: infinite loop

* feat: improve offloading logic to run in two separate txes

* refactor: rework how overrides work

* fix: lint

* fix: migration number

* fix: migration

* fix: migration version

* fix: revert back to reading payloads out

* fix: fall back to previous input, part i

* fix: input fallback

* fix: add back input to replay

* fix: input fallback in dispatcher

* fix: nil check

* feat: advisory locks, part i

* fix: no skip locked

* feat: hash partitioned wal table

* fix: modify queries a bit, tweak crud enum

* fix: pk order, function to find tenants

* feat: wal processing

* fix: only write wal if an external store is enabled, fix offloading logic

* fix: spacing

* feat: schema cleanup

* fix: rm external store loc name

* fix: set content to null when offloading

* fix: cleanup, naming

* fix: pass overwrite payload store along

* debug: add some logging

* Revert "debug: add some logging"

This reverts commit 43e71eadf1.

* fix: typo

* fx: add offloatAt to store opts for offloading

* fix: handle leasing with advisory lock

* fix: struct def

* fix: requeue on payloads not found

* fix: rm hack for triggers

* fix: revert empty input on write

* fix: write input

* feat: env var for enabling / disabling dual writes

* feat: wire up dual writes

* fix: comments

* feat: generics!

* fix: panic from type cast

* fix: migration

* fix: generic

* fix: hack for T key in map

* fix: cleanup
2025-09-12 09:53:01 -04:00
Mohammed Nafees
89e6d00a8f Add telemetry around task statuses in controller (#2090)
* add telemetry around task statuses in controller

* fixes

* more fixes
2025-08-06 08:41:54 -04:00
abelanger5
1abb2a20e7 fix: hatchet-lite connection leakage and improve listen/notify performance (#1924)
* fix: hatchet-lite connection leakage and improve listen/notify performance

* fix: cancel mq listener

* remove event deps

* skip webhook test for now
2025-06-30 17:13:09 -04:00
Matt Kaye
e62f7edab3 Fix: Streaming Bugs (#1913)
* fix: bug with json parsing failing

* fix: hang up on cancel and fail

* fix: pub stream events even if tenant pubs are disabled

* fix: condition

* fix: eq
2025-06-26 16:22:56 -04:00
abelanger5
b8352bcaca config: allow buffer settings to be configurable (#1649) 2025-05-01 07:13:30 -04:00
abelanger5
2c1f1f4808 test: improve Go testing harness (#1631)
* test: improves testing harness for engine

* update CI test

* fix: race condition in test

* make tests more stable

* cleanup pub and sub buffers

* fix: goleak on rampup test

* feat: matrix tests for engine
2025-04-29 10:55:16 -04:00
abelanger5
ef6668a8c3 fix: go signature and docs (#1561)
* fix: go signature and docs

* Update examples/v1/workflows/concurrency-rr.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-15 17:31:48 -04:00
abelanger5
6813ab1c75 fix: streaming order improvements, go sdk stability (#1536)
* fix: streaming order improvements, go sdk stability

* fix: improve replay query
2025-04-11 13:02:47 -04:00
abelanger5
b03a8d2666 improve ttl cache on pgmq (#1438)
* improve ttl cache on pgmq

* fix: panic
2025-03-28 09:27:12 -07:00
abelanger5
a20ab2de65 fix(v1): add exponential backoff for internal retries (#1399) 2025-03-25 09:14:15 -07:00
abelanger5
ac968e94b8 fix: concurrency issues and a few small improvements (#1324) 2025-03-12 16:30:34 -04:00
abelanger5
1f2096313d feat: v1 engine (#1318) 2025-03-11 14:57:13 -04:00
Sean Reilly
190f3f984a clean up rabbit mq session stuff, add a quick ack and error processin… (#1197)
* clean up rabbit mq session stuff, add a quick ack and error processing for AddMessage

* bit more paranoid about getting stuck in chans

* first pass at locking the message to deal with the failed states better

* clean up the access to ready for the mq

* make sure we don't block sending this ack
2025-01-23 16:06:02 -08:00
abelanger5
61ae067014 fix: race condition on err in pgmq (#1198) 2025-01-18 16:20:24 +00:00
abelanger5
dcb67a1dac feat: postgres-backed message queue (#1119) 2024-12-18 09:00:54 -05:00
abelanger5
2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
abelanger5
95558138a4 chore: improve throughput, remove deadlocks (#949)
* add otel to pub

* temporarily remove tenant id exchange

* fix: increase internal queue throughput

* fix: remove potential deadlocking

* rollback hash factor multiplier

* fix: batch update issues

* fix: rm unneeded locks

* move disable tenant pubsub to an env var

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-10-10 08:54:34 -04:00
abelanger5
8939c94f63 fix: send fewer messages to job queue when it's not necessary (#932)
* handle started at differently

* fix: start job runs in workflows controller

* fix: keep job runs around for backwards compat
2024-10-03 07:39:06 -04:00
abelanger5
c3fa2c57f3 fix: don't need acks on queue checks (#926) 2024-10-02 00:52:02 +00:00
abelanger5
5f5e1e8a88 refactor: use shared tenant listener for messages (#911)
* refactor: use shared tenant listener per tenant exchange

* fix: remove subscription properly
2024-09-26 14:54:11 -04:00
abelanger5
9d69e4d192 fix: use read-only message queue (#897)
* fix: use read-only message queue

* set very high qos for read-heavy queue
2024-09-24 18:30:43 -04:00
abelanger5
0204929b02 fix: concurrency key performance (#894) 2024-09-19 21:28:08 -04:00
abelanger5
263eaf069b feat: pass otel through msgqueue (#802)
* feat: pass otel through msgqueue

* feat: more spans on scheduling

* otel increase batch size
2024-08-28 14:45:02 +00:00
Gabe Ruttner
4ea4712d4d refactor: performance and throughput (#756)
Refactors the queueing logic to be fairly balanced between actions, with each action backed as a separate FIFO queue. Also adds support for priority queueing and custom queues, though those aren't exposed on the API layer yet. Improves throughput to be > 5000 tasks/second on a single queue. 

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-08-12 14:38:47 +00:00
Viktor Szépe
0948598749 Fix typos (#775) 2024-08-10 10:58:33 +00:00
Gabe Ruttner
b4670af138 Fix qos otel config (#754)
* feat: otel trace id ratio

* feat: rabbitmq qos

* feat: requeue limit

* fix: tests
2024-07-30 18:11:10 -04:00
abelanger5
5538196169 fix: correct lengths on random.Generate (#638) 2024-06-25 15:12:59 -04:00
Luca Steeb
b6dcb4e7e9 refactor(random): refactor random string generation (#633) 2024-06-24 23:44:03 +01:00
abelanger5
7c3ddfca32 feat: api server extensions (#614)
* feat: allow extending the api server

* chore: remove internal packages to pkg

* chore: update db_gen.go

* fix: expose auth

* fix: move logger to pkg

* fix: don't generate gitignore for prisma client

* fix: allow extensions to register their own api spec

* feat: expose pool on server config

* fix: nil pointer exception on empty opts

* fix: run.go file
2024-06-19 09:36:13 -04:00
abelanger5
b0b2e26952 feat: hatchet-lite (#560)
* feat: hatchet-lite mvp

* fix: init shadow db

* fix: install atlas

* fix: correct env

* fix: wait for db ready

* fix: remove name flag

* fix: add hatchet-lite to build
2024-06-06 14:03:53 -04:00
abelanger5
ff90533458 fix: only close rabbitmq channels if they are open (#402) 2024-04-22 05:35:30 -04:00
abelanger5
347bc5dd53 feat: rabbitmq connection pooling (#387)
* feat: add rabbitmq connection pool and remove non-fatal worker errors

* chore: go mod tidy

* fix: release pool after opening channel

* fix: make sure channel is closed after all tasks return on subscribe

* fix: don't loop endlessly
2024-04-16 16:45:03 -04:00
abelanger5
08f0864046 fix: retry rabbitmq connections properly and retry published messages (#369) 2024-04-10 15:48:06 -04:00
abelanger5
7b7fbe3668 fix: update Requeue and Reassign logic to fix performance degradation when many events are queued (#310)
Logic for requeueing and reassigning did not limit the number of step runs to requeue, so when events accumulate with no worker present it causes memory to spike along with a very high query latency on the database. This commit limits the number of step runs returned in the requeue and reassign queries, and also properly locks step run rows for these queries so only a step run in a PENDING or PENDING_ASSIGNMENT state can be requeued.

It also improves performance of the `AssignStepRunToWorker` query and ensures that `maxRuns` on workers are always respected through the introduction of a `WorkerSemaphore` model. The value gets decremented when a step run is assigned and incremented when a step run is in a final state. 

Co-authored-by: Luca Steeb <contact@luca-steeb.com>

* Update controller.go

---------

Co-authored-by: steebchen <contact@luca-steeb.com>
2024-04-01 12:33:18 -04:00
abelanger5
c66f97c856 fix: deadlocks on workers and tickers (#241)
* chore: add sentry support to engine

* fix: deadlocks on workers and tickers

* refactor: reduce prisma calls in engine

* trigger

* fix: remove some tenant lookups

* feat: dlx and renamed taskqueue -> msgqueue

* refactor: get group key run logic

* fix: retry counts on messages and concurrency edge cases

* fix: rabbitmq integration tests

* feat: add consumer timeouts

---------

Co-authored-by: Luca Steeb <contact@luca-steeb.com>
2024-03-12 00:45:18 -04:00