Commit Graph

64 Commits

Author SHA1 Message Date
Julius Park
d94a3b4718 Add queue to update scheduled cron triggers on-demand (#3149)
Adds a queue that gets triggered whenever a cron is created, updated, or deleted that will automatically update the list of crons running in the ticker.
2026-03-04 11:34:43 -05:00
Gabe Ruttner
2fdc47a6af feat: multiple slot types (#2927)
* feat: adds support for multiple slot types, primarily motivated by durable slots

---------

Co-authored-by: mrkaye97 <mrkaye97@gmail.com>
2026-02-17 05:43:47 -08:00
abelanger5
851fbaf214 feat: reduced cold starts for new workers and queues (#2969)
* feat: reduced cold starts for new workers and queues

* address changes from pr review

* fix: data race

* set logs to debug on the harness

* debug for queue level as well

* debug lines for queuer

* fix: add queue notifier to v0 workflow registration

* revert: lease manager interval

* revert log level changes

* add more debug, revert reverts

* more debug

* add debug to lease manager

* do it, try it

* fix: call upsertQueue as part of workflow version put

* change log level to error again

* pr review changes
2026-02-11 13:12:10 -08:00
Greg Furman
80dc9786fd chore: run go-fmt (#2984) 2026-02-10 17:34:36 -05:00
abelanger5
2ddcbd2672 refactor: use typed maps (#2928)
* refactor: use typed maps

* self-review comments
2026-02-03 19:35:09 -05:00
matt
058968c06b Refactor: Attempt II at removing pgtype.UUID everywhere + convert string UUIDs into uuid.UUID (#2894)
* fix: add type override in sqlc.yaml

* chore: gen sqlc

* chore: big find and replace

* chore: more

* fix: clean up bunch of outdated `.Valid` refs

* refactor: remove `sqlchelpers.uuidFromStr()` in favor of `uuid.MustParse()`

* refactor: remove uuidToStr

* fix: lint

* fix: use pointers for null uuids

* chore: clean up more null pointers

* chore: clean up a bunch more

* fix: couple more

* fix: some types on the api

* fix: incorrectly non-null param

* fix: more nullable params

* fix: more refs

* refactor: start replacing tenant id strings with uuids

* refactor: more tenant id uuid casting

* refactor: fix a bunch more

* refactor: more

* refactor: more

* refactor: is that all of them?!

* fix: panic

* fix: rm scans

* fix: unwind some broken things

* chore: tests

* fix: rebase issues

* fix: more tests

* fix: nil checks

* Refactor: Make all UUIDs into `uuid.UUID` (#2897)

* refactor: remove a bunch more string uuids

* refactor: pointers and lists

* refactor: fix all the refs

* refactor: fix a few more

* fix: config loader

* fix: revert some changes

* fix: tests

* fix: test

* chore: proto

* fix: durable listener

* fix: some more string types

* fix: python health worker sleep

* fix: remove a bunch of `MustParse`s from the various gRPC servers

* fix: rm more uuid.MustParse calls

* fix: rm mustparse from api

* fix: test

* fix: merge issues

* fix: handle a bunch more uses of `MustParse` everywhere

* fix: nil id for worker label

* fix: more casting in the oss

* fix: more id parsing

* fix: stringify jwt opt

* fix: couple more bugs in untyped calls

* fix: more types

* fix: broken test

* refactor: implement `GetKeyUuid`

* chore: regen sqlc

* chore: replace pgtype.UUID again

* fix: bunch more type errors

* fix: panic
2026-02-03 11:02:59 -05:00
abelanger5
d56dee4266 feat: durable user event log (#2861)
* placeholder

* feat: db tables for user events (#2862)

* feat: db tables for user events

* move event payloads to payloads table, fix env var loading

* fix: address pr review comments

* missed save

* feat: optimistic scheduling (#2867)

* feat: db tables for user events

* move event payloads to payloads table, fix env var loading

* refactor: small changes to prepare optimistic txs

* feat: optimistic scheduling

* address pr review comments

* rm comments

* fix: rampup test race condition

* fix: goleak

* feat: grpc-side triggers

* fix: config and sem logic

* fix: respect optimistic scheduling env var

* add optimistic to testing matrix, remove pg-only mode

* fix cleanup of pubbuffers

* merge migrations

* last testing fixes
2026-02-02 18:04:02 -05:00
abelanger5
04953129a4 fix: compute payload size correctly for pg_notify (#2873) 2026-01-31 16:31:49 -05:00
Gabe Ruttner
a8afa07dcf fix: validate json at edges and dont retry on invalid (#2882)
* drop and validate at edges

* rm submod

* use enum

* lint
2026-01-29 08:04:55 -08:00
matt
3bd605d4ed Fix: Chunk and recursively retry too-large message sends (#2761)
* feat: recursively split payload list into chunks

* fix: use slices.Chunk and run sequentially

* fix: return error if only one payload

* fix: log error

* fix: couple edge cases
2026-01-08 11:10:06 -05:00
Andrei Gaspar
4dda2b2884 Send create:user Event from OAuth Flow (#2683)
* feat: Send create:user event from OAuth flow

* feat: Implement user and tenant creation events in callbacks

* move callback into cb.Do

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2026-01-06 14:06:38 -05:00
abelanger5
9f463e92d6 refactor: move v1 packages, remove webhook worker references (#2749)
* chore: move v1 packages, remove webhook worker references

* chore: move msgqueue

* fix: relative paths in sqlc.yaml
2026-01-02 11:42:40 -05:00
abelanger5
f82d3bd071 refactor: consolidate repository methods (#2730)
* refactor: remove v0 paths from codebase

* remove uiVersion references

* refactor: remove v0-exclusive database queries

* remove webhook test

* chore: move api token repository

* chore: move dispatcher repository to v1

* chore: move health repository to v1

* chore: remove event repository

* remove some unused repositories

* chore: move mq implementation to v1

* chore: consolidate rate limit implementations

* chore: move security check to v1 repository

* chore: move slack to v1 repository

* chore: move sns implementation to v1 repository

* clean up step repository

* chore: move tenant invite to v1 repository

* chore: move limits, workers, tenant alerts to v1 repository

* chore: move user, tenant, userSession to v1 repository

* chore: move ticker to v1 repository

* chore: move scheduled workflows to v1 repository

* chore: remove workflows

* fix: remove pointer for limits config file

* propagate cache value to api token

* propagate cache durations
2025-12-31 16:35:46 -05:00
abelanger5
dd9c36c315 refactor: remove v0 paths from codebase (#2728)
* refactor: remove v0 paths from codebase

* remove uiVersion references
2025-12-30 09:57:00 -05:00
Mohammed Nafees
a13c74bd1d Reuse timers for delayed semaphore release in MQ buffers (#2691)
* reuse msg buffer semaphore timer

* goroutine

* comments
2025-12-25 12:01:12 +01:00
Mohammed Nafees
88e7a60b83 msgqueue msg IDs as constants for ease of navigation and readability (#2692) 2025-12-25 11:56:07 +01:00
matt
b65c6de53f Feat: Hatchet Metrics Monitoring, I (#2699)
* Revert "Revert "Feat: Hatchet Metrics Monitoring, I (#2480)" (#2698)"

This reverts commit b87150767a.

* go mod tidy

---------

Co-authored-by: Mohammed Nafees <hello@mnafees.me>
2025-12-23 20:14:14 +01:00
matt
b87150767a Revert "Feat: Hatchet Metrics Monitoring, I (#2480)" (#2698)
This reverts commit fdc075ec6f.
2025-12-22 16:26:14 -05:00
matt
fdc075ec6f Feat: Hatchet Metrics Monitoring, I (#2480)
* feat: queries + task methods for oldest running task and oldest task

* feat: worker slot and sdk metrics

* feat: wal metrics

* repository stub

* feat: add meter provider thingy

* pg queries

* fix: add task

* feat: repo methods for worker metrics

* feat: active workers query, fix where clauses

* fix: aliasing

* fix: sql, cleanup

* chore: cast

* feat: olap queries

* feat: olap queries

* feat: finish wiring up olap status update metrics

* chore: lint

* chore: lint

* fix: dupes, other code review comments

* send metrics to OTel collector

* last autovac

* flag

* logging updates

* address PR comments

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
Co-authored-by: Mohammed Nafees <hello@mnafees.me>
2025-12-23 01:04:02 +05:30
abelanger5
9dabe7d902 feat: dlq for dispatcher queues (#2600)
* feat: dlq for dispatcher queues

* reduce dispatcher message ttl to 20 seconds

* rename dispatcher queue for clarity

* add error logs when dead lettering

* address comment
2025-12-04 14:19:01 -05:00
abelanger5
3f5c243325 fix: move check for large payloads to after json.Marshal (#2594) 2025-12-02 11:45:37 -05:00
abelanger5
d906a441d4 fix: ensure that slow worker doesn't interrupt dispatcher, guard large RabbitMQ pubs (#2591)
* ensure that slow worker doesn't interrupt dispatcher

* fix: large payload pub issues

* add comments

* fix: review comments
2025-12-02 09:54:54 -05:00
Mohammed Nafees
54701e87d0 Retry RMQ messages indefinitely with aggressive logging after 5 retries (#2448)
* aggressively log errors when rmq retry more than 5 times

* revisit comments

* new vars and fix integration test

* fix test

* log only after 5 retries
2025-10-28 16:51:39 +01:00
Mohammed Nafees
e2b1f1353e Fix OTel span attribute naming convention (#2409)
* rename spans according to convention

* low cardinality
2025-10-16 18:43:40 +02:00
matt
d677cb2b08 feat: gzip compression for large payloads, persistent OLAP writes (#2368)
* debug: remove event pub

* add additional spans to publish message

* debug: don't publish payloads

* fix: persistent messages on olap

* add back other payloads

* remove pub buffers temporarily

* fix: correct queue

* hacky partitioning

* add back pub buffers to scheduler

* don't send no worker events

* add attributes for queue name and message id to publish

* add back pub buffers to grpc api

* remove pubs again, no worker writes though

* task processing queue hashes

* remove payloads again

* gzip compression over 5kb

* add back task controller payloads

* add back no worker requeueing event, with expirable lru cache

* add back pub buffers

* remove hash partitioned queues

* small fixes

* ignore lru cache top fn

* config vars for compression, disable by default

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2025-10-08 11:44:04 -04:00
Mohammed Nafees
ed40a82dbb Include tenant_id in OTel spans wherever possible (#2382) 2025-10-03 18:16:16 +02:00
abelanger5
2edeeb10ea feat: max channels for rabbitmq (#2365)
* feat: max conns for rabbitmq

* rename conns -> chans
2025-09-30 08:49:45 -04:00
abelanger5
733feedbff fix: use separate connections for pub and sub (#2358)
* use separate connections for pub and sub

* Update internal/msgqueue/v1/rabbitmq/rabbitmq.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-29 14:29:45 -04:00
matt
025f42af74 Debug: Error log if we send >10mb message over the internal queue (#2345)
* fix: send error log if we try to send message > 10mb

* feat: add some span attributes

* fix: span attribute names

* fix: cleanup

* fix: add message id
2025-09-25 18:15:35 -04:00
matt
92843bb277 Feat: Payload Store Repository (#2047)
* feat: add table for storing payloads

* feat: add payload type enum

* feat: gen sqlc

* feat: initial sql impl

* feat: add payload store repo to shared

* feat: add overwrite

* fix: impl

* feat: bulk op

* feat: initial wiring of inputs for task triggers

* feat: wire up dag matches

* feat: create V1TaskWithPayload and use it everywhere

* fix: couple bugs

* fix: clean up types

* fix: overwrite

* fix: rm input from replay

* fix: move payload store to shared repo

* fix: schema

* refactor: repo setup

* refactor: repos

* fix: gen

* chore: lint

* fix: rename

* feat: naming, write dag inputs

* fix: more naming, trigger bug

* fix: dual writes for now

* fix: pass in tx

* feat: initial work on offloader

* feat: improve external offloader

* fix: some refs

* add withExternalHandler

* fix: improve impl of external store

* feat: implement offloading, fix other impls

* feat: add query to update JSON

* fix: implement offloading + updating records in payloads table

* feat: add WAL table

* feat: add queries for polling WAL and evicting

* feat: wire up writes into WAL

* fix: get job working

* refactor: improve types

* fix: infinite loop

* feat: improve offloading logic to run in two separate txes

* refactor: rework how overrides work

* fix: lint

* fix: migration number

* fix: migration

* fix: migration version

* fix: revert back to reading payloads out

* fix: fall back to previous input, part i

* fix: input fallback

* fix: add back input to replay

* fix: input fallback in dispatcher

* fix: nil check

* feat: advisory locks, part i

* fix: no skip locked

* feat: hash partitioned wal table

* fix: modify queries a bit, tweak crud enum

* fix: pk order, function to find tenants

* feat: wal processing

* fix: only write wal if an external store is enabled, fix offloading logic

* fix: spacing

* feat: schema cleanup

* fix: rm external store loc name

* fix: set content to null when offloading

* fix: cleanup, naming

* fix: pass overwrite payload store along

* debug: add some logging

* Revert "debug: add some logging"

This reverts commit 43e71eadf1.

* fix: typo

* fx: add offloatAt to store opts for offloading

* fix: handle leasing with advisory lock

* fix: struct def

* fix: requeue on payloads not found

* fix: rm hack for triggers

* fix: revert empty input on write

* fix: write input

* feat: env var for enabling / disabling dual writes

* feat: wire up dual writes

* fix: comments

* feat: generics!

* fix: panic from type cast

* fix: migration

* fix: generic

* fix: hack for T key in map

* fix: cleanup
2025-09-12 09:53:01 -04:00
Mohammed Nafees
89e6d00a8f Add telemetry around task statuses in controller (#2090)
* add telemetry around task statuses in controller

* fixes

* more fixes
2025-08-06 08:41:54 -04:00
abelanger5
1abb2a20e7 fix: hatchet-lite connection leakage and improve listen/notify performance (#1924)
* fix: hatchet-lite connection leakage and improve listen/notify performance

* fix: cancel mq listener

* remove event deps

* skip webhook test for now
2025-06-30 17:13:09 -04:00
Matt Kaye
e62f7edab3 Fix: Streaming Bugs (#1913)
* fix: bug with json parsing failing

* fix: hang up on cancel and fail

* fix: pub stream events even if tenant pubs are disabled

* fix: condition

* fix: eq
2025-06-26 16:22:56 -04:00
abelanger5
b8352bcaca config: allow buffer settings to be configurable (#1649) 2025-05-01 07:13:30 -04:00
abelanger5
2c1f1f4808 test: improve Go testing harness (#1631)
* test: improves testing harness for engine

* update CI test

* fix: race condition in test

* make tests more stable

* cleanup pub and sub buffers

* fix: goleak on rampup test

* feat: matrix tests for engine
2025-04-29 10:55:16 -04:00
abelanger5
ef6668a8c3 fix: go signature and docs (#1561)
* fix: go signature and docs

* Update examples/v1/workflows/concurrency-rr.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-15 17:31:48 -04:00
abelanger5
6813ab1c75 fix: streaming order improvements, go sdk stability (#1536)
* fix: streaming order improvements, go sdk stability

* fix: improve replay query
2025-04-11 13:02:47 -04:00
abelanger5
b03a8d2666 improve ttl cache on pgmq (#1438)
* improve ttl cache on pgmq

* fix: panic
2025-03-28 09:27:12 -07:00
abelanger5
a20ab2de65 fix(v1): add exponential backoff for internal retries (#1399) 2025-03-25 09:14:15 -07:00
abelanger5
ac968e94b8 fix: concurrency issues and a few small improvements (#1324) 2025-03-12 16:30:34 -04:00
abelanger5
1f2096313d feat: v1 engine (#1318) 2025-03-11 14:57:13 -04:00
Sean Reilly
190f3f984a clean up rabbit mq session stuff, add a quick ack and error processin… (#1197)
* clean up rabbit mq session stuff, add a quick ack and error processing for AddMessage

* bit more paranoid about getting stuck in chans

* first pass at locking the message to deal with the failed states better

* clean up the access to ready for the mq

* make sure we don't block sending this ack
2025-01-23 16:06:02 -08:00
abelanger5
61ae067014 fix: race condition on err in pgmq (#1198) 2025-01-18 16:20:24 +00:00
abelanger5
dcb67a1dac feat: postgres-backed message queue (#1119) 2024-12-18 09:00:54 -05:00
abelanger5
2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
abelanger5
95558138a4 chore: improve throughput, remove deadlocks (#949)
* add otel to pub

* temporarily remove tenant id exchange

* fix: increase internal queue throughput

* fix: remove potential deadlocking

* rollback hash factor multiplier

* fix: batch update issues

* fix: rm unneeded locks

* move disable tenant pubsub to an env var

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-10-10 08:54:34 -04:00
abelanger5
8939c94f63 fix: send fewer messages to job queue when it's not necessary (#932)
* handle started at differently

* fix: start job runs in workflows controller

* fix: keep job runs around for backwards compat
2024-10-03 07:39:06 -04:00
abelanger5
c3fa2c57f3 fix: don't need acks on queue checks (#926) 2024-10-02 00:52:02 +00:00
abelanger5
5f5e1e8a88 refactor: use shared tenant listener for messages (#911)
* refactor: use shared tenant listener per tenant exchange

* fix: remove subscription properly
2024-09-26 14:54:11 -04:00
abelanger5
9d69e4d192 fix: use read-only message queue (#897)
* fix: use read-only message queue

* set very high qos for read-heavy queue
2024-09-24 18:30:43 -04:00