42 Commits

Author SHA1 Message Date
abelanger5
5c5c1aa5a1 feat: more features in the load testing harness (#1691)
* fix: make stripped payload size configurable

* feat: more load test features

* Update cmd/hatchet-loadtest/do.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: try to fix load tests

* increase timeout, update goleak ignores

* fix: data race in scheduler with snapshot input

* fix: logger improvements

* add one more goleak ignore

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-07 21:39:30 -04:00
Gabe Ruttner
3f4424b0fc fix: check candidate slots in bounds (#1684)
* fix: check candidate slots in bounds

* fix
2025-05-06 15:34:52 -04:00
abelanger5
ffbeafc204 revert: add back testing harness (#1659)
* re-add new testing harness

* add healthcheck port and pick random grpc port to listen on

* feat: parallel load tests and faster tests

* make parallelism = 5

* fix: lint

* add linter to pre

* fix: add back rampup fixes

* reduce matrix on PR, add matrix to pre-release step

* make load tests less likely to block

* make limit strategy group round robin

* uncomment lines
2025-05-01 15:22:30 -04:00
abelanger5
fa7ab2bd75 fix: random ticker stops working when no receiver (#1653) 2025-04-30 15:44:22 -04:00
abelanger5
d047813fd8 fix: randomize concurrency loop (#1644) 2025-04-30 07:38:34 -04:00
abelanger5
5084934b40 fix: critical deadlock bug in scheduler (#1621) 2025-04-25 21:28:15 -04:00
abelanger5
9aead7ab68 feat: global prometheus metrics (#1568)
* feat: global prometheus metrics

* configure prom with env vars, clean up metrics

* add histogram and docs

* update port
2025-04-17 15:11:38 -04:00
abelanger5
aebcf0bb0c fix: boundary conditions on 1-second rate limiters (#1379) 2025-03-20 21:44:08 +00:00
abelanger5
21bd707ba6 fix(v1): improved query plans for replay and task outputs, reassignment + timeout tweaks (#1354)
* don't call parent output task when not necessary

* help query planner by refactoring replay task

* fix: use failed task pathway for reassignments and
timeouts
2025-03-17 14:10:32 -04:00
abelanger5
1f2096313d feat: v1 engine (#1318) 2025-03-11 14:57:13 -04:00
abelanger5
9b30a3c5a3 fix: make extension less memory intensive (#1241) 2025-01-31 10:28:53 -05:00
Gabe Ruttner
3185a6740d Optimization scheduler memory (#1240)
* memory optimizations

* revert mu

* trace

* revert trace

* chore: lint
2025-01-31 09:48:48 -05:00
Gabe Ruttner
ffa0e2782e fix: memory (#1237)
* fix: memory

* fix: rip

* simplify structs

* fix: unassigned
2025-01-29 19:55:30 -05:00
abelanger5
75657a109e fix: hard sticky strategy with no desired worker id (#1186) 2025-01-14 09:12:29 -08:00
abelanger5
332ccb77cf fix: don't exit early out of queuer (#1184)
* fix: don't exit early out of queuer

* rm unused file
2025-01-13 19:54:25 -08:00
Gabe Ruttner
df75ddb611 fix: fifo (#1173)
* fix: serially try assign

* fix: ensure queue sort
2025-01-09 19:33:41 -05:00
abelanger5
08356f8ae4 feat: scheduling extensions (#1131)
* feat: scheduling extensions

* add maxRuns to worker objects

* don't double count slots

* add unassigned and actions to slots to unassigned qi

* fix: proper use of extensions channel

* convert ext results ch to slice

* fix: race conditions with read locks

* add ability to set tenants on the extension

* fix: panic on scheduler
2025-01-08 19:50:19 +00:00
abelanger5
4c74a62183 refactor(repository): improve usability of repository (#1114)
* refactor(repository): consolidate repository buffers, create pattern for callbacks, consolidate queries

* fix: spelling

* fix: clean up cache
2024-12-11 18:45:02 -05:00
abelanger5
92a96beaf5 fix: latency issues on queueing caused by race condition (#1078)
* fix: remove todo

* fix: race condition on queue inserts causing high latency, improved telemetry
2024-12-02 13:52:33 -05:00
abelanger5
197bdd1f88 feat: exponential backoff (#1062)
* initial migration

* feat: exp backoff, fix linting

* fix utc issue and cleanup
2024-11-21 13:39:02 -05:00
abelanger5
c40b9154d8 fix: tenant race conditions, cleanup logic, old workers getting assigned (#1050) 2024-11-15 09:19:36 -05:00
abelanger5
48aadc6ace fix: avoid panics in lease manager (#1029) 2024-11-07 16:07:01 -05:00
Gabe Ruttner
5759311574 fix: ratelimit and invalid output blocking queue (#1023)
* fix: rm unused offending code, handle unacked

* fix: handle invalid outputs

* fix: dont reset failed

* fix: case on json err

* fix: completed step run ids

* fix: scope
2024-11-06 18:21:22 +00:00
abelanger5
9d133bc15c fix: catch all nack cases for rate limits (#1015)
* fix: properly nack rate limit when failing to schedule

* more nack cases
2024-11-05 11:37:47 -05:00
abelanger5
68bc5a0197 fix: unacked messages in the queuer (#1014)
* fix: when scheduling fails with schedule timeouts, we never ack the queue item

* add error line if we don't process everything we pass into the scheduler
2024-11-05 10:27:53 -05:00
abelanger5
3e0f15c0d8 fix: divide by zero panic (#995)
* fix: divide by zero panic

* fix: add continue
2024-10-25 19:57:55 -04:00
abelanger5
509542b804 fix: duplicate assignments in queuer (#993)
* wip: individual mutexes for actions

* tmp: debug panic

* remove debug code

* remove deadlocks package and don't write unassigned events

* fix: race condition in scheduler and add internal retries

* fix: data race
2024-10-25 16:52:43 +00:00
abelanger5
2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
abelanger5
7b701ed209 fix: proper deletion of tenants from the scheduling pool (#974)
* fix: proper deletion of tenants from the scheduling pool

* adds some assignment spans

* feat: caching for rankings

* remove cache
2024-10-17 15:47:15 -04:00
Sean Reilly
ecb9ce1e1e rejig the query for creating multiple sticky states (#973)
* rejig the query for creating multiple sticky states

* fix: sticky strategy of soft and improve query

* fix: sort method was using indexes that didn't necessarilly correspond to original indexes, leading to inconsistent behavior

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-17 13:29:19 +00:00
abelanger5
17dc80cad8 fix: don't append invalid slots with a hard sticky strategy (#972) 2024-10-16 20:21:39 +00:00
abelanger5
e4af494f69 fix: add slot expiry and delete actions from scheduler properly (#969)
* fix: add back slot expiry

* fix: remove action if all slots are inactive
2024-10-16 15:55:18 -04:00
abelanger5
cb39c938b3 fix: ack rate limits properly (#968) 2024-10-16 13:32:10 -04:00
Sean Reilly
7e526de381 fix: deadlocks on events and incorrect step run ordering query (#966)
* make it so the bulk example succeeds

* make the bulk workflows work a little harder

* add some ordering to mitigate deadlocks

* fix: link step run parents bad query, improvements to locking

* add timed mutex and telemetry

* remove for update on cancel

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-16 10:28:33 -04:00
abelanger5
67a96d7166 feat(throughput): single process per queue (#956)
* feat(throughput): single process per queue

* fix data race

* fix: golint and data race on load test

* wrap up initial v2 scheduler

* fix: more debug logs and tighten channel logic/blocking sends

* improved casing on dispatcher and lease manager

* fix: data race on min id

* increase wait on load test, fix data race

* fix: trylock -> lock

* clean up queue when no longer in set

* fix: clean up cache on exit

* ensure cleanup is only called once

* address review comments
2024-10-15 11:05:19 -04:00
abelanger5
3d218302ff fix: internal queue items performance and race conditions (#943)
* fix: don't use xmin hack

* fix: assign not append

* refactor: parallel step run updates via hashes

* fix: intermittent double execution of child step runs

* fix: rollback rate limits

* fix: bulk event writes from single buffer

* expose cleanup

* fix: race conditions on failures and cancellations

* change logger defaults to warn and console
2024-10-07 11:16:53 -04:00
abelanger5
a1a10b4073 feat: dynamic rate limits (#904)
* wip: step run expressions on rate limits

* feat: dynamic rate limits

* chore: v0.47.0

* chore: address changes from PR review

* fix: improved error handling

* address pr review

* better error messages for step run cels, remove debug logs

* fix: hash

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-09-26 22:00:34 +00:00
abelanger5
891514b461 feat: queue v4 (#842)
* wip: v4 of queue

* fix: correct query for updating counts

* tmp: save migration files

* feat: wrap up initial queue

* fix compilation

* fix: reassigns
2024-09-06 16:12:22 -04:00
abelanger5
263eaf069b feat: pass otel through msgqueue (#802)
* feat: pass otel through msgqueue

* feat: more spans on scheduling

* otel increase batch size
2024-08-28 14:45:02 +00:00
Gabe Ruttner
ee5d86796f fix: required affinity (#812)
* fix: required affinity

* chore: rm dead code
2024-08-23 15:19:29 -04:00
abelanger5
dd8a4144cb fix: hard sticky assignment to workers when no desired worker id (#809) 2024-08-23 07:42:52 -04:00
Gabe Ruttner
4ea4712d4d refactor: performance and throughput (#756)
Refactors the queueing logic to be fairly balanced between actions, with each action backed as a separate FIFO queue. Also adds support for priority queueing and custom queues, though those aren't exposed on the API layer yet. Improves throughput to be > 5000 tasks/second on a single queue. 

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-08-12 14:38:47 +00:00