Commit Graph

54 Commits

Author SHA1 Message Date
abelanger5
2249ef3b79 fix: small scheduler optimizations (#2426)
* fix: actually increment snapshot count

* add a context with timeout to wrap replenish
2025-11-17 15:45:14 -05:00
Mohammed Nafees
cf5c5989ff add vars to tune concurrency poller (#2428) 2025-10-23 11:36:12 -04:00
Mohammed Nafees
e2b1f1353e Fix OTel span attribute naming convention (#2409)
* rename spans according to convention

* low cardinality
2025-10-16 18:43:40 +02:00
Mohammed Nafees
ed40a82dbb Include tenant_id in OTel spans wherever possible (#2382) 2025-10-03 18:16:16 +02:00
matt
cf59a7bcd9 Feat: Worker slot Prom metrics (#2195)
* feat: add slots to prom metrics

* feat: available

* fix: extension instead

* fix: docs

* fix: rm unused query changes

* fix: rm unused struct

* fix: labels

* feat: improve total slots

* fix: pr feedback

* fix: docs

* Revert "fix: docs"

This reverts commit 7fe105da92.

* fix: derive total slots
2025-09-08 14:07:44 -04:00
abelanger5
2c8ea66a7a fix: remove rate limited items from in memory buffer (#2207) 2025-08-27 14:51:35 -04:00
abelanger5
acf7215b3f fix: don't query database when flush is called concurrently (#2202) 2025-08-26 11:00:47 -04:00
abelanger5
8463b2c4a3 limit frequency of updates to rate limits (#2173) 2025-08-21 12:50:22 -04:00
abelanger5
1407594902 fix: move rate limited queue items off the main queue (#2155)
* fix: move rate limited queue items off the main queue

* preserve FIFO behavior on queues

* fix unit tests, address pr comments

* fix: generated

* rename table
2025-08-18 11:31:21 -04:00
Mohammed Nafees
c5915a3b14 Add rate limiter around scheduler concurrency (#2021)
* add rate limiter around scheduler concurrency

* have upper limit

* loadtest should pass now
2025-07-18 08:24:57 -04:00
Jean-Baptiste Souvestre
f08c348710 fix(scheduling): negative weigths ranks were not excluded from the candidate workers pool (#1941)
Co-authored-by: jbsouvestre <jean-baptiste@ubble.ai>
2025-07-03 09:03:12 -04:00
Mohammed Nafees
ef498a6235 Introduce tenant Prometheus metrics (#1875)
* introduce tenant workflow completed metric

* expose tenant prom metrics via handler

* fix workflow and worker id in metrics

* correctly add workflow metrics from workflow controller

* use olap DB to gather information for workflow completion

* fix prom metrics endpoint for tenant

* workflow name from external id

* simplify tenant registry based metrics

* add docs for prometheus metrics

* fix docs lint

* run prettier fix

* WIP metrics work

* use federate prom server URL to proxy metrics

* implement workflow duration histogram metric

* separate prom stack docker compose

* fix duration metrics calls

* move scheduler metrics to prom tenant specific file

* update docs for prom metrics

* fix lint

* use proper indices to query for durations

* reorg tenant metrics

* fix lint for doc

* update docs with promql examples and casing around prom metrics enabled

* update prom server url

* fix lint

* enabled prom metrics for v1 only from controller
2025-06-27 11:46:31 -04:00
abelanger5
5c5c1aa5a1 feat: more features in the load testing harness (#1691)
* fix: make stripped payload size configurable

* feat: more load test features

* Update cmd/hatchet-loadtest/do.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: try to fix load tests

* increase timeout, update goleak ignores

* fix: data race in scheduler with snapshot input

* fix: logger improvements

* add one more goleak ignore

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-07 21:39:30 -04:00
Gabe Ruttner
3f4424b0fc fix: check candidate slots in bounds (#1684)
* fix: check candidate slots in bounds

* fix
2025-05-06 15:34:52 -04:00
abelanger5
ffbeafc204 revert: add back testing harness (#1659)
* re-add new testing harness

* add healthcheck port and pick random grpc port to listen on

* feat: parallel load tests and faster tests

* make parallelism = 5

* fix: lint

* add linter to pre

* fix: add back rampup fixes

* reduce matrix on PR, add matrix to pre-release step

* make load tests less likely to block

* make limit strategy group round robin

* uncomment lines
2025-05-01 15:22:30 -04:00
abelanger5
fa7ab2bd75 fix: random ticker stops working when no receiver (#1653) 2025-04-30 15:44:22 -04:00
abelanger5
d047813fd8 fix: randomize concurrency loop (#1644) 2025-04-30 07:38:34 -04:00
abelanger5
5084934b40 fix: critical deadlock bug in scheduler (#1621) 2025-04-25 21:28:15 -04:00
abelanger5
9aead7ab68 feat: global prometheus metrics (#1568)
* feat: global prometheus metrics

* configure prom with env vars, clean up metrics

* add histogram and docs

* update port
2025-04-17 15:11:38 -04:00
abelanger5
aebcf0bb0c fix: boundary conditions on 1-second rate limiters (#1379) 2025-03-20 21:44:08 +00:00
abelanger5
21bd707ba6 fix(v1): improved query plans for replay and task outputs, reassignment + timeout tweaks (#1354)
* don't call parent output task when not necessary

* help query planner by refactoring replay task

* fix: use failed task pathway for reassignments and
timeouts
2025-03-17 14:10:32 -04:00
abelanger5
1f2096313d feat: v1 engine (#1318) 2025-03-11 14:57:13 -04:00
abelanger5
9b30a3c5a3 fix: make extension less memory intensive (#1241) 2025-01-31 10:28:53 -05:00
Gabe Ruttner
3185a6740d Optimization scheduler memory (#1240)
* memory optimizations

* revert mu

* trace

* revert trace

* chore: lint
2025-01-31 09:48:48 -05:00
Gabe Ruttner
ffa0e2782e fix: memory (#1237)
* fix: memory

* fix: rip

* simplify structs

* fix: unassigned
2025-01-29 19:55:30 -05:00
abelanger5
75657a109e fix: hard sticky strategy with no desired worker id (#1186) 2025-01-14 09:12:29 -08:00
abelanger5
332ccb77cf fix: don't exit early out of queuer (#1184)
* fix: don't exit early out of queuer

* rm unused file
2025-01-13 19:54:25 -08:00
Gabe Ruttner
df75ddb611 fix: fifo (#1173)
* fix: serially try assign

* fix: ensure queue sort
2025-01-09 19:33:41 -05:00
abelanger5
08356f8ae4 feat: scheduling extensions (#1131)
* feat: scheduling extensions

* add maxRuns to worker objects

* don't double count slots

* add unassigned and actions to slots to unassigned qi

* fix: proper use of extensions channel

* convert ext results ch to slice

* fix: race conditions with read locks

* add ability to set tenants on the extension

* fix: panic on scheduler
2025-01-08 19:50:19 +00:00
abelanger5
4c74a62183 refactor(repository): improve usability of repository (#1114)
* refactor(repository): consolidate repository buffers, create pattern for callbacks, consolidate queries

* fix: spelling

* fix: clean up cache
2024-12-11 18:45:02 -05:00
abelanger5
92a96beaf5 fix: latency issues on queueing caused by race condition (#1078)
* fix: remove todo

* fix: race condition on queue inserts causing high latency, improved telemetry
2024-12-02 13:52:33 -05:00
abelanger5
197bdd1f88 feat: exponential backoff (#1062)
* initial migration

* feat: exp backoff, fix linting

* fix utc issue and cleanup
2024-11-21 13:39:02 -05:00
abelanger5
c40b9154d8 fix: tenant race conditions, cleanup logic, old workers getting assigned (#1050) 2024-11-15 09:19:36 -05:00
abelanger5
48aadc6ace fix: avoid panics in lease manager (#1029) 2024-11-07 16:07:01 -05:00
Gabe Ruttner
5759311574 fix: ratelimit and invalid output blocking queue (#1023)
* fix: rm unused offending code, handle unacked

* fix: handle invalid outputs

* fix: dont reset failed

* fix: case on json err

* fix: completed step run ids

* fix: scope
2024-11-06 18:21:22 +00:00
abelanger5
9d133bc15c fix: catch all nack cases for rate limits (#1015)
* fix: properly nack rate limit when failing to schedule

* more nack cases
2024-11-05 11:37:47 -05:00
abelanger5
68bc5a0197 fix: unacked messages in the queuer (#1014)
* fix: when scheduling fails with schedule timeouts, we never ack the queue item

* add error line if we don't process everything we pass into the scheduler
2024-11-05 10:27:53 -05:00
abelanger5
3e0f15c0d8 fix: divide by zero panic (#995)
* fix: divide by zero panic

* fix: add continue
2024-10-25 19:57:55 -04:00
abelanger5
509542b804 fix: duplicate assignments in queuer (#993)
* wip: individual mutexes for actions

* tmp: debug panic

* remove debug code

* remove deadlocks package and don't write unassigned events

* fix: race condition in scheduler and add internal retries

* fix: data race
2024-10-25 16:52:43 +00:00
abelanger5
2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
abelanger5
7b701ed209 fix: proper deletion of tenants from the scheduling pool (#974)
* fix: proper deletion of tenants from the scheduling pool

* adds some assignment spans

* feat: caching for rankings

* remove cache
2024-10-17 15:47:15 -04:00
Sean Reilly
ecb9ce1e1e rejig the query for creating multiple sticky states (#973)
* rejig the query for creating multiple sticky states

* fix: sticky strategy of soft and improve query

* fix: sort method was using indexes that didn't necessarilly correspond to original indexes, leading to inconsistent behavior

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-17 13:29:19 +00:00
abelanger5
17dc80cad8 fix: don't append invalid slots with a hard sticky strategy (#972) 2024-10-16 20:21:39 +00:00
abelanger5
e4af494f69 fix: add slot expiry and delete actions from scheduler properly (#969)
* fix: add back slot expiry

* fix: remove action if all slots are inactive
2024-10-16 15:55:18 -04:00
abelanger5
cb39c938b3 fix: ack rate limits properly (#968) 2024-10-16 13:32:10 -04:00
Sean Reilly
7e526de381 fix: deadlocks on events and incorrect step run ordering query (#966)
* make it so the bulk example succeeds

* make the bulk workflows work a little harder

* add some ordering to mitigate deadlocks

* fix: link step run parents bad query, improvements to locking

* add timed mutex and telemetry

* remove for update on cancel

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-16 10:28:33 -04:00
abelanger5
67a96d7166 feat(throughput): single process per queue (#956)
* feat(throughput): single process per queue

* fix data race

* fix: golint and data race on load test

* wrap up initial v2 scheduler

* fix: more debug logs and tighten channel logic/blocking sends

* improved casing on dispatcher and lease manager

* fix: data race on min id

* increase wait on load test, fix data race

* fix: trylock -> lock

* clean up queue when no longer in set

* fix: clean up cache on exit

* ensure cleanup is only called once

* address review comments
2024-10-15 11:05:19 -04:00
abelanger5
3d218302ff fix: internal queue items performance and race conditions (#943)
* fix: don't use xmin hack

* fix: assign not append

* refactor: parallel step run updates via hashes

* fix: intermittent double execution of child step runs

* fix: rollback rate limits

* fix: bulk event writes from single buffer

* expose cleanup

* fix: race conditions on failures and cancellations

* change logger defaults to warn and console
2024-10-07 11:16:53 -04:00
abelanger5
a1a10b4073 feat: dynamic rate limits (#904)
* wip: step run expressions on rate limits

* feat: dynamic rate limits

* chore: v0.47.0

* chore: address changes from PR review

* fix: improved error handling

* address pr review

* better error messages for step run cels, remove debug logs

* fix: hash

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-09-26 22:00:34 +00:00
abelanger5
891514b461 feat: queue v4 (#842)
* wip: v4 of queue

* fix: correct query for updating counts

* tmp: save migration files

* feat: wrap up initial queue

* fix compilation

* fix: reassigns
2024-09-06 16:12:22 -04:00