24 Commits

Author SHA1 Message Date
abelanger5
9f463e92d6 refactor: move v1 packages, remove webhook worker references (#2749)
* chore: move v1 packages, remove webhook worker references

* chore: move msgqueue

* fix: relative paths in sqlc.yaml
2026-01-02 11:42:40 -05:00
abelanger5
f82d3bd071 refactor: consolidate repository methods (#2730)
* refactor: remove v0 paths from codebase

* remove uiVersion references

* refactor: remove v0-exclusive database queries

* remove webhook test

* chore: move api token repository

* chore: move dispatcher repository to v1

* chore: move health repository to v1

* chore: remove event repository

* remove some unused repositories

* chore: move mq implementation to v1

* chore: consolidate rate limit implementations

* chore: move security check to v1 repository

* chore: move slack to v1 repository

* chore: move sns implementation to v1 repository

* clean up step repository

* chore: move tenant invite to v1 repository

* chore: move limits, workers, tenant alerts to v1 repository

* chore: move user, tenant, userSession to v1 repository

* chore: move ticker to v1 repository

* chore: move scheduled workflows to v1 repository

* chore: remove workflows

* fix: remove pointer for limits config file

* propagate cache value to api token

* propagate cache durations
2025-12-31 16:35:46 -05:00
abelanger5
dd9c36c315 refactor: remove v0 paths from codebase (#2728)
* refactor: remove v0 paths from codebase

* remove uiVersion references
2025-12-30 09:57:00 -05:00
abelanger5
2249ef3b79 fix: small scheduler optimizations (#2426)
* fix: actually increment snapshot count

* add a context with timeout to wrap replenish
2025-11-17 15:45:14 -05:00
Mohammed Nafees
cf5c5989ff add vars to tune concurrency poller (#2428) 2025-10-23 11:36:12 -04:00
Mohammed Nafees
e2b1f1353e Fix OTel span attribute naming convention (#2409)
* rename spans according to convention

* low cardinality
2025-10-16 18:43:40 +02:00
Mohammed Nafees
ed40a82dbb Include tenant_id in OTel spans wherever possible (#2382) 2025-10-03 18:16:16 +02:00
matt
cf59a7bcd9 Feat: Worker slot Prom metrics (#2195)
* feat: add slots to prom metrics

* feat: available

* fix: extension instead

* fix: docs

* fix: rm unused query changes

* fix: rm unused struct

* fix: labels

* feat: improve total slots

* fix: pr feedback

* fix: docs

* Revert "fix: docs"

This reverts commit 7fe105da92.

* fix: derive total slots
2025-09-08 14:07:44 -04:00
abelanger5
2c8ea66a7a fix: remove rate limited items from in memory buffer (#2207) 2025-08-27 14:51:35 -04:00
abelanger5
acf7215b3f fix: don't query database when flush is called concurrently (#2202) 2025-08-26 11:00:47 -04:00
abelanger5
8463b2c4a3 limit frequency of updates to rate limits (#2173) 2025-08-21 12:50:22 -04:00
abelanger5
1407594902 fix: move rate limited queue items off the main queue (#2155)
* fix: move rate limited queue items off the main queue

* preserve FIFO behavior on queues

* fix unit tests, address pr comments

* fix: generated

* rename table
2025-08-18 11:31:21 -04:00
Mohammed Nafees
c5915a3b14 Add rate limiter around scheduler concurrency (#2021)
* add rate limiter around scheduler concurrency

* have upper limit

* loadtest should pass now
2025-07-18 08:24:57 -04:00
Jean-Baptiste Souvestre
f08c348710 fix(scheduling): negative weigths ranks were not excluded from the candidate workers pool (#1941)
Co-authored-by: jbsouvestre <jean-baptiste@ubble.ai>
2025-07-03 09:03:12 -04:00
Mohammed Nafees
ef498a6235 Introduce tenant Prometheus metrics (#1875)
* introduce tenant workflow completed metric

* expose tenant prom metrics via handler

* fix workflow and worker id in metrics

* correctly add workflow metrics from workflow controller

* use olap DB to gather information for workflow completion

* fix prom metrics endpoint for tenant

* workflow name from external id

* simplify tenant registry based metrics

* add docs for prometheus metrics

* fix docs lint

* run prettier fix

* WIP metrics work

* use federate prom server URL to proxy metrics

* implement workflow duration histogram metric

* separate prom stack docker compose

* fix duration metrics calls

* move scheduler metrics to prom tenant specific file

* update docs for prom metrics

* fix lint

* use proper indices to query for durations

* reorg tenant metrics

* fix lint for doc

* update docs with promql examples and casing around prom metrics enabled

* update prom server url

* fix lint

* enabled prom metrics for v1 only from controller
2025-06-27 11:46:31 -04:00
abelanger5
5c5c1aa5a1 feat: more features in the load testing harness (#1691)
* fix: make stripped payload size configurable

* feat: more load test features

* Update cmd/hatchet-loadtest/do.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: try to fix load tests

* increase timeout, update goleak ignores

* fix: data race in scheduler with snapshot input

* fix: logger improvements

* add one more goleak ignore

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-07 21:39:30 -04:00
Gabe Ruttner
3f4424b0fc fix: check candidate slots in bounds (#1684)
* fix: check candidate slots in bounds

* fix
2025-05-06 15:34:52 -04:00
abelanger5
ffbeafc204 revert: add back testing harness (#1659)
* re-add new testing harness

* add healthcheck port and pick random grpc port to listen on

* feat: parallel load tests and faster tests

* make parallelism = 5

* fix: lint

* add linter to pre

* fix: add back rampup fixes

* reduce matrix on PR, add matrix to pre-release step

* make load tests less likely to block

* make limit strategy group round robin

* uncomment lines
2025-05-01 15:22:30 -04:00
abelanger5
d047813fd8 fix: randomize concurrency loop (#1644) 2025-04-30 07:38:34 -04:00
abelanger5
5084934b40 fix: critical deadlock bug in scheduler (#1621) 2025-04-25 21:28:15 -04:00
abelanger5
9aead7ab68 feat: global prometheus metrics (#1568)
* feat: global prometheus metrics

* configure prom with env vars, clean up metrics

* add histogram and docs

* update port
2025-04-17 15:11:38 -04:00
abelanger5
aebcf0bb0c fix: boundary conditions on 1-second rate limiters (#1379) 2025-03-20 21:44:08 +00:00
abelanger5
21bd707ba6 fix(v1): improved query plans for replay and task outputs, reassignment + timeout tweaks (#1354)
* don't call parent output task when not necessary

* help query planner by refactoring replay task

* fix: use failed task pathway for reassignments and
timeouts
2025-03-17 14:10:32 -04:00
abelanger5
1f2096313d feat: v1 engine (#1318) 2025-03-11 14:57:13 -04:00