hatchet

mirror of https://github.com/hatchet-dev/hatchet.git synced 2025-12-20 08:10:26 -06:00

Author	SHA1	Message	Date
abelanger5	2249ef3b79	fix: small scheduler optimizations (#2426 ) * fix: actually increment snapshot count * add a context with timeout to wrap replenish	2025-11-17 15:45:14 -05:00
Mohammed Nafees	cf5c5989ff	add vars to tune concurrency poller (#2428 )	2025-10-23 11:36:12 -04:00
Mohammed Nafees	e2b1f1353e	Fix OTel span attribute naming convention (#2409 ) * rename spans according to convention * low cardinality	2025-10-16 18:43:40 +02:00
Mohammed Nafees	ed40a82dbb	Include `tenant_id` in OTel spans wherever possible (#2382 )	2025-10-03 18:16:16 +02:00
matt	cf59a7bcd9	Feat: Worker slot Prom metrics (#2195 ) * feat: add slots to prom metrics * feat: available * fix: extension instead * fix: docs * fix: rm unused query changes * fix: rm unused struct * fix: labels * feat: improve total slots * fix: pr feedback * fix: docs * Revert "fix: docs" This reverts commit `7fe105da92`. * fix: derive total slots	2025-09-08 14:07:44 -04:00
abelanger5	2c8ea66a7a	fix: remove rate limited items from in memory buffer (#2207 )	2025-08-27 14:51:35 -04:00
abelanger5	acf7215b3f	fix: don't query database when flush is called concurrently (#2202 )	2025-08-26 11:00:47 -04:00
abelanger5	8463b2c4a3	limit frequency of updates to rate limits (#2173 )	2025-08-21 12:50:22 -04:00
abelanger5	1407594902	fix: move rate limited queue items off the main queue (#2155 ) * fix: move rate limited queue items off the main queue * preserve FIFO behavior on queues * fix unit tests, address pr comments * fix: generated * rename table	2025-08-18 11:31:21 -04:00
Mohammed Nafees	c5915a3b14	Add rate limiter around scheduler concurrency (#2021 ) * add rate limiter around scheduler concurrency * have upper limit * loadtest should pass now	2025-07-18 08:24:57 -04:00
Jean-Baptiste Souvestre	f08c348710	fix(scheduling): negative weigths ranks were not excluded from the candidate workers pool (#1941 ) Co-authored-by: jbsouvestre <jean-baptiste@ubble.ai>	2025-07-03 09:03:12 -04:00
Mohammed Nafees	ef498a6235	Introduce tenant Prometheus metrics (#1875 ) * introduce tenant workflow completed metric * expose tenant prom metrics via handler * fix workflow and worker id in metrics * correctly add workflow metrics from workflow controller * use olap DB to gather information for workflow completion * fix prom metrics endpoint for tenant * workflow name from external id * simplify tenant registry based metrics * add docs for prometheus metrics * fix docs lint * run prettier fix * WIP metrics work * use federate prom server URL to proxy metrics * implement workflow duration histogram metric * separate prom stack docker compose * fix duration metrics calls * move scheduler metrics to prom tenant specific file * update docs for prom metrics * fix lint * use proper indices to query for durations * reorg tenant metrics * fix lint for doc * update docs with promql examples and casing around prom metrics enabled * update prom server url * fix lint * enabled prom metrics for v1 only from controller	2025-06-27 11:46:31 -04:00
abelanger5	5c5c1aa5a1	feat: more features in the load testing harness (#1691 ) * fix: make stripped payload size configurable * feat: more load test features * Update cmd/hatchet-loadtest/do.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: try to fix load tests * increase timeout, update goleak ignores * fix: data race in scheduler with snapshot input * fix: logger improvements * add one more goleak ignore --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-05-07 21:39:30 -04:00
Gabe Ruttner	3f4424b0fc	fix: check candidate slots in bounds (#1684 ) * fix: check candidate slots in bounds * fix	2025-05-06 15:34:52 -04:00
abelanger5	ffbeafc204	revert: add back testing harness (#1659 ) * re-add new testing harness * add healthcheck port and pick random grpc port to listen on * feat: parallel load tests and faster tests * make parallelism = 5 * fix: lint * add linter to pre * fix: add back rampup fixes * reduce matrix on PR, add matrix to pre-release step * make load tests less likely to block * make limit strategy group round robin * uncomment lines	2025-05-01 15:22:30 -04:00
abelanger5	fa7ab2bd75	fix: random ticker stops working when no receiver (#1653 )	2025-04-30 15:44:22 -04:00
abelanger5	d047813fd8	fix: randomize concurrency loop (#1644 )	2025-04-30 07:38:34 -04:00
abelanger5	5084934b40	fix: critical deadlock bug in scheduler (#1621 )	2025-04-25 21:28:15 -04:00
abelanger5	9aead7ab68	feat: global prometheus metrics (#1568 ) * feat: global prometheus metrics * configure prom with env vars, clean up metrics * add histogram and docs * update port	2025-04-17 15:11:38 -04:00
abelanger5	aebcf0bb0c	fix: boundary conditions on 1-second rate limiters (#1379 )	2025-03-20 21:44:08 +00:00
abelanger5	21bd707ba6	fix(v1): improved query plans for replay and task outputs, reassignment + timeout tweaks (#1354 ) * don't call parent output task when not necessary * help query planner by refactoring replay task * fix: use failed task pathway for reassignments and timeouts	2025-03-17 14:10:32 -04:00
abelanger5	1f2096313d	feat: v1 engine (#1318 )	2025-03-11 14:57:13 -04:00
abelanger5	9b30a3c5a3	fix: make extension less memory intensive (#1241 )	2025-01-31 10:28:53 -05:00
Gabe Ruttner	3185a6740d	Optimization scheduler memory (#1240 ) * memory optimizations * revert mu * trace * revert trace * chore: lint	2025-01-31 09:48:48 -05:00
Gabe Ruttner	ffa0e2782e	fix: memory (#1237 ) * fix: memory * fix: rip * simplify structs * fix: unassigned	2025-01-29 19:55:30 -05:00
abelanger5	75657a109e	fix: hard sticky strategy with no desired worker id (#1186 )	2025-01-14 09:12:29 -08:00
abelanger5	332ccb77cf	fix: don't exit early out of queuer (#1184 ) * fix: don't exit early out of queuer * rm unused file	2025-01-13 19:54:25 -08:00
Gabe Ruttner	df75ddb611	fix: fifo (#1173 ) * fix: serially try assign * fix: ensure queue sort	2025-01-09 19:33:41 -05:00
abelanger5	08356f8ae4	feat: scheduling extensions (#1131 ) * feat: scheduling extensions * add maxRuns to worker objects * don't double count slots * add unassigned and actions to slots to unassigned qi * fix: proper use of extensions channel * convert ext results ch to slice * fix: race conditions with read locks * add ability to set tenants on the extension * fix: panic on scheduler	2025-01-08 19:50:19 +00:00
abelanger5	4c74a62183	refactor(repository): improve usability of repository (#1114 ) * refactor(repository): consolidate repository buffers, create pattern for callbacks, consolidate queries * fix: spelling * fix: clean up cache	2024-12-11 18:45:02 -05:00
abelanger5	92a96beaf5	fix: latency issues on queueing caused by race condition (#1078 ) * fix: remove todo * fix: race condition on queue inserts causing high latency, improved telemetry	2024-12-02 13:52:33 -05:00
abelanger5	197bdd1f88	feat: exponential backoff (#1062 ) * initial migration * feat: exp backoff, fix linting * fix utc issue and cleanup	2024-11-21 13:39:02 -05:00
abelanger5	c40b9154d8	fix: tenant race conditions, cleanup logic, old workers getting assigned (#1050 )	2024-11-15 09:19:36 -05:00
abelanger5	48aadc6ace	fix: avoid panics in lease manager (#1029 )	2024-11-07 16:07:01 -05:00
Gabe Ruttner	5759311574	fix: ratelimit and invalid output blocking queue (#1023 ) * fix: rm unused offending code, handle unacked * fix: handle invalid outputs * fix: dont reset failed * fix: case on json err * fix: completed step run ids * fix: scope	2024-11-06 18:21:22 +00:00
abelanger5	9d133bc15c	fix: catch all nack cases for rate limits (#1015 ) * fix: properly nack rate limit when failing to schedule * more nack cases	2024-11-05 11:37:47 -05:00
abelanger5	68bc5a0197	fix: unacked messages in the queuer (#1014 ) * fix: when scheduling fails with schedule timeouts, we never ack the queue item * add error line if we don't process everything we pass into the scheduler	2024-11-05 10:27:53 -05:00
abelanger5	3e0f15c0d8	fix: divide by zero panic (#995 ) * fix: divide by zero panic * fix: add continue	2024-10-25 19:57:55 -04:00
abelanger5	509542b804	fix: duplicate assignments in queuer (#993 ) * wip: individual mutexes for actions * tmp: debug panic * remove debug code * remove deadlocks package and don't write unassigned events * fix: race condition in scheduler and add internal retries * fix: data race	2024-10-25 16:52:43 +00:00
abelanger5	2cdee59aea	refactor: optimize v0.50.0 release (#975 ) - Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services. - Fixes a set of bugs and race conditions in the `v2` scheduler - Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. - Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.	2024-10-23 12:05:16 +00:00
abelanger5	7b701ed209	fix: proper deletion of tenants from the scheduling pool (#974 ) * fix: proper deletion of tenants from the scheduling pool * adds some assignment spans * feat: caching for rankings * remove cache	2024-10-17 15:47:15 -04:00
Sean Reilly	ecb9ce1e1e	rejig the query for creating multiple sticky states (#973 ) * rejig the query for creating multiple sticky states * fix: sticky strategy of soft and improve query * fix: sort method was using indexes that didn't necessarilly correspond to original indexes, leading to inconsistent behavior --------- Co-authored-by: Sean Reilly <sean@hatchet.run> Co-authored-by: Alexander Belanger <alexander@hatchet.run>	2024-10-17 13:29:19 +00:00
abelanger5	17dc80cad8	fix: don't append invalid slots with a hard sticky strategy (#972 )	2024-10-16 20:21:39 +00:00
abelanger5	e4af494f69	fix: add slot expiry and delete actions from scheduler properly (#969 ) * fix: add back slot expiry * fix: remove action if all slots are inactive	2024-10-16 15:55:18 -04:00
abelanger5	cb39c938b3	fix: ack rate limits properly (#968 )	2024-10-16 13:32:10 -04:00
Sean Reilly	7e526de381	fix: deadlocks on events and incorrect step run ordering query (#966 ) * make it so the bulk example succeeds * make the bulk workflows work a little harder * add some ordering to mitigate deadlocks * fix: link step run parents bad query, improvements to locking * add timed mutex and telemetry * remove for update on cancel --------- Co-authored-by: Sean Reilly <sean@hatchet.run> Co-authored-by: Alexander Belanger <alexander@hatchet.run>	2024-10-16 10:28:33 -04:00
abelanger5	67a96d7166	feat(throughput): single process per queue (#956 ) * feat(throughput): single process per queue * fix data race * fix: golint and data race on load test * wrap up initial v2 scheduler * fix: more debug logs and tighten channel logic/blocking sends * improved casing on dispatcher and lease manager * fix: data race on min id * increase wait on load test, fix data race * fix: trylock -> lock * clean up queue when no longer in set * fix: clean up cache on exit * ensure cleanup is only called once * address review comments	2024-10-15 11:05:19 -04:00
abelanger5	3d218302ff	fix: internal queue items performance and race conditions (#943 ) * fix: don't use xmin hack * fix: assign not append * refactor: parallel step run updates via hashes * fix: intermittent double execution of child step runs * fix: rollback rate limits * fix: bulk event writes from single buffer * expose cleanup * fix: race conditions on failures and cancellations * change logger defaults to warn and console	2024-10-07 11:16:53 -04:00
abelanger5	a1a10b4073	feat: dynamic rate limits (#904 ) * wip: step run expressions on rate limits * feat: dynamic rate limits * chore: v0.47.0 * chore: address changes from PR review * fix: improved error handling * address pr review * better error messages for step run cels, remove debug logs * fix: hash --------- Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>	2024-09-26 22:00:34 +00:00
abelanger5	891514b461	feat: queue v4 (#842 ) * wip: v4 of queue * fix: correct query for updating counts * tmp: save migration files * feat: wrap up initial queue * fix compilation * fix: reassigns	2024-09-06 16:12:22 -04:00

1 2

54 Commits