Commit Graph

323 Commits

Author SHA1 Message Date
Sean Reilly
e32f353587 Speed up the delete worker query (#1103)
* add an index on lastHeartbeatAt and don't do highly related actions concurrently



---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-12-12 20:49:22 -05:00
abelanger5
94d14336aa feat(go-sdk): blocking worker (#1106) 2024-12-12 20:42:13 -05:00
abelanger5
4c74a62183 refactor(repository): improve usability of repository (#1114)
* refactor(repository): consolidate repository buffers, create pattern for callbacks, consolidate queries

* fix: spelling

* fix: clean up cache
2024-12-11 18:45:02 -05:00
Gabe Ruttner
44ffe1d66c fix: panic (#1105) 2024-12-09 15:50:36 +00:00
abelanger5
1499668df9 fix: duplicate cron expressions only cause a single trigger (#1101) 2024-12-06 16:02:37 -05:00
abelanger5
92a96beaf5 fix: latency issues on queueing caused by race condition (#1078)
* fix: remove todo

* fix: race condition on queue inserts causing high latency, improved telemetry
2024-12-02 13:52:33 -05:00
Gabe Ruttner
574eb0b67e feat: dynamic crons (#1000)
* wip: stub schedule page

* wip: stub list

* fix: 2025 bug...

* feat: wip cron list

* feat: addl meta

* feat: expose metadata column

* feat: sort and created at

* cron to recurring

* scheduled: with statuses

* fix: links

* feat: expose schedule ids

* feat: delete run

* fix: remove search

* feat: filterable scheduled

* fix: remove broken features

* chore: lint

* rm metadata for now

* chore: lint

* chore: recurring to cron job

* fix: review comments

* fix: populator

* wip cron changes

* fix: ids are helpful

* fix: populator

* wip

* wip: create crons, stub scheduled

* wip: create schedule

* wip add trigger buttons to all the pages

* wip: reusable trigger form

* fix: hash

* fixes: cron bugs

* fixes: cron sort

* fix: out of order migrations

* fix: add internalRetryCount

* feat: api things survive version transitions

* feat: table things

* feat: delete disabled for non api

* feat: prevent delete non api

* feat: filters

* require cron name for api

* default name

* fix: migrations

* frontend improvements and migrations

* fix: pagination

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-21 16:18:24 -05:00
abelanger5
197bdd1f88 feat: exponential backoff (#1062)
* initial migration

* feat: exp backoff, fix linting

* fix utc issue and cleanup
2024-11-21 13:39:02 -05:00
abelanger5
ae5df5b88d fix: make race condition on reassignment more rare (#1052)
* fix: make race condition on reassignment more rare

* fix: proper concurrency on bulk dispatch

* prevent concurrent err assignments
2024-11-15 14:17:51 -05:00
abelanger5
faff6001a8 fix: propagate schedule timeouts to children (#1051) 2024-11-15 10:07:33 -05:00
Sean Reilly
d7d80393c3 add some logging so it is easier to see what grpc rate limits are set (#1045)
Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-15 09:20:15 -05:00
abelanger5
c40b9154d8 fix: tenant race conditions, cleanup logic, old workers getting assigned (#1050) 2024-11-15 09:19:36 -05:00
abelanger5
780496e7fb fix: prevent infinite reassign loop (#1028) 2024-11-07 17:28:12 +00:00
Gabe Ruttner
c531c36870 fix: filter-cancel-cases (#1027)
* fix: filter-cancel-cases

* fix: case CANCELLED_BY_CONCURRENCY_LIMIT
2024-11-07 11:18:50 -05:00
Alexander Belanger
5b59af076e fix: cancellation status propagation and minimap view 2024-11-07 11:13:14 -05:00
Gabe Ruttner
c227960453 fix: drop e in Requeuing (#1013) 2024-11-04 16:30:38 -05:00
Sean Reilly
b456382429 add multiple rate limiter in grpc using a token bucket (#984)
* add multiple rate limiter in grpc using a token bucket

* PR feedback

* add in client retry for go client

* update test files

* remove log line only retry on ResourceExhausted and Unavailable

* add some concurrency limits so we don't swamp ourselves

* add some logging for when we are getting backed up

* lets not queue up when we are too full to prevent OOM problems

* fix spelling

* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up

* lots of changes to buffering

* fix data race

* add some comments explaing how this works, change errors to be ResourceExhausted now that we have client retry and limit how many gofuncs we can create on cleanup and wait for them to finish before we exit

* hooking up the config values so they go to the right place

* Update config.go to default to 1 ms waitForFlush

* disable grpc_retry for client streams

* explicitly set the limit if it is 0

* weirdness because we were using an older version of the lib

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-01 11:48:23 +00:00
Gabe Ruttner
1003a1f5e7 fix: filter alert runs by failure only (#1001)
* fix: filter runs by failure only

* fix: post-lookup filter

* fix: filtered failures

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-01 11:46:27 +00:00
Gabe Ruttner
44addbb47e Feat scheduled improvements (#992)
* wip: stub schedule page

* wip: stub list

* fix: 2025 bug...

* feat: wip cron list

* feat: addl meta

* feat: expose metadata column

* feat: sort and created at

* cron to recurring

* scheduled: with statuses

* fix: links

* feat: expose schedule ids

* feat: delete run

* fix: remove search

* feat: filterable scheduled

* fix: remove broken features

* chore: lint

* rm metadata for now

* chore: lint

* chore: recurring to cron job

* fix: review comments

* fix: populator
2024-11-01 07:16:20 -04:00
Sean Reilly
7d5b41b082 add an essential pool for heatbeats (#1003)
* add an essential pool for heatbeats

* add some telemetry spans to heartbeat and capture any errors

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-01 07:09:45 -04:00
Gabe Ruttner
4932e7f863 Feat sdk runtime (#942)
* feat: runtime signature

* feat: add sdk runtime to worker model

* feat: post runtime

* feat: expose sdk version on worker

* feat: go inf

* chore: gen

* chore: migrations and generation

* fix: simpler runtime

* feat: hatchet sdk ver

* fix: rm debug line
2024-10-28 13:47:12 -07:00
abelanger5
509542b804 fix: duplicate assignments in queuer (#993)
* wip: individual mutexes for actions

* tmp: debug panic

* remove debug code

* remove deadlocks package and don't write unassigned events

* fix: race condition in scheduler and add internal retries

* fix: data race
2024-10-25 16:52:43 +00:00
abelanger5
dd5bc90497 fix: more efficient step run events, reduce caching on queue (#981) 2024-10-23 16:23:59 -04:00
abelanger5
2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
Gabe Ruttner
7cd08077d5 feat: improved sdk ack (#931)
* feat: add step run event reasons

* feat: ack

* fix: remove rejected reason

* fix: merge

* fix: correct buffer

* fix: consistent message

* chore: rm todo
2024-10-15 15:52:42 +00:00
abelanger5
67a96d7166 feat(throughput): single process per queue (#956)
* feat(throughput): single process per queue

* fix data race

* fix: golint and data race on load test

* wrap up initial v2 scheduler

* fix: more debug logs and tighten channel logic/blocking sends

* improved casing on dispatcher and lease manager

* fix: data race on min id

* increase wait on load test, fix data race

* fix: trylock -> lock

* clean up queue when no longer in set

* fix: clean up cache on exit

* ensure cleanup is only called once

* address review comments
2024-10-15 11:05:19 -04:00
Sean Reilly
29721cd1f0 Feat bulk workflows (#940)
Adds support for inserting workflows in bulk via the API and an optional buffered insert on the engine.
2024-10-14 15:35:29 -04:00
Gabe Ruttner
2519b71b9e fix: concurrency (#962)
* fix: concurrency

* fix: counts

* fix: try lock
2024-10-14 18:44:53 +00:00
Gabe Ruttner
6af75638f2 feat: add helpful context to alert email (#954) 2024-10-11 09:53:28 -04:00
abelanger5
95558138a4 chore: improve throughput, remove deadlocks (#949)
* add otel to pub

* temporarily remove tenant id exchange

* fix: increase internal queue throughput

* fix: remove potential deadlocking

* rollback hash factor multiplier

* fix: batch update issues

* fix: rm unneeded locks

* move disable tenant pubsub to an env var

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-10-10 08:54:34 -04:00
abelanger5
3d218302ff fix: internal queue items performance and race conditions (#943)
* fix: don't use xmin hack

* fix: assign not append

* refactor: parallel step run updates via hashes

* fix: intermittent double execution of child step runs

* fix: rollback rate limits

* fix: bulk event writes from single buffer

* expose cleanup

* fix: race conditions on failures and cancellations

* change logger defaults to warn and console
2024-10-07 11:16:53 -04:00
abelanger5
fd4ee804d3 refactor: buffered writes of step run statuses (#941)
* (wip) handle step run updates without deferred updates

* refactor: buffered writes of step run statuses

* fix: add more safety on tenant pools

* add configurable flush period, remove wait for started

* flush immediately if last flush time plus flush period is in the past

* feat: add configurable flush internal/max items
2024-10-04 15:08:21 -04:00
Sean Reilly
27736fa30f bulk insert buffering (#913)
Adds bulk inserts to event writes, and adds a generic buffer which can be used by future batch implementations.
2024-10-03 16:26:12 -04:00
Gabe Ruttner
f5add0d15c fix: write duration (#936) 2024-10-03 09:37:53 -04:00
abelanger5
8939c94f63 fix: send fewer messages to job queue when it's not necessary (#932)
* handle started at differently

* fix: start job runs in workflows controller

* fix: keep job runs around for backwards compat
2024-10-03 07:39:06 -04:00
abelanger5
b4c861d7a1 patch: release semaphore slots before jobs controller (#927)
* fix: don't need acks on queue checks

* patch: release semaphores early

* proper list on high queue depth

* fix: don't release on started
2024-10-02 11:36:05 -04:00
Gabe Ruttner
5fcf5eff6a fix: separate context (#929)
* fix: separate context

* chore: comments

* chore: generate

* chore: gen

* chore: update protoc 28.2
2024-10-02 10:53:51 -04:00
abelanger5
c3fa2c57f3 fix: don't need acks on queue checks (#926) 2024-10-02 00:52:02 +00:00
abelanger5
c29984305e fix: faster processing of timeout queue items (#924) 2024-10-01 13:50:38 +00:00
Gabe Ruttner
7d7e43d4e1 feat: pauseable workflows (#879)
* feat: pause workflow state

* feat: dont run paused workflows

* feat: skipped paused

* implement unpaused behavior for workflow runs

* fix: frontend

* fix: more frontend

* fix: imports

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-09-29 10:58:10 -04:00
abelanger5
bfb11cac51 fix: always use retention on queues, optional data/worker (#916) 2024-09-27 14:23:14 -04:00
Gabe Ruttner
3ff00a1866 feat: improved dag validation (#915) 2024-09-27 14:23:06 -04:00
abelanger5
a1a10b4073 feat: dynamic rate limits (#904)
* wip: step run expressions on rate limits

* feat: dynamic rate limits

* chore: v0.47.0

* chore: address changes from PR review

* fix: improved error handling

* address pr review

* better error messages for step run cels, remove debug logs

* fix: hash

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-09-26 22:00:34 +00:00
abelanger5
5f5e1e8a88 refactor: use shared tenant listener for messages (#911)
* refactor: use shared tenant listener per tenant exchange

* fix: remove subscription properly
2024-09-26 14:54:11 -04:00
Alexander Belanger
85f6d07ddf patch: handle nil result.Output 2024-09-24 19:34:20 -04:00
Gabe Ruttner
f98d3277b7 fix: trunc large payloads (#903)
* fix: trunc large payloads

* lets send the stepRuns and steps with output back on the WorkflowRunGet

* fix: times

* fix: rm unsafe

* rename to GetStepRunsForJobRunsWithOutput so we know we might potentially be getting a very large result set

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-09-24 22:52:00 +00:00
abelanger5
9d69e4d192 fix: use read-only message queue (#897)
* fix: use read-only message queue

* set very high qos for read-heavy queue
2024-09-24 18:30:43 -04:00
Sean Reilly
5811929928 feat: bulk inserts of events (#887)
* progress commit of bulk inserts

* in_flight: Add changes to metering finish the bulk insert

* remove an attempt to overide enforce limits

* merge in PR fixes

* update docs to add in an additional section in the User guide to describe pushing single events and pushing multiple events

* run lint fix

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-09-23 09:19:39 -07:00
abelanger5
4936b3dce0 fix: use worker id properly on timeout (#901) 2024-09-23 08:30:31 -07:00
abelanger5
ad12f658da fix: have refresh timeout use timeout queue item (#898) 2024-09-23 05:41:06 -07:00