Commit Graph

257 Commits

Author SHA1 Message Date
abelanger5 fbbe02fa33 fix: revert previous migration for new build of 0.52.0 (#1072)
* fix: revert previous migration for new build of 0.52.0

* also remove identityId
2024-11-25 14:03:36 -05:00
Gabe Ruttner 574eb0b67e feat: dynamic crons (#1000)
* wip: stub schedule page

* wip: stub list

* fix: 2025 bug...

* feat: wip cron list

* feat: addl meta

* feat: expose metadata column

* feat: sort and created at

* cron to recurring

* scheduled: with statuses

* fix: links

* feat: expose schedule ids

* feat: delete run

* fix: remove search

* feat: filterable scheduled

* fix: remove broken features

* chore: lint

* rm metadata for now

* chore: lint

* chore: recurring to cron job

* fix: review comments

* fix: populator

* wip cron changes

* fix: ids are helpful

* fix: populator

* wip

* wip: create crons, stub scheduled

* wip: create schedule

* wip add trigger buttons to all the pages

* wip: reusable trigger form

* fix: hash

* fixes: cron bugs

* fixes: cron sort

* fix: out of order migrations

* fix: add internalRetryCount

* feat: api things survive version transitions

* feat: table things

* feat: delete disabled for non api

* feat: prevent delete non api

* feat: filters

* require cron name for api

* default name

* fix: migrations

* frontend improvements and migrations

* fix: pagination

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-21 16:18:24 -05:00
Sean Reilly 31e425a858 lets make retry configurable and do not retry for unavailable because the retry is slower than regular heartbeat (#1046)
Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-21 13:39:31 -05:00
abelanger5 197bdd1f88 feat: exponential backoff (#1062)
* initial migration

* feat: exp backoff, fix linting

* fix utc issue and cleanup
2024-11-21 13:39:02 -05:00
Sean Reilly 42afe083cf Partition Step Run and Remove Prisma (#982)
* add in the migration for now

* Update step_runs.sql

remove TODO

* change the schema so we don't undo it

* add the migration for step run partition. remove prisma. add a helper task for recreating the db

* do a manual merge of the schema.sql

* add in the serial

* update docs

* PR feedback

* add Identity to all tables that don't have a Bigserial

* do the atlas hash with the new migration

* squash the migrations

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-20 15:20:36 -08:00
Sean Reilly b5de6e26ff Add a dynamic strategy for flushing as a function of currently flushing (#1055)
* add a dynamic strategy for flushing where we make the trigger for flush a funciton of the depth of the concurrency

* default value for tests and New for FlushStrategy

* clean up the currently flushing locking and add deadlock.Mutex

* don't wait as long for the buffer

* lets see if this 2ms thing is what is causing things to break

* lets error for this to see if we are actually hitting these limits

* put a really short deadline on the lock timeout to see if github actions will blow up

* lets use RW mutexs se we don't block as much

* lets extend this out to 100ms

* lets just do fewer locks

* add a lock to prevent a queue behind the semaphore

* deal with potential data races

* a simpler loop fib and now locks

* lets get rid of the wait for flush

* remove the deadlock stuff

* mod tidy

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-20 19:49:30 +00:00
abelanger5 ae5df5b88d fix: make race condition on reassignment more rare (#1052)
* fix: make race condition on reassignment more rare

* fix: proper concurrency on bulk dispatch

* prevent concurrent err assignments
2024-11-15 14:17:51 -05:00
abelanger5 c40b9154d8 fix: tenant race conditions, cleanup logic, old workers getting assigned (#1050) 2024-11-15 09:19:36 -05:00
Gabe Ruttner 4eaa9e7fd9 feat: configurable internal retry (#1049)
* feat: configurable internal retry

* fix: bump default to 3
2024-11-15 09:19:24 -05:00
Sean Reilly 9a5acc5179 modify the Event created at to be a clock_timestamp instead of a transaction timestamp so we maintain ordering of inserted events - also extend the length of the timestamp so we have enough significant bits (#1044)
* add the migration for the timestamp and clock

* regenerate

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-14 11:15:45 -08:00
Gabe Ruttner 3850964a98 feat: initial doc pages (#1020)
* generate initial cloud client

* feat: initial doc pages

* feat: cloud register id, action filtering

* feat:cloud register

* fix: env var

* chore:lint

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-08 07:46:43 -08:00
abelanger5 48aadc6ace fix: avoid panics in lease manager (#1029) 2024-11-07 16:07:01 -05:00
abelanger5 780496e7fb fix: prevent infinite reassign loop (#1028) 2024-11-07 17:28:12 +00:00
Gabe Ruttner c531c36870 fix: filter-cancel-cases (#1027)
* fix: filter-cancel-cases

* fix: case CANCELLED_BY_CONCURRENCY_LIMIT
2024-11-07 11:18:50 -05:00
Alexander Belanger 5b59af076e fix: cancellation status propagation and minimap view 2024-11-07 11:13:14 -05:00
Gabe Ruttner 3871df01ee fix: dont bump deleted (#1024) 2024-11-06 16:11:36 -05:00
Gabe Ruttner 5759311574 fix: ratelimit and invalid output blocking queue (#1023)
* fix: rm unused offending code, handle unacked

* fix: handle invalid outputs

* fix: dont reset failed

* fix: case on json err

* fix: completed step run ids

* fix: scope
2024-11-06 18:21:22 +00:00
abelanger5 71e01b3b5a fix: compute wording and add user callback (#1018)
* user callbacks and move location of managed workers

* rename pools to compute

* move managed workers to right fs location, remove prefix on /workers
2024-11-05 20:14:57 +00:00
abelanger5 9d133bc15c fix: catch all nack cases for rate limits (#1015)
* fix: properly nack rate limit when failing to schedule

* more nack cases
2024-11-05 11:37:47 -05:00
abelanger5 68bc5a0197 fix: unacked messages in the queuer (#1014)
* fix: when scheduling fails with schedule timeouts, we never ack the queue item

* add error line if we don't process everything we pass into the scheduler
2024-11-05 10:27:53 -05:00
abelanger5 75a89d00f0 use essential pool for dispatcher heartbeats too (#1007) 2024-11-01 08:55:54 -04:00
Gabe Ruttner abdd81c1eb fix: orderby (#1008) 2024-11-01 08:48:09 -04:00
Sean Reilly b456382429 add multiple rate limiter in grpc using a token bucket (#984)
* add multiple rate limiter in grpc using a token bucket

* PR feedback

* add in client retry for go client

* update test files

* remove log line only retry on ResourceExhausted and Unavailable

* add some concurrency limits so we don't swamp ourselves

* add some logging for when we are getting backed up

* lets not queue up when we are too full to prevent OOM problems

* fix spelling

* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up

* lots of changes to buffering

* fix data race

* add some comments explaing how this works, change errors to be ResourceExhausted now that we have client retry and limit how many gofuncs we can create on cleanup and wait for them to finish before we exit

* hooking up the config values so they go to the right place

* Update config.go to default to 1 ms waitForFlush

* disable grpc_retry for client streams

* explicitly set the limit if it is 0

* weirdness because we were using an older version of the lib

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-01 11:48:23 +00:00
Gabe Ruttner 1003a1f5e7 fix: filter alert runs by failure only (#1001)
* fix: filter runs by failure only

* fix: post-lookup filter

* fix: filtered failures

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-01 11:46:27 +00:00
Gabe Ruttner 44addbb47e Feat scheduled improvements (#992)
* wip: stub schedule page

* wip: stub list

* fix: 2025 bug...

* feat: wip cron list

* feat: addl meta

* feat: expose metadata column

* feat: sort and created at

* cron to recurring

* scheduled: with statuses

* fix: links

* feat: expose schedule ids

* feat: delete run

* fix: remove search

* feat: filterable scheduled

* fix: remove broken features

* chore: lint

* rm metadata for now

* chore: lint

* chore: recurring to cron job

* fix: review comments

* fix: populator
2024-11-01 07:16:20 -04:00
Sean Reilly 7d5b41b082 add an essential pool for heatbeats (#1003)
* add an essential pool for heatbeats

* add some telemetry spans to heartbeat and capture any errors

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-01 07:09:45 -04:00
Sean Reilly ea682f5c6b Feat concurrency limit for flush (#991)
* add some concurrency limits so we don't swamp ourselves

* lets not queue up when we are too full to prevent OOM problems

* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up


---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-10-31 09:43:21 -07:00
abelanger5 a9936ef687 fix: set otel insecure flag for all telemetry instantiations (#999) 2024-10-30 17:34:36 -04:00
abelanger5 6158aa2a4c feat: docs for performance (#997)
* feat: docs for performance

* wrap up perf doc

* address review comments
2024-10-29 18:29:03 -04:00
Gabe Ruttner 4932e7f863 Feat sdk runtime (#942)
* feat: runtime signature

* feat: add sdk runtime to worker model

* feat: post runtime

* feat: expose sdk version on worker

* feat: go inf

* chore: gen

* chore: migrations and generation

* fix: simpler runtime

* feat: hatchet sdk ver

* fix: rm debug line
2024-10-28 13:47:12 -07:00
abelanger5 3e0f15c0d8 fix: divide by zero panic (#995)
* fix: divide by zero panic

* fix: add continue
2024-10-25 19:57:55 -04:00
Sean Reilly 9f4b63817d add a serial write for step run events (#990)
* add a serial write for step run events

* update other problematic queries

* tmp: don't upsert queue

* add SerialBuffer to the config

* revert the change to config

* fix: add back queue upsert

* add statement timeout to upsert queue

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-25 16:56:38 +00:00
abelanger5 509542b804 fix: duplicate assignments in queuer (#993)
* wip: individual mutexes for actions

* tmp: debug panic

* remove debug code

* remove deadlocks package and don't write unassigned events

* fix: race condition in scheduler and add internal retries

* fix: data race
2024-10-25 16:52:43 +00:00
abelanger5 718d8f59c9 fix: rewrite queries for checking child workflows (#983)
* rewrite queries for child workflows

* add index

* fix: remove tenant id where it's not needed
2024-10-23 19:18:26 -04:00
abelanger5 dd5bc90497 fix: more efficient step run events, reduce caching on queue (#981) 2024-10-23 16:23:59 -04:00
Sean Reilly 35b115cb4f don't need to filter on tenant id for step runs & some debug for buffers (#980)
Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-10-23 15:04:11 -04:00
abelanger5 2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
abelanger5 7b701ed209 fix: proper deletion of tenants from the scheduling pool (#974)
* fix: proper deletion of tenants from the scheduling pool

* adds some assignment spans

* feat: caching for rankings

* remove cache
2024-10-17 15:47:15 -04:00
Sean Reilly ecb9ce1e1e rejig the query for creating multiple sticky states (#973)
* rejig the query for creating multiple sticky states

* fix: sticky strategy of soft and improve query

* fix: sort method was using indexes that didn't necessarilly correspond to original indexes, leading to inconsistent behavior

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-17 13:29:19 +00:00
abelanger5 17dc80cad8 fix: don't append invalid slots with a hard sticky strategy (#972) 2024-10-16 20:21:39 +00:00
abelanger5 c86a50711b fix: don't reset input for concurrency keys on replay (#970) 2024-10-16 15:55:28 -04:00
abelanger5 e4af494f69 fix: add slot expiry and delete actions from scheduler properly (#969)
* fix: add back slot expiry

* fix: remove action if all slots are inactive
2024-10-16 15:55:18 -04:00
abelanger5 cb39c938b3 fix: ack rate limits properly (#968) 2024-10-16 13:32:10 -04:00
Sean Reilly 7e526de381 fix: deadlocks on events and incorrect step run ordering query (#966)
* make it so the bulk example succeeds

* make the bulk workflows work a little harder

* add some ordering to mitigate deadlocks

* fix: link step run parents bad query, improvements to locking

* add timed mutex and telemetry

* remove for update on cancel

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-16 10:28:33 -04:00
Gabe Ruttner 7cd08077d5 feat: improved sdk ack (#931)
* feat: add step run event reasons

* feat: ack

* fix: remove rejected reason

* fix: merge

* fix: correct buffer

* fix: consistent message

* chore: rm todo
2024-10-15 15:52:42 +00:00
abelanger5 19e151e29a fix: RunWorkflow and SpawnWorkflow should respond with consistent APIs (#965) 2024-10-15 11:09:58 -04:00
abelanger5 67a96d7166 feat(throughput): single process per queue (#956)
* feat(throughput): single process per queue

* fix data race

* fix: golint and data race on load test

* wrap up initial v2 scheduler

* fix: more debug logs and tighten channel logic/blocking sends

* improved casing on dispatcher and lease manager

* fix: data race on min id

* increase wait on load test, fix data race

* fix: trylock -> lock

* clean up queue when no longer in set

* fix: clean up cache on exit

* ensure cleanup is only called once

* address review comments
2024-10-15 11:05:19 -04:00
Sean Reilly 29721cd1f0 Feat bulk workflows (#940)
Adds support for inserting workflows in bulk via the API and an optional buffered insert on the engine.
2024-10-14 15:35:29 -04:00
Gabe Ruttner c8711f7f83 fix: id constraint (#957)
* fix: id constraint

* chore: gen
2024-10-11 18:00:12 -04:00
Gabe Ruttner 6af75638f2 feat: add helpful context to alert email (#954) 2024-10-11 09:53:28 -04:00