Commit Graph

337 Commits

Author SHA1 Message Date
abelanger5 61ae067014 fix: race condition on err in pgmq (#1198) 2025-01-18 16:20:24 +00:00
Matt Kaye 9efd56c7de Feat: Propagate Error Through Context (#1193)
* feat: add query to fetch upstream errors from db

* fix: return many

* feat: propagate errors through `input`

* fix: implement the method to get the errors out

* fix: query cleanup

* feat: rename errors

* fix: col names

* fix: key name in the json

* feat: add method to context to get failed step errors

* fix: add 👀

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>

* feat: add error log if not errors

* fix: logger

* fix: simplify query

---------

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
2025-01-17 21:49:13 -05:00
Sean Reilly a8dd33c61f Feature - configurable logging backend (#1188)
* allow us to configure different repos

* make the struct contents public

* pass in config values to new log repo

* rename functions - possibly breaking changes so lets discuss

* make the logging backend configurable

* fix tests

* don't allow calls to WithAdditionalConfig

* cleanup

* replace sc with server

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>

* rename sc to server

* add a LRU cache for the step run lookup

* lets not use an expirable cache and just use the regular one - we cannot close the go func in exirable

---------

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
2025-01-17 15:34:10 -08:00
Gabe Ruttner 49000e5c65 Fix webhook stop healthcheck (#1163)
* fix: concurrent map writes

* fix: cancel healthcheck on move

* fix: cancel healthcheck on move

* revert: remove unneeded check
2025-01-08 09:42:58 -05:00
Gabe Ruttner e92146816f fix: webhook workers on rebalance (#1162)
* fix: log ui

* fix: partition handling and unregister

* fix: concurrent cleanup

* feat: op pool

* fix: run or continue partition id

* fix: return false out of check
2025-01-07 10:54:15 -08:00
Sean Reilly 9e961ac196 Feature add version info (#1154)
* adding a /version endpoint for the engine and a /api/v1/version endpoint for the API

* make the security optional so we don't get redirected for having auth

* lint

* upgrade protoc to the latest available version on brew

* use useQuery and clean up html
2025-01-06 10:50:04 -08:00
abelanger5 a237f90450 fix: circuit breaker for dispatcher reassignment (#1144) 2024-12-20 16:00:23 -05:00
abelanger5 b383ae8047 Improve handling of result size in dispatcher (#1133)
* Improve handling of result size in dispatcher

* small if case

* 3MB as var
2024-12-18 16:56:07 -05:00
abelanger5 23dc410552 fix: make retries with exp backoff atomic, and fix issues related to cancelling states (#1132)
* fix: exp backoff retries and cancelling states

* fix flaky concurrency test
2024-12-18 19:32:08 +00:00
abelanger5 dcb67a1dac feat: postgres-backed message queue (#1119) 2024-12-18 09:00:54 -05:00
abelanger5 c696263d20 fix: don't cancel context on failed sends (#1129) 2024-12-18 02:02:58 +00:00
abelanger5 e12e700980 feat: CANCEL_NEWEST strategy and make cancel in progress more reliable (#1127) 2024-12-18 01:40:14 +00:00
Sean Reilly cbc2526c0b add a monitoring probe (#1108)
* add a monitoring probe

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-12-17 15:55:50 -05:00
Sean Reilly 9943452490 Make round robin enqueueing atomic (#1085) 2024-12-17 15:18:20 -05:00
Sean Reilly e32f353587 Speed up the delete worker query (#1103)
* add an index on lastHeartbeatAt and don't do highly related actions concurrently



---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-12-12 20:49:22 -05:00
abelanger5 94d14336aa feat(go-sdk): blocking worker (#1106) 2024-12-12 20:42:13 -05:00
abelanger5 4c74a62183 refactor(repository): improve usability of repository (#1114)
* refactor(repository): consolidate repository buffers, create pattern for callbacks, consolidate queries

* fix: spelling

* fix: clean up cache
2024-12-11 18:45:02 -05:00
Gabe Ruttner 44ffe1d66c fix: panic (#1105) 2024-12-09 15:50:36 +00:00
abelanger5 1499668df9 fix: duplicate cron expressions only cause a single trigger (#1101) 2024-12-06 16:02:37 -05:00
abelanger5 92a96beaf5 fix: latency issues on queueing caused by race condition (#1078)
* fix: remove todo

* fix: race condition on queue inserts causing high latency, improved telemetry
2024-12-02 13:52:33 -05:00
Gabe Ruttner 574eb0b67e feat: dynamic crons (#1000)
* wip: stub schedule page

* wip: stub list

* fix: 2025 bug...

* feat: wip cron list

* feat: addl meta

* feat: expose metadata column

* feat: sort and created at

* cron to recurring

* scheduled: with statuses

* fix: links

* feat: expose schedule ids

* feat: delete run

* fix: remove search

* feat: filterable scheduled

* fix: remove broken features

* chore: lint

* rm metadata for now

* chore: lint

* chore: recurring to cron job

* fix: review comments

* fix: populator

* wip cron changes

* fix: ids are helpful

* fix: populator

* wip

* wip: create crons, stub scheduled

* wip: create schedule

* wip add trigger buttons to all the pages

* wip: reusable trigger form

* fix: hash

* fixes: cron bugs

* fixes: cron sort

* fix: out of order migrations

* fix: add internalRetryCount

* feat: api things survive version transitions

* feat: table things

* feat: delete disabled for non api

* feat: prevent delete non api

* feat: filters

* require cron name for api

* default name

* fix: migrations

* frontend improvements and migrations

* fix: pagination

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-21 16:18:24 -05:00
abelanger5 197bdd1f88 feat: exponential backoff (#1062)
* initial migration

* feat: exp backoff, fix linting

* fix utc issue and cleanup
2024-11-21 13:39:02 -05:00
abelanger5 ae5df5b88d fix: make race condition on reassignment more rare (#1052)
* fix: make race condition on reassignment more rare

* fix: proper concurrency on bulk dispatch

* prevent concurrent err assignments
2024-11-15 14:17:51 -05:00
abelanger5 faff6001a8 fix: propagate schedule timeouts to children (#1051) 2024-11-15 10:07:33 -05:00
Sean Reilly d7d80393c3 add some logging so it is easier to see what grpc rate limits are set (#1045)
Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-15 09:20:15 -05:00
abelanger5 c40b9154d8 fix: tenant race conditions, cleanup logic, old workers getting assigned (#1050) 2024-11-15 09:19:36 -05:00
abelanger5 780496e7fb fix: prevent infinite reassign loop (#1028) 2024-11-07 17:28:12 +00:00
Gabe Ruttner c531c36870 fix: filter-cancel-cases (#1027)
* fix: filter-cancel-cases

* fix: case CANCELLED_BY_CONCURRENCY_LIMIT
2024-11-07 11:18:50 -05:00
Alexander Belanger 5b59af076e fix: cancellation status propagation and minimap view 2024-11-07 11:13:14 -05:00
Gabe Ruttner c227960453 fix: drop e in Requeuing (#1013) 2024-11-04 16:30:38 -05:00
Sean Reilly b456382429 add multiple rate limiter in grpc using a token bucket (#984)
* add multiple rate limiter in grpc using a token bucket

* PR feedback

* add in client retry for go client

* update test files

* remove log line only retry on ResourceExhausted and Unavailable

* add some concurrency limits so we don't swamp ourselves

* add some logging for when we are getting backed up

* lets not queue up when we are too full to prevent OOM problems

* fix spelling

* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up

* lots of changes to buffering

* fix data race

* add some comments explaing how this works, change errors to be ResourceExhausted now that we have client retry and limit how many gofuncs we can create on cleanup and wait for them to finish before we exit

* hooking up the config values so they go to the right place

* Update config.go to default to 1 ms waitForFlush

* disable grpc_retry for client streams

* explicitly set the limit if it is 0

* weirdness because we were using an older version of the lib

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-01 11:48:23 +00:00
Gabe Ruttner 1003a1f5e7 fix: filter alert runs by failure only (#1001)
* fix: filter runs by failure only

* fix: post-lookup filter

* fix: filtered failures

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-01 11:46:27 +00:00
Gabe Ruttner 44addbb47e Feat scheduled improvements (#992)
* wip: stub schedule page

* wip: stub list

* fix: 2025 bug...

* feat: wip cron list

* feat: addl meta

* feat: expose metadata column

* feat: sort and created at

* cron to recurring

* scheduled: with statuses

* fix: links

* feat: expose schedule ids

* feat: delete run

* fix: remove search

* feat: filterable scheduled

* fix: remove broken features

* chore: lint

* rm metadata for now

* chore: lint

* chore: recurring to cron job

* fix: review comments

* fix: populator
2024-11-01 07:16:20 -04:00
Sean Reilly 7d5b41b082 add an essential pool for heatbeats (#1003)
* add an essential pool for heatbeats

* add some telemetry spans to heartbeat and capture any errors

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-01 07:09:45 -04:00
Gabe Ruttner 4932e7f863 Feat sdk runtime (#942)
* feat: runtime signature

* feat: add sdk runtime to worker model

* feat: post runtime

* feat: expose sdk version on worker

* feat: go inf

* chore: gen

* chore: migrations and generation

* fix: simpler runtime

* feat: hatchet sdk ver

* fix: rm debug line
2024-10-28 13:47:12 -07:00
abelanger5 509542b804 fix: duplicate assignments in queuer (#993)
* wip: individual mutexes for actions

* tmp: debug panic

* remove debug code

* remove deadlocks package and don't write unassigned events

* fix: race condition in scheduler and add internal retries

* fix: data race
2024-10-25 16:52:43 +00:00
abelanger5 dd5bc90497 fix: more efficient step run events, reduce caching on queue (#981) 2024-10-23 16:23:59 -04:00
abelanger5 2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
Gabe Ruttner 7cd08077d5 feat: improved sdk ack (#931)
* feat: add step run event reasons

* feat: ack

* fix: remove rejected reason

* fix: merge

* fix: correct buffer

* fix: consistent message

* chore: rm todo
2024-10-15 15:52:42 +00:00
abelanger5 67a96d7166 feat(throughput): single process per queue (#956)
* feat(throughput): single process per queue

* fix data race

* fix: golint and data race on load test

* wrap up initial v2 scheduler

* fix: more debug logs and tighten channel logic/blocking sends

* improved casing on dispatcher and lease manager

* fix: data race on min id

* increase wait on load test, fix data race

* fix: trylock -> lock

* clean up queue when no longer in set

* fix: clean up cache on exit

* ensure cleanup is only called once

* address review comments
2024-10-15 11:05:19 -04:00
Sean Reilly 29721cd1f0 Feat bulk workflows (#940)
Adds support for inserting workflows in bulk via the API and an optional buffered insert on the engine.
2024-10-14 15:35:29 -04:00
Gabe Ruttner 2519b71b9e fix: concurrency (#962)
* fix: concurrency

* fix: counts

* fix: try lock
2024-10-14 18:44:53 +00:00
Gabe Ruttner 6af75638f2 feat: add helpful context to alert email (#954) 2024-10-11 09:53:28 -04:00
abelanger5 95558138a4 chore: improve throughput, remove deadlocks (#949)
* add otel to pub

* temporarily remove tenant id exchange

* fix: increase internal queue throughput

* fix: remove potential deadlocking

* rollback hash factor multiplier

* fix: batch update issues

* fix: rm unneeded locks

* move disable tenant pubsub to an env var

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-10-10 08:54:34 -04:00
abelanger5 3d218302ff fix: internal queue items performance and race conditions (#943)
* fix: don't use xmin hack

* fix: assign not append

* refactor: parallel step run updates via hashes

* fix: intermittent double execution of child step runs

* fix: rollback rate limits

* fix: bulk event writes from single buffer

* expose cleanup

* fix: race conditions on failures and cancellations

* change logger defaults to warn and console
2024-10-07 11:16:53 -04:00
abelanger5 fd4ee804d3 refactor: buffered writes of step run statuses (#941)
* (wip) handle step run updates without deferred updates

* refactor: buffered writes of step run statuses

* fix: add more safety on tenant pools

* add configurable flush period, remove wait for started

* flush immediately if last flush time plus flush period is in the past

* feat: add configurable flush internal/max items
2024-10-04 15:08:21 -04:00
Sean Reilly 27736fa30f bulk insert buffering (#913)
Adds bulk inserts to event writes, and adds a generic buffer which can be used by future batch implementations.
2024-10-03 16:26:12 -04:00
Gabe Ruttner f5add0d15c fix: write duration (#936) 2024-10-03 09:37:53 -04:00
abelanger5 8939c94f63 fix: send fewer messages to job queue when it's not necessary (#932)
* handle started at differently

* fix: start job runs in workflows controller

* fix: keep job runs around for backwards compat
2024-10-03 07:39:06 -04:00
abelanger5 b4c861d7a1 patch: release semaphore slots before jobs controller (#927)
* fix: don't need acks on queue checks

* patch: release semaphores early

* proper list on high queue depth

* fix: don't release on started
2024-10-02 11:36:05 -04:00