Commit Graph

55 Commits

Author SHA1 Message Date
abelanger5 f25c408d5c fix: reassignments consistent with v0 behavior (#1360) 2025-03-18 09:17:31 -04:00
Gabe Ruttner 3670b94fc4 Feat v1 UI tweaks (#1344)
* fix: drop uncached loader

* feat: upgrade modal

* add beta

* hacky feature flag

* fix: build

* refetch interval

* 5s

* stop flashing on load

* lint

* fix: map

* fix: last redir

* nil check

* small styling and wording things, change default canUpgrade -> true

* switch link to github discussion

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2025-03-15 09:23:32 -04:00
abelanger5 afd853e223 v1 hotfixes (#1320)
* fix: when grpcInsecure is set to true with no internal client overrides, use TLS strategy=none

* fix: invites
2025-03-11 16:18:07 -04:00
abelanger5 1f2096313d feat: v1 engine (#1318) 2025-03-11 14:57:13 -04:00
abelanger5 769fed7d97 feat(go-sdk): adds preset labels on workers for autoscaling (#1195)
* feat(go-sdk): adds preset labels on workers for autoscaling

* fix: env var consistency

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2025-01-28 14:58:41 +00:00
Matt Kaye fc9ff0eb05 Feat: Sample Sentries in the Engine (#1209)
* feat: sample sentries in the engine

* set sample rate via env var

* fix: propagate sample rate through config

* fix: bind env
2025-01-23 17:41:41 -05:00
Sean Reilly a8dd33c61f Feature - configurable logging backend (#1188)
* allow us to configure different repos

* make the struct contents public

* pass in config values to new log repo

* rename functions - possibly breaking changes so lets discuss

* make the logging backend configurable

* fix tests

* don't allow calls to WithAdditionalConfig

* cleanup

* replace sc with server

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>

* rename sc to server

* add a LRU cache for the step run lookup

* lets not use an expirable cache and just use the regular one - we cannot close the go func in exirable

---------

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
2025-01-17 15:34:10 -08:00
Sean Reilly c2248c08ab Fix security headers and emails (#1181)
* add a bunch of default headers

* add a check on the emails so we don't resend if we have a valid invite in future

* lets people invite for a new role

* add in some logging so we have more visibility on what is hapening here

* Add a limit to the number of pending invites a user can have. Add comments for the various headers
2025-01-17 15:06:26 -08:00
Sean Reilly 9e961ac196 Feature add version info (#1154)
* adding a /version endpoint for the engine and a /api/v1/version endpoint for the API

* make the security optional so we don't get redirected for having auth

* lint

* upgrade protoc to the latest available version on brew

* use useQuery and clean up html
2025-01-06 10:50:04 -08:00
abelanger5 dcb67a1dac feat: postgres-backed message queue (#1119) 2024-12-18 09:00:54 -05:00
Sean Reilly cbc2526c0b add a monitoring probe (#1108)
* add a monitoring probe

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-12-17 15:55:50 -05:00
abelanger5 4c74a62183 refactor(repository): improve usability of repository (#1114)
* refactor(repository): consolidate repository buffers, create pattern for callbacks, consolidate queries

* fix: spelling

* fix: clean up cache
2024-12-11 18:45:02 -05:00
abelanger5 0b2b12d851 docs: high availability and docs on HA helm chart (#1074)
* docs: high availability and docs on HA helm chart

* fix: linting

* ignore manifests for typos
2024-11-25 15:17:20 -08:00
Sean Reilly 31e425a858 lets make retry configurable and do not retry for unavailable because the retry is slower than regular heartbeat (#1046)
Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-21 13:39:31 -05:00
abelanger5 197bdd1f88 feat: exponential backoff (#1062)
* initial migration

* feat: exp backoff, fix linting

* fix utc issue and cleanup
2024-11-21 13:39:02 -05:00
Sean Reilly b5de6e26ff Add a dynamic strategy for flushing as a function of currently flushing (#1055)
* add a dynamic strategy for flushing where we make the trigger for flush a funciton of the depth of the concurrency

* default value for tests and New for FlushStrategy

* clean up the currently flushing locking and add deadlock.Mutex

* don't wait as long for the buffer

* lets see if this 2ms thing is what is causing things to break

* lets error for this to see if we are actually hitting these limits

* put a really short deadline on the lock timeout to see if github actions will blow up

* lets use RW mutexs se we don't block as much

* lets extend this out to 100ms

* lets just do fewer locks

* add a lock to prevent a queue behind the semaphore

* deal with potential data races

* a simpler loop fib and now locks

* lets get rid of the wait for flush

* remove the deadlock stuff

* mod tidy

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-20 19:49:30 +00:00
Gabe Ruttner 4eaa9e7fd9 feat: configurable internal retry (#1049)
* feat: configurable internal retry

* fix: bump default to 3
2024-11-15 09:19:24 -05:00
Gabe Ruttner 3850964a98 feat: initial doc pages (#1020)
* generate initial cloud client

* feat: initial doc pages

* feat: cloud register id, action filtering

* feat:cloud register

* fix: env var

* chore:lint

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-08 07:46:43 -08:00
Sean Reilly b456382429 add multiple rate limiter in grpc using a token bucket (#984)
* add multiple rate limiter in grpc using a token bucket

* PR feedback

* add in client retry for go client

* update test files

* remove log line only retry on ResourceExhausted and Unavailable

* add some concurrency limits so we don't swamp ourselves

* add some logging for when we are getting backed up

* lets not queue up when we are too full to prevent OOM problems

* fix spelling

* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up

* lots of changes to buffering

* fix data race

* add some comments explaing how this works, change errors to be ResourceExhausted now that we have client retry and limit how many gofuncs we can create on cleanup and wait for them to finish before we exit

* hooking up the config values so they go to the right place

* Update config.go to default to 1 ms waitForFlush

* disable grpc_retry for client streams

* explicitly set the limit if it is 0

* weirdness because we were using an older version of the lib

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-11-01 11:48:23 +00:00
Sean Reilly 7d5b41b082 add an essential pool for heatbeats (#1003)
* add an essential pool for heatbeats

* add some telemetry spans to heartbeat and capture any errors

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-11-01 07:09:45 -04:00
Sean Reilly ea682f5c6b Feat concurrency limit for flush (#991)
* add some concurrency limits so we don't swamp ourselves

* lets not queue up when we are too full to prevent OOM problems

* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up


---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
2024-10-31 09:43:21 -07:00
abelanger5 a9936ef687 fix: set otel insecure flag for all telemetry instantiations (#999) 2024-10-30 17:34:36 -04:00
abelanger5 6158aa2a4c feat: docs for performance (#997)
* feat: docs for performance

* wrap up perf doc

* address review comments
2024-10-29 18:29:03 -04:00
Sean Reilly 9f4b63817d add a serial write for step run events (#990)
* add a serial write for step run events

* update other problematic queries

* tmp: don't upsert queue

* add SerialBuffer to the config

* revert the change to config

* fix: add back queue upsert

* add statement timeout to upsert queue

---------

Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-10-25 16:56:38 +00:00
abelanger5 2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
abelanger5 67a96d7166 feat(throughput): single process per queue (#956)
* feat(throughput): single process per queue

* fix data race

* fix: golint and data race on load test

* wrap up initial v2 scheduler

* fix: more debug logs and tighten channel logic/blocking sends

* improved casing on dispatcher and lease manager

* fix: data race on min id

* increase wait on load test, fix data race

* fix: trylock -> lock

* clean up queue when no longer in set

* fix: clean up cache on exit

* ensure cleanup is only called once

* address review comments
2024-10-15 11:05:19 -04:00
Sean Reilly 29721cd1f0 Feat bulk workflows (#940)
Adds support for inserting workflows in bulk via the API and an optional buffered insert on the engine.
2024-10-14 15:35:29 -04:00
abelanger5 95558138a4 chore: improve throughput, remove deadlocks (#949)
* add otel to pub

* temporarily remove tenant id exchange

* fix: increase internal queue throughput

* fix: remove potential deadlocking

* rollback hash factor multiplier

* fix: batch update issues

* fix: rm unneeded locks

* move disable tenant pubsub to an env var

---------

Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
2024-10-10 08:54:34 -04:00
abelanger5 3d218302ff fix: internal queue items performance and race conditions (#943)
* fix: don't use xmin hack

* fix: assign not append

* refactor: parallel step run updates via hashes

* fix: intermittent double execution of child step runs

* fix: rollback rate limits

* fix: bulk event writes from single buffer

* expose cleanup

* fix: race conditions on failures and cancellations

* change logger defaults to warn and console
2024-10-07 11:16:53 -04:00
abelanger5 7b5bb398e4 improvements to conn pooling (#939)
* attempt improvements to conn pooling

* cleanup PR
2024-10-04 15:23:34 -04:00
abelanger5 fd4ee804d3 refactor: buffered writes of step run statuses (#941)
* (wip) handle step run updates without deferred updates

* refactor: buffered writes of step run statuses

* fix: add more safety on tenant pools

* add configurable flush period, remove wait for started

* flush immediately if last flush time plus flush period is in the past

* feat: add configurable flush internal/max items
2024-10-04 15:08:21 -04:00
Sean Reilly 27736fa30f bulk insert buffering (#913)
Adds bulk inserts to event writes, and adds a generic buffer which can be used by future batch implementations.
2024-10-03 16:26:12 -04:00
abelanger5 bfb11cac51 fix: always use retention on queues, optional data/worker (#916) 2024-09-27 14:23:14 -04:00
abelanger5 6172956bbd refactor: remove foreign keys from unchanged/non-cascading parent tables (#918)
* refactor: remove fks from unchanged/non-cascading parent tables

* fix: cleanup cache for engine repository

* fix: remove streamevent
2024-09-27 14:21:45 -04:00
abelanger5 9efcebe6af fix: better logic for multiple restricted domains (#860) 2024-09-10 12:07:55 -04:00
abelanger5 7308876776 fix: use separate database pool for queueing, statement timeouts on tx (#839)
* fix: different queue pool and statement timeouts on step runs

* fix: implement prepareTx

* fix: defer rollback properly

* fix: race condition
2024-09-03 21:07:26 +00:00
abelanger5 b5014f6b3d chore: more visibility and debug lines for queues (#836)
* chore: more visibility and debug options for queues

* better debug lines on queue repo

* don't log so much in load test
2024-08-29 14:49:24 -04:00
abelanger5 17b7e84876 fix: delete queue items when no longer used (#831) 2024-08-28 17:12:31 -04:00
Gabe Ruttner 4ea4712d4d refactor: performance and throughput (#756)
Refactors the queueing logic to be fairly balanced between actions, with each action backed as a separate FIFO queue. Also adds support for priority queueing and custom queues, though those aren't exposed on the API layer yet. Improves throughput to be > 5000 tasks/second on a single queue. 

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-08-12 14:38:47 +00:00
abelanger5 652f604873 fix: add max msg size as env var (#759) 2024-08-01 09:25:05 -04:00
Gabe Ruttner b4670af138 Fix qos otel config (#754)
* feat: otel trace id ratio

* feat: rabbitmq qos

* feat: requeue limit

* fix: tests
2024-07-30 18:11:10 -04:00
abelanger5 aafdd278db make max msg size configurable (#745) 2024-07-26 10:58:16 -07:00
Gabe Ruttner b7cec9ec53 feat: soft delete (#717)
* feat: soft delete workflows and versions

* feat: filter soft deletes wf and wfr

* feat: filter events and step runs

* fix: query

* fix: query

* chore: generate

* wip

* chore: squash migrations

* chore: separate retention into new service

* feat: regularly clean up

* chore: migrations

* fix: tests

* fix: queries

* fix: ambiguous

* fix: refs

* fix: ambiguous id

* fix: remove update from

* fix: soft delete

* fix: cleanup retention scheduler

* fix: has more query

* chore: gen

* fix: query

* fix: table
2024-07-18 09:06:05 -04:00
Gabe Ruttner 1e20bf946a fix: improved assign, reassign, and requeue (#702)
* fix: improved queries

* fix: 1s timeout

* fix: indexes

* fix: increase timeout to 4s

* fix: migration

* merge in db changes

* chore: squash migrations

* chore: re-hash

* chore: remove comment

* chore: rm unused query

* fix: state

* fix: check valid workers before commit

* fix: query

* chore: gen
2024-07-10 12:45:08 -04:00
abelanger5 f36e66cd28 feat: configurable data retention period (#693)
* feat: data retention for tenants

* chore: generate and docs

* chore: lint
2024-07-06 14:31:12 +00:00
abelanger5 f2c6bc1f44 feat: tenant partitioning (#649)
* feat: tenant partitioning

* fix: rebalance inactive partitions, split into separate partitioner

* fix: shutdown partitioner scheduler properly

* update config options

* fix: config options linting
2024-06-26 21:06:51 +00:00
Gabe Ruttner a8d42819ea feat: check security service (#639)
* feat: check security service

* feat: propegate version

* feat: with ident

* fix: lint

* chore: generate

* fix: change domain

* fix: panic recover

* fix: migrations

* fix: hash

* fix: dont check in tests
2024-06-26 16:26:29 -04:00
Gabe Ruttner 35979bea68 feat: disable signups (#643)
* feat: disable signups

* feat: ui and password

* merge main

* fix: disable allow oauth
2024-06-26 16:12:40 -04:00
abelanger5 39eeed04a5 fix: correct alarm limit vars (#645) 2024-06-26 14:44:41 +00:00
Luca Steeb 1490d88954 feat: webhook workers (#542)
Adds serverless support via the concept of webhook workers. Allows any webhook to be registered as a serverless endpoint for executing a step.
2024-06-25 17:06:43 -04:00