* add in the migration for now
* Update step_runs.sql
remove TODO
* change the schema so we don't undo it
* add the migration for step run partition. remove prisma. add a helper task for recreating the db
* do a manual merge of the schema.sql
* add in the serial
* update docs
* PR feedback
* add Identity to all tables that don't have a Bigserial
* do the atlas hash with the new migration
* squash the migrations
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
* add a dynamic strategy for flushing where we make the trigger for flush a funciton of the depth of the concurrency
* default value for tests and New for FlushStrategy
* clean up the currently flushing locking and add deadlock.Mutex
* don't wait as long for the buffer
* lets see if this 2ms thing is what is causing things to break
* lets error for this to see if we are actually hitting these limits
* put a really short deadline on the lock timeout to see if github actions will blow up
* lets use RW mutexs se we don't block as much
* lets extend this out to 100ms
* lets just do fewer locks
* add a lock to prevent a queue behind the semaphore
* deal with potential data races
* a simpler loop fib and now locks
* lets get rid of the wait for flush
* remove the deadlock stuff
* mod tidy
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
* fix: when scheduling fails with schedule timeouts, we never ack the queue item
* add error line if we don't process everything we pass into the scheduler
* add multiple rate limiter in grpc using a token bucket
* PR feedback
* add in client retry for go client
* update test files
* remove log line only retry on ResourceExhausted and Unavailable
* add some concurrency limits so we don't swamp ourselves
* add some logging for when we are getting backed up
* lets not queue up when we are too full to prevent OOM problems
* fix spelling
* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up
* lots of changes to buffering
* fix data race
* add some comments explaing how this works, change errors to be ResourceExhausted now that we have client retry and limit how many gofuncs we can create on cleanup and wait for them to finish before we exit
* hooking up the config values so they go to the right place
* Update config.go to default to 1 ms waitForFlush
* disable grpc_retry for client streams
* explicitly set the limit if it is 0
* weirdness because we were using an older version of the lib
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* add an essential pool for heatbeats
* add some telemetry spans to heartbeat and capture any errors
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
* add some concurrency limits so we don't swamp ourselves
* lets not queue up when we are too full to prevent OOM problems
* add config options for maximum concurrent and how long to wait for flush , let the wait for flush setting be used as back pressure and a signal to writers that we are slowing up
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
* feat: runtime signature
* feat: add sdk runtime to worker model
* feat: post runtime
* feat: expose sdk version on worker
* feat: go inf
* chore: gen
* chore: migrations and generation
* fix: simpler runtime
* feat: hatchet sdk ver
* fix: rm debug line
* add a serial write for step run events
* update other problematic queries
* tmp: don't upsert queue
* add SerialBuffer to the config
* revert the change to config
* fix: add back queue upsert
* add statement timeout to upsert queue
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours.
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
* rejig the query for creating multiple sticky states
* fix: sticky strategy of soft and improve query
* fix: sort method was using indexes that didn't necessarilly correspond to original indexes, leading to inconsistent behavior
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* make it so the bulk example succeeds
* make the bulk workflows work a little harder
* add some ordering to mitigate deadlocks
* fix: link step run parents bad query, improvements to locking
* add timed mutex and telemetry
* remove for update on cancel
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* feat(throughput): single process per queue
* fix data race
* fix: golint and data race on load test
* wrap up initial v2 scheduler
* fix: more debug logs and tighten channel logic/blocking sends
* improved casing on dispatcher and lease manager
* fix: data race on min id
* increase wait on load test, fix data race
* fix: trylock -> lock
* clean up queue when no longer in set
* fix: clean up cache on exit
* ensure cleanup is only called once
* address review comments