* feat: add query to fetch upstream errors from db
* fix: return many
* feat: propagate errors through `input`
* fix: implement the method to get the errors out
* fix: query cleanup
* feat: rename errors
* fix: col names
* fix: key name in the json
* feat: add method to context to get failed step errors
* fix: add 👀
Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
* feat: add error log if not errors
* fix: logger
* fix: simplify query
---------
Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
* fix: log ui
* fix: partition handling and unregister
* fix: concurrent cleanup
* feat: op pool
* fix: run or continue partition id
* fix: return false out of check
* feat: allow extending the api server
* chore: remove internal packages to pkg
* chore: update db_gen.go
* fix: expose auth
* fix: move logger to pkg
* fix: don't generate gitignore for prisma client
* fix: allow extensions to register their own api spec
* feat: expose pool on server config
* fix: nil pointer exception on empty opts
* fix: run.go file
* feat: add release slot proto
* feat: add semaphore release state and methods
* feat: go sdk and example
* docs: manual slot release
* chore: linting
* fix: broken test
* fix: unlink step run on manual release
* feat: release slot event
* fix: test
* fix: revert e2e test changes
* chore: remove debug line
* fix: place step run query in same tx
* fix: change migration release version
---------
Co-authored-by: Alexander Belanger <belanger@sas.upenn.edu>
* new api-contract for workflow run events
* feat: initial implementation for new subscribe listener
* fix: sync issues and send workflow runs immediately
* refactor: add context to all engine db queries, fix deadlocking query
* fix: use new ctx for deleting dispatcher and ticker
* add cancellation reasons
* fix: docs linting
---------
Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
Logic for requeueing and reassigning did not limit the number of step runs to requeue, so when events accumulate with no worker present it causes memory to spike along with a very high query latency on the database. This commit limits the number of step runs returned in the requeue and reassign queries, and also properly locks step run rows for these queries so only a step run in a PENDING or PENDING_ASSIGNMENT state can be requeued.
It also improves performance of the `AssignStepRunToWorker` query and ensures that `maxRuns` on workers are always respected through the introduction of a `WorkerSemaphore` model. The value gets decremented when a step run is assigned and incremented when a step run is in a final state.
Co-authored-by: Luca Steeb <contact@luca-steeb.com>
* Update controller.go
---------
Co-authored-by: steebchen <contact@luca-steeb.com>
* feat(go-sdk): spawnWorkflow method and get up to speed with other sdks
* fix: manual trigger example
* fix: linting errors
* fix: double serialization from go sdk
* fix: spawn workflow logic and procedural example
* test(e2e): add procedural test
* fix: panic in e2e test
* fix: e2e test preparation
* fix: api server url in test.yml
* fix: load test server url
* chore: make num children configurable
* address pr review
This PR adds support for retrying failed step runs against the engine and SDKs. This was tested up to 30 retries per step run, with both failure and success at the 30th step run. Each SDK now has a `retries` configurable param for steps when declaring a workflow.