Logic for requeueing and reassigning did not limit the number of step runs to requeue, so when events accumulate with no worker present it causes memory to spike along with a very high query latency on the database. This commit limits the number of step runs returned in the requeue and reassign queries, and also properly locks step run rows for these queries so only a step run in a PENDING or PENDING_ASSIGNMENT state can be requeued.
It also improves performance of the `AssignStepRunToWorker` query and ensures that `maxRuns` on workers are always respected through the introduction of a `WorkerSemaphore` model. The value gets decremented when a step run is assigned and incremented when a step run is in a final state.
Co-authored-by: Luca Steeb <contact@luca-steeb.com>
* Update controller.go
---------
Co-authored-by: steebchen <contact@luca-steeb.com>
* feat(go-sdk): spawnWorkflow method and get up to speed with other sdks
* fix: manual trigger example
* fix: linting errors
* fix: double serialization from go sdk
* fix: spawn workflow logic and procedural example
* test(e2e): add procedural test
* fix: panic in e2e test
* fix: e2e test preparation
* fix: api server url in test.yml
* fix: load test server url
* chore: make num children configurable
* address pr review
* refactor: separate api and engine repositories, change ticker logic
* fix: nil error blocks
* fix: run migration on load test
* fix: generate db package in load test
* fix: test.yml
* fix: add pnpm to load test
* fix: don't lock CTEs with columns that don't get updated
* fix: update heartbeat for worker every 4 seconds, not 5
* chore: remove dead code
* chore: update python sdk
* chore: add back telemetry attributes
* fix(go-sdk): support tls strategy of none, with docs
* chore: errorf -> sprintf in examples
* Apply suggestions from code review
Co-authored-by: Luca Steeb <contact@luca-steeb.com>
* fix: remove time from example
---------
Co-authored-by: Luca Steeb <contact@luca-steeb.com>
This PR adds support for retrying failed step runs against the engine and SDKs. This was tested up to 30 retries per step run, with both failure and success at the 30th step run. Each SDK now has a `retries` configurable param for steps when declaring a workflow.
* feat: dag-style execution
* docs: update to reflect new context
* ensure no cycles
* remove example cycle
* linting
* lint and small fixes
* update deferred rollback
* last rollback handling
* unset max issues
* fix requeue edge case
* fix: simple example
* chore: telemetry improvements
- Adds opentelemetry integration for the engine
- Adds standard logger with json and more readable output formats
* remove env from nodemon config files
* feat: support one-time scheduled workflows
* refactor: move schedule out of workflow trigger def
* docs: add scheduling workflows section
* docs: update creating workflow
* only cancel schedules that are in the future
* feat: add initial docs site
* feat: allow workflows to be defined from go sdk
* fix release action
* chore: remove server dependencies from client
* fix: use correct certificate for server
* chore: add port and bind address to grpc config
* fix: add env for grpc config
* fix: nil pointer when output is null
* chore: support variation in output args
* fix unresolve merge conflict
* fix: quickstart improvements
* temp remove database url
* fix: action id not required for event
* fix: actionid validation for events
* Remove deleted files