* introduce tenant workflow completed metric
* expose tenant prom metrics via handler
* fix workflow and worker id in metrics
* correctly add workflow metrics from workflow controller
* use olap DB to gather information for workflow completion
* fix prom metrics endpoint for tenant
* workflow name from external id
* simplify tenant registry based metrics
* add docs for prometheus metrics
* fix docs lint
* run prettier fix
* WIP metrics work
* use federate prom server URL to proxy metrics
* implement workflow duration histogram metric
* separate prom stack docker compose
* fix duration metrics calls
* move scheduler metrics to prom tenant specific file
* update docs for prom metrics
* fix lint
* use proper indices to query for durations
* reorg tenant metrics
* fix lint for doc
* update docs with promql examples and casing around prom metrics enabled
* update prom server url
* fix lint
* enabled prom metrics for v1 only from controller
* re-add new testing harness
* add healthcheck port and pick random grpc port to listen on
* feat: parallel load tests and faster tests
* make parallelism = 5
* fix: lint
* add linter to pre
* fix: add back rampup fixes
* reduce matrix on PR, add matrix to pre-release step
* make load tests less likely to block
* make limit strategy group round robin
* uncomment lines
* don't call parent output task when not necessary
* help query planner by refactoring replay task
* fix: use failed task pathway for reassignments and
timeouts
* feat: scheduling extensions
* add maxRuns to worker objects
* don't double count slots
* add unassigned and actions to slots to unassigned qi
* fix: proper use of extensions channel
* convert ext results ch to slice
* fix: race conditions with read locks
* add ability to set tenants on the extension
* fix: panic on scheduler
* fix: when scheduling fails with schedule timeouts, we never ack the queue item
* add error line if we don't process everything we pass into the scheduler
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours.
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
* rejig the query for creating multiple sticky states
* fix: sticky strategy of soft and improve query
* fix: sort method was using indexes that didn't necessarilly correspond to original indexes, leading to inconsistent behavior
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* make it so the bulk example succeeds
* make the bulk workflows work a little harder
* add some ordering to mitigate deadlocks
* fix: link step run parents bad query, improvements to locking
* add timed mutex and telemetry
* remove for update on cancel
---------
Co-authored-by: Sean Reilly <sean@hatchet.run>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* feat(throughput): single process per queue
* fix data race
* fix: golint and data race on load test
* wrap up initial v2 scheduler
* fix: more debug logs and tighten channel logic/blocking sends
* improved casing on dispatcher and lease manager
* fix: data race on min id
* increase wait on load test, fix data race
* fix: trylock -> lock
* clean up queue when no longer in set
* fix: clean up cache on exit
* ensure cleanup is only called once
* address review comments