* introduce tenant workflow completed metric
* expose tenant prom metrics via handler
* fix workflow and worker id in metrics
* correctly add workflow metrics from workflow controller
* use olap DB to gather information for workflow completion
* fix prom metrics endpoint for tenant
* workflow name from external id
* simplify tenant registry based metrics
* add docs for prometheus metrics
* fix docs lint
* run prettier fix
* WIP metrics work
* use federate prom server URL to proxy metrics
* implement workflow duration histogram metric
* separate prom stack docker compose
* fix duration metrics calls
* move scheduler metrics to prom tenant specific file
* update docs for prom metrics
* fix lint
* use proper indices to query for durations
* reorg tenant metrics
* fix lint for doc
* update docs with promql examples and casing around prom metrics enabled
* update prom server url
* fix lint
* enabled prom metrics for v1 only from controller
* re-add new testing harness
* add healthcheck port and pick random grpc port to listen on
* feat: parallel load tests and faster tests
* make parallelism = 5
* fix: lint
* add linter to pre
* fix: add back rampup fixes
* reduce matrix on PR, add matrix to pre-release step
* make load tests less likely to block
* make limit strategy group round robin
* uncomment lines
* wip: api contracts
* feat: implement put workflow version endpoint
* add support for match existing data, get scaffolding in place for additional triggers
* create additional matches
* feat: durable sleep, user event matching
* update protos
* fix: working poc of user events, durable sleep
* add migration
* fix: migration column
* feat: durable event listener
* fix: skip overrides
* fix: input -> output
* allow us to configure different repos
* make the struct contents public
* pass in config values to new log repo
* rename functions - possibly breaking changes so lets discuss
* make the logging backend configurable
* fix tests
* don't allow calls to WithAdditionalConfig
* cleanup
* replace sc with server
Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
* rename sc to server
* add a LRU cache for the step run lookup
* lets not use an expirable cache and just use the regular one - we cannot close the go func in exirable
---------
Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
* adding a /version endpoint for the engine and a /api/v1/version endpoint for the API
* make the security optional so we don't get redirected for having auth
* lint
* upgrade protoc to the latest available version on brew
* use useQuery and clean up html
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours.
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
* feat(throughput): single process per queue
* fix data race
* fix: golint and data race on load test
* wrap up initial v2 scheduler
* fix: more debug logs and tighten channel logic/blocking sends
* improved casing on dispatcher and lease manager
* fix: data race on min id
* increase wait on load test, fix data race
* fix: trylock -> lock
* clean up queue when no longer in set
* fix: clean up cache on exit
* ensure cleanup is only called once
* address review comments
Refactors the queueing logic to be fairly balanced between actions, with each action backed as a separate FIFO queue. Also adds support for priority queueing and custom queues, though those aren't exposed on the API layer yet. Improves throughput to be > 5000 tasks/second on a single queue.
---------
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* fix: reduce max throughput of requeue
* fix: reassign query
* fix: move step run timeout to partition model
* fix: partitioning queries and index
* better logs on requeue
* fix: inactive rebalance and get step run for engine query
* fix: correct inactive queries
* feat: allow extending the api server
* chore: remove internal packages to pkg
* chore: update db_gen.go
* fix: expose auth
* fix: move logger to pkg
* fix: don't generate gitignore for prisma client
* fix: allow extensions to register their own api spec
* feat: expose pool on server config
* fix: nil pointer exception on empty opts
* fix: run.go file
* feat: alerting. implements slack alerting, email, and refactors tenant settings to make them more manageable
* chore: generate
* chore: generate sqlc after migrate