* aggressively log errors when rmq retry more than 5 times
* revisit comments
* new vars and fix integration test
* fix test
* log only after 5 retries
* debug: remove event pub
* add additional spans to publish message
* debug: don't publish payloads
* fix: persistent messages on olap
* add back other payloads
* remove pub buffers temporarily
* fix: correct queue
* hacky partitioning
* add back pub buffers to scheduler
* don't send no worker events
* add attributes for queue name and message id to publish
* add back pub buffers to grpc api
* remove pubs again, no worker writes though
* task processing queue hashes
* remove payloads again
* gzip compression over 5kb
* add back task controller payloads
* add back no worker requeueing event, with expirable lru cache
* add back pub buffers
* remove hash partitioned queues
* small fixes
* ignore lru cache top fn
* config vars for compression, disable by default
---------
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* feat: add table for storing payloads
* feat: add payload type enum
* feat: gen sqlc
* feat: initial sql impl
* feat: add payload store repo to shared
* feat: add overwrite
* fix: impl
* feat: bulk op
* feat: initial wiring of inputs for task triggers
* feat: wire up dag matches
* feat: create V1TaskWithPayload and use it everywhere
* fix: couple bugs
* fix: clean up types
* fix: overwrite
* fix: rm input from replay
* fix: move payload store to shared repo
* fix: schema
* refactor: repo setup
* refactor: repos
* fix: gen
* chore: lint
* fix: rename
* feat: naming, write dag inputs
* fix: more naming, trigger bug
* fix: dual writes for now
* fix: pass in tx
* feat: initial work on offloader
* feat: improve external offloader
* fix: some refs
* add withExternalHandler
* fix: improve impl of external store
* feat: implement offloading, fix other impls
* feat: add query to update JSON
* fix: implement offloading + updating records in payloads table
* feat: add WAL table
* feat: add queries for polling WAL and evicting
* feat: wire up writes into WAL
* fix: get job working
* refactor: improve types
* fix: infinite loop
* feat: improve offloading logic to run in two separate txes
* refactor: rework how overrides work
* fix: lint
* fix: migration number
* fix: migration
* fix: migration version
* fix: revert back to reading payloads out
* fix: fall back to previous input, part i
* fix: input fallback
* fix: add back input to replay
* fix: input fallback in dispatcher
* fix: nil check
* feat: advisory locks, part i
* fix: no skip locked
* feat: hash partitioned wal table
* fix: modify queries a bit, tweak crud enum
* fix: pk order, function to find tenants
* feat: wal processing
* fix: only write wal if an external store is enabled, fix offloading logic
* fix: spacing
* feat: schema cleanup
* fix: rm external store loc name
* fix: set content to null when offloading
* fix: cleanup, naming
* fix: pass overwrite payload store along
* debug: add some logging
* Revert "debug: add some logging"
This reverts commit 43e71eadf1.
* fix: typo
* fx: add offloatAt to store opts for offloading
* fix: handle leasing with advisory lock
* fix: struct def
* fix: requeue on payloads not found
* fix: rm hack for triggers
* fix: revert empty input on write
* fix: write input
* feat: env var for enabling / disabling dual writes
* feat: wire up dual writes
* fix: comments
* feat: generics!
* fix: panic from type cast
* fix: migration
* fix: generic
* fix: hack for T key in map
* fix: cleanup
* fix: bug with json parsing failing
* fix: hang up on cancel and fail
* fix: pub stream events even if tenant pubs are disabled
* fix: condition
* fix: eq
* test: improves testing harness for engine
* update CI test
* fix: race condition in test
* make tests more stable
* cleanup pub and sub buffers
* fix: goleak on rampup test
* feat: matrix tests for engine
* clean up rabbit mq session stuff, add a quick ack and error processing for AddMessage
* bit more paranoid about getting stuck in chans
* first pass at locking the message to deal with the failed states better
* clean up the access to ready for the mq
* make sure we don't block sending this ack
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours.
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
Refactors the queueing logic to be fairly balanced between actions, with each action backed as a separate FIFO queue. Also adds support for priority queueing and custom queues, though those aren't exposed on the API layer yet. Improves throughput to be > 5000 tasks/second on a single queue.
---------
Co-authored-by: Alexander Belanger <alexander@hatchet.run>
* feat: allow extending the api server
* chore: remove internal packages to pkg
* chore: update db_gen.go
* fix: expose auth
* fix: move logger to pkg
* fix: don't generate gitignore for prisma client
* fix: allow extensions to register their own api spec
* feat: expose pool on server config
* fix: nil pointer exception on empty opts
* fix: run.go file
* feat: add rabbitmq connection pool and remove non-fatal worker errors
* chore: go mod tidy
* fix: release pool after opening channel
* fix: make sure channel is closed after all tasks return on subscribe
* fix: don't loop endlessly
Logic for requeueing and reassigning did not limit the number of step runs to requeue, so when events accumulate with no worker present it causes memory to spike along with a very high query latency on the database. This commit limits the number of step runs returned in the requeue and reassign queries, and also properly locks step run rows for these queries so only a step run in a PENDING or PENDING_ASSIGNMENT state can be requeued.
It also improves performance of the `AssignStepRunToWorker` query and ensures that `maxRuns` on workers are always respected through the introduction of a `WorkerSemaphore` model. The value gets decremented when a step run is assigned and incremented when a step run is in a final state.
Co-authored-by: Luca Steeb <contact@luca-steeb.com>
* Update controller.go
---------
Co-authored-by: steebchen <contact@luca-steeb.com>