Commit Graph

72 Commits

Author SHA1 Message Date
abelanger5
d071a1c36b fix: prevent large worker gRPC stream backlogs (#2597)
* fix: prevent large worker backlogs

* add config value

* add doc for troubleshooting
2025-12-03 17:15:43 -05:00
matt
7fe9806f5d Feat: Configurable OLAP status update size limits (#2499)
* feat: configurable status updates

* fix: config

* fix: wiring

* feat: export limits from olap

* fix: param drilling
2025-11-06 13:37:40 -05:00
Mohammed Nafees
ed4c0327ce [hotfix] Meaningful casing for engine liveness and readiness probes (#2465)
* more fixes for engine live and ready probes

* rename

* no need to set it to false

* fix casing health check

* log onlt when not shutting down
2025-10-30 20:24:33 +01:00
Mohammed Nafees
b58359d7b3 Do not run cleanup on v1_workflow_concurrency_slot (#2463)
* do not run cleanup on v1_concurrency_slot

* fix health endpoints for engine
2025-10-30 15:34:50 +01:00
Mohammed Nafees
e2b1f1353e Fix OTel span attribute naming convention (#2409)
* rename spans according to convention

* low cardinality
2025-10-16 18:43:40 +02:00
Mohammed Nafees
a750ce950d Introduce vars to tune ANALYZE job gocron run intervals (#2407)
* introduce cars to tune ANALYZE job gocron run intervals

* update config doc

* fix assignment
2025-10-10 11:02:10 +02:00
Gabe Ruttner
f59ebd6c47 feat: analytics events (#2171)
* feat: analytics events

* review comments
2025-08-22 05:41:17 -07:00
Mohammed Nafees
793df41ccb Deploy HyperDX locally via docker-compose and add traces to task controller (#2058)
* deploy jaegar locally and add traces to task controller

* use jaegar v2

* add SERVER_OTEL_COLLECTOR_AUTH

* fix PR comments

* fix span name
2025-07-29 16:24:38 +02:00
abelanger5
27435a72d6 feat: option to disable logging (#2030) 2025-07-21 16:53:11 +02:00
Mohammed Nafees
ef498a6235 Introduce tenant Prometheus metrics (#1875)
* introduce tenant workflow completed metric

* expose tenant prom metrics via handler

* fix workflow and worker id in metrics

* correctly add workflow metrics from workflow controller

* use olap DB to gather information for workflow completion

* fix prom metrics endpoint for tenant

* workflow name from external id

* simplify tenant registry based metrics

* add docs for prometheus metrics

* fix docs lint

* run prettier fix

* WIP metrics work

* use federate prom server URL to proxy metrics

* implement workflow duration histogram metric

* separate prom stack docker compose

* fix duration metrics calls

* move scheduler metrics to prom tenant specific file

* update docs for prom metrics

* fix lint

* use proper indices to query for durations

* reorg tenant metrics

* fix lint for doc

* update docs with promql examples and casing around prom metrics enabled

* update prom server url

* fix lint

* enabled prom metrics for v1 only from controller
2025-06-27 11:46:31 -04:00
Gabe Ruttner
68de72d534 Ops disableable replay (#1855)
* try lock

* revert

* Update pkg/repository/v1/scheduler_concurrency.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update pkg/repository/v1/scheduler_concurrency.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* both strats

* disable

* remove input

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-12 15:25:38 -04:00
Gabe Ruttner
1261509755 configurable task ops jitter (#1800)
* configurable task ops jitter

* single config, configurable poll

* revert timeout

* fix correct param
2025-06-02 16:02:01 -04:00
Gabe Ruttner
1421c826ad Feat configurable olap jitter (#1759)
* jitter

* times

* configurable olap jitter and interval
2025-05-21 11:01:00 -04:00
abelanger5
8f9ae4ecf2 fix: make stripped payload size configurable (#1685) 2025-05-07 09:13:07 -04:00
abelanger5
d4ba9c761d feat: pause internal controllers (#1670)
* feat: pause internal controllers

* improve controller active logic
2025-05-03 18:19:34 -04:00
abelanger5
ffbeafc204 revert: add back testing harness (#1659)
* re-add new testing harness

* add healthcheck port and pick random grpc port to listen on

* feat: parallel load tests and faster tests

* make parallelism = 5

* fix: lint

* add linter to pre

* fix: add back rampup fixes

* reduce matrix on PR, add matrix to pre-release step

* make load tests less likely to block

* make limit strategy group round robin

* uncomment lines
2025-05-01 15:22:30 -04:00
abelanger5
dacf48180b feat: sampling (#1592)
* feat: sampling

* Update internal/services/controllers/v1/olap/controller.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* docs: sampling

* sampling -> trace sampling

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-23 17:28:58 -04:00
abelanger5
9aead7ab68 feat: global prometheus metrics (#1568)
* feat: global prometheus metrics

* configure prom with env vars, clean up metrics

* add histogram and docs

* update port
2025-04-17 15:11:38 -04:00
abelanger5
c54bf9266c feat(v1): tenant limits (#1388)
* feat(v1): tenant limits

* fix: migration

* fix: kill metered cache
2025-03-23 19:03:55 -07:00
abelanger5
00c4bbff09 feat(v1): new gRPC API endpoints (#1367)
* wip: api contracts

* feat: implement put workflow version endpoint

* add support for match existing data, get scaffolding in place for additional triggers

* create additional matches

* feat: durable sleep, user event matching

* update protos

* fix: working poc of user events, durable sleep

* add migration

* fix: migration column

* feat: durable event listener

* fix: skip overrides

* fix: input -> output
2025-03-23 18:58:20 -07:00
abelanger5
e91047d7b3 feat: add back tenant alerting to v1 (#1372) 2025-03-19 17:50:42 -04:00
abelanger5
1f2096313d feat: v1 engine (#1318) 2025-03-11 14:57:13 -04:00
Sean Reilly
a8dd33c61f Feature - configurable logging backend (#1188)
* allow us to configure different repos

* make the struct contents public

* pass in config values to new log repo

* rename functions - possibly breaking changes so lets discuss

* make the logging backend configurable

* fix tests

* don't allow calls to WithAdditionalConfig

* cleanup

* replace sc with server

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>

* rename sc to server

* add a LRU cache for the step run lookup

* lets not use an expirable cache and just use the regular one - we cannot close the go func in exirable

---------

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
2025-01-17 15:34:10 -08:00
Sean Reilly
9e961ac196 Feature add version info (#1154)
* adding a /version endpoint for the engine and a /api/v1/version endpoint for the API

* make the security optional so we don't get redirected for having auth

* lint

* upgrade protoc to the latest available version on brew

* use useQuery and clean up html
2025-01-06 10:50:04 -08:00
abelanger5
a9936ef687 fix: set otel insecure flag for all telemetry instantiations (#999) 2024-10-30 17:34:36 -04:00
abelanger5
7ece86dfff fix: start scheduler if old config is used (#989) 2024-10-24 10:52:57 -04:00
abelanger5
2cdee59aea refactor: optimize v0.50.0 release (#975)
- Simplifies architecture for splitting engine services into different components. The three supported services are now `grpc-api`, `scheduler`, and `controllers`. The `grpc-api` service is the only one which needs to be exposed for workers. The other two can run as unexposed services.
- Fixes a set of bugs and race conditions in the `v2` scheduler
- Adds a `lastActive` time to the `Queue` table and includes a migration which sets this `lastActive` time for the most recent 24 hours of queues. Effectively this means that the max scheduling time in a queue is 24 hours. 
- Rewrites the `ListWorkflowsForEvent` query to improve performance and select far fewer rows.
2024-10-23 12:05:16 +00:00
abelanger5
0ec434d62e feat: allow insecure option for otel collector address (#971)
* feat: allow insecure option for otel collector address

* cast to lower
2024-10-16 20:16:22 +00:00
abelanger5
67a96d7166 feat(throughput): single process per queue (#956)
* feat(throughput): single process per queue

* fix data race

* fix: golint and data race on load test

* wrap up initial v2 scheduler

* fix: more debug logs and tighten channel logic/blocking sends

* improved casing on dispatcher and lease manager

* fix: data race on min id

* increase wait on load test, fix data race

* fix: trylock -> lock

* clean up queue when no longer in set

* fix: clean up cache on exit

* ensure cleanup is only called once

* address review comments
2024-10-15 11:05:19 -04:00
Sean Reilly
27736fa30f bulk insert buffering (#913)
Adds bulk inserts to event writes, and adds a generic buffer which can be used by future batch implementations.
2024-10-03 16:26:12 -04:00
abelanger5
bfb11cac51 fix: always use retention on queues, optional data/worker (#916) 2024-09-27 14:23:14 -04:00
abelanger5
b5014f6b3d chore: more visibility and debug lines for queues (#836)
* chore: more visibility and debug options for queues

* better debug lines on queue repo

* don't log so much in load test
2024-08-29 14:49:24 -04:00
abelanger5
6317f86793 refactor: consolidate partition logic (#826)
* refactor: consolidate partition logic

* fix: race on scheduler

* fix: move partition uuid to db query

* fix: generate
2024-08-27 15:28:53 -04:00
Gabe Ruttner
4ea4712d4d refactor: performance and throughput (#756)
Refactors the queueing logic to be fairly balanced between actions, with each action backed as a separate FIFO queue. Also adds support for priority queueing and custom queues, though those aren't exposed on the API layer yet. Improves throughput to be > 5000 tasks/second on a single queue. 

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2024-08-12 14:38:47 +00:00
Gabe Ruttner
b4670af138 Fix qos otel config (#754)
* feat: otel trace id ratio

* feat: rabbitmq qos

* feat: requeue limit

* fix: tests
2024-07-30 18:11:10 -04:00
Gabe Ruttner
b802f9f45f feat: stream by addl meta (#751)
* feat: prop schedule and run

* wip

* fix: filter wfrid

* feat: hangup

* chore: rm debug log

* chore: func name

* fix: cancelled payload

* fix: load

* fix: cleanup the cahce

* fix: single proto

* fix: key -> val

* chore: case

* chore: rm dead code

* chore: rm dead code

* feat: go and docs

* fix: docs
2024-07-29 19:09:51 +00:00
Gabe Ruttner
ad29edb44f fix: partitioned semaphore resolver (#731)
* fix: partition and improve query

* feat: paginate until done

* chore: address comments

* fix: write partitions
2024-07-18 11:06:25 -04:00
Gabe Ruttner
b7cec9ec53 feat: soft delete (#717)
* feat: soft delete workflows and versions

* feat: filter soft deletes wf and wfr

* feat: filter events and step runs

* fix: query

* fix: query

* chore: generate

* wip

* chore: squash migrations

* chore: separate retention into new service

* feat: regularly clean up

* chore: migrations

* fix: tests

* fix: queries

* fix: ambiguous

* fix: refs

* fix: ambiguous id

* fix: remove update from

* fix: soft delete

* fix: cleanup retention scheduler

* fix: has more query

* chore: gen

* fix: query

* fix: table
2024-07-18 09:06:05 -04:00
abelanger5
8f8f3ad287 fix: reduce max throughput of requeue (#713)
* fix: reduce max throughput of requeue

* fix: reassign query

* fix: move step run timeout to partition model

* fix: partitioning queries and index

* better logs on requeue

* fix: inactive rebalance and get step run for engine query

* fix: correct inactive queries
2024-07-12 14:03:55 -04:00
abelanger5
c2debe62d8 fix: add back deprecated service names and fix webhook worker query (#660) 2024-06-27 08:01:02 -04:00
abelanger5
f2c6bc1f44 feat: tenant partitioning (#649)
* feat: tenant partitioning

* fix: rebalance inactive partitions, split into separate partitioner

* fix: shutdown partitioner scheduler properly

* update config options

* fix: config options linting
2024-06-26 21:06:51 +00:00
Gabe Ruttner
a8d42819ea feat: check security service (#639)
* feat: check security service

* feat: propegate version

* feat: with ident

* fix: lint

* chore: generate

* fix: change domain

* fix: panic recover

* fix: migrations

* fix: hash

* fix: dont check in tests
2024-06-26 16:26:29 -04:00
abelanger5
d19e299d1e refactor: make engine runnable with config instead of loader (#640)
* refactor: make hatchet-engine runnable programmatically

* feat: export teardown name and fn
2024-06-26 08:14:30 -04:00
Luca Steeb
1490d88954 feat: webhook workers (#542)
Adds serverless support via the concept of webhook workers. Allows any webhook to be registered as a serverless endpoint for executing a step.
2024-06-25 17:06:43 -04:00
abelanger5
7c3ddfca32 feat: api server extensions (#614)
* feat: allow extending the api server

* chore: remove internal packages to pkg

* chore: update db_gen.go

* fix: expose auth

* fix: move logger to pkg

* fix: don't generate gitignore for prisma client

* fix: allow extensions to register their own api spec

* feat: expose pool on server config

* fix: nil pointer exception on empty opts

* fix: run.go file
2024-06-19 09:36:13 -04:00
Gabe Ruttner
bbc4e58dd9 feat: limits (#559)
* feat: workflow run limits

* fix: resource exhausted 429

* feat: event limit

* feat: worker limit

* fix: sensible error

* fix: pb

* feat: expose limits api

* feat: default limits

* feat: add enable alert option

* feat: slack and email alerts

* fix: cron interval

* feat: make metered util

* wip: schedules and crons

* chore: squash migration

* fix: select or insert

* fix: remove unfinished meter

* chore: atlas migration

* fix: template format

* fix: shared ErrResourceExhausted

* feat: cache

* fix: limit can be nil

* fix: clarification

* fix: close meter ticker

* fix: friendly error for child workflows
2024-06-07 10:57:57 -07:00
abelanger5
68a79fe071 fix: handle nil input more gracefully (#486) 2024-05-13 13:07:41 -04:00
abelanger5
b50ed62924 feat: alerting from slack and email (#461)
* feat: alerting. implements slack alerting, email, and refactors tenant settings to make them more manageable

* chore: generate

* chore: generate sqlc after migrate
2024-05-08 10:04:58 -04:00
abelanger5
e0d363e796 chore: intercept grpc errors and don't send internal to client (#370) 2024-04-10 19:03:18 -04:00
Gabe Ruttner
d8b6843dec feat: streaming events (#309)
* feat: add stream event model

* docs: how to work with db models

* feat: put stream event

* chore: rm comments

* feat: add stream resource type

* feat: enqueue stream event

* fix: contracts

* feat: protos

* chore: set properties correctly for typing

* fix: stream example

* chore: rm old example

* fix: async on

* fix: bytea type

* fix: worker

* feat: put stream data

* feat: stream type

* fix: correct queue

* feat: streaming payloads

* fix: cleanup

* fix: validation

* feat: example file streaming

* chore: rm unused query

* fix: tenant check and read only consumer

* fix: check tenant-steprun relation

* Update prisma/schema.prisma

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>

* chore: generate protos

* chore: rename migration

* release: 0.20.0

* feat(go-sdk): implement streaming in go

---------

Co-authored-by: gabriel ruttner <gabe@hatchet.run>
Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
2024-04-01 15:46:21 -04:00