Commit Graph

611 Commits

Author SHA1 Message Date
Gabe Ruttner
1e2a587b21 fix: GetLatestWorkflowVersionForWorkflows (#2590)
* fix query

* gen
2025-12-02 05:14:08 -08:00
Sid Premkumar
709dd89a18 Add gzip compression (#2539)
* Add gzip compression init

* revert

* Feat: Initial cross-domain identify setup (#2533)

* feat: initial setup

* fix: factor out

* chore: lint

* fix: xss vuln

* feat: set up properly

* fix: lint

* fix: key

* fix: keys, cleanup

* Fix: use sessionStorage instead of localStorage (#2541)

* chore(deps): bump golang.org/x/crypto from 0.44.0 to 0.45.0 (#2545)

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.44.0 to 0.45.0.
- [Commits](https://github.com/golang/crypto/compare/v0.44.0...v0.45.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.45.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump google/osv-scanner-action/.github/workflows/osv-scanner-reusable-pr.yml (#2547)

Bumps [google/osv-scanner-action/.github/workflows/osv-scanner-reusable-pr.yml](https://github.com/google/osv-scanner-action) from 2.2.4 to 2.3.0.
- [Release notes](https://github.com/google/osv-scanner-action/releases)
- [Commits](https://github.com/google/osv-scanner-action/compare/v2.2.4...v2.3.0)

---
updated-dependencies:
- dependency-name: google/osv-scanner-action/.github/workflows/osv-scanner-reusable-pr.yml
  dependency-version: 2.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Go SDK] Resubscribe and get a new listener stream when gRPC connections fail (#2544)

* fix listener cache issue to resubscribe when erroring out

* worker retry message clarification (#2543)

* add another retry layer and add comments

* fix loop logic

* make listener channel retry

* Compression test utils, and add log to indicate its enabled

* clean + fix

* more fallbacks

* common pgxpool afterconnect method (#2553)

* remove

* lint

* lint

* add cpu monitor during test

* fix background monitor and lint

* Make envvar to disable compression

* cleanup monitoring

* PR Feedback

* Update paths in compression tests + bump package versions

* path issue on test script

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: matt <mrkaye97@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mohammed Nafees <hello@mnafees.me>
2025-11-26 17:14:38 -05:00
matt
be9c7df026 Fix: Noisy Payload Error (#2561)
* fix: noisy error

* fix: only error if the task is completed but has no payload
2025-11-26 17:04:37 -05:00
Gabe Ruttner
3e5f737ef5 fix: query optimization get latest workflow version (#2576) 2025-11-26 08:56:20 -08:00
Gabe Ruttner
c920d54519 analyze v1 lookup table (#2568)
Co-authored-by: matt <mrkaye97@gmail.com>
2025-11-25 17:25:40 -05:00
matt
727a8fe470 Fix: OLAP Task Event Dual Write Bug (#2572)
* fix: task events bug

* fix: fallback bug

* fix: simplfiy test
2025-11-25 17:24:56 -05:00
matt
8350cb2205 Revert "optimize UUID sqlchelpers (#2532)" (#2571)
This reverts commit 9a09105e52.
2025-11-25 12:10:34 -05:00
Mohammed Nafees
9a09105e52 optimize UUID sqlchelpers (#2532) 2025-11-24 16:50:21 +01:00
Mohammed Nafees
7bb3e1da8d common pgxpool afterconnect method (#2553) 2025-11-21 14:55:04 +01:00
Mohammed Nafees
f66fe63ad0 [Go SDK] Resubscribe and get a new listener stream when gRPC connections fail (#2544)
* fix listener cache issue to resubscribe when erroring out

* worker retry message clarification (#2543)

* add another retry layer and add comments

* fix loop logic

* make listener channel retry
2025-11-20 19:13:24 +01:00
abelanger5
2249ef3b79 fix: small scheduler optimizations (#2426)
* fix: actually increment snapshot count

* add a context with timeout to wrap replenish
2025-11-17 15:45:14 -05:00
matt
62a163d835 Fix: Revert n+1 queries on the list API (#2531)
* feat: revert query

* feat: revert n+1 query

* feat: revert another n+1 query

* fix: payloads
2025-11-17 10:54:05 -05:00
Mohammed Nafees
49b11b2548 Fix seq scan in PollCronSchedules query (#2524)
* fix seq scan

* new CTE

* fmt
2025-11-14 17:15:39 +01:00
Mohammed Nafees
8d47de193b Attempt to fix pgx multi dimensional slice reflection error #1 (#2523)
* multi dim slice pgx reflection error

* make sure to maintain the cardinality

* fix nil
2025-11-14 16:54:26 +01:00
Mohammed Nafees
f97171f245 [Go SDK] Case on worker labels for durable tasks (#2511)
* fix durable task worker labels

* fix assignment
2025-11-12 18:32:58 +01:00
Jishnu
e82915b626 feat: add pagination support for V1LogLineList (#2354)
* feat: pagination for v1 loglines list

* add: sqlc v1 query for loglines count

* add: generated rest-client changes for python sdk

* refactor: frontend logs UI with pagination elements

* add: ts-sdk logline pagination, py logline list pagination docstring

* feat: add since queryparam for v1logline, add infinitescroll pagination on FE

* add custom polling for logs refresh on FE, remove inefficient default refresh logic

* add since queryparam of v1logline to all rest-clients

* refactor: remove offset query param, add until query param(v1logline)

* remove pagination from v1loglinelist

* fix: inconsistent scroll behaviour, add pagination response schema on v1loglist

* add: infinite scroll behavior for smooth log scrolling; prefetch next page logs in advance

* fix: pagination scroll, when task is running, remove stale pagination data when logs tab inactive

* chore: lint

* chore: lint

---------

Co-authored-by: mrkaye97 <mrkaye97@gmail.com>
2025-11-07 17:38:29 +01:00
matt
2824646ad7 Immediate Payload Offloads OLAP Wiring (#2492)
* feat: payload store updates for immediate offloads

* feat: handle immediate offloads

* feat: start wiring up immediate offloads

* fix: get rid of payload store return

* feat: start immediate offloads work

* fix: event trigger put call

* fix: dynamic payload put depending on if offload worked

* fix: rm put

* fix: write event payload from the right place

* fix: dummy id for task events to prevent duplication issues with the tasks themselves

* fix: rm comments

* fix: rm unused struct

* fix: enabled wal

* fix: rm `RETURNING`

* fix: small cleanup

* fix: wal issue
2025-11-07 17:38:10 +01:00
Mohammed Nafees
c5496184be pass labels to durable worker (#2504) 2025-11-07 16:10:01 +01:00
matt
7fe9806f5d Feat: Configurable OLAP status update size limits (#2499)
* feat: configurable status updates

* fix: config

* fix: wiring

* feat: export limits from olap

* fix: param drilling
2025-11-06 13:37:40 -05:00
Mohammed Nafees
57ad1af68d fix: deadlocks on trigger, olap prometheus background worker, otel improvements (#2475)
* print error log temporarily

* casing

* only for create-monitoring-event

* rate limit iterator

* add a debugger

* remove rate limiter

* improve otel on trigger

* cache probability stuff

* track misses

* move down one ln

* default

* Fix: Pass tx down into payload retrieve (#2483)

* [Python] Feat: Dataclass Support (#2476)

* fix: prevent lifespan error from hanging worker

* fix: handle cleanup

* feat: dataclass outputs

* feat: dataclasses

* feat: incremental dataclass work

* feat: dataclass tests

* fix: lint

* fix: register wf

* fix: ugh

* chore: changelog

* fix: validation issue

* fix: none check

* fix: lint

* fix: error type

* chore: regenerate examples (#2477)

Co-authored-by: GitHub Action <action@github.com>

* feat: add health and metrics api on typescript sdk worker (#2457)

* feat: add health and metrics api on typescript sdk worker

add: prom-client to fetch metrics data
add: track health status of worker across different states

* refactor: keep prom-client as optional dependency

* refactor: remove async import of prom-client

* chore: update package version for ts sdk

* fix: lint

* fix: lint, const enum

---------

Co-authored-by: mrkaye97 <mrkaye97@gmail.com>

* Update frontend onboarding steps (#2478)

* Update frontend onboarding steps

* Update sidebar as well

* Fix Go SDK cron inputs (#2481)

* cron input in Go SDK

* add example

* fix: pass tx down to retrieve

* fix: attempt 2, another pool use

* fix: spans and debugging for task statuses

* attempted hotfix on olap statuses

* process tenants in parallel in prom worker

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: GitHub Action <action@github.com>
Co-authored-by: Jishnu <jishnun789@gmail.com>
Co-authored-by: Sid Premkumar <sid.premkumar@gmail.com>
Co-authored-by: Mohammed Nafees <hello@mnafees.me>
Co-authored-by: Alexander Belanger <alexander@hatchet.run>

* move debugger package, clean up init

* remove probability factor logic

* remove debug

* fix: debugger instantiation

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com>
Co-authored-by: mrkaye97 <mrkaye97@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: GitHub Action <action@github.com>
Co-authored-by: Jishnu <jishnun789@gmail.com>
Co-authored-by: Sid Premkumar <sid.premkumar@gmail.com>
2025-11-04 09:05:44 +01:00
Mohammed Nafees
861e205171 Fix Go SDK cron inputs (#2481)
* cron input in Go SDK

* add example
2025-11-02 18:00:23 +01:00
abelanger5
7603b5ef39 feat: add grpc otel spans, better tx debugging (#2474)
* feat: add grpc otel spans, better tx debugging

* fix: ctx
2025-10-31 18:55:42 +01:00
matt
c33091e815 fix: include payload partitions in olap partitions to drop (#2472) 2025-10-31 10:39:39 +01:00
matt
99544bbd4e Fix: read payloads from payload store for event API (#2471)
* fix: read payloads from payload store

* debug: add log

* debug: more log lines

* fix: bug

* fix: rm debug lines

* fix: comment loc
2025-10-31 00:57:36 +01:00
matt
4700c42183 fix: re-enable writes (#2469) 2025-10-31 00:11:43 +01:00
abelanger5
3a27bdf7cb fix: don't send expiry alert on internal proxy tokens (#2468) 2025-10-30 23:17:56 +01:00
Mohammed Nafees
1aabbe3e94 Run cleanup on more tables (#2467)
* cleanup more tables

* use task retention period

* use task retention period

* cleanup

* fix query
2025-10-30 23:17:36 +01:00
Mohammed Nafees
bc3dc53433 no need to check for partitions when updating them (#2466) 2025-10-30 22:13:46 +01:00
Mohammed Nafees
b58359d7b3 Do not run cleanup on v1_workflow_concurrency_slot (#2463)
* do not run cleanup on v1_concurrency_slot

* fix health endpoints for engine
2025-10-30 15:34:50 +01:00
Mohammed Nafees
91cdb28ddf Logs for liveness and readiness endpoints + PG conn stats (#2460)
* error logs for liveness and readiness endpoints with pg information

* use context timeout of 3 seconds

* context
2025-10-30 14:35:02 +01:00
abelanger5
745918ba2c fix: reduce status update limits from 10k -> 1k (#2462)
* reduce status update limits from 10k -> 1k

* remove comment
2025-10-30 14:34:03 +01:00
Sid Premkumar
4f7a8da580 Add support for non-wal payload store logic to skip main db (#2445) 2025-10-29 07:24:11 +01:00
Mohammed Nafees
f1eccfddf4 [hotfix] Fix running task stats without concurrency keys (#2452)
* fix task stats running

* formatting

* if block fix
2025-10-28 22:19:52 +01:00
Mohammed Nafees
56eb054a1e New tenant task stats endpoint (#2433)
* tenant workflow stats endpoint

* not olap but task

* lint

* enable rate limiting on endpoint

* fix SQL query

* spelling

* lesser CTEs

* fix query

* rename to task

* update query

* fix nil pointer

* typed API object

* queues have counts
2025-10-28 16:52:19 +01:00
Mohammed Nafees
54701e87d0 Retry RMQ messages indefinitely with aggressive logging after 5 retries (#2448)
* aggressively log errors when rmq retry more than 5 times

* revisit comments

* new vars and fix integration test

* fix test

* log only after 5 retries
2025-10-28 16:51:39 +01:00
abelanger5
e1fdeeaf1c fix: payload performance (#2441)
* change some olap flush settings

* increase timeouts for payload wal

* fix: improve performance of payload wal metrics

* slight updates

* more small tweaks

* undo some olap changes, don't offload some payloads

* remove double reads

* try reducing wal poll limit

* analyze v1_dag

* move partition method
2025-10-23 17:45:49 -04:00
Mohammed Nafees
cf5c5989ff add vars to tune concurrency poller (#2428) 2025-10-23 11:36:12 -04:00
abelanger5
1f35782b59 fix: move err check to before len check (#2437) 2025-10-21 19:24:19 -04:00
matt
c6e154fd03 Feat: OLAP Payloads (#2410)
* feat: olap payloads table

* feat: olap queue messages for payload puts

* feat: wire up writes on task write

* driveby: add + ignore psql-connect

* fix: down migration

* fix: use external id for pk

* fix: insert query

* fix: more external ids

* fix: bit more cleanup

* feat: dags

* fix: the rest of the refs

* fix: placeholder uuid

* fix: write external ids

* feat: wire up messages over the queue

* fix: panic

* Revert "fix: panic"

This reverts commit c0adccf2ea.

* Revert "feat: wire up messages over the queue"

This reverts commit 36f425f3c1.

* fix: rm unused method

* fix: rm more

* fix: rm cruft

* feat: wire up failures

* feat: start wiring up completed events

* fix: more wiring

* fix: finish wiring up completed event payloads

* fix: lint

* feat: start wiring up external ids in the core

* feat: olap pub

* fix: add returning

* fix: wiring

* debug: log lines for pubs

* fix: external id writes

* Revert "debug: log lines for pubs"

This reverts commit fe430840bd.

* fix: rm sample

* debug: rm pub buffer param

* Revert "debug: rm pub buffer param"

This reverts commit b42a5cacbb.

* debug: stuck queries

* debug: more logs

* debug: yet more logs

* fix: rename BulkRetrieve -> Retrieve

* chore: lint

* fix: naming

* fix: conn leak in putpayloads

* fix: revert debug

* Revert "debug: more logs"

This reverts commit 95da7de64f.

* Revert "debug: stuck queries"

This reverts commit 8fda64adc4.

* feat: improve getters, olap getter

* fix: key type

* feat: first pass at pulling olap payloads from the payload store

* fix: start fixing bugs

* fix: start reworking `includePayloads` param

* fix: include payloads wiring

* feat: analyze for payloads

* fix: simplify writes more + write event payloads

* feat: read out event payloads

* feat: env vars for dual writes

* refactor: clean up task prop drilling a bit

* feat: add include payloads params to python for tests

* fix: tx commit

* fix: dual writes

* fix: not null constraint

* fix: one more

* debug: logging

* fix: more debugging, tweak function sig

* fix: function sig

* fix: refs

* debug: more logging

* debug: more logging

* debug: fix condition

* debug: overwrite properly

* fix: revert debug

* fix: rm more drilling

* fix: comments

* fix: partitioning jobs

* chore: ver

* fix: bug, docs

* hack: dummy id and inserted at for payload offloads

* fix: bug

* fix: no need to handle offloads for task event data

* hack: jitter + current ts

* fix: short circuit

* fix: offload payloads in a tx

* fix: uncomment sampling

* fix: don't offload if external store is disabled

* chore: gen sqlc

* fix: migration

* fix: start reworking types

* fix: couple more

* fix: rm unused code

* fix: drill includePayloads down again

* fix: silence annoying error in some cases

* fix: always store payloads

* debug: use workflow run id for input

* fix: improve logging

* debug: logging on retrieve

* debug: task input

* fix: use correct field

* debug: write even null payloads to limit errors

* debug: hide error lines

* fix: quieting more errors

* fix: duplicate example names, remove print lines

* debug: add logging for olap event writes

* hack: immediate event offloads and cutovers

* fix: rm log line

* fix: import

* fix: short circuit events

* fix: duped names
2025-10-20 09:09:49 -04:00
Mohammed Nafees
8f57989730 fix race condition in child spawn (#2429) 2025-10-17 16:56:41 +02:00
Mohammed Nafees
e2b1f1353e Fix OTel span attribute naming convention (#2409)
* rename spans according to convention

* low cardinality
2025-10-16 18:43:40 +02:00
Mohammed Nafees
d9268c7270 Cleanup job for old and invalid entries (#2378)
* auto run table cleanup

* batched cleanup of tables

* address PR comments

* fix timeout

* update queries

* fix shouldContinue

* also call cleanup for v1_workflow_concurrency_slot

* fix comment

* comment fix
2025-10-16 16:51:08 +02:00
matt
aa38c6d2df fix: payload fallback for child runs (#2421) 2025-10-15 16:16:51 -04:00
abelanger5
b16be655be feat: stateful polling intervals (#2417)
* initial pass on stateful intervals

* pr review comments + add evict expired idempotency keys

* fix: goroutine leak and name vars better

* fix some cleanup logic
2025-10-15 11:40:22 -04:00
matt
5b5adcb8ed Feat: Scheduled run detail view, bulk cancel / replay with pagination helper (#2416)
* feat: endpoint for listing external ids

* feat: wire up external id list

* chore: regen api

* feat: py sdk wrapper

* fix: since type

* fix: log

* fix: improve defaults for statuses

* feat: docs

* feat: docs

* fix: rm extra file

* feat: add id column to scheduled runs

* feat: side panel for scheduled runs

* fix: side panel header pinned

* fix: border + padding

* chore: gen

* chore: lint

* chore: changelog, version

* fix: spacing of cols

* fix: empty webhook resource limit

* fix: tsc

* fix: sort organizations and tenants alphabetically
2025-10-15 11:36:45 -04:00
Mohammed Nafees
a750ce950d Introduce vars to tune ANALYZE job gocron run intervals (#2407)
* introduce cars to tune ANALYZE job gocron run intervals

* update config doc

* fix assignment
2025-10-10 11:02:10 +02:00
Mohammed Nafees
0695db820c Use UTC for all pgx connections and check for database TZ (#2398)
* set utc for all pgx sessions

* helper func

* also accept IANA Etc/UTC
2025-10-09 10:54:27 +02:00
matt
d677cb2b08 feat: gzip compression for large payloads, persistent OLAP writes (#2368)
* debug: remove event pub

* add additional spans to publish message

* debug: don't publish payloads

* fix: persistent messages on olap

* add back other payloads

* remove pub buffers temporarily

* fix: correct queue

* hacky partitioning

* add back pub buffers to scheduler

* don't send no worker events

* add attributes for queue name and message id to publish

* add back pub buffers to grpc api

* remove pubs again, no worker writes though

* task processing queue hashes

* remove payloads again

* gzip compression over 5kb

* add back task controller payloads

* add back no worker requeueing event, with expirable lru cache

* add back pub buffers

* remove hash partitioned queues

* small fixes

* ignore lru cache top fn

* config vars for compression, disable by default

---------

Co-authored-by: Alexander Belanger <alexander@hatchet.run>
2025-10-08 11:44:04 -04:00
matt
c48a3211b5 Feat: Immediate Payload Offloads (#2375)
* feat: modify operations

* feat: attempt 1 at doing the cutover + the offload in the same query

* fix: operation write

* debug: add some print lines

* fix: check constraint

* fix: select records to offload properly

* fix: fn

* feat: add second table to hold queued cutovers

* fix: start reworking queries

* fix: select

* fix: missing cols

* fix: for update

* fix: query name for finalize

* feat: cut over query finalizer

* feat: query for writes into cutover queue

* feat: add query for cut over polling

* feat: add cutover job

* fix: rm operations

* feat: write cutover queue items at the same time as setting payload keys

* fix: simplify into single query

* fix: revert debug

* chore: lint

* fix: don't remove operation column yet

* feat: refactor into struct of opts and make job intervals configurable

* fix: add analyze for payload table

* fix: schema copy paste

* fix: drop fk

* feat: add an index to help with poll performance for a short while

* fix: simplify poll ordering

* fix: simplify more

* fix: ctx

Co-authored-by: Mohammed Nafees <hello@mnafees.me>

* Feat: Task Event and DAG Payloads (#2370)

* feat: initial work on task event payloads

* fix: iterator

* feat: wire up task events

* fix: backwards compat

* fix: migrations

* fix: duplication

* fix: col

* fix: add timestamptz col

* fix: overwrite

* fix: rm debugging

* fix: revert debugging

* fix: rm unused cols

* fix: spelling

* fix: use `current_timestamp` as default

* feat: dual writes for payloads

* fix: improve debug lines

* debug: add log

* debug: always write

* fix: make annoying log debug level

* fix: rm debug lines

* fix: add comment

* feat: dag payloads

* fix: index

* fix: migration ver

* fix: error msg

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>

* fix: create, then set default

* fix: inserted at copy paste

* fix: n+1 query

* fix: another n+1 query

* fix: rm unused singleton retrieve

---------

Co-authored-by: abelanger5 <belanger@sas.upenn.edu>

---------

Co-authored-by: Mohammed Nafees <hello@mnafees.me>
Co-authored-by: abelanger5 <belanger@sas.upenn.edu>
2025-10-08 11:22:34 -04:00
matt
8fd90a29a6 Feat: Pausable Crons (#2395)
* feat: update query, patch route

* feat: api for update

* fix: simplify ui a bit

* feat: wire up fe

* feat: improve copy, spinners

* fix: invert naming to avoid horrible double negative

* fix: improve handling of optional types

* fix: last bits of naming

* feat: persist enabled flag across workflow versions properly

* fix: update spinner
2025-10-08 11:12:14 -04:00