Mohammed Nafees
3f28a6c45b
Fix nil error in handleTaskBulkAssignedTask ( #2427 )
...
* continue from loop when bulk retrieve for payload fails
* nil map fix
* revert space
2025-11-14 21:33:42 +05:30
matt
7fe9806f5d
Feat: Configurable OLAP status update size limits ( #2499 )
...
* feat: configurable status updates
* fix: config
* fix: wiring
* feat: export limits from olap
* fix: param drilling
2025-11-06 13:37:40 -05:00
Mohammed Nafees
57ad1af68d
fix: deadlocks on trigger, olap prometheus background worker, otel improvements ( #2475 )
...
* print error log temporarily
* casing
* only for create-monitoring-event
* rate limit iterator
* add a debugger
* remove rate limiter
* improve otel on trigger
* cache probability stuff
* track misses
* move down one ln
* default
* Fix: Pass tx down into payload retrieve (#2483 )
* [Python] Feat: Dataclass Support (#2476 )
* fix: prevent lifespan error from hanging worker
* fix: handle cleanup
* feat: dataclass outputs
* feat: dataclasses
* feat: incremental dataclass work
* feat: dataclass tests
* fix: lint
* fix: register wf
* fix: ugh
* chore: changelog
* fix: validation issue
* fix: none check
* fix: lint
* fix: error type
* chore: regenerate examples (#2477 )
Co-authored-by: GitHub Action <action@github.com >
* feat: add health and metrics api on typescript sdk worker (#2457 )
* feat: add health and metrics api on typescript sdk worker
add: prom-client to fetch metrics data
add: track health status of worker across different states
* refactor: keep prom-client as optional dependency
* refactor: remove async import of prom-client
* chore: update package version for ts sdk
* fix: lint
* fix: lint, const enum
---------
Co-authored-by: mrkaye97 <mrkaye97@gmail.com >
* Update frontend onboarding steps (#2478 )
* Update frontend onboarding steps
* Update sidebar as well
* Fix Go SDK cron inputs (#2481 )
* cron input in Go SDK
* add example
* fix: pass tx down to retrieve
* fix: attempt 2, another pool use
* fix: spans and debugging for task statuses
* attempted hotfix on olap statuses
* process tenants in parallel in prom worker
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: GitHub Action <action@github.com >
Co-authored-by: Jishnu <jishnun789@gmail.com >
Co-authored-by: Sid Premkumar <sid.premkumar@gmail.com >
Co-authored-by: Mohammed Nafees <hello@mnafees.me >
Co-authored-by: Alexander Belanger <alexander@hatchet.run >
* move debugger package, clean up init
* remove probability factor logic
* remove debug
* fix: debugger instantiation
---------
Co-authored-by: Alexander Belanger <alexander@hatchet.run >
Co-authored-by: gabriel ruttner <gabriel.ruttner@gmail.com >
Co-authored-by: mrkaye97 <mrkaye97@gmail.com >
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: GitHub Action <action@github.com >
Co-authored-by: Jishnu <jishnun789@gmail.com >
Co-authored-by: Sid Premkumar <sid.premkumar@gmail.com >
2025-11-04 09:05:44 +01:00
abelanger5
7603b5ef39
feat: add grpc otel spans, better tx debugging ( #2474 )
...
* feat: add grpc otel spans, better tx debugging
* fix: ctx
2025-10-31 18:55:42 +01:00
Mohammed Nafees
ed4c0327ce
[hotfix] Meaningful casing for engine liveness and readiness probes ( #2465 )
...
* more fixes for engine live and ready probes
* rename
* no need to set it to false
* fix casing health check
* log onlt when not shutting down
2025-10-30 20:24:33 +01:00
Mohammed Nafees
dc77404030
increase timeout and log more ( #2464 )
2025-10-30 19:08:25 +01:00
Mohammed Nafees
b58359d7b3
Do not run cleanup on v1_workflow_concurrency_slot ( #2463 )
...
* do not run cleanup on v1_concurrency_slot
* fix health endpoints for engine
2025-10-30 15:34:50 +01:00
Mohammed Nafees
91cdb28ddf
Logs for liveness and readiness endpoints + PG conn stats ( #2460 )
...
* error logs for liveness and readiness endpoints with pg information
* use context timeout of 3 seconds
* context
2025-10-30 14:35:02 +01:00
Mohammed Nafees
54701e87d0
Retry RMQ messages indefinitely with aggressive logging after 5 retries ( #2448 )
...
* aggressively log errors when rmq retry more than 5 times
* revisit comments
* new vars and fix integration test
* fix test
* log only after 5 retries
2025-10-28 16:51:39 +01:00
Mohammed Nafees
8412a985e3
increase timeout to 30 seconds ( #2449 )
2025-10-28 16:50:36 +01:00
Mohammed Nafees
c3a1ac621d
fix confusing error ( #2447 )
2025-10-27 15:58:53 -04:00
abelanger5
e1fdeeaf1c
fix: payload performance ( #2441 )
...
* change some olap flush settings
* increase timeouts for payload wal
* fix: improve performance of payload wal metrics
* slight updates
* more small tweaks
* undo some olap changes, don't offload some payloads
* remove double reads
* try reducing wal poll limit
* analyze v1_dag
* move partition method
2025-10-23 17:45:49 -04:00
Mohammed Nafees
bd95b78f38
run cleanup job every minute ( #2440 )
2025-10-23 14:45:10 -04:00
matt
c6e154fd03
Feat: OLAP Payloads ( #2410 )
...
* feat: olap payloads table
* feat: olap queue messages for payload puts
* feat: wire up writes on task write
* driveby: add + ignore psql-connect
* fix: down migration
* fix: use external id for pk
* fix: insert query
* fix: more external ids
* fix: bit more cleanup
* feat: dags
* fix: the rest of the refs
* fix: placeholder uuid
* fix: write external ids
* feat: wire up messages over the queue
* fix: panic
* Revert "fix: panic"
This reverts commit c0adccf2ea .
* Revert "feat: wire up messages over the queue"
This reverts commit 36f425f3c1 .
* fix: rm unused method
* fix: rm more
* fix: rm cruft
* feat: wire up failures
* feat: start wiring up completed events
* fix: more wiring
* fix: finish wiring up completed event payloads
* fix: lint
* feat: start wiring up external ids in the core
* feat: olap pub
* fix: add returning
* fix: wiring
* debug: log lines for pubs
* fix: external id writes
* Revert "debug: log lines for pubs"
This reverts commit fe430840bd .
* fix: rm sample
* debug: rm pub buffer param
* Revert "debug: rm pub buffer param"
This reverts commit b42a5cacbb .
* debug: stuck queries
* debug: more logs
* debug: yet more logs
* fix: rename BulkRetrieve -> Retrieve
* chore: lint
* fix: naming
* fix: conn leak in putpayloads
* fix: revert debug
* Revert "debug: more logs"
This reverts commit 95da7de64f .
* Revert "debug: stuck queries"
This reverts commit 8fda64adc4 .
* feat: improve getters, olap getter
* fix: key type
* feat: first pass at pulling olap payloads from the payload store
* fix: start fixing bugs
* fix: start reworking `includePayloads` param
* fix: include payloads wiring
* feat: analyze for payloads
* fix: simplify writes more + write event payloads
* feat: read out event payloads
* feat: env vars for dual writes
* refactor: clean up task prop drilling a bit
* feat: add include payloads params to python for tests
* fix: tx commit
* fix: dual writes
* fix: not null constraint
* fix: one more
* debug: logging
* fix: more debugging, tweak function sig
* fix: function sig
* fix: refs
* debug: more logging
* debug: more logging
* debug: fix condition
* debug: overwrite properly
* fix: revert debug
* fix: rm more drilling
* fix: comments
* fix: partitioning jobs
* chore: ver
* fix: bug, docs
* hack: dummy id and inserted at for payload offloads
* fix: bug
* fix: no need to handle offloads for task event data
* hack: jitter + current ts
* fix: short circuit
* fix: offload payloads in a tx
* fix: uncomment sampling
* fix: don't offload if external store is disabled
* chore: gen sqlc
* fix: migration
* fix: start reworking types
* fix: couple more
* fix: rm unused code
* fix: drill includePayloads down again
* fix: silence annoying error in some cases
* fix: always store payloads
* debug: use workflow run id for input
* fix: improve logging
* debug: logging on retrieve
* debug: task input
* fix: use correct field
* debug: write even null payloads to limit errors
* debug: hide error lines
* fix: quieting more errors
* fix: duplicate example names, remove print lines
* debug: add logging for olap event writes
* hack: immediate event offloads and cutovers
* fix: rm log line
* fix: import
* fix: short circuit events
* fix: duped names
2025-10-20 09:09:49 -04:00
Mohammed Nafees
e2b1f1353e
Fix OTel span attribute naming convention ( #2409 )
...
* rename spans according to convention
* low cardinality
2025-10-16 18:43:40 +02:00
Mohammed Nafees
d9268c7270
Cleanup job for old and invalid entries ( #2378 )
...
* auto run table cleanup
* batched cleanup of tables
* address PR comments
* fix timeout
* update queries
* fix shouldContinue
* also call cleanup for v1_workflow_concurrency_slot
* fix comment
* comment fix
2025-10-16 16:51:08 +02:00
matt
fb6720a909
Hotfix: Swallow idempotency key error for scheduled runs ( #2425 )
...
* fix: attempt to swallow idempotency error on dupe
* fix: format a little better
* fix: comment
* fix: return early
* fix: comment
* fix: magic string
* fix: log level
* fix: comment
2025-10-16 09:32:41 -04:00
abelanger5
b16be655be
feat: stateful polling intervals ( #2417 )
...
* initial pass on stateful intervals
* pr review comments + add evict expired idempotency keys
* fix: goroutine leak and name vars better
* fix some cleanup logic
2025-10-15 11:40:22 -04:00
Mohammed Nafees
a750ce950d
Introduce vars to tune ANALYZE job gocron run intervals ( #2407 )
...
* introduce cars to tune ANALYZE job gocron run intervals
* update config doc
* fix assignment
2025-10-10 11:02:10 +02:00
matt
d677cb2b08
feat: gzip compression for large payloads, persistent OLAP writes ( #2368 )
...
* debug: remove event pub
* add additional spans to publish message
* debug: don't publish payloads
* fix: persistent messages on olap
* add back other payloads
* remove pub buffers temporarily
* fix: correct queue
* hacky partitioning
* add back pub buffers to scheduler
* don't send no worker events
* add attributes for queue name and message id to publish
* add back pub buffers to grpc api
* remove pubs again, no worker writes though
* task processing queue hashes
* remove payloads again
* gzip compression over 5kb
* add back task controller payloads
* add back no worker requeueing event, with expirable lru cache
* add back pub buffers
* remove hash partitioned queues
* small fixes
* ignore lru cache top fn
* config vars for compression, disable by default
---------
Co-authored-by: Alexander Belanger <alexander@hatchet.run >
2025-10-08 11:44:04 -04:00
matt
c48a3211b5
Feat: Immediate Payload Offloads ( #2375 )
...
* feat: modify operations
* feat: attempt 1 at doing the cutover + the offload in the same query
* fix: operation write
* debug: add some print lines
* fix: check constraint
* fix: select records to offload properly
* fix: fn
* feat: add second table to hold queued cutovers
* fix: start reworking queries
* fix: select
* fix: missing cols
* fix: for update
* fix: query name for finalize
* feat: cut over query finalizer
* feat: query for writes into cutover queue
* feat: add query for cut over polling
* feat: add cutover job
* fix: rm operations
* feat: write cutover queue items at the same time as setting payload keys
* fix: simplify into single query
* fix: revert debug
* chore: lint
* fix: don't remove operation column yet
* feat: refactor into struct of opts and make job intervals configurable
* fix: add analyze for payload table
* fix: schema copy paste
* fix: drop fk
* feat: add an index to help with poll performance for a short while
* fix: simplify poll ordering
* fix: simplify more
* fix: ctx
Co-authored-by: Mohammed Nafees <hello@mnafees.me >
* Feat: Task Event and DAG Payloads (#2370 )
* feat: initial work on task event payloads
* fix: iterator
* feat: wire up task events
* fix: backwards compat
* fix: migrations
* fix: duplication
* fix: col
* fix: add timestamptz col
* fix: overwrite
* fix: rm debugging
* fix: revert debugging
* fix: rm unused cols
* fix: spelling
* fix: use `current_timestamp` as default
* feat: dual writes for payloads
* fix: improve debug lines
* debug: add log
* debug: always write
* fix: make annoying log debug level
* fix: rm debug lines
* fix: add comment
* feat: dag payloads
* fix: index
* fix: migration ver
* fix: error msg
Co-authored-by: abelanger5 <belanger@sas.upenn.edu >
* fix: create, then set default
* fix: inserted at copy paste
* fix: n+1 query
* fix: another n+1 query
* fix: rm unused singleton retrieve
---------
Co-authored-by: abelanger5 <belanger@sas.upenn.edu >
---------
Co-authored-by: Mohammed Nafees <hello@mnafees.me >
Co-authored-by: abelanger5 <belanger@sas.upenn.edu >
2025-10-08 11:22:34 -04:00
Mohammed Nafees
ccf92482d8
properly case on output byte length ( #2394 )
2025-10-08 12:06:30 +02:00
matt
dfc5074057
Fix: Payload fallbacks, WAL conflict handling, WAL eviction ( #2372 )
...
* fix: improve error handling
* fix: add default operation
* fix: dont write operation
* fix: refactor offload to always evict
* fix: err check
* fix: err
2025-10-03 14:50:46 -04:00
Mohammed Nafees
ed40a82dbb
Include tenant_id in OTel spans wherever possible ( #2382 )
2025-10-03 18:16:16 +02:00
matt
bb1de91254
fix: run analyze every 3 hours ( #2380 )
2025-10-03 09:49:35 -04:00
matt
4730273bce
Fix: Relax check constraint to allow null payloads ( #2366 )
...
* fix: relax check constraint
* fix: tweak logs
* fix: constraint in schema
2025-09-30 12:24:38 -04:00
abelanger5
2edeeb10ea
feat: max channels for rabbitmq ( #2365 )
...
* feat: max conns for rabbitmq
* rename conns -> chans
2025-09-30 08:49:45 -04:00
abelanger5
733feedbff
fix: use separate connections for pub and sub ( #2358 )
...
* use separate connections for pub and sub
* Update internal/msgqueue/v1/rabbitmq/rabbitmq.go
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-09-29 14:29:45 -04:00
matt
025f42af74
Debug: Error log if we send >10mb message over the internal queue ( #2345 )
...
* fix: send error log if we try to send message > 10mb
* feat: add some span attributes
* fix: span attribute names
* fix: cleanup
* fix: add message id
2025-09-25 18:15:35 -04:00
matt
d2cab4924a
Fix: use SplitN instead of Split ( #2336 )
2025-09-24 15:26:38 -04:00
matt
63c644577a
Feat: Add error level logs if we fall back to the task input for monitoring ( #2328 )
...
* feat: logs on fallback to input from task
* drive-by: couple more status badge colors
* fix: durable sleep matches
2025-09-23 15:48:30 -04:00
matt
8b8ded655d
Fix: Update payload properly on replay ( #2317 )
...
* fix: overwrite payloads when task is in an initially e.g. cancelled state
* fix: add distinct to payload writes to limit conflict resolution
* feat: first pass at test
* fix: tenant in warning
* fix: lint, more assertions
* fix: bug
* fix: my pet peeve
2025-09-18 20:42:39 -04:00
matt
bdedab653a
Fix: WAL partition poll function type ( #2301 )
...
* fix: type
* fix: cast to int32
* debug: add logging
* debug: more logs
* Revert "debug: more logs"
This reverts commit 2ff8033f89 .
* Revert "debug: add logging"
This reverts commit a7aaa05b9c .
* fix: rm unnecessary generic
* feat: span attrs + names
* fix: span naming, more details
2025-09-16 12:44:55 -04:00
matt
92843bb277
Feat: Payload Store Repository ( #2047 )
...
* feat: add table for storing payloads
* feat: add payload type enum
* feat: gen sqlc
* feat: initial sql impl
* feat: add payload store repo to shared
* feat: add overwrite
* fix: impl
* feat: bulk op
* feat: initial wiring of inputs for task triggers
* feat: wire up dag matches
* feat: create V1TaskWithPayload and use it everywhere
* fix: couple bugs
* fix: clean up types
* fix: overwrite
* fix: rm input from replay
* fix: move payload store to shared repo
* fix: schema
* refactor: repo setup
* refactor: repos
* fix: gen
* chore: lint
* fix: rename
* feat: naming, write dag inputs
* fix: more naming, trigger bug
* fix: dual writes for now
* fix: pass in tx
* feat: initial work on offloader
* feat: improve external offloader
* fix: some refs
* add withExternalHandler
* fix: improve impl of external store
* feat: implement offloading, fix other impls
* feat: add query to update JSON
* fix: implement offloading + updating records in payloads table
* feat: add WAL table
* feat: add queries for polling WAL and evicting
* feat: wire up writes into WAL
* fix: get job working
* refactor: improve types
* fix: infinite loop
* feat: improve offloading logic to run in two separate txes
* refactor: rework how overrides work
* fix: lint
* fix: migration number
* fix: migration
* fix: migration version
* fix: revert back to reading payloads out
* fix: fall back to previous input, part i
* fix: input fallback
* fix: add back input to replay
* fix: input fallback in dispatcher
* fix: nil check
* feat: advisory locks, part i
* fix: no skip locked
* feat: hash partitioned wal table
* fix: modify queries a bit, tweak crud enum
* fix: pk order, function to find tenants
* feat: wal processing
* fix: only write wal if an external store is enabled, fix offloading logic
* fix: spacing
* feat: schema cleanup
* fix: rm external store loc name
* fix: set content to null when offloading
* fix: cleanup, naming
* fix: pass overwrite payload store along
* debug: add some logging
* Revert "debug: add some logging"
This reverts commit 43e71eadf1 .
* fix: typo
* fx: add offloatAt to store opts for offloading
* fix: handle leasing with advisory lock
* fix: struct def
* fix: requeue on payloads not found
* fix: rm hack for triggers
* fix: revert empty input on write
* fix: write input
* feat: env var for enabling / disabling dual writes
* feat: wire up dual writes
* fix: comments
* feat: generics!
* fix: panic from type cast
* fix: migration
* fix: generic
* fix: hack for T key in map
* fix: cleanup
2025-09-12 09:53:01 -04:00
matt
f385964fcc
Fix: Scheduled runs race w/ idempotency key check ( #2077 )
...
* feat: create table for storing key
* feat: is_filled col
* feat: idempotency repo
* fix: handle filling
* fix: improve queries
* feat: check if was created already before triggering
* fix: handle partitions
* feat: improve schema
* feat: initial idempotency key claiming impl
* fix: db
* fix: sql fmt
* feat: crazy query
* fix: downstream
* fix: queries
* fix: query bug
* fix: migration rename
* fix: couple small issues
* feat: eviction job
* fix: copilot comments
* fix: index name
* fix: rm comment
2025-09-12 07:54:42 -04:00
Mohammed Nafees
de08285517
Remove nginx and use custom static fileservers ( #1928 )
...
* serve static files using staticfileserver
* server static files on port 3000 by default
* frontend expose port 80 default
2025-09-12 12:37:32 +02:00
Gabe Ruttner
9459dad14d
Feat improve auth error handling ( #1893 )
...
* common errors
* rate limits
* add IP extractor to api server
* use echo rate limit middleware func
* use rate limit for webhooks as well
---------
Co-authored-by: Mohammed Nafees <hello@mnafees.me >
2025-09-11 18:30:07 +02:00
Mohammed Nafees
03e5b37059
Introduce UI for Organizations ( #2247 )
...
* org selector
* org selector and pages
* org page starts to look nice I think
* add mgmt tokens section
* better messaging
* custom auth interface
* add comments
* more modals
* more fixes
* onboarding create tenant for orgs
* use ConfirmDialog
* org invite modal
* org invites work
* email service into pkg
* fix build error
* attempt at creating hook
* address PR comments
* more fixes
* update for org list endpoint
2025-09-05 21:30:37 +02:00
Mohammed Nafees
1a2891154e
Periodically run ANALYZE on v1_task and v1_task_event ( #2236 )
...
* analyze v1_task and v1_task_event tables periodically
* copy pasta
2025-09-02 11:07:05 -04:00
matt
54caf2e68a
fix: rm annoying loki logs ( #2224 )
2025-08-28 20:54:42 -04:00
abelanger5
f7eda21c10
fix: confusing error message ( #2199 )
2025-08-26 10:55:23 -04:00
abelanger5
67aef4fa64
add visibility to stream send event ( #2174 )
...
* add visibility to stream send event
* more otel
* track down stream timings
* experimental: use PrepareMsg before writing to the stream
* add control over stream window size, add error to span if large delays in stream sends
2025-08-22 09:51:31 -04:00
Gabe Ruttner
f59ebd6c47
feat: analytics events ( #2171 )
...
* feat: analytics events
* review comments
2025-08-22 05:41:17 -07:00
matt
5eab4b74e7
Feat: Run ANALYZE on a few tables once a day ( #2163 )
...
* feat: add analyze for a few tables
* feat: run at 5am utc
* fix: add tx, timeout
* fix: 30m timeout
2025-08-19 13:43:27 -04:00
abelanger5
1407594902
fix: move rate limited queue items off the main queue ( #2155 )
...
* fix: move rate limited queue items off the main queue
* preserve FIFO behavior on queues
* fix unit tests, address pr comments
* fix: generated
* rename table
2025-08-18 11:31:21 -04:00
matt
36924936fa
Feat: Webhook fixes / improvements ( #2131 )
...
* feat: webhook update
* feat: add headers to cel env
* fix: header casing
* feat: wire up edits
* fix: updates
* fix: finish wiring up updates
* fix: handle save on enter
* fix: lint
* feat: add slack and discord
* feat: initial slack setup
* fix: get slack working
* fix: rm discord for now
* fix: lint
* chore: gen
* fix: explicit save button
* feat: add link to CEL docs
* feat: add callout for reaching out to support
* feat: docs
* refactor: challenge
* fix: naming
* fix: return
* fix: resp codes
* fix: webhooks beta flag
* fix: rm discord
* fix: docs
2025-08-14 10:46:57 -05:00
matt
ed65e41ff2
Fix: Optimize DAG timing query for Prom ( #2102 )
...
* feat: improve dag duration query
* fix: naming
* fix: wiring
* feat: add trace
* fix: add timeouts
* fix: inserted at
* fix: correctness tweak
* fix: try upgrading pino
2025-08-12 08:01:00 -04:00
Mohammed Nafees
d504d51f13
add k8s pod info to traces ( #2109 )
2025-08-12 07:57:47 -04:00
abelanger5
508c18342d
fix: don't wait for grpc stream send on rabbitmq loop ( #2115 )
...
* fix: don't wait for grpc stream send on rabbitmq loop
* fix: ctx cancel
2025-08-12 07:54:45 -04:00
Mohammed Nafees
8a0e88ac48
[HAT-432] Enforce task priorities to be between 1 and 3 ( #2110 )
...
* user provided priorities can only be 1,2,3
* sanitize
* check for retry counts
* update partition functions to include constraints
* do SQL migration afterwards
* revert sql changes
2025-08-11 21:50:34 +02:00