-
Couldn't load subscription status.
- Fork 14
feat: add map step type in sql #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: 9058bbb The changes in this PR will be included in the next version bump. This PR includes changesets to release 7 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. ✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
View your CI Pipeline Execution ↗ for commit 9058bbb
☁️ Nx Cloud last updated this comment at |
e53492c to
a3b7cb0
Compare
a3b7cb0 to
e5f7c0d
Compare
PLAN.md
Outdated
| jsonb_build_object('run', r.input) || COALESCE(dep_out.deps_output, '{}'::jsonb) | ||
| WHEN s.step_type = 'map' AND s.deps_count = 0 THEN | ||
| -- Root map: extract item from run input array | ||
| (SELECT input -> st.task_index FROM pgflow.runs WHERE run_id = st.run_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential array bounds issue: The direct array indexing with task_index lacks bounds checking, which could cause runtime errors if the index exceeds array length. Consider adding validation:
CASE
WHEN st.task_index < jsonb_array_length(input)
THEN input -> st.task_index
ELSE (SELECT ERROR('Task index out of bounds'))
ENDThis would provide a clear error message rather than a potentially confusing SQL error when an index is out of range.
| (SELECT input -> st.task_index FROM pgflow.runs WHERE run_id = st.run_id) | |
| CASE | |
| WHEN st.task_index < jsonb_array_length((SELECT input FROM pgflow.runs WHERE run_id = st.run_id)) | |
| THEN (SELECT input -> st.task_index FROM pgflow.runs WHERE run_id = st.run_id) | |
| ELSE (SELECT ERROR('Task index out of bounds')) | |
| END |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
PLAN.md
Outdated
| -- Set remaining_tasks = NULL for all 'created' steps (not started yet) | ||
| UPDATE pgflow.step_states | ||
| SET remaining_tasks = NULL | ||
| WHERE status = 'created'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data migration statement could be optimized to avoid unnecessary updates. Consider adding a condition to only update rows that need changing:
UPDATE pgflow.step_states
SET remaining_tasks = NULL
WHERE status = 'created' AND remaining_tasks IS NOT NULL;This prevents updating rows where remaining_tasks is already NULL, reducing lock contention and improving migration performance, especially in production environments with existing data.
| -- Set remaining_tasks = NULL for all 'created' steps (not started yet) | |
| UPDATE pgflow.step_states | |
| SET remaining_tasks = NULL | |
| WHERE status = 'created'; | |
| -- Set remaining_tasks = NULL for all 'created' steps (not started yet) | |
| UPDATE pgflow.step_states | |
| SET remaining_tasks = NULL | |
| WHERE status = 'created' AND remaining_tasks IS NOT NULL; | |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
e5f7c0d to
a3a6369
Compare
pkgs/core/supabase/tests/add_step/step_type_validation.test.sql
Outdated
Show resolved
Hide resolved
pkgs/core/supabase/tests/add_step/step_type_validation.test.sql
Outdated
Show resolved
Hide resolved
a3a6369 to
6d7f91f
Compare
pkgs/core/supabase/migrations/20250911062632_pgflow_add_map_step_type.sql
Show resolved
Hide resolved
6d7f91f to
2036d92
Compare
| SET status = 'started', | ||
| started_at = now() | ||
| started_at = now(), | ||
| remaining_tasks = ready_steps.initial_tasks -- Copy initial_tasks to remaining_tasks when starting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a null check when copying initial_tasks to remaining_tasks. While the schema defines initial_tasks with DEFAULT 1, existing data might contain null values. Using COALESCE would provide a safety net:
remaining_tasks = COALESCE(ready_steps.initial_tasks, 1) -- Safely copy with fallbackThis ensures the task counting logic remains robust even with unexpected data states.
| remaining_tasks = ready_steps.initial_tasks -- Copy initial_tasks to remaining_tasks when starting | |
| remaining_tasks = COALESCE(ready_steps.initial_tasks, 1) -- Safely copy with fallback |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
a330d14 to
91ff1a6
Compare
08a197e to
a0af828
Compare
68d09d6 to
c35c02f
Compare
436b03d to
ee33c4a
Compare
c35c02f to
d025233
Compare
ee33c4a to
2d741f6
Compare
2d741f6 to
dfed36c
Compare
d025233 to
9058bbb
Compare
🔍 Preview Deployment: Website✅ Deployment successful! 🔗 Preview URL: https://pr-208.pgflow.pages.dev 📝 Details:
_Last updated: _ |
🔍 Preview Deployment: Playground✅ Deployment successful! 🔗 Preview URL: https://pr-208--pgflow-demo.netlify.app 📝 Details:
_Last updated: _ |
Merge activity
|
# Map Step Type Implementation (SQL Core)
## Overview
This PR implements **map step type** functionality in pgflow's SQL Core, enabling parallel processing of array data through fanout/map operations. This foundational feature allows workflows to spawn multiple tasks from a single array input, process them in parallel, and aggregate results back into ordered arrays.
## What We're Building
### Core Concept: Map Steps
Map steps transform single array inputs into parallel task execution:
```typescript
// Input array
const items = ['item1', 'item2', 'item3'];
// Map step spawns 3 parallel tasks
// Each task processes one array element
// Results aggregated back: ["result1", "result2", "result3"]
```
### Two Types of Map Steps
**1. Root Maps (0 dependencies)**
- Input: `runs.input` must be a JSONB array
- Behavior: Spawns N tasks where N = array length
- Example: `start_flow('my_flow', ["a", "b", "c"])` → 3 parallel tasks
**2. Dependent Maps (1 dependency)**
- Input: Output from a completed dependency step (must be array)
- Behavior: Spawns N tasks based on dependency's output array length
- Example: `single_step → ["x", "y"]` → `map_step` spawns 2 tasks
### Map→Map Chaining
Maps can depend on other maps:
- Parent map completes all tasks first
- Child map receives parent's task count as its planned count
- Child tasks read from aggregated parent array using simple approach
## Technical Implementation
### Hybrid Architecture: "Fresh Data" vs "Dumb Spawning"
**Key Innovation**: Separate array parsing from task spawning for optimal performance.
**Fresh Data Functions** (parse arrays, set counts):
- `start_flow()`: Validates `runs.input` arrays for root maps, sets `initial_tasks`
- `complete_task()`: Validates dependency outputs for dependent maps, sets `initial_tasks`
**Dumb Functions** (use pre-computed counts):
- `start_ready_steps()`: Copies `initial_tasks → remaining_tasks` and spawns N tasks (no JSON parsing)
- `start_tasks()`: Extracts array elements using `task_index`
- `maybe_complete_run()`: Aggregates results into ordered arrays
### Schema Changes
1. **Enable Map Step Type**
```sql
-- steps table constraint updated
check (step_type in ('single', 'map'))
```
2. **Remove Single Task Limit**
```sql
-- step_tasks table constraint removed
-- constraint only_single_task_per_step check (task_index = 0)
```
3. **Add Task Planning Columns**
```sql
-- step_states table updates
remaining_tasks int NULL, -- NULL = not started, >0 = active countdown
initial_tasks int DEFAULT 1 CHECK (initial_tasks >= 0), -- Planned count
CONSTRAINT remaining_tasks_state_consistency CHECK (
remaining_tasks IS NULL OR status != 'created'
)
```
4. **Enhanced Functions**
- `add_step()`: Accepts `step_type` parameter, validates map constraints
- `start_flow()`: Root map validation and count setting
- `complete_task()`: Dependent map count setting
- `start_ready_steps()`: Efficient bulk task spawning
- `start_tasks()`: Array element extraction
- `maybe_complete_run()`: Result aggregation
### Performance Optimizations
- **Minimal JSON Parsing**: Arrays parsed exactly twice (once in each "fresh data" function)
- **Batch Operations**: Efficient task creation using `generate_series(0, N-1)`
- **Atomic Updates**: Count setting under existing locks
### Edge Cases Handled
- **Empty Arrays**: Auto-complete with `[]` output (not failures)
- **Array Validation**: Fail fast with clear errors for non-array inputs
- **Mixed Workflows**: Single + map steps in same flow
- **Task Indexing**: 0-based indexing matching JSONB array indices
- **Ordered Results**: Results aggregated in `task_index` order
## Database Flow Examples
### Root Map Flow
```sql
-- 1. Create flow with root map step
SELECT pgflow.create_flow('example_flow');
SELECT pgflow.add_step('example_flow', 'process_items', 'map');
-- 2. Start with array input
SELECT pgflow.start_flow('example_flow', '["item1", "item2", "item3"]');
-- → Validates input is array
-- → Sets initial_tasks = 3 for 'process_items' step
-- 3. start_ready_steps() spawns 3 tasks
-- → Creates step_tasks with task_index 0, 1, 2
-- → Sends 3 messages to pgmq queue
-- 4. Workers process tasks in parallel
-- → Each gets individual array element: "item1", "item2", "item3"
-- → Returns processed results
-- 5. Results aggregated in run output
-- → Final output: {"process_items": ["result1", "result2", "result3"]}
```
### Dependent Map Flow
```sql
-- 1. Single step produces array output
SELECT pgflow.complete_task(run_id, 'prep_step', 0, '["a", "b"]');
-- → complete_task() detects map dependent
-- → Sets initial_tasks = 2 for dependent map step
-- 2. Map step becomes ready and spawns 2 tasks
-- → Each task gets individual element: "a", "b"
-- → Processes in parallel
-- 3. Results aggregated
-- → Map step output: ["processed_a", "processed_b"]
```
## Handler Interface
Map task handlers receive individual array elements (TypeScript `Array.map()` style):
```typescript
// Single step handler
const singleHandler = (input: {run: Json, dep1?: Json}) => { ... }
// Map step handler
const mapHandler = (item: Json) => {
// Receives just the array element, not wrapped object
return processItem(item);
}
```
## Comprehensive Testing
**33 test files** covering:
- Schema validation and constraints
- Function-by-function unit tests
- Integration workflows (root maps, dependent maps, mixed)
- Edge cases (empty arrays, validation failures)
- Error handling and failure propagation
- Idempotency and double-execution safety
## Benefits
- **Parallel Processing**: Automatic fanout for array data
- **Type Safety**: Strong validation ensures arrays where expected
- **Performance**: Minimal JSON parsing, efficient batch operations
- **Simplicity**: Familiar map semantics, clear error messages
- **Scalability**: Handles large arrays efficiently
- **Composability**: Map steps integrate seamlessly with existing single steps
## Breaking Changes
None. This is a pure addition:
- Existing single steps unchanged
- New `step_type` parameter defaults to `'single'`
- Backward compatibility maintained
- Database migrations are additive only
## Implementation Status
- ✅ **Architecture Designed**: Hybrid approach validated
- ✅ **Plan Documented**: Complete implementation roadmap
- 🔄 **In Progress**: Schema changes and function updates
- ⏳ **Next**: TDD implementation starting with `add_step()` function
## Overview This PR implements root map step support in pgflow, enabling map steps that run as the first steps in a workflow with no dependencies. Root map steps automatically spawn N parallel tasks based on the input array length provided to start_flow(). ## What's Changed ### Enhanced start_flow() Function - Root map detection: Identifies map steps with deps_count = 0 as root maps - Array validation: Validates that runs.input is an array when root map steps are present - Dynamic task count: Sets initial_tasks = jsonb_array_length(input) for root map steps - Clear error messages: Fails with descriptive error if input is not an array for flows with root map steps ### Implementation Details The start_flow() function now: 1. Checks for root map steps (step_type='map' AND deps_count=0) 2. Validates input is an array if root maps exist 3. Sets appropriate initial_tasks count: - Root map steps: Array length from input - All other steps: 1 (existing behavior) ### Testing Coverage Added comprehensive test suites: - root_map_array_validation.test.sql: Validates array input requirement for root map flows - root_map_initial_tasks.test.sql: Tests initial_tasks calculation for various array sizes (0, 3, 100 elements) - multiple_root_maps.test.sql: Tests flows with multiple root map steps - mixed_step_types.test.sql: Tests flows combining single and map step types ## Key Features - ✅ Automatic parallel task spawning based on input array length - ✅ Support for empty arrays (initial_tasks = 0) - ✅ Support for multiple root map steps in the same flow - ✅ Mixed step type support (single and map steps can coexist) - ✅ Clear validation errors for incorrect input types ## Migration - Updated migration: 20250911095520_pgflow_add_map_step_type.sql - Includes all schema changes from PR #208 plus the enhanced start_flow() function

Map Step Type Implementation (SQL Core)
Overview
This PR implements map step type functionality in pgflow's SQL Core, enabling parallel processing of array data through fanout/map operations. This foundational feature allows workflows to spawn multiple tasks from a single array input, process them in parallel, and aggregate results back into ordered arrays.
What We're Building
Core Concept: Map Steps
Map steps transform single array inputs into parallel task execution:
Two Types of Map Steps
1. Root Maps (0 dependencies)
runs.inputmust be a JSONB arraystart_flow('my_flow', ["a", "b", "c"])→ 3 parallel tasks2. Dependent Maps (1 dependency)
single_step → ["x", "y"]→map_stepspawns 2 tasksMap→Map Chaining
Maps can depend on other maps:
Technical Implementation
Hybrid Architecture: "Fresh Data" vs "Dumb Spawning"
Key Innovation: Separate array parsing from task spawning for optimal performance.
Fresh Data Functions (parse arrays, set counts):
start_flow(): Validatesruns.inputarrays for root maps, setsinitial_taskscomplete_task(): Validates dependency outputs for dependent maps, setsinitial_tasksDumb Functions (use pre-computed counts):
start_ready_steps(): Copiesinitial_tasks → remaining_tasksand spawns N tasks (no JSON parsing)start_tasks(): Extracts array elements usingtask_indexmaybe_complete_run(): Aggregates results into ordered arraysSchema Changes
Enable Map Step Type
Remove Single Task Limit
Add Task Planning Columns
Enhanced Functions
add_step(): Acceptsstep_typeparameter, validates map constraintsstart_flow(): Root map validation and count settingcomplete_task(): Dependent map count settingstart_ready_steps(): Efficient bulk task spawningstart_tasks(): Array element extractionmaybe_complete_run(): Result aggregationPerformance Optimizations
generate_series(0, N-1)Edge Cases Handled
[]output (not failures)task_indexorderDatabase Flow Examples
Root Map Flow
Dependent Map Flow
Handler Interface
Map task handlers receive individual array elements (TypeScript
Array.map()style):Comprehensive Testing
33 test files covering:
Benefits
Breaking Changes
None. This is a pure addition:
step_typeparameter defaults to'single'Implementation Status
add_step()function