A high-performance, configurable Go data processing pipeline library with support for batch processing and data deduplication.
- β Reliably processes tens of billions of data entries daily
- β‘οΈ Handles hundreds of thousands of entries per second per instance
- πΎ Controlled memory usage, supports large-scale distributed deployment
- π₯ Excellent performance in high-concurrency and big data scenarios
-
π― Generic support for processing any data type
-
π Provides both synchronous and asynchronous processing modes
-
π¨ Data deduplication support
-
βοΈ Configurable batch size and flush intervals
-
π‘οΈ Built-in error handling and recovery mechanisms
-
π Graceful shutdown and resource release
-
Production Environment Validation:
- Stable operation with tens of billions of daily data entries
- Single instance processes hundreds of thousands of entries per second
- Controlled memory usage, supports large-scale distributed deployment
- Excellent performance in high-concurrency and big data scenarios
go get github.com/rushairer/go-pipeline
package main
import (
"context"
"fmt"
"time"
"github.com/rushairer/go-pipeline"
)
func main() {
// Create context
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
defer cancel()
// Create pipeline instance
pipeline := gopipeline.NewPipeline[string](
gopipeline.PipelineConfig{
FlushSize: 1000, // Batch size
BufferSize: 2000, // Buffer size
FlushInterval: time.Second, // Flush interval
},
func(ctx context.Context, batchData []string) error {
// Process batch data
fmt.Printf("Processing batch data, count: %d\n", len(batchData))
return nil
},
)
// Start async processing
go pipeline.AsyncPerform(ctx)
// Add data
for i := 0; i < 5000; i++ {
if err := pipeline.Add(ctx, fmt.Sprintf("item-%d", i)); err != nil {
fmt.Printf("Failed to add data: %v\n", err)
return
}
}
}
package main
import (
"context"
"fmt"
"time"
"github.com/rushairer/go-pipeline"
)
// Define a deduplication-supported data structure
type Item struct {
ID string
Data string
}
// Implement MapData interface
func (i Item) GetKey() string {
return i.ID
}
func main() {
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
defer cancel()
// Create deduplication pipeline instance
pipeline := gopipeline.NewPipelineDeduplication[Item](
gopipeline.PipelineConfig{
FlushSize: 1000,
BufferSize: 2000,
FlushInterval: time.Second,
},
func(ctx context.Context, batchData map[string]Item) error {
fmt.Printf("Processing deduplicated batch data, count: %d\n", len(batchData))
return nil
},
)
go pipeline.AsyncPerform(ctx)
// Add duplicate data
items := []Item{
{ID: "1", Data: "data1"},
{ID: "2", Data: "data2"},
{ID: "1", Data: "data1-new"}, // Will override old data with ID="1"
}
for _, item := range items {
if err := pipeline.Add(ctx, item); err != nil {
fmt.Printf("Failed to add data: %v\n", err)
return
}
}
}
graph TB
subgraph Core Interfaces
A[DataProcessor] -->|implements| B[Pipeline]
A -->|implements| C[PipelineDeduplication]
D[DataAdder] -->|implements| E[BasePipelineImpl]
F[Performer] -->|implements| E
G[BasePipeline] -->|combines| D
G -->|combines| F
G -->|combines| A
end
graph TB
subgraph Data Input
A[External Data] --> B[Add Method]
B --> C[dataChan Channel]
end
subgraph Perform Processing Loop
C --> D{select processing}
D -->|data event| E[data receive processing]
D -->|timer event| F[timer flush processing]
D -->|Context Done| G[exit processing]
E --> H{check batch size}
H -->|less than FlushSize| I[add to batchData]
H -->|reaches FlushSize| J[trigger batch flush]
F --> K{check batchData}
K -->|has data| L[trigger timer flush]
K -->|no data| M[continue waiting]
J --> N[clear batchData]
L --> N
end
subgraph Batch Processing
I --> O[batchData array]
O --> P[sequential data storage]
end
subgraph Async Flush Processing
J --> R[async execute flushFunc]
L --> R
R --> S[process batch data]
end
graph TB
subgraph Data Input
A[External Data] --> B[Add Method]
B --> C[dataChan Channel]
end
subgraph Perform Processing Loop
C --> D{select processing}
D -->|data event| E[data receive processing]
D -->|timer event| F[timer flush processing]
D -->|Context Done| G[exit processing]
E --> H{check batch size}
H -->|less than FlushSize| I[add to batchData]
H -->|reaches FlushSize| J[trigger batch flush]
F --> K{check batchData}
K -->|has data| L[trigger timer flush]
K -->|no data| M[continue waiting]
J --> N[clear batchData]
L --> N
end
subgraph Data Deduplication Processing
I --> O[batchData Map]
O -->|Key conflict| P[override old value]
O -->|new Key| Q[add new data]
end
subgraph Async Flush Processing
J --> R[async execute flushFunc]
L --> R
R --> S[process batch data]
end
-
Interface Design
- DataProcessor: Core interface defining batch data processing, including initialization, addition, flushing, and status checking methods
- DataAdder: Provides data addition capability
- Performer: Provides synchronous and asynchronous execution capabilities
- BasePipeline: Combines above interfaces, defines complete pipeline functionality
- MapData: Provides GetKey interface for deduplication functionality
-
Data Storage Structure
- Pipeline: Uses array ([]T) to store data, maintains data input order
- PipelineDeduplication: Uses Map (map[string]T) to store data, implements deduplication based on Key
-
Error Handling Mechanism
- Unified error type definitions (e.g., ErrContextIsClosed)
- Panic recovery mechanism in Add method
- Defer panic handling in performLoop
- Graceful exit on Context cancellation
-
Performance Optimization Features
- Configurable batch size (FlushSize)
- Adjustable buffer size (BufferSize)
- Flexible flush interval (FlushInterval)
- Support for both synchronous and asynchronous processing modes
-
FlushSize Configuration
- Recommended range: 1000-100000
- Considerations: Downstream processing capacity, memory usage
- Default value: 100000
-
BufferSize Configuration
- Recommended as 1.5-2 times FlushSize
- Default value: 200000
- Adjustment principles:
- Production faster than consumption: Increase appropriately
- Consumption faster than production: Can be reduced
- Memory constrained: Reduce FlushSize and BufferSize proportionally
-
FlushInterval Configuration
- Default value: 60 seconds
- Adjust based on real-time requirements
- Smaller intervals improve real-time performance but increase processing overhead
-
Concurrency Control
- Consider implementing goroutine pool to control concurrency
- Take measures to prevent goroutine leaks under high load
-
Error Handling Enhancement
- Consider adding error callback mechanism
- Implement comprehensive graceful shutdown strategy
- Consider adding batch processing status tracking
-
Performance Optimization
- Implement memory pool for batchData reuse
- Add configurable retry mechanism
- Provide performance monitoring metrics
- Processing latency
- Success rate
- Memory usage
- Throughput
-
Observability Improvements
- Add detailed logging
- Integrate monitoring metrics export
- Provide debugging interfaces
This project is licensed under the MIT License - see the LICENSE file for details.