Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference
Cloud DevOps/SRE engineer working with Kubernetes, GitHub Actions, Terraform, and distributed systems. I share practical guides, architecture patterns, and troubleshooting stories learned from running production systems.
Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference
Project Overview
| Industry | SaaS / ML Platform |
| Challenge | Single ML pipeline couldn't efficiently serve both small real-time requests and large batch jobs |
| Solution | Dual-pipeline architecture routing workloads to GPU or CPU based on batch size and latency requirements |
| Timeline | 8 weeks |
| Impact | 75% cost reduction on large batches, 80% latency improvement on small requests |
The Challenge
A high-volume ML classification platform processed millions of text documents through multiple NLP models. The system performed categorization, entity extraction, and scoring on incoming documents.
The existing single-pipeline architecture using SageMaker Serverless faced critical limitations:
Linear cost scaling: Processing 100,000 items cost exactly 100,000x a single item with no economies of scale
Concurrency limits: SageMaker Serverless caps at 200 concurrent requests per endpoint
Resource contention: Large enterprise batch jobs blocked demo and trial requests
Timeout failures: Long-running batches caused cascading failures across the pipeline
Business requirements demanded two contradictory capabilities: immediate response for demos and small requests, plus cost-effective processing for enterprise batch jobs.
The Approach
Workload Analysis
Traffic pattern analysis revealed a bimodal distribution:
| Batch Size | % of Requests | % of Total Volume | Latency Sensitivity |
| 1-10 items | 73% | 2% | High |
| 11-100 items | 18% | 8% | Medium |
| 101-1000 items | 7% | 15% | Low |
| 1000+ items | 2% | 75% | Low |
Most requests were small, but most volume came from large batches. Each required fundamentally different infrastructure.
Architecture Design
GPU Pipeline (Large Batches)
SageMaker Async Inference Endpoints
Scale-to-zero with CloudWatch-triggered spin-up
6-7 minute cold start, milliseconds per inference
Cost-effective at scale through GPU parallelism
CPU Pipeline (Small Batches)
SageMaker Serverless Inference Endpoints
Always warm, synchronous processing
Seconds per inference, zero cold start
Handles demos, trials, and ad-hoc requests
Predictable latency for user-facing workloads
Shared Infrastructure
DynamoDB for real-time state management
Aurora RDS Serverless for analytics (eventual consistency)
S3 for inference payloads and results
SQS queues between all microservices
Slack integration for alerting

Technical Implementation
Routing Logic
Incoming requests route based on batch size and account type:
def route_request(batch_size: int, is_demo: bool, is_trial: bool) -> str:
# Demos and trials always use CPU for fast response
if is_demo or is_trial:
return CPU_PIPELINE_GATEWAY
# Large batches use GPU for cost efficiency
if batch_size > BATCH_THRESHOLD:
return GPU_PIPELINE_GATEWAY
return CPU_PIPELINE_GATEWAY
The threshold (~500 items) was determined empirically based on GPU cold start time versus total processing time.

GPU Auto-Scaling
GPUs scale from zero to minimize costs during idle periods:
resource "aws_cloudwatch_metric_alarm" "gpu_pipeline_traffic" {
alarm_name = "gpu-pipeline-incoming-traffic"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 60
statistic = "Sum"
threshold = 0
alarm_actions = [aws_lambda_function.scale_up_gpu.arn]
}
EventBridge cron jobs periodically check queue status and scale GPUs to zero when idle.
Queue Architecture
SQS queues between microservices provide modularity and backpressure:
GPU pipeline queues: 15-minute visibility timeout for long inference jobs
CPU pipeline queues: 2-minute visibility timeout for fast processing
Dead letter queues capture failures for investigation
Lambda concurrency controls prevent downstream service overwhelm
Database Strategy
Initial Aurora RDS implementation hit write concurrency limits under GPU pipeline load:
ERROR: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found
ERROR: too many connections for role "app_user"
Solution: DynamoDB On-Demand for real-time writes with horizontal scaling. Aurora RDS retained for analytics with eventual consistency replication.
Results & Impact
Quantitative Improvements
| Metric | Before | After | Improvement |
| Large batch cost per item | $0.012 | $0.003 | 75% reduction |
| Small batch latency (p99) | 45s | 8s | 82% faster |
| Demo response time | 2-3 min | 15s | 90% faster |
| Max throughput | 50K items/hr | 500K items/hr | 10x increase |
| Failed jobs per day | 23 | 2 | 91% reduction |
Operational Improvements
Enterprise batch clients became highest-margin segment (previously unprofitable)
Sales demos no longer blocked by production workloads
Trial conversion improved due to faster initial experience
On-call incidents reduced through proactive queue depth alerting
Cost Optimization
GPU scale-to-zero eliminates idle costs during nights and weekends
Right-sized workload routing prevents expensive GPU usage for small requests
Reduced failure rate eliminates re-processing costs
Technologies Used
| Category | Technologies |
| Compute | AWS Lambda, SageMaker Async Inference, SageMaker Serverless |
| Messaging | Amazon SQS, Amazon SNS |
| Storage | Amazon S3, Amazon DynamoDB, Aurora RDS Serverless |
| Orchestration | Amazon EventBridge, CloudWatch Alarms |
| Infrastructure | Terraform, Docker, AWS ECR |
| Monitoring | CloudWatch Dashboards, Slack Webhooks |
Key Takeaways
Bimodal workloads need bimodal infrastructure. A single pipeline optimized for averages serves neither use case well.
Cold start is acceptable when amortized. The 6-7 minute GPU startup becomes negligible across large batch processing times.
Queue depth predicts failures. Alerting on queue depth provides lead time before user-visible impact.
Database access patterns matter more than data models. DynamoDB's horizontal write scaling solved problems that RDS couldn't handle regardless of instance size.
Operational complexity is manageable with shared infrastructure. Two pipelines with shared databases, storage, and alerting minimize maintenance overhead.