Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference

Project Overview


Industry	SaaS / ML Platform
Challenge	Single ML pipeline couldn't efficiently serve both small real-time requests and large batch jobs
Solution	Dual-pipeline architecture routing workloads to GPU or CPU based on batch size and latency requirements
Timeline	8 weeks
Impact	75% cost reduction on large batches, 80% latency improvement on small requests

The Challenge

A high-volume ML classification platform processed millions of text documents through multiple NLP models. The system performed categorization, entity extraction, and scoring on incoming documents.

The existing single-pipeline architecture using SageMaker Serverless faced critical limitations:

Linear cost scaling: Processing 100,000 items cost exactly 100,000x a single item with no economies of scale
Concurrency limits: SageMaker Serverless caps at 200 concurrent requests per endpoint
Resource contention: Large enterprise batch jobs blocked demo and trial requests
Timeout failures: Long-running batches caused cascading failures across the pipeline

Business requirements demanded two contradictory capabilities: immediate response for demos and small requests, plus cost-effective processing for enterprise batch jobs.

The Approach

Workload Analysis

Traffic pattern analysis revealed a bimodal distribution:

Batch Size	% of Requests	% of Total Volume	Latency Sensitivity
1-10 items	73%	2%	High
11-100 items	18%	8%	Medium
101-1000 items	7%	15%	Low
1000+ items	2%	75%	Low

Most requests were small, but most volume came from large batches. Each required fundamentally different infrastructure.

Architecture Design

GPU Pipeline (Large Batches)

SageMaker Async Inference Endpoints
Scale-to-zero with CloudWatch-triggered spin-up
6-7 minute cold start, milliseconds per inference
Cost-effective at scale through GPU parallelism

CPU Pipeline (Small Batches)

SageMaker Serverless Inference Endpoints
Always warm, synchronous processing
Seconds per inference, zero cold start
Handles demos, trials, and ad-hoc requests
Predictable latency for user-facing workloads

Shared Infrastructure

DynamoDB for real-time state management
Aurora RDS Serverless for analytics (eventual consistency)
S3 for inference payloads and results
SQS queues between all microservices
Slack integration for alerting

Technical Implementation

Routing Logic

Incoming requests route based on batch size and account type:

def route_request(batch_size: int, is_demo: bool, is_trial: bool) -> str:
    # Demos and trials always use CPU for fast response
    if is_demo or is_trial:
        return CPU_PIPELINE_GATEWAY

    # Large batches use GPU for cost efficiency
    if batch_size > BATCH_THRESHOLD:
        return GPU_PIPELINE_GATEWAY

    return CPU_PIPELINE_GATEWAY

The threshold (~500 items) was determined empirically based on GPU cold start time versus total processing time.

GPU Auto-Scaling

GPUs scale from zero to minimize costs during idle periods:

resource "aws_cloudwatch_metric_alarm" "gpu_pipeline_traffic" {
  alarm_name          = "gpu-pipeline-incoming-traffic"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  alarm_actions       = [aws_lambda_function.scale_up_gpu.arn]
}

EventBridge cron jobs periodically check queue status and scale GPUs to zero when idle.

Queue Architecture

SQS queues between microservices provide modularity and backpressure:

GPU pipeline queues: 15-minute visibility timeout for long inference jobs
CPU pipeline queues: 2-minute visibility timeout for fast processing
Dead letter queues capture failures for investigation
Lambda concurrency controls prevent downstream service overwhelm

Database Strategy

Initial Aurora RDS implementation hit write concurrency limits under GPU pipeline load:

ERROR: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found
ERROR: too many connections for role "app_user"

Solution: DynamoDB On-Demand for real-time writes with horizontal scaling. Aurora RDS retained for analytics with eventual consistency replication.

Results & Impact

Quantitative Improvements

Metric	Before	After	Improvement
Large batch cost per item	$0.012	$0.003	75% reduction
Small batch latency (p99)	45s	8s	82% faster
Demo response time	2-3 min	15s	90% faster
Max throughput	50K items/hr	500K items/hr	10x increase
Failed jobs per day	23	2	91% reduction

Operational Improvements

Enterprise batch clients became highest-margin segment (previously unprofitable)
Sales demos no longer blocked by production workloads
Trial conversion improved due to faster initial experience
On-call incidents reduced through proactive queue depth alerting

Cost Optimization

GPU scale-to-zero eliminates idle costs during nights and weekends
Right-sized workload routing prevents expensive GPU usage for small requests
Reduced failure rate eliminates re-processing costs

Technologies Used

Category	Technologies
Compute	AWS Lambda, SageMaker Async Inference, SageMaker Serverless
Messaging	Amazon SQS, Amazon SNS
Storage	Amazon S3, Amazon DynamoDB, Aurora RDS Serverless
Orchestration	Amazon EventBridge, CloudWatch Alarms
Infrastructure	Terraform, Docker, AWS ECR
Monitoring	CloudWatch Dashboards, Slack Webhooks

Key Takeaways

Bimodal workloads need bimodal infrastructure. A single pipeline optimized for averages serves neither use case well.
Cold start is acceptable when amortized. The 6-7 minute GPU startup becomes negligible across large batch processing times.
Queue depth predicts failures. Alerting on queue depth provides lead time before user-visible impact.
Database access patterns matter more than data models. DynamoDB's horizontal write scaling solved problems that RDS couldn't handle regardless of instance size.
Operational complexity is manageable with shared infrastructure. Two pipelines with shared databases, storage, and alerting minimize maintenance overhead.

Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference

Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference

Project Overview

The Challenge

The Approach

Workload Analysis

Architecture Design

Technical Implementation

Routing Logic

GPU Auto-Scaling

Queue Architecture

Database Strategy

Results & Impact

Quantitative Improvements

Operational Improvements

Cost Optimization

Technologies Used

Key Takeaways

Comments

More from this blog

Eliminating .env Files: A Practical Guide to AWS Secrets Manager for Development Teams

AWS DevOps Agent: Complete Technical Analysis and Adoption Guide

Building a VFR Flight Weather App with Next.js and Aviation APIs

Zero-Downtime Database Migration Pipeline: PostgreSQL to Aurora

Command Palette

Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference

Project Overview

The Challenge

The Approach

Workload Analysis

Architecture Design

Technical Implementation

Routing Logic

GPU Auto-Scaling

Queue Architecture

Database Strategy

Results & Impact

Quantitative Improvements

Operational Improvements

Cost Optimization

Technologies Used

Key Takeaways

Comments

More from this blog