Skip to main content

Command Palette

Search for a command to run...

Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference

Updated
5 min read
M

Cloud DevOps/SRE engineer working with Kubernetes, GitHub Actions, Terraform, and distributed systems. I share practical guides, architecture patterns, and troubleshooting stories learned from running production systems.

Dual-Pipeline Architecture: GPU vs CPU for High-Volume ML Inference

Project Overview

IndustrySaaS / ML Platform
ChallengeSingle ML pipeline couldn't efficiently serve both small real-time requests and large batch jobs
SolutionDual-pipeline architecture routing workloads to GPU or CPU based on batch size and latency requirements
Timeline8 weeks
Impact75% cost reduction on large batches, 80% latency improvement on small requests

The Challenge

A high-volume ML classification platform processed millions of text documents through multiple NLP models. The system performed categorization, entity extraction, and scoring on incoming documents.

The existing single-pipeline architecture using SageMaker Serverless faced critical limitations:

  • Linear cost scaling: Processing 100,000 items cost exactly 100,000x a single item with no economies of scale

  • Concurrency limits: SageMaker Serverless caps at 200 concurrent requests per endpoint

  • Resource contention: Large enterprise batch jobs blocked demo and trial requests

  • Timeout failures: Long-running batches caused cascading failures across the pipeline

Business requirements demanded two contradictory capabilities: immediate response for demos and small requests, plus cost-effective processing for enterprise batch jobs.


The Approach

Workload Analysis

Traffic pattern analysis revealed a bimodal distribution:

Batch Size% of Requests% of Total VolumeLatency Sensitivity
1-10 items73%2%High
11-100 items18%8%Medium
101-1000 items7%15%Low
1000+ items2%75%Low

Most requests were small, but most volume came from large batches. Each required fundamentally different infrastructure.

Architecture Design

GPU Pipeline (Large Batches)

  • SageMaker Async Inference Endpoints

  • Scale-to-zero with CloudWatch-triggered spin-up

  • 6-7 minute cold start, milliseconds per inference

  • Cost-effective at scale through GPU parallelism

CPU Pipeline (Small Batches)

  • SageMaker Serverless Inference Endpoints

  • Always warm, synchronous processing

  • Seconds per inference, zero cold start

  • Handles demos, trials, and ad-hoc requests

  • Predictable latency for user-facing workloads

Shared Infrastructure

  • DynamoDB for real-time state management

  • Aurora RDS Serverless for analytics (eventual consistency)

  • S3 for inference payloads and results

  • SQS queues between all microservices

  • Slack integration for alerting


Technical Implementation

Routing Logic

Incoming requests route based on batch size and account type:

def route_request(batch_size: int, is_demo: bool, is_trial: bool) -> str:
    # Demos and trials always use CPU for fast response
    if is_demo or is_trial:
        return CPU_PIPELINE_GATEWAY

    # Large batches use GPU for cost efficiency
    if batch_size > BATCH_THRESHOLD:
        return GPU_PIPELINE_GATEWAY

    return CPU_PIPELINE_GATEWAY

The threshold (~500 items) was determined empirically based on GPU cold start time versus total processing time.

GPU Auto-Scaling

GPUs scale from zero to minimize costs during idle periods:

resource "aws_cloudwatch_metric_alarm" "gpu_pipeline_traffic" {
  alarm_name          = "gpu-pipeline-incoming-traffic"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  alarm_actions       = [aws_lambda_function.scale_up_gpu.arn]
}

EventBridge cron jobs periodically check queue status and scale GPUs to zero when idle.

Queue Architecture

SQS queues between microservices provide modularity and backpressure:

  • GPU pipeline queues: 15-minute visibility timeout for long inference jobs

  • CPU pipeline queues: 2-minute visibility timeout for fast processing

  • Dead letter queues capture failures for investigation

  • Lambda concurrency controls prevent downstream service overwhelm

Database Strategy

Initial Aurora RDS implementation hit write concurrency limits under GPU pipeline load:

ERROR: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found
ERROR: too many connections for role "app_user"

Solution: DynamoDB On-Demand for real-time writes with horizontal scaling. Aurora RDS retained for analytics with eventual consistency replication.


Results & Impact

Quantitative Improvements

MetricBeforeAfterImprovement
Large batch cost per item$0.012$0.00375% reduction
Small batch latency (p99)45s8s82% faster
Demo response time2-3 min15s90% faster
Max throughput50K items/hr500K items/hr10x increase
Failed jobs per day23291% reduction

Operational Improvements

  • Enterprise batch clients became highest-margin segment (previously unprofitable)

  • Sales demos no longer blocked by production workloads

  • Trial conversion improved due to faster initial experience

  • On-call incidents reduced through proactive queue depth alerting

Cost Optimization

  • GPU scale-to-zero eliminates idle costs during nights and weekends

  • Right-sized workload routing prevents expensive GPU usage for small requests

  • Reduced failure rate eliminates re-processing costs


Technologies Used

CategoryTechnologies
ComputeAWS Lambda, SageMaker Async Inference, SageMaker Serverless
MessagingAmazon SQS, Amazon SNS
StorageAmazon S3, Amazon DynamoDB, Aurora RDS Serverless
OrchestrationAmazon EventBridge, CloudWatch Alarms
InfrastructureTerraform, Docker, AWS ECR
MonitoringCloudWatch Dashboards, Slack Webhooks

Key Takeaways

  1. Bimodal workloads need bimodal infrastructure. A single pipeline optimized for averages serves neither use case well.

  2. Cold start is acceptable when amortized. The 6-7 minute GPU startup becomes negligible across large batch processing times.

  3. Queue depth predicts failures. Alerting on queue depth provides lead time before user-visible impact.

  4. Database access patterns matter more than data models. DynamoDB's horizontal write scaling solved problems that RDS couldn't handle regardless of instance size.

  5. Operational complexity is manageable with shared infrastructure. Two pipelines with shared databases, storage, and alerting minimize maintenance overhead.