ScriptMatix - OpenAI Rate Limit Scaling Architecture

Problem: Hitting Tier 5 rate limits (10M tokens/min, 10K requests/min) with screenplay generation requiring 30-100 sequential OpenAI API calls.

Solution: Horizontal scaling across multiple OpenAI organizations + persistent worker queue system.

Current Architecture (The Problem)

React Frontend
Java/Tomcat Backend
Single OpenAI Org
(Tier 5 Limits)
8GB In-Memory Queue
(Lost on crash?)
Sequential 30-100 calls
(5-15 minutes)
Rate Limit Hit ❌

Issues:

Proposed Architecture (The Solution)

React Frontend
Java/Tomcat API
(Returns immediately)
Redis/PostgreSQL
Persistent Queue
Job Queue
Worker Pool
(Load Balancer)
OpenAI Org A
(Tier 5)
OpenAI Org B
(Tier 5)
OpenAI Org C
(Tier 5)
OpenAI Org D
(Tier 5)

Benefits:

Cloud Deployment Options

Option Timeout Limit Auto Scaling Complexity Monthly Cost Best For
AWS Lambda 15 min ❌ Yes ✅ Low $50-100 Short jobs only (NOT ideal)
AWS EC2 Auto Scaling None ✅ Yes ✅ Medium $100-200 Full control, traditional VMs
Digital Ocean App Platform None ✅ Yes ✅ Low ✅ $50-100 Simplicity, Heroku-like experience
Keep Current + Add Workers None ✅ Manual Low ✅ $25-50 Fastest implementation, least risk

Implementation Approach

Phase 1: Quick Win (2-3 weeks)

Keep current Java/Tomcat, add:

  • Persistent queue (Redis or PostgreSQL)
  • Background workers (Java or Node.js)
  • 3-5 OpenAI organizations
  • Load balancer (route to least-loaded org)

✅ Fastest to implement

✅ Lowest risk (minimal changes)

✅ Immediate 3-5x scaling

Phase 2: Cloud Native (Later)

If you need AWS-level scaling:

  • Migrate workers to ECS Fargate containers
  • Redis on AWS ElastiCache
  • Auto-scaling based on queue depth
  • CloudWatch monitoring

✅ True auto-scaling

✅ Enterprise-grade reliability

*Higher maintenance overhead

Multi-Organization Strategy

How It Works:

  1. Set up 3-5 OpenAI organizations (official, not against ToS)
  2. Each org gets Tier 5 limits: 10M tokens/min, 10K requests/min
  3. Worker pool tracks usage per org (real-time monitoring)
  4. Jobs route to least-loaded org (smart load balancing)
  5. If org hits limit → route to next org (automatic failover)
Result: 3 orgs = 30M tokens/min capacity (3x current limit)
Result: 5 orgs = 50M tokens/min capacity (5x current limit)

Why NOT AWS Lambda?

Lambda Limitations for Your Use Case:

Better: Persistent Workers (AWS ECS Fargate)

Recommended Next Steps

  1. Phase 1 (Week 1): Set up persistent queue (Redis) + migrate 8GB in-memory queue
  2. Phase 1 (Week 2): Build worker processes with multi-org load balancing
  3. Phase 1 (Week 3): Test with 3 OpenAI orgs, verify 3x throughput
  4. Phase 2 (Later): Migrate to ECS Fargate for auto-scaling (optional)
Deliverables:
  • Persistent queue system (no more job loss)
  • Multi-org worker pool (3-5x rate limit capacity)
  • Smart load balancing (automatic failover)
  • Monitoring dashboard (track usage per org)
  • Documentation for maintenance

lock Access Code Required

Enter 4-digit PIN to View