ScriptMatix - OpenAI Rate Limit Scaling Architecture
Problem: Hitting Tier 5 rate limits (10M tokens/min, 10K requests/min) with screenplay generation requiring 30-100 sequential OpenAI API calls.
Solution: Horizontal scaling across multiple OpenAI organizations + persistent worker queue system.
Solution: Horizontal scaling across multiple OpenAI organizations + persistent worker queue system.
Current Architecture (The Problem)
React Frontend
→
Java/Tomcat Backend
→
Single OpenAI Org
(Tier 5 Limits)
(Tier 5 Limits)
8GB In-Memory Queue
(Lost on crash?)
(Lost on crash?)
→
Sequential 30-100 calls
(5-15 minutes)
(5-15 minutes)
→
Rate Limit Hit ❌
Issues:
- Single rate limit ceiling (can't scale beyond Tier 5)
- In-memory queue (jobs lost on server restart)
- No load balancing across multiple OpenAI orgs
Proposed Architecture (The Solution)
React Frontend
→
Java/Tomcat API
(Returns immediately)
(Returns immediately)
→
Redis/PostgreSQL
Persistent Queue
Persistent Queue
Job Queue
→
Worker Pool
(Load Balancer)
(Load Balancer)
→
OpenAI Org A
(Tier 5)
(Tier 5)
OpenAI Org B
(Tier 5)
(Tier 5)
OpenAI Org C
(Tier 5)
(Tier 5)
OpenAI Org D
(Tier 5)
(Tier 5)
Benefits:
- 3-5x throughput: Multiple orgs = multiple Tier 5 rate limit pools
- Persistent queue: Jobs survive server restarts
- Smart load balancing: Routes to least-loaded org
- Automatic failover: If one org hits limit, route to next
- No job loss: Redis/PostgreSQL persistence
Cloud Deployment Options
| Option | Timeout Limit | Auto Scaling | Complexity | Monthly Cost | Best For |
|---|---|---|---|---|---|
| AWS Lambda | 15 min ❌ | Yes ✅ | Low | $50-100 | Short jobs only (NOT ideal) |
| AWS ECS Fargate (RECOMMENDED) | None ✅ | Yes ✅ | Medium | $150-300 | Serverless containers, auto-scaling |
| AWS EC2 Auto Scaling | None ✅ | Yes ✅ | Medium | $100-200 | Full control, traditional VMs |
| Digital Ocean App Platform | None ✅ | Yes ✅ | Low ✅ | $50-100 | Simplicity, Heroku-like experience |
| Keep Current + Add Workers | None ✅ | Manual | Low ✅ | $25-50 | Fastest implementation, least risk |
Implementation Approach
Phase 1: Quick Win (2-3 weeks)
Keep current Java/Tomcat, add:
- Persistent queue (Redis or PostgreSQL)
- Background workers (Java or Node.js)
- 3-5 OpenAI organizations
- Load balancer (route to least-loaded org)
✅ Fastest to implement
✅ Lowest risk (minimal changes)
✅ Immediate 3-5x scaling
Phase 2: Cloud Native (Later)
If you need AWS-level scaling:
- Migrate workers to ECS Fargate containers
- Redis on AWS ElastiCache
- Auto-scaling based on queue depth
- CloudWatch monitoring
✅ True auto-scaling
✅ Enterprise-grade reliability
*Higher maintenance overhead
Multi-Organization Strategy
How It Works:
- Set up 3-5 OpenAI organizations (official, not against ToS)
- Each org gets Tier 5 limits: 10M tokens/min, 10K requests/min
- Worker pool tracks usage per org (real-time monitoring)
- Jobs route to least-loaded org (smart load balancing)
- If org hits limit → route to next org (automatic failover)
Result: 3 orgs = 30M tokens/min capacity (3x current limit)
Result: 5 orgs = 50M tokens/min capacity (5x current limit)
Result: 5 orgs = 50M tokens/min capacity (5x current limit)
Why NOT AWS Lambda?
Lambda Limitations for Your Use Case:
- 15-minute timeout: Your jobs can run 5-15+ minutes, approaching Lambda's hard limit
- Not designed for long-running jobs: Lambda is optimized for short-lived functions
- Cost model mismatch: Paying per 100ms while mostly waiting on API calls is inefficient
Better: Persistent Workers (AWS ECS Fargate)
- No timeout limits: Jobs can run for hours if needed
- Cost-efficient: Pay per hour, not per millisecond of wait time
- Built for this: Long-running sequential background jobs
Recommended Next Steps
- Phase 1 (Week 1): Set up persistent queue (Redis) + migrate 8GB in-memory queue
- Phase 1 (Week 2): Build worker processes with multi-org load balancing
- Phase 1 (Week 3): Test with 3 OpenAI orgs, verify 3x throughput
- Phase 2 (Later): Migrate to ECS Fargate for auto-scaling (optional)
Deliverables:
- Persistent queue system (no more job loss)
- Multi-org worker pool (3-5x rate limit capacity)
- Smart load balancing (automatic failover)
- Monitoring dashboard (track usage per org)
- Documentation for maintenance