pLLM
High-Performance
LLM Gateway
Drop-in OpenAI replacement built in Go. Handle thousands of concurrent requests with adaptive routing, multi-provider support, and enterprise-grade reliability.
Enterprise-Grade Features
Built from the ground up for production workloads with performance, reliability, and developer experience in mind.
100% OpenAI Compatible
Drop-in replacement for OpenAI API. No code changes needed - just update your base URL and you're ready to go.
Multi-Provider Support
Support for OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Llama, and Cohere with unified interface.
Adaptive Routing
Intelligent request routing with automatic failover, circuit breakers, and health-based load balancing.
High Performance
Built in Go for maximum performance. Handle thousands of concurrent requests with minimal latency overhead.
Enterprise Security
JWT authentication, RBAC, audit logging, and comprehensive monitoring with Prometheus metrics.
Cost Optimization
Budget management, intelligent caching, and multi-key load balancing to minimize API costs.
Technical Excellence
Deep technical capabilities designed for mission-critical production environments.
Performance
- Sub-millisecond routing overhead
- Thousands of concurrent connections
- Native compilation with Go
- Efficient memory management
Reliability
- Circuit breaker protection
- Automatic health monitoring
- Graceful degradation
- Zero-downtime deployments
Scalability
- Horizontal scaling ready
- Kubernetes native
- Redis-backed caching
- Distributed rate limiting
Observability
- Prometheus metrics
- Grafana dashboards
- Distributed tracing
- Comprehensive logging
Performance Advantage
See how pLLM compares to typical interpreted gateway solutions.
Metric | pLLM (Go) | Typical Gateway | Advantage |
---|---|---|---|
Concurrent Connections | Thousands | Limited | 🚀 Superior |
Memory Usage | 50-80MB | 150-300MB+ | 💾 3-6x Less |
Startup Time | <100ms | 2-5s | ⚡ 20-50x Faster |
CPU Efficiency | All cores | GIL limited | 🔥 True Parallel |
Enterprise Authentication
Seamless integration with your existing identity infrastructure through OAuth/OIDC support powered by Dex.
Zero Configuration
Connect to your existing identity providers without complex setup. Dex handles the OAuth/OIDC protocols while pLLM manages authorization.
Enterprise Security
Industry-standard OAuth 2.0 and OpenID Connect protocols with support for SAML, LDAP, and multi-factor authentication.
Supported Identity Providers
Connect with popular identity providers and enterprise systems


And many more through standard protocols:
Simple Configuration
Get started with OAuth/OIDC in minutes with a simple YAML configuration
auth:
dex:
issuer: https://dex.yourcompany.com
connectors:
- type: oidc
name: Google
config:
issuer: https://accounts.google.com
clientID: your-google-client-id
clientSecret: your-google-client-secret
- type: ldap
name: Corporate Directory
config:
host: ldap.yourcompany.com:636
insecureNoSSL: false
bindDN: cn=admin,dc=company,dc=com
System Architecture
Enterprise-grade architecture designed for high availability, scalability, and performance
Client Layer
Applications & Services
Web Applications
React, Vue, Angular
Mobile Apps
iOS, Android, React Native
Backend Services
Node.js, Python, Go
AI Platforms
LangChain, AutoGPT
pLLM Gateway
Intelligent Routing Engine
Core Gateway
High-performance Go runtime
Router
Chi HTTP Router
Auth
JWT & RBAC
Cache
Redis Layer
Monitor
Metrics & Logs
Provider Layer
LLM Service Providers
OpenAI
healthy
99.9%
uptime
Anthropic
healthy
99.9%
uptime
Azure OpenAI
degraded
85.2%
uptime
AWS Bedrock
healthy
99.9%
uptime
Google Vertex
healthy
99.9%
uptime
Llama
failed
0%
uptime
Data Flow & Features
Real-time monitoring and intelligent routing
Circuit Breaker
Automatic failover protection
Health Checks
Continuous monitoring
Rate Limiting
Traffic control & quotas
Analytics
Performance insights
Live Performance Metrics
1000+
Requests/sec
<1ms
Latency
99.9%
Uptime
65MB
Memory
Intelligent Load Balancing
Choose from 6 different routing strategies optimized for different use cases and workload patterns.
Round Robin
Even distribution across all providers
Least Busy
Routes to least loaded provider
Weighted
Custom weight distribution
Priority
Prefers high-priority providers
Latency-Based
Routes to fastest responding provider
Usage-Based
Respects rate limits and quotas
Production-Ready Stack
Built with battle-tested technologies and modern best practices for enterprise reliability.
Chi Router
Lightning-fast HTTP routing and middleware
PostgreSQL
Reliable data persistence with GORM ORM
Redis
High-speed caching and rate limiting
Prometheus
Enterprise monitoring and metrics
Adaptive Request Flow
Interactive visualization of intelligent routing with automatic failover and circuit breaker protection.
Product Roadmap
Exciting features coming soon to make pLLM even more powerful and enterprise-ready.
Key Rotation & Secret Management
Automated key rotation and integration with external secret managers for enhanced security.
Key Features:
Advanced Guardrails
Content filtering, rate limiting, and safety mechanisms to ensure responsible AI usage.
Key Features:
Enhanced Audit & Logging
Comprehensive audit trails with retention policies and compliance reporting.
Key Features:
Shape Our Roadmap
Have a feature request or want to influence our development priorities? We'd love to hear from you.
Enterprise Support
While pLLM is open source and free, we offer professional support and custom solutions for enterprises with mission-critical requirements.
Community
Perfect for developers and small teams getting started with pLLM.
What's included:
- GitHub Issues & Discussions
- Community Discord support
- Documentation and guides
- Best effort response time
- Open source under MIT license
Limitations:
- No SLA guarantees
- Community-driven support
- No priority bug fixes
Professional
For growing businesses that need reliable support and faster issue resolution.
What's included:
- Priority email support
- Guaranteed 24-hour response time
- Deployment assistance
- Configuration review
- Priority bug fixes
- Access to beta features
Limitations:
- Business hours support only
- Email-based communication
Enterprise
Comprehensive support for mission-critical deployments with custom requirements.
What's included:
- Dedicated support engineer
- Custom SLA (down to 2-hour response)
- Phone & video call support
- Custom feature development
- Architecture consulting
- On-site deployment assistance
- Training and workshops
- Priority feature requests
Ready for Enterprise Deployment?
Contact our team for custom integrations, dedicated support, and enterprise-grade deployment assistance.
Professional Consultation
Architecture review, deployment planning, and best practices guidance from our core team.
Custom Development
Tailored features, custom integrations, and specialized deployment configurations for your use case.
Dedicated Support
Priority support channels, SLA guarantees, and direct access to our engineering team.
Schedule a Consultation
Book a 30-minute call to discuss your requirements and explore how pLLM can fit your enterprise needs.
Book ConsultationEnterprise Inquiry
Submit a detailed inquiry for custom features, deployment assistance, or partnership opportunities.
Submit Inquiry FormFrequently Asked Questions
Everything you need to know about pLLM, from technical details to enterprise support options.
Yes, pLLM is completely free and open source under the MIT license. This means you can use it in commercial applications, modify the code, and deploy it anywhere without licensing fees. The only costs you'll incur are your infrastructure expenses (servers, cloud resources) and API costs from the LLM providers themselves (OpenAI, Anthropic, etc.).
pLLM is built in Go for superior performance and lower resource usage compared to Python-based solutions. Key advantages include: sub-millisecond routing overhead, native compilation for better performance, 3-6x lower memory usage, 20-50x faster startup times, and true parallel processing without GIL limitations. Plus, it's 100% OpenAI API compatible, requiring zero code changes to integrate.
pLLM supports all major LLM providers including OpenAI (GPT-3.5, GPT-4, GPT-4 Turbo), Anthropic Claude, Azure OpenAI, AWS Bedrock, Google Vertex AI, Groq, and Cohere. The unified API interface means you can switch between providers or use multiple providers simultaneously with intelligent routing and automatic failover.
Yes, we provide comprehensive enterprise support including dedicated support engineers, custom SLA agreements (down to 2-hour response times), priority bug fixes, custom feature development, architecture consulting, on-site deployment assistance, and training workshops. Enterprise support is available through custom pricing based on your specific requirements.
Absolutely. pLLM is designed for flexible deployment scenarios including on-premise installations, air-gapped environments, and hybrid cloud setups. We provide Kubernetes manifests, Docker containers, and can assist with custom deployment configurations. The gateway can run entirely within your infrastructure while connecting to external LLM APIs or internal models.
pLLM includes comprehensive security features: JWT-based authentication, Role-Based Access Control (RBAC), audit logging for compliance, OAuth/OIDC integration through Dex (supporting Google, Microsoft, LDAP, Active Directory), API key management, rate limiting, and request monitoring. All communications use TLS encryption, and we support enterprise identity providers.
pLLM is optimized for high-performance scenarios: handles thousands of concurrent connections, sub-millisecond routing overhead, efficient memory usage (50-80MB typical), fast startup times (<100ms), and intelligent caching to reduce API costs. The Go-based architecture provides significant performance advantages over interpreted language solutions.
Yes! We have comprehensive documentation, GitHub discussions for community support, Discord server for real-time help, and regular updates on our roadmap. The open-source community actively contributes features and bug fixes. For enterprise customers, we provide dedicated documentation, training materials, and direct access to our engineering team.
Performance Benchmarks
Real-world performance data showing why Go-based pLLM outperforms interpreted gateway solutions.
Requests/sec
P99 Latency
Memory Efficiency
Cold Start
Performance Comparison
pLLM vs Typical Interpreted Gateway
Concurrent Connections
Memory Usage
Startup Time
Response Time
Performance comparison between pLLM and typical gateways
Load Testing Results
Stress tested with 10,000 concurrent users making chat completion requests.
Test Configuration
Results
Zero failed requests under normal conditions
Gateway overhead only, excluding LLM processing
Peak memory during 10K concurrent connections
Enterprise Scalability
Built-in scalability features that make pLLM ideal for high-volume production workloads.
True Parallelism
No GIL limitations - utilize all CPU cores effectively
Memory Efficient
Native compilation with optimized memory management
Instant Scaling
Sub-100ms startup enables aggressive auto-scaling
Network Optimized
Efficient connection pooling and keep-alive management
Enterprise Performance Scaling
For massive performance and ultra-low latency, the bottleneck is often the LLM providers themselves, not the gateway. To achieve true enterprise scale:
- Multiple LLM Deployments: Deploy several instances of the same model (e.g., 5-10 GPT-4 Azure OpenAI deployments)
- Multi-Provider Redundancy: Use multiple AWS Bedrock accounts, Azure regions, or provider accounts
- Geographic Distribution: Deploy models across regions for latency optimization
Why This Matters: A single LLM deployment typically handles 60-100 RPM. For 10,000+ concurrent users, you need multiple deployments of the same model to prevent provider-side bottlenecks. pLLM's adaptive routing automatically distributes load across all deployments.
Get Started in Minutes
Choose your deployment method and get pLLM running in your environment quickly.
Deployment Options
Kubernetes with Helm
Production-ready deployment with auto-scaling
# Add the Helm repository
helm repo add pllm https://andreimerfu.github.io/pllm
helm repo update
# Install with your configuration
helm install pllm pllm/pllm \
--set pllm.secrets.jwtSecret="your-jwt-secret" \
--set pllm.secrets.masterKey="sk-master-key" \
--set pllm.secrets.openaiApiKey="sk-your-openai-key"
# Check status
kubectl get pods -l app.kubernetes.io/name=pllm
Docker Compose
Perfect for development and testing
# Clone and setup
git clone https://github.com/andreimerfu/pllm.git
cd pllm
# Configure environment
cp .env.example .env
echo "OPENAI_API_KEY=sk-your-key-here" >> .env
# Launch pLLM
docker compose up -d
# Test it works
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}'
Binary Installation
Lightweight deployment for simple setups
# Download latest release
wget https://github.com/andreimerfu/pllm/releases/latest/download/pllm-linux-amd64
# Make executable
chmod +x pllm-linux-amd64
# Set environment variables
export OPENAI_API_KEY=sk-your-key-here
export JWT_SECRET=your-jwt-secret
export MASTER_KEY=sk-master-key
# Run pLLM
./pllm-linux-amd64 server
Drop-in Integration
pLLM is 100% OpenAI compatible. Just change your base URL and you're ready to go.
Python
from openai import OpenAI
# Just change the base_url - that's it!
client = OpenAI(
api_key="your-api-key",
base_url="http://localhost:8080/v1" # ← Point to pLLM
)
# Use exactly like OpenAI
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
Node.js
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: 'your-api-key',
baseURL: 'http://localhost:8080/v1' // ← Point to pLLM
});
const completion = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{role: "user", content: "Hello!"}]
});
cURL
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-API-Key: your-api-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}]
}'