Overview
Groq provides lightning-fast AI inference powered by their custom Language Processing Unit (LPU™) architecture, delivering industry-leading speed for open-source models. Key Features:- Ultra-low latency inference (up to 10x faster than GPUs)
- Fully OpenAI-compatible API
- Support for top open-source models (Llama, Mixtral, Gemma)
- Competitive pricing with generous free tier
Authentication
Groq uses Bearer token authentication with the OpenAI-compatible format. Header:Popular Models (October 2025)
| Model | Context | Description | Speed |
|---|---|---|---|
| llama-3.3-70b-versatile | 128K | Meta’s Llama 3.3 flagship | ~300 tokens/sec |
| mixtral-8x7b-32768 | 32K | Mistral’s mixture-of-experts | ~500 tokens/sec |
| gemma2-9b-it | 8K | Google’s efficient instruction model | ~800 tokens/sec |
Quick Start Example
Available Endpoints
Groq supports standard OpenAI-compatible endpoints:| Endpoint | Method | Description |
|---|---|---|
/openai/v1/chat/completions | POST | Text generation with conversation context |
/openai/v1/models | GET | List available models |
/openai/v1/audio/transcriptions | POST | Whisper audio transcription |
Usage Tracking
Usage data is returned in the response body (OpenAI format):data.usage
Format: Standard OpenAI usage object + Groq-specific timing metrics
Lava Tracking: Automatically tracked via x-lava-request-id header
Features & Capabilities
JSON Mode:BYOK Support
Status: ✅ Supported (managed keys + BYOK) BYOK Implementation:- Append your Groq API key to the forward token:
${TOKEN}.${YOUR_GROQ_KEY} - Lava tracks usage and billing while you maintain key control
- No additional Lava API key costs (metering-only mode available)
- Sign up at Groq Console
- Navigate to API Keys section
- Create a new API key
- Use in Lava forward token (4th segment)
Best Practices
- Model Selection: Use Llama 3.3 for reasoning, Gemma2 for speed, Mixtral for balanced performance
- Speed Optimization: Groq excels at streaming - use
stream: truefor real-time UX - Temperature: Keep between 0.5-0.9 for open models (they tend to be deterministic)
- Context Management: Llama 3.3 supports 128K context - ideal for long documents
- Rate Limits: Groq has generous limits - check console for current tier
Speed Benchmarks
Groq LPU™ vs Traditional GPU:- Llama 3.3 70B: ~300 tokens/sec (vs ~30 tokens/sec on GPU)
- Mixtral 8x7B: ~500 tokens/sec (vs ~50 tokens/sec on GPU)
- Gemma2 9B: ~800 tokens/sec (vs ~80 tokens/sec on GPU)
- Real-time chat applications
- Low-latency voice assistants
- Streaming content generation
- High-throughput batch processing