Groq

Overview

Groq provides lightning-fast AI inference powered by their custom Language Processing Unit (LPU™) architecture, delivering industry-leading speed for open-source models. Key Features:

Ultra-low latency inference (up to 10x faster than GPUs)
Fully OpenAI-compatible API
Support for top open-source models (Llama, Mixtral, Gemma)
Competitive pricing with generous free tier

Official Documentation: Groq Documentation

Authentication

Groq uses Bearer token authentication with the OpenAI-compatible format. Header:

Authorization: Bearer YOUR_GROQ_API_KEY

Lava Forward Token:

${LAVA_SECRET_KEY}.${CONNECTION_SECRET}.${PRODUCT_SECRET}

For BYOK (Bring Your Own Key):

${LAVA_SECRET_KEY}.${CONNECTION_SECRET}.${PRODUCT_SECRET}.${YOUR_GROQ_API_KEY}

Popular Models (October 2025)

Model	Context	Description	Speed
llama-3.3-70b-versatile	128K	Meta’s Llama 3.3 flagship	~300 tokens/sec
mixtral-8x7b-32768	32K	Mistral’s mixture-of-experts	~500 tokens/sec
gemma2-9b-it	8K	Google’s efficient instruction model	~800 tokens/sec

Pricing: See Groq Pricing for current rates. Speed Advantage: Groq’s LPU™ architecture delivers 5-10x faster inference than traditional GPU deployments.

Quick Start Example

// 1. Set up your environment variables
const LAVA_FORWARD_TOKEN = process.env.LAVA_FORWARD_TOKEN;

// 2. Define the Groq endpoint
const GROQ_ENDPOINT = 'https://api.groq.com/openai/v1/chat/completions';

// 3. Make the request through Lava
const response = await fetch(
  `https://api.lavapayments.com/v1/forward?u=${encodeURIComponent(GROQ_ENDPOINT)}`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${LAVA_FORWARD_TOKEN}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'llama-3.3-70b-versatile',
      messages: [
        {
          role: 'user',
          content: 'Write a haiku about speed.'
        }
      ],
      temperature: 0.7,
      max_tokens: 100
    })
  }
);

// 4. Parse response and extract usage
const data = await response.json();
console.log('Response:', data.choices[0].message.content);

// 5. Track usage (from response body)
const usage = data.usage;
console.log('Tokens used:', usage.total_tokens);

// 6. Get Lava request ID (from headers)
const requestId = response.headers.get('x-lava-request-id');
console.log('Lava Request ID:', requestId);

Available Endpoints

Groq supports standard OpenAI-compatible endpoints:

Endpoint	Method	Description
`/openai/v1/chat/completions`	POST	Text generation with conversation context
`/openai/v1/models`	GET	List available models
`/openai/v1/audio/transcriptions`	POST	Whisper audio transcription

Usage Tracking

Usage data is returned in the response body (OpenAI format):

{
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 25,
    "total_tokens": 40,
    "queue_time": 0.002,
    "prompt_time": 0.005,
    "completion_time": 0.050,
    "total_time": 0.057
  }
}

Location: data.usage Format: Standard OpenAI usage object + Groq-specific timing metrics Lava Tracking: Automatically tracked via x-lava-request-id header

Features & Capabilities

JSON Mode:

{
  "response_format": { "type": "json_object" }
}

Streaming:

{
  "stream": true
}

Audio Transcription (Whisper):

// Endpoint: https://api.groq.com/openai/v1/audio/transcriptions
const formData = new FormData();
formData.append('file', audioFile);
formData.append('model', 'whisper-large-v3');

BYOK Support

Status: ✅ Supported (managed keys + BYOK) BYOK Implementation:

Append your Groq API key to the forward token: ${TOKEN}.${YOUR_GROQ_KEY}
Lava tracks usage and billing while you maintain key control
No additional Lava API key costs (metering-only mode available)

Getting a Groq API Key:

Sign up at Groq Console
Navigate to API Keys section
Create a new API key
Use in Lava forward token (4th segment)

Best Practices

Model Selection: Use Llama 3.3 for reasoning, Gemma2 for speed, Mixtral for balanced performance
Speed Optimization: Groq excels at streaming - use stream: true for real-time UX
Temperature: Keep between 0.5-0.9 for open models (they tend to be deterministic)
Context Management: Llama 3.3 supports 128K context - ideal for long documents
Rate Limits: Groq has generous limits - check console for current tier

Speed Benchmarks

Groq LPU™ vs Traditional GPU:

Llama 3.3 70B: ~300 tokens/sec (vs ~30 tokens/sec on GPU)
Mixtral 8x7B: ~500 tokens/sec (vs ~50 tokens/sec on GPU)
Gemma2 9B: ~800 tokens/sec (vs ~80 tokens/sec on GPU)

Use Cases:

Real-time chat applications
Low-latency voice assistants
Streaming content generation
High-throughput batch processing

Getting Started

Integration Guides

Core Concepts

Provider Reference

SDK Reference

Overview

Authentication

Popular Models (October 2025)

Quick Start Example

Available Endpoints

Usage Tracking

Features & Capabilities

BYOK Support

Best Practices

Speed Benchmarks

Additional Resources

Getting Started

Integration Guides

Core Concepts

Provider Reference

SDK Reference

​Overview

​Authentication

​Popular Models (October 2025)

​Quick Start Example

​Available Endpoints

​Usage Tracking

​Features & Capabilities

​BYOK Support

​Best Practices

​Speed Benchmarks

​Additional Resources

Overview

Authentication

Popular Models (October 2025)

Quick Start Example

Available Endpoints

Usage Tracking

Features & Capabilities

BYOK Support

Best Practices

Speed Benchmarks

Additional Resources