Build with the world's
best AI models.
Access all models through a single, OpenAI-compatible gateway. Sub-millisecond routing, unified billing, and 100% SDK compatibility.
Quickstart
Get up and running in less than 60 seconds using the official OpenAI SDKs.
from openai import OpenAI
client = OpenAI(
api_key="sk-ht-xxxx",
base_url="https://api.heytoken.ai"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Authentication
All API requests must include your API key in the Authorization HTTP header. You can generate and manage your keys in the API Keys dashboard.
Authorization: Bearer sk-ht-xxxxxxxxxxxxProtect your Secret Key
Your API key carries the same privileges as your account. Never share it, commit it to version control, or use it in client-side code.
Chat Completions
/v1/chat/completionsGenerate intelligent responses using state-of-the-art language models. Our API supports high-performance streaming (SSE) and multimodal inputs.
Request Parameters
| Parameter | Type | Description |
|---|---|---|
modelRequired | string | ID of the model to use. See /v1/models for a full list. |
messagesRequired | array | A list of messages comprising the conversation. |
stream | boolean | If true, partial message deltas will be sent via SSE. Default: |
temperature | number | Sampling temperature between 0 and 2. Default: |
max_tokens | integer | The maximum number of tokens to generate. |
reasoning | boolean | Enable chain-of-thought for supported models (O1/DeepSeek). Default: |
Response Formats
Standard JSON (Non-streaming)
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-4o",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello!"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}Streaming (Server-Sent Events)
data: {"choices":[{"delta":{"content":" world"},"index":0}]}
data: [DONE]
Tip: Set stream: true to significantly reduce perceived latency (Time To First Token).
Embeddings
Convert text into high-dimensional vectors for semantic search, clustering, and RAG applications.
cURL Example
curl https://api.heytoken.ai/embeddings -H "Content-Type: application/json" -H "Authorization: Bearer sk-ht-xxxx" -d '{
"input": "The food was delicious and the service was excellent.",
"model": "text-embedding-3-small"
}'Image Generation
Create high-resolution images from text prompts using DALL-E 3, Midjourney, and Stable Diffusion.
- Multiple aspect ratios
- HD quality support
- Style consistency
Example Request
{
"model": "dall-e-3",
"prompt": "A futuristic city at sunset",
"size": "1024x1024",
"quality": "hd"
}Video Generation
Generate cinematic videos from text or images using Sora, Runway, and Kling.
- 1.Submit a generation task to receive a
task_id. - 2.Poll the status endpoint or wait for a webhook callback.
- 3.Download the high-quality MP4 result once completed.
Realtime Voice (Beta)
Low-latency, full-duplex voice interactions using OpenAI Realtime and Google Gemini Multimodal.
This endpoint requires a WebSocket connection. Documentation for the /v1/realtime socket protocol is available upon request for Enterprise customers.
Rate Limits
Enforced to ensure fair usage and system stability across all developers.
RPM
100
Requests per min
RPH
1,000
Requests per hour
TPM
100k
Tokens per min
Concurrency
10
Active streams
Errors & Handling
We use standard HTTP status codes. A successful request returns a 2xx status code.
Invalid or missing API key. Check your Authorization header format.
Insufficient balance. Top up your credits in the Billing section.
The requested model or endpoint does not exist.
Too many requests. Please implement exponential backoff.
Upstream provider is temporarily down. Our system will auto-retry another channel.
The model is currently experiencing high traffic. Try again in a few seconds.
Best Practices
- 1Use streaming for a better user experience in chat applications.
- 2Implement timeouts (recommended 30-60s) to handle long-running generations.
- 3Set max_tokens to control costs and prevent unexpected usage spikes.
- 4Cache frequent requests to reduce latency and save on token costs.