HeyToken AI DOCS
Developer Hub

Build with the world's best AI models.

Access all models through a single, OpenAI-compatible gateway. Sub-millisecond routing, unified billing, and 100% SDK compatibility.

Quickstart

Get up and running in less than 60 seconds using the official OpenAI SDKs.

from openai import OpenAI

client = OpenAI(
    api_key="sk-ht-xxxx",
    base_url="https://api.heytoken.ai"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Authentication

All API requests must include your API key in the Authorization HTTP header. You can generate and manage your keys in the API Keys dashboard.

Authorization: Bearer sk-ht-xxxxxxxxxxxx

Protect your Secret Key

Your API key carries the same privileges as your account. Never share it, commit it to version control, or use it in client-side code.

Chat Completions

POST
/v1/chat/completions

Generate intelligent responses using state-of-the-art language models. Our API supports high-performance streaming (SSE) and multimodal inputs.

Request Parameters

ParameterTypeDescription
modelRequired
string

ID of the model to use. See /v1/models for a full list.

messagesRequired
array

A list of messages comprising the conversation.

stream
boolean

If true, partial message deltas will be sent via SSE.

Default: false

temperature
number

Sampling temperature between 0 and 2.

Default: 0.7

max_tokens
integer

The maximum number of tokens to generate.

reasoning
boolean

Enable chain-of-thought for supported models (O1/DeepSeek).

Default: false

Response Formats

Standard JSON (Non-streaming)

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello!"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

Streaming (Server-Sent Events)

# data: prefix for each chunk
data: {"choices":[{"delta":{"content":"Hello"},"index":0}]}
data: {"choices":[{"delta":{"content":" world"},"index":0}]}
data: [DONE]

Tip: Set stream: true to significantly reduce perceived latency (Time To First Token).

Embeddings

Convert text into high-dimensional vectors for semantic search, clustering, and RAG applications.

cURL Example

curl https://api.heytoken.ai/embeddings   -H "Content-Type: application/json"   -H "Authorization: Bearer sk-ht-xxxx"   -d '{
    "input": "The food was delicious and the service was excellent.",
    "model": "text-embedding-3-small"
  }'

Image Generation

Create high-resolution images from text prompts using DALL-E 3, Midjourney, and Stable Diffusion.

POST /v1/images/generations
Key Features
  • Multiple aspect ratios
  • HD quality support
  • Style consistency

Example Request

{
  "model": "dall-e-3",
  "prompt": "A futuristic city at sunset",
  "size": "1024x1024",
  "quality": "hd"
}

Video Generation

Generate cinematic videos from text or images using Sora, Runway, and Kling.

POST /v1/videos/generations
Asynchronous API
  1. 1.Submit a generation task to receive a task_id.
  2. 2.Poll the status endpoint or wait for a webhook callback.
  3. 3.Download the high-quality MP4 result once completed.

Realtime Voice (Beta)

Low-latency, full-duplex voice interactions using OpenAI Realtime and Google Gemini Multimodal.

This endpoint requires a WebSocket connection. Documentation for the /v1/realtime socket protocol is available upon request for Enterprise customers.

Rate Limits

Enforced to ensure fair usage and system stability across all developers.

RPM

100

Requests per min

RPH

1,000

Requests per hour

TPM

100k

Tokens per min

Concurrency

10

Active streams

Errors & Handling

We use standard HTTP status codes. A successful request returns a 2xx status code.

401
Unauthorized

Invalid or missing API key. Check your Authorization header format.

402
Payment Required

Insufficient balance. Top up your credits in the Billing section.

404
Not Found

The requested model or endpoint does not exist.

429
Rate Limit

Too many requests. Please implement exponential backoff.

500
Server Error

Upstream provider is temporarily down. Our system will auto-retry another channel.

503
Overloaded

The model is currently experiencing high traffic. Try again in a few seconds.

Best Practices

  • 1
    Use streaming for a better user experience in chat applications.
  • 2
    Implement timeouts (recommended 30-60s) to handle long-running generations.
  • 3
    Set max_tokens to control costs and prevent unexpected usage spikes.
  • 4
    Cache frequent requests to reduce latency and save on token costs.