Inference API

Table of contents

Authentication
List models
Chat completions
Streaming chat completions
Embeddings

Inference API provides an OpenAI-compatible interface for chat completions, streaming chat completions, server-side web search, embeddings, and model listing. Standard OpenAI-compatible SDKs and HTTP clients can connect to the API by using Tempico endpoints and a Tempico API key.

Authentication

All API requests require an API key. Bearer authentication is the recommended method.

Authorization: Bearer <API_KEY>

For compatibility, the API also supports x-api-key: <API_KEY> header.

List models

Returns available model IDs for the current account. Returned id values can be used as the model field in chat completion and embedding requests.

GET

https://api.tempico.com/v1/models

{
  "object": "list",
  "data": [
    {
      "id": "kimi-k2.7-code:1t",
      "object": "model",
      "created": 0,
      "owned_by": "tempicolabs",
      "context_window": 262144,
      "capabilities": [
        "chat"
      ],
      "max_output_tokens": 65536
    },
    {
      "id": "embeddinggemma:300m",
      "object": "model",
      "created": 0,
      "owned_by": "tempicolabs",
      "context_window": 2048,
      "capabilities": [
        "embeddings"
      ]
    }
  ]
}

Field	Description
`id`	Model ID used in API requests.
`created`	Model creation timestamp when available.
`context_window`	Maximum context length supported by the model, in tokens.
`capabilities`	Supported API features for the model, such as chat or embeddings.
`max_output_tokens`	Maximum output token limit when available.

Chat completions

Generates a model response from a conversation in OpenAI chat format. The endpoint supports standard JSON responses, streaming output, and server-side web search.

POST

https://api.tempico.com/v1/chat/completions

Web search

Server-side web search works as a tool available to the model during chat completion generation. For most chat models, web search is enabled by default. The model decides when search is needed and what query should be used.

Search behavior can be controlled through the prompt. System or user messages can define when search should be used, what sources to prefer, and when the model should answer without searching.

Set web_search to false to disable server-side search for a specific request.

curl https://api.tempico.com/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2.7-code:1t",
    "messages": [
      {
        "role": "system",
        "content": "You are a concise technical assistant."
      },
      {
        "role": "user",
        "content": "Search the web and summarize the current Python 3.13 release status."
      }
    ],
    "max_tokens": 800,
    "temperature": 0.2,
    "web_search": true,
    "web_search_options": {
      "search_context_size": "medium",
      "user_location": {
        "type": "approximate",
        "approximate": {
          "country": "US"
        }
      },
      "safesearch": "moderate"
    }
  }'

Field	Description
`model`	Model that generated the response.
`messages`	Conversation history sent to the model in OpenAI chat format.
`max_tokens`	Maximum number of tokens the model can generate in the response.
`web_search`	Enables or disables server-side web search. Enabled by default for most models.
`search_context_size`	Amount of search context passed to the model. Supported values are `low`, `medium`, and `high`.
`country`	Country code used as the location hint.
`safesearch`	Search safety preference passed to the search backend.
`accept_language`	Language preference for search results.

Example Response

{
  "id": "chatcmpl-0000000000000000",
  "object": "chat.completion",
  "created": 0,
  "model": "kimi-k2.7-code:1t",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The answer text appears here."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 120,
    "completion_tokens": 64,
    "total_tokens": 184
  }
}

Field	Description
`finish_reason`	Reason generation stopped, such as reaching a stop condition or token limit.
`usage`	Token usage information for the request.
`prompt_tokens`	Number of input tokens processed by the model.
`completion_tokens`	Number of output tokens generated by the model.
`total_tokens`	Sum of input and output tokens.

Streaming chat completions

Returns a chat completion as a stream of Server-Sent Events. Each event contains a partial response chunk. The stream ends with data: [DONE].

POST

https://api.tempico.com/v1/chat/completions

curl https://api.tempico.com/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2.7-code:1t",
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": "Write a short Python example using requests."
      }
    ]
  }'

Example Stream Response

data: {
  "id": "chatcmpl-0000000000000000",
  "object": "chat.completion.chunk",
  "created": 0,
  "model": "kimi-k2.7-code:1t",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant"
      },
      "finish_reason": null
    }
  ]
}

data: {
  "id": "chatcmpl-0000000000000000",
  "object": "chat.completion.chunk",
  "created": 0,
  "model": "kimi-k2.7-code:1t",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "import requests\n\n"
      },
      "finish_reason": null
    }
  ]
}

data: {
  "id": "chatcmpl-0000000000000000",
  "object": "chat.completion.chunk",
  "created": 0,
  "model": "kimi-k2.7-code:1t",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop"
    }
  ]
}

data: [DONE]

Field	Description
`stream`	Enables streaming mode when set to true. The response is returned as Server-Sent Events instead of one JSON object.
`choices`	Array of streamed output choices. For normal single-response generation, this usually contains one item.
`index`	Position of the choice in the choices array.
`delta`	Incremental update for the assistant message. This object can contain role, content, or be empty in the final chunk.
`finish_reason`	Indicates why generation ended. The value is `null` while generation is still running.
`data: [DONE]`	Final stream marker. No more chunks are sent after this event.

Embeddings

Embeddings convert text input into vectors for semantic search, similarity matching, clustering, and ranking.

POST

https://api.tempico.com/v1/embeddings

curl https://api.tempico.com/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "embeddinggemma:300m",
    "input": "Represent this text as a vector for semantic search."
  }'

Example Response

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        0.0123,
        -0.0456,
        0.0789
      ]
    }
  ],
  "model": "embeddinggemma:300m",
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12
  }
}

Field	Description
`input`	Text or array of texts used to create embeddings.
`embedding`	Vector representation of the input text.
`prompt_tokens`	Number of input tokens processed by the model.
`total_tokens`	Total tokens counted for the request.

Inference API

Authentication

List models

Chat completions

Web search

Example Response

Streaming chat completions

Example Stream Response

Embeddings

COMPANY

SUPPORT

LINKS

Pricing and Feature Availability

DATACENTER

LM1

Estonia

SG1

Singapore

USW2

USA

Infrastructure Units size

CPU

RAM

Storage size

IOPS

Infrastructure Unit​ price

PaaS ADD-ONS

Dedicated NAT gateway

GPU RTX 5000 Ada

GPU RTX PRO 6000 Blackwell

Paid TLS certificate

Dedicated IPv4

Dedicated IPv6

Dedicated IPv4 subnet​

Dedicated IPv6 subnet

CI/CD Add-ons

CI/CD runner

Environment-dependent features

Stateful Firewall

TLS with Let's Encrypt

Web Application Firewall

Platform built-in logging

ARM CPUs

Anti-DDoS

Cloud-native services

GPU-accelerated computing

Custom PCI passtrough

Log storage

Log ingestion

Log Retrieval

Infrastructure Unit price

Dedicated IPv4 subnet