The problem
NexToken is an OpenAI-compatible gateway: keep your existing OpenAI client,
point base_url at us, and reach 40+ models across providers
(OpenAI, Anthropic, Google, Qwen, local) through one
/v1/chat/completions endpoint and one key.
Text worked everywhere. Images didn't. If you sent the standard OpenAI vision
shape — a content array mixing text and image_url
parts — the request was rejected, because our message schema only accepted a
string. Anyone doing document, chart, or screenshot analysis was stuck. This
is how we closed that gap without breaking a single existing call.
Constraint 1: backward compatibility is non-negotiable
A large volume of in-flight requests send content as a plain
string. The fix had to keep that working byte-for-byte. So content
became a union:
content: str | list[dict] | None
String in → unchanged path. List in → new multimodal path. No migration for anyone.
Constraint 2: every provider speaks a different dialect
OpenAI's vision format is a content array of
{type:"text"} and {type:"image_url"} parts.
Anthropic's Messages API wants {type:"image","source":{…}}
blocks, with base64 and URL sources expressed differently. Our job as a
gateway is to absorb that difference so you never see it:
# OpenAI image_url -> Anthropic image block
data: URL -> {"type":"image","source":{"type":"base64","media_type":…,"data":…}}
http(s) URL -> {"type":"image","source":{"type":"url","url":…}}
Plain strings still pass straight through, so we never pay conversion cost on the common text path.
Constraint 3: fail loudly, in the right place
Send an image to a model that can't see and the worst outcome is a cryptic error from an upstream after billing pre-checks have already run. Our catalog tags each model with capabilities, so we gate at the edge:
image content + non-vision model
-> 400 NEX_MODEL_NO_VISION
"Model 'X' does not support image input.
Use claude-sonnet-4-6, gpt-4o, …"
You learn what's wrong before the request leaves our building.
Constraint 4: don't open a compliance hole
We run content moderation on prompts before forwarding upstream — it protects the shared provider accounts every customer depends on. Text was moderated; images would have bypassed it. So image parts now go through the same moderation step. Closing that gap was part of shipping, not a follow-up.
What it looks like to you
curl https://api.nextoken.biz/v1/chat/completions \
-H "Authorization: Bearer $NEX_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [{ "role": "user", "content": [
{"type": "text", "text": "What does this chart show?"},
{"type": "image_url",
"image_url": {"url": "data:image/png;base64,…"}}
]}]
}'
The response is the OpenAI shape you already parse —
choices[0].message.content — and every response also reports the
exact cost of that call in nex.cost_usd.
image_url accepts both base64 data URLs and http(s)
URLs.
/v1/models): claude-sonnet-4-6,
claude-opus-4-6, claude-haiku-4-5-20251001,
gpt-4o, gpt-4o-mini, gpt-4.1,
gemini-2.5-pro. Send an image to any of them through the same
endpoint and key.
Things we kept honest
- Image tokens are counted and billed like any input — every response still
reports
nex.cost_usdfor that exact call. - One message can carry multiple images for multi-image analysis.
- Tests cover string content, text+image arrays, base64 vs URL sources, malformed data URLs, the vision gate, and image moderation.
Takeaway
A good gateway is one your code doesn't notice. Multimodal shipped as an additive union type, a per-provider translation layer, an edge capability gate, and a moderation extension — and zero changes for anyone sending plain text. That's the bar we hold for every feature.
Build multimodal without picking a vendor
One OpenAI-compatible endpoint, 40+ models, per-request cost visibility.
Get started free → Read the docs