Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cowagent.ai/llms.txt

Use this file to discover all available pages before exploring further.

Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.

Model Selection

The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:
  1. Main model — uses the currently configured main model for image recognition (zero extra cost)
  2. Other configured models — auto-discovers other models with configured API keys as alternatives
  3. OpenAI — uses open_ai_api_key to call gpt-4.1-mini
  4. LinkAI — uses linkai_api_key to call LinkAI vision service
When use_linkai=true, LinkAI is promoted to the highest priority. If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.

Supported Models

VendorVision ModelNotes
OpenAI / CompatibleMain modelAll OpenAI-compatible multimodal models
Baidu QianfanMain modelMultimodal main models (e.g. ernie-5.1) handle images directly; falls back to ernie-4.5-turbo-vl for text-only main models
Qwen (DashScope)Main modelVia MultiModalConversation API
ClaudeMain modelAnthropic native image format
GeminiMain modelinlineData format
DoubaoMain modeldoubao-seed-2-0 series natively supported
Kimi (Moonshot)Main modelkimi-k2.6, kimi-k2.5 natively supported
ZhipuAIglm-5v-turboAlways uses dedicated vision model
MiniMaxMiniMax-Text-01Always uses dedicated vision model
ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically.

Parameters

ParameterTypeRequiredDescription
imagestringYesLocal file path or HTTP(S) image URL
questionstringYesQuestion to ask about the image
Supported image formats: jpg, jpeg, png, gif, webp

Custom Configuration

To specify a particular model for the vision tool, add to config.json:
{
    "tool": {
        "vision": {
            "model": "ernie-4.5-turbo-vl"
        }
    }
}
In most cases no configuration is needed. The tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.

Use Cases

  • Describe image content
  • Extract text from images (OCR)
  • Identify objects, colors, scenes
  • Analyze screenshots and scanned documents
Images larger than 1MB are automatically compressed (max edge 1536px). All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.