> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cowagent.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# vision - Image Understanding

> Analyze image content (recognition, description, OCR, etc.)

Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.

## Model Selection

The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:

1. **Main model** — uses the currently configured main model for image recognition (must be a multimodal model)
2. **Other configured models** — auto-discovers other multimodal models with configured API keys as alternatives

If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.

### Supported Models

| Provider            | Vision Model    | Notes                                                                                                                              |
| ------------------- | --------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| OpenAI / Compatible | Main model      | All OpenAI-protocol-compatible multimodal models                                                                                   |
| Qwen (DashScope)    | Main model      | e.g. qwen3.7-plus, etc.                                                                                                            |
| Claude              | Main model      | Anthropic native image format                                                                                                      |
| Gemini              | Main model      | inlineData format                                                                                                                  |
| Doubao              | Main model      | doubao-seed-2-1 series natively supported                                                                                          |
| Kimi (Moonshot)     | Main model      | kimi-k2.6, kimi-k2.5 natively supported                                                                                            |
| ERNIE               | Main model      | Defaults to the multimodal main model (e.g. `ernie-5.1`); falls back to `ernie-4.5-turbo-vl` when the main model is not multimodal |
| ZhipuAI             | glm-5v-turbo    | Always uses the dedicated vision model                                                                                             |
| MiniMax             | MiniMax-Text-01 | Always uses the dedicated vision model                                                                                             |

<Note>
  ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically.
</Note>

> When `use_linkai=true`, LinkAI's multimodal model is used by default.

## Custom Configuration

To specify the model used by Vision, configure it in `config.json`, for example:

```json theme={null}
{
    "tools": {
        "vision": {
            "model": "gpt-4.1"
        }
    }
}
```

The specified model is **used first**, and the tool automatically routes to the corresponding provider based on the model name; on failure, it falls back to other configured providers.

In most cases no configuration is needed — the tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.

## Parameters

| Parameter  | Type   | Required | Description                          |
| ---------- | ------ | -------- | ------------------------------------ |
| `image`    | string | Yes      | Local file path or HTTP(S) image URL |
| `question` | string | Yes      | Question to ask about the image      |

Supported image formats: jpg, jpeg, png, gif, webp

## Use Cases

* Describe image content
* Extract text from images (OCR)
* Identify objects, colors, scenes
* Analyze screenshots and scanned documents

<Note>
  Images larger than 1MB are automatically compressed before upload. All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.
</Note>
