MiMo - CowAgent

Xiaomi MiMo is a native omni-modal large model. A single mimo_api_key enables text chat, image understanding, and text-to-speech all at once.

All capabilities below can be configured in one place via the “Model Management” page in the Web Console — no need to manually edit the configuration file.

Text Chat

{
  "model": "mimo-v2.5-pro",
  "mimo_api_key": "YOUR_API_KEY",
  "mimo_api_base": "https://api.xiaomimimo.com/v1"
}

Parameter	Description
`model`	Default recommendation: `mimo-v2.5-pro`; `mimo-v2.5` is also supported
`mimo_api_key`	Create one in the MiMo Open Platform
`mimo_api_base`	Optional, defaults to `https://api.xiaomimimo.com/v1`

Model Selection

Model	Use Case
`mimo-v2.5-pro`	Flagship: native omni-modal + Agent capability, up to 1M tokens context
`mimo-v2.5`	General-purpose, native omni-modal (text / image / video / audio)

Thinking Mode

The MiMo V2.5 series enables “thinking mode” by default: the model emits reasoning_content (chain-of-thought) before the final answer, improving performance on complex tasks. Use the global enable_thinking flag to toggle visibility (also switchable from the Web Console settings):

{
  "enable_thinking": true
}

Image Understanding

Once mimo_api_key is configured, the Agent’s Vision tool can automatically use MiMo’s vision models:

When the main model itself is multimodal (mimo-v2.5-pro / mimo-v2.5), images are handled directly by the main model with no extra setup.
When the main model belongs to another provider, the Vision tool falls back to mimo-v2.5-pro in order.

To force a specific Vision model, set it explicitly in the configuration:

{
  "tools": {
    "vision": {
      "provider": "mimo",
      "model": "mimo-v2.5-pro"
    }
  }
}

Text-to-Speech (TTS)

{
  "text_to_voice": "mimo",
  "text_to_voice_model": "mimo-v2.5-tts",
  "tts_voice_id": "冰糖"
}

Parameter	Description
`text_to_voice_model`	Currently only `mimo-v2.5-tts` (preset voices + singing mode)
`tts_voice_id`	Preset voice name (Chinese voice IDs use the Chinese name directly)

Preset Voices

Voice ID	Description
`Mia`	English · Female
`Chloe`	English · Female
`Milo`	English · Male
`Dean`	English · Male
`冰糖`	Chinese · Female (default)
`茉莉`	Chinese · Female
`苏打`	Chinese · Male
`白桦`	Chinese · Male

You can also pick a voice visually from the Web Console under “Model Management → Text-to-Speech”.

Style Control

MiMo TTS supports embedding audio tags in the synthesis text to control emotion, tone, dialect, persona, and even singing. Tags must appear in the text that will be synthesized to speech (i.e. the Agent’s reply), with the overall style tag placed at the very beginning:

(style)content-to-synthesize

Half-width (), full-width （）, and [] brackets are all accepted. Both Chinese and English style descriptors work — pick whichever language expresses the timbre most precisely. Common examples:

Category	Example tags
Basic emotions	`happy` `sad` `angry` `fear` `surprised` `excited` `aggrieved` `calm` `indifferent`
Compound emotions	`wistful` `relieved` `helpless` `guilty` `at ease` `uneasy` `touched`
Overall tone	`gentle` `aloof` `lively` `serious` `languid` `playful` `deep` `sharp` `cutting`
Voice character	`magnetic` `mellow` `bright` `ethereal` `childlike` `aged` `sweet` `husky`
Persona	`squeaky` `mature lady` `young boy` `uncle` `Taiwanese accent`
Dialect	`Northeastern` `Sichuan` `Henan` `Cantonese`
Role-play	`Sun Wukong` `Lin Daiyu`
Singing	`sing` / `singing`

Examples:

(magnetic)The night is deep, and the city is still breathing.
(gentle)Take a breath. You've got this.
(serious)This is the final warning before the system reboots.
(singing)Oh, when the saints go marching in…

You can also insert fine-grained audio tags at any position in the text to control breathing, laughter, pauses, etc. For example:

(nervous, deep breath) Phew… stay calm, stay calm. (faster pace) I've rehearsed this intro fifty times, it'll be fine.

See the MiMo speech synthesis documentation for the full tag list.

When CowAgent calls TTS, the Agent’s reply text (including any (...) tags) is forwarded directly to MiMo for synthesis. Tell the model in its persona / system prompt to “prefix replies with a (style) tag to control the tone”, and IM channels (WeChat / Feishu / DingTalk / WeCom) will play voice replies with the corresponding emotion, dialect, or even singing.

​Text Chat

​Model Selection

​Thinking Mode

​Image Understanding

​Text-to-Speech (TTS)

​Preset Voices

​Style Control

Text Chat

Model Selection

Thinking Mode

Image Understanding

Text-to-Speech (TTS)

Preset Voices

Style Control