Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cowagent.ai/llms.txt

Use this file to discover all available pages before exploring further.

Xiaomi MiMo is a native omni-modal large model. A single mimo_api_key enables text chat, image understanding, and text-to-speech all at once.
All capabilities below can be configured in one place via the “Model Management” page in the Web Console — no need to manually edit the configuration file.

Text Chat

{
  "model": "mimo-v2.5-pro",
  "mimo_api_key": "YOUR_API_KEY",
  "mimo_api_base": "https://api.xiaomimimo.com/v1"
}
ParameterDescription
modelDefault recommendation: mimo-v2.5-pro; mimo-v2.5 is also supported
mimo_api_keyCreate one in the MiMo Open Platform
mimo_api_baseOptional, defaults to https://api.xiaomimimo.com/v1

Model Selection

ModelUse Case
mimo-v2.5-proFlagship: native omni-modal + Agent capability, up to 1M tokens context
mimo-v2.5General-purpose, native omni-modal (text / image / video / audio)

Thinking Mode

The MiMo V2.5 series enables “thinking mode” by default: the model emits reasoning_content (chain-of-thought) before the final answer, improving performance on complex tasks. Use the global enable_thinking flag to toggle visibility (also switchable from the Web Console settings):
{
  "enable_thinking": true
}

Image Understanding

Once mimo_api_key is configured, the Agent’s Vision tool can automatically use MiMo’s vision models:
  • When the main model itself is multimodal (mimo-v2.5-pro / mimo-v2.5), images are handled directly by the main model with no extra setup.
  • When the main model belongs to another vendor, the Vision tool falls back to mimo-v2.5-pro in order.
To force a specific Vision model, set it explicitly in the configuration:
{
  "tools": {
    "vision": {
      "provider": "mimo",
      "model": "mimo-v2.5-pro"
    }
  }
}

Text-to-Speech (TTS)

{
  "text_to_voice": "mimo",
  "text_to_voice_model": "mimo-v2.5-tts",
  "tts_voice_id": "冰糖"
}
ParameterDescription
text_to_voice_modelCurrently only mimo-v2.5-tts (preset voices + singing mode)
tts_voice_idPreset voice name (Chinese voice IDs use the Chinese name directly)

Preset Voices

Voice IDDescription
MiaEnglish · Female
ChloeEnglish · Female
MiloEnglish · Male
DeanEnglish · Male
冰糖Chinese · Female (default)
茉莉Chinese · Female
苏打Chinese · Male
白桦Chinese · Male
You can also pick a voice visually from the Web Console under “Model Management → Text-to-Speech”.

Style Control

MiMo TTS supports embedding audio tags in the synthesis text to control emotion, tone, dialect, persona, and even singing. Tags must appear in the text that will be synthesized to speech (i.e. the Agent’s reply), with the overall style tag placed at the very beginning:
(style)content-to-synthesize
Half-width (), full-width (), and [] brackets are all accepted. Both Chinese and English style descriptors work — pick whichever language expresses the timbre most precisely. Common examples:
CategoryExample tags
Basic emotionshappy sad angry fear surprised excited aggrieved calm indifferent
Compound emotionswistful relieved helpless guilty at ease uneasy touched
Overall tonegentle aloof lively serious languid playful deep sharp cutting
Voice charactermagnetic mellow bright ethereal childlike aged sweet husky
Personasqueaky mature lady young boy uncle Taiwanese accent
DialectNortheastern Sichuan Henan Cantonese
Role-playSun Wukong Lin Daiyu
Singingsing / singing
Examples:
  • (magnetic)The night is deep, and the city is still breathing.
  • (gentle)Take a breath. You've got this.
  • (serious)This is the final warning before the system reboots.
  • (singing)Oh, when the saints go marching in…
You can also insert fine-grained audio tags at any position in the text to control breathing, laughter, pauses, etc. For example:
(nervous, deep breath) Phew… stay calm, stay calm. (faster pace) I've rehearsed this intro fifty times, it'll be fine.
See the MiMo speech synthesis documentation for the full tag list.
When CowAgent calls TTS, the Agent’s reply text (including any (...) tags) is forwarded directly to MiMo for synthesis. Tell the model in its persona / system prompt to “prefix replies with a (style) tag to control the tone”, and IM channels (WeChat / Feishu / DingTalk / WeCom) will play voice replies with the corresponding emotion, dialect, or even singing.