Installation¶
Step 1 — Core package¶
Step 2 — Choose an inference backend¶
Unsloth — recommended (GPU required)¶
GPU-specific PyTorch must be installed before the unsloth extra, otherwise pip falls back to a slow CPU-only build.
Supported checkpoints
unsloth/Qwen3-VL-3B-Instruct, unsloth/Qwen3-VL-8B-Instruct,
unsloth/gemma-3-4b-it, unsloth/Qwen2-VL-2B-Instruct,
unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit.
Any vision model that unsloth.FastVisionModel can load should work.
Ollama — lightweight (no GPU required)¶
Install the Ollama application first:
Download the installer from ollama.com, then:
llama.cpp — CLI-based¶
The llama-mtmd-cli binary must be installed separately:
Then install the Python binding:
Cloud APIs (Claude / GPT-4o / Gemini)¶
Optional extras¶
| Extra | What it adds |
|---|---|
audio |
pydub — needed for audio slicing (get_sound_from_location) |
all |
All inference backends + API providers (no audio) |
all,audio |
Everything |
dev |
Pytest, ruff, build tools |
GPU torch + [all]
Pre-install the CUDA torch wheel before running pip install "urban-worm[all]".
See the Unsloth tab above for the one-liner.