Urban-WORM (minimal Colab demo)¶
This notebook shows the core VLM inference workflow in under 5 minutes on a free Colab T4 GPU.
What we do:
- Install
urban-wormwith the Unsloth backend - Download three sample street-view images of a Detroit house (no API key needed)
- Ask the model a structured occupancy question about a single image
- Run batched inference over all three images and inspect the results
Before running: go to Runtime → Change runtime type and select T4 GPU.
1 Install¶
# @title Install urban-worm + Unsloth backend
# Colab already ships a CUDA-enabled PyTorch build, so we only need the extras.
!pip install "urban-worm[unsloth]" -q
# @title Verify GPU
import torch
if not torch.cuda.is_available():
raise RuntimeError(
"No GPU detected. Go to Runtime → Change runtime type and select T4 GPU."
)
props = torch.cuda.get_device_properties(0)
print(f"GPU : {props.name}")
print(f"VRAM: {props.total_memory / 1024**3:.1f} GiB")
2 Download sample images¶
Three reoriented street-view images of the same residential property in Detroit, MI are bundled with the repository. No API key is needed.
# @title Download sample street-view images from the urban-worm repo
import urllib.request, pathlib
BASE = "https://raw.githubusercontent.com/billbillbilly/urbanworm/main/docs/data"
img_paths = []
for name in ["img_1.jpg", "img_2.jpg", "img_3.jpg"]:
dst = pathlib.Path(name)
if not dst.exists():
urllib.request.urlretrieve(f"{BASE}/{name}", dst)
img_paths.append(str(dst))
print(f"Ready: {dst}")
# @title Preview the three images
import matplotlib.pyplot as plt
from PIL import Image
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, path in zip(axes, img_paths):
ax.imshow(Image.open(path))
ax.axis("off")
ax.set_title(path)
plt.tight_layout()
plt.show()
3 Define the output schema¶
Pass a plain dict to declare the structured fields the model must return.
Standard Python type hints (Literal, str, bool, …) control what values are allowed.
# @title Schema + prompts
from typing import Literal
schema = {
"occupancy": (Literal["occupied", "unoccupied"], ...),
"visual_evidence": (str, ...),
}
SYSTEM = "You are an urban researcher assessing residential housing conditions."
PROMPT = (
"Does this house look occupied or vacant? "
"Briefly describe the visual evidence that supports your answer."
)
4 Single-image inference¶
The model is downloaded from HuggingFace Hub on first run (~1.5 GB for the 2B model in 4-bit).
Subsequent cells reuse the same loaded model.
# @title Load model + run single-image inference
from urbanworm import InferenceUnsloth
infer = InferenceUnsloth(
llm="unsloth/Qwen2-VL-2B-Instruct", # 2 B model — comfortably fits on a T4 in 4-bit
load_in_4bit=True, # halves VRAM vs. fp16 with minimal quality loss
schema=schema,
)
result = infer.one_inference(
system=SYSTEM,
prompt=PROMPT,
image=img_paths[0],
)
result
5 Batch inference¶
batch_inference processes a list of images in GPU-batched chunks.
batch_size=4 groups four items per forward pass for higher throughput.
checkpoint_path lets the run resume safely from where it stopped if interrupted.
# @title Batch inference over all three images
infer.imgs = img_paths # one path per location; nest lists for multi-image-per-prompt
df = infer.batch_inference(
system=SYSTEM,
prompt=PROMPT,
batch_size=4,
max_new_tokens=256,
checkpoint_path="demo_checkpoint.jsonl", # safe to re-run: completed items are skipped
)
df
6 Inspect results¶
# @title Show each image alongside its predicted label
import matplotlib.pyplot as plt
from PIL import Image
fig, axes = plt.subplots(1, len(df), figsize=(6 * len(df), 5))
if len(df) == 1:
axes = [axes]
for ax, (_, row) in zip(axes, df.iterrows()):
ax.imshow(Image.open(row["data"]))
ax.axis("off")
resp = row["responses"]
if resp:
r = resp[0]
label = getattr(r, "occupancy", "—")
evidence = getattr(r, "visual_evidence", "")
ax.set_title(f"{label}\n{evidence[:80]}", fontsize=9)
plt.tight_layout()
plt.show()
Next steps¶
- Collect real geo-located data — use
GeoTaggedDatato fetch street views, Flickr photos, or audio near building footprints (seedocs/2_inference_geo_located_data.ipynb). - Try a larger model — swap in
unsloth/Qwen3-VL-3B-Instructfor higher quality, orunsloth/Qwen3-VL-8B-Instructon an A100. - Use a cloud API — replace
InferenceUnslothwithInferenceAPIto use Claude, GPT-4o, or Gemini instead. - Export — call
GeoTaggedData.export()to produce ametadata.csvpaired with an organised image folder.
Full documentation: https://billbillbilly.github.io/urbanworm/