Urban-WORM (minimal Colab demo)¶

This notebook shows the core VLM inference workflow in under 5 minutes on a free Colab T4 GPU.

What we do:

Install urban-worm with the Unsloth backend
Download three sample street-view images of a Detroit house (no API key needed)
Ask the model a structured occupancy question about a single image
Run batched inference over all three images and inspect the results

Before running: go to Runtime → Change runtime type and select T4 GPU.

1 Install¶

In [ ]:

Copied!

# @title Install urban-worm + Unsloth backend
# Colab already ships a CUDA-enabled PyTorch build, so we only need the extras.
!pip install "urban-worm[unsloth]" -q
# @title Install urban-worm + Unsloth backend
# Colab already ships a CUDA-enabled PyTorch build, so we only need the extras.
!pip install "urban-worm[unsloth]" -q

In [ ]:

Copied!





# @title Verify GPU
import torch
if not torch.cuda.is_available():
    raise RuntimeError(
        "No GPU detected. Go to Runtime → Change runtime type and select T4 GPU."
    )
props = torch.cuda.get_device_properties(0)
print(f"GPU : {props.name}")
print(f"VRAM: {props.total_memory / 1024**3:.1f} GiB")
# @title Verify GPU
import torch
if not torch.cuda.is_available():
    raise RuntimeError(
        "No GPU detected. Go to Runtime → Change runtime type and select T4 GPU."
    )
props = torch.cuda.get_device_properties(0)
print(f"GPU : {props.name}")
print(f"VRAM: {props.total_memory / 1024**3:.1f} GiB")

2 Download sample images¶

Three reoriented street-view images of the same residential property in Detroit, MI are bundled with the repository. No API key is needed.

In [ ]:

Copied!





# @title Download sample street-view images from the urban-worm repo
import urllib.request, pathlib

BASE = "https://raw.githubusercontent.com/billbillbilly/urbanworm/main/docs/data"
img_paths = []
for name in ["img_1.jpg", "img_2.jpg", "img_3.jpg"]:
    dst = pathlib.Path(name)
    if not dst.exists():
        urllib.request.urlretrieve(f"{BASE}/{name}", dst)
    img_paths.append(str(dst))
    print(f"Ready: {dst}")
# @title Download sample street-view images from the urban-worm repo
import urllib.request, pathlib

BASE = "https://raw.githubusercontent.com/billbillbilly/urbanworm/main/docs/data"
img_paths = []
for name in ["img_1.jpg", "img_2.jpg", "img_3.jpg"]:
    dst = pathlib.Path(name)
    if not dst.exists():
        urllib.request.urlretrieve(f"{BASE}/{name}", dst)
    img_paths.append(str(dst))
    print(f"Ready: {dst}")

In [ ]:

Copied!





# @title Preview the three images
import matplotlib.pyplot as plt
from PIL import Image

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, path in zip(axes, img_paths):
    ax.imshow(Image.open(path))
    ax.axis("off")
    ax.set_title(path)
plt.tight_layout()
plt.show()
# @title Preview the three images
import matplotlib.pyplot as plt
from PIL import Image

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, path in zip(axes, img_paths):
    ax.imshow(Image.open(path))
    ax.axis("off")
    ax.set_title(path)
plt.tight_layout()
plt.show()

3 Define the output schema¶

Pass a plain dict to declare the structured fields the model must return.
Standard Python type hints (Literal, str, bool, …) control what values are allowed.

In [ ]:

Copied!





# @title Schema + prompts
from typing import Literal

schema = {
    "occupancy": (Literal["occupied", "unoccupied"], ...),
    "visual_evidence": (str, ...),
}

SYSTEM = "You are an urban researcher assessing residential housing conditions."
PROMPT = (
    "Does this house look occupied or vacant? "
    "Briefly describe the visual evidence that supports your answer."
)
# @title Schema + prompts
from typing import Literal

schema = {
    "occupancy": (Literal["occupied", "unoccupied"], ...),
    "visual_evidence": (str, ...),
}

SYSTEM = "You are an urban researcher assessing residential housing conditions."
PROMPT = (
    "Does this house look occupied or vacant? "
    "Briefly describe the visual evidence that supports your answer."
)

4 Single-image inference¶

The model is downloaded from HuggingFace Hub on first run (~1.5 GB for the 2B model in 4-bit).
Subsequent cells reuse the same loaded model.

In [ ]:

Copied!





# @title Load model + run single-image inference
from urbanworm import InferenceUnsloth

infer = InferenceUnsloth(
    llm="unsloth/Qwen2-VL-2B-Instruct",  # 2 B model — comfortably fits on a T4 in 4-bit
    load_in_4bit=True,                     # halves VRAM vs. fp16 with minimal quality loss
    schema=schema,
)

result = infer.one_inference(
    system=SYSTEM,
    prompt=PROMPT,
    image=img_paths[0],
)
result
# @title Load model + run single-image inference
from urbanworm import InferenceUnsloth

infer = InferenceUnsloth(
    llm="unsloth/Qwen2-VL-2B-Instruct",  # 2 B model — comfortably fits on a T4 in 4-bit
    load_in_4bit=True,                     # halves VRAM vs. fp16 with minimal quality loss
    schema=schema,
)

result = infer.one_inference(
    system=SYSTEM,
    prompt=PROMPT,
    image=img_paths[0],
)
result

5 Batch inference¶

batch_inference processes a list of images in GPU-batched chunks.
batch_size=4 groups four items per forward pass for higher throughput.
checkpoint_path lets the run resume safely from where it stopped if interrupted.

In [ ]:

Copied!





# @title Batch inference over all three images
infer.imgs = img_paths  # one path per location; nest lists for multi-image-per-prompt

df = infer.batch_inference(
    system=SYSTEM,
    prompt=PROMPT,
    batch_size=4,
    max_new_tokens=256,
    checkpoint_path="demo_checkpoint.jsonl",  # safe to re-run: completed items are skipped
)
df
# @title Batch inference over all three images
infer.imgs = img_paths  # one path per location; nest lists for multi-image-per-prompt

df = infer.batch_inference(
    system=SYSTEM,
    prompt=PROMPT,
    batch_size=4,
    max_new_tokens=256,
    checkpoint_path="demo_checkpoint.jsonl",  # safe to re-run: completed items are skipped
)
df

6 Inspect results¶

In [ ]:

Copied!





# @title Show each image alongside its predicted label
import matplotlib.pyplot as plt
from PIL import Image

fig, axes = plt.subplots(1, len(df), figsize=(6 * len(df), 5))
if len(df) == 1:
    axes = [axes]

for ax, (_, row) in zip(axes, df.iterrows()):
    ax.imshow(Image.open(row["data"]))
    ax.axis("off")
    resp = row["responses"]
    if resp:
        r = resp[0]
        label    = getattr(r, "occupancy", "—")
        evidence = getattr(r, "visual_evidence", "")
        ax.set_title(f"{label}\n{evidence[:80]}", fontsize=9)

plt.tight_layout()
plt.show()
# @title Show each image alongside its predicted label
import matplotlib.pyplot as plt
from PIL import Image

fig, axes = plt.subplots(1, len(df), figsize=(6 * len(df), 5))
if len(df) == 1:
    axes = [axes]

for ax, (_, row) in zip(axes, df.iterrows()):
    ax.imshow(Image.open(row["data"]))
    ax.axis("off")
    resp = row["responses"]
    if resp:
        r = resp[0]
        label    = getattr(r, "occupancy", "—")
        evidence = getattr(r, "visual_evidence", "")
        ax.set_title(f"{label}\n{evidence[:80]}", fontsize=9)

plt.tight_layout()
plt.show()

Next steps¶

Collect real geo-located data — use GeoTaggedData to fetch street views, Flickr photos, or audio near building footprints (see docs/2_inference_geo_located_data.ipynb).
Try a larger model — swap in unsloth/Qwen3-VL-3B-Instruct for higher quality, or unsloth/Qwen3-VL-8B-Instruct on an A100.
Use a cloud API — replace InferenceUnsloth with InferenceAPI to use Claude, GPT-4o, or Gemini instead.
Export — call GeoTaggedData.export() to produce a metadata.csv paired with an organised image folder.

Full documentation: https://billbillbilly.github.io/urbanworm/