Urban-WORM¶
Workflow Of Reproducible Multimodal Inference

Urban-WORM is a high-level Python interface for building geo-referenced urban datasets with model-generated ground-truth labels. It covers the full pipeline — from collecting crowdsourced street views, photos, and sounds near building footprints, through batched VLM inference, to an organised export of labelled metadata.
Features¶
Data collection¶
- Collect geotagged street views (Mapillary / Google), photos (Flickr), and audio (Freesound / Radio Aporee) within the proximity of building footprints or other points of interest
- Calibrate panorama orientation to face a given location; auto-compute field-of-view from building footprints
- Filter personal photos with face detection; slice audio recordings into fixed-duration clips
- Crash-safe checkpointing — pass
checkpoint_pathto any collection method; already-fetched locations are skipped on resume
Inference / ground-truth labelling¶
- Define a structured output schema once; all backends share the same
one_inference/batch_inferenceinterface - Unsloth — GPU-accelerated local VLM; auto-detects multiple GPUs; OOM-safe chunk retry; 2–4× faster than Ollama
- Ollama — lightweight local inference, no GPU required
- llama.cpp — highly customisable sampling; supports audio input
- Cloud APIs — Claude, GPT-4o, Gemini via
InferenceAPI - Crash-safe checkpointing on all
batch_inferencemethods
Export¶
GeoTaggedData.export()— one call produces ametadata.csvpaired with an organisedimages/oraudio/folder
Quick example¶
from urbanworm import GeoTaggedData, InferenceUnsloth
from typing import Literal
# 1 — collect street views near building footprints
gtd = GeoTaggedData()
gtd.getBuildings(bbox=(-83.208, 42.374, -83.206, 42.375), source='osm')
gtd.get_svi_from_locations(key="YOUR_MAPILLARY_KEY", distance=30, reoriented=True)
# 2 — define output schema and run inference
schema = {
"occupancy": (Literal["occupied", "unoccupied", "uncertain"], ...),
"visual_evidence": (str, ...),
}
infer = InferenceUnsloth(
llm="unsloth/Qwen2-VL-2B-Instruct",
load_in_4bit=True,
geo_tagged_data=gtd,
schema=schema,
)
df = infer.batch_inference(
prompt="Does this house look occupied or vacant?",
batch_size=4,
checkpoint_path="labels.jsonl",
)
# 3 — export
gtd.export(output_dir="dataset", data="svi", labels=df)
License¶
MIT — see LICENSE.
The development of this package is supported and inspired by the city of Detroit.