Batched input with geolocated data¶

In this tutorial, we will be using the urbanworm.dataset module to collect geo-located data, including street views, Flickr photos, and Freesound recordings. The urbanworm.inference module will be used to inference with InternVL3-8B-Instruct and Qwen2.5-Omni for imagery and audio data.

We will be using three case studies to demonstrate what insight may be gained from these datasets:

case study (in Detroit) using street views: Does the house look occupied?
case study (in Hongkong) using Flickr photos: What was captured in the photo?
case study (in Tokyo) using Freesound recordings: Did you hear the wind?

For each case study, We follow the following steps:

Query and process data
Download the dataset
Pass the dataset constructor to inference constructor
Batch inference

Retrieving data will require api keys of Mapillary, Google, Flickr, and Freesound, which can be requested from:

Note:

To see all the available street views on Mapillary, please check out Mapillary Map App
To see all the available geo-tagged photos on Flickr, please check out everyone's photo on the map
To see all the available geo-tagged recordings on Freesound, please check out the map of sounds

In [2]:

Copied!





from urbanworm.dataset import GeoTaggedData
from urbanworm.inference.llama import InferenceLlamacpp
# Optional fast local VLM backend (requires `pip install "urban-worm[unsloth]"`)
from urbanworm import InferenceUnsloth
# Aporee helpers (Internet Archive catalog + duration enrichment)
from urbanworm.sources.aporee import fetch_aporee_catalog, enrich_aporee_catalog
from urbanworm.dataset import GeoTaggedData
from urbanworm.inference.llama import InferenceLlamacpp
# Optional fast local VLM backend (requires `pip install "urban-worm[unsloth]"`)
from urbanworm import InferenceUnsloth
# Aporee helpers (Internet Archive catalog + duration enrichment)
from urbanworm.sources.aporee import fetch_aporee_catalog, enrich_aporee_catalog

In [3]:

Copied!





# Import api keys
with open("mapillary_key.txt", 'r') as file:
    mapillary_key = file.read()
with open("google_key.txt", 'r') as file:
    google_key = file.read()
with open("flickr_key.txt", 'r') as file:
    flickr_key = file.read()
with open("freesound_key.txt", 'r') as file:
    freesound_key = file.read()
# Import api keys
with open("mapillary_key.txt", 'r') as file:
    mapillary_key = file.read()
with open("google_key.txt", 'r') as file:
    google_key = file.read()
with open("flickr_key.txt", 'r') as file:
    flickr_key = file.read()
with open("freesound_key.txt", 'r') as file:
    freesound_key = file.read()

1 Does the house look occupied?¶

1.1 Retrieve street views at property-level¶

Building footprints will be used as proximity for gathering the data.

In [4]:

Copied!





# get building footprints from OSM

# Initiate the constructor
gtd = GeoTaggedData()
# Define the area of interest using a bounding box (bbox)
bbox = (-83.208003,42.374646,-83.206608,42.375328) # in Detroit, USA
# we can just get house with no more than 200 square meter (single family houses with garage excluded)
gtd.getBuildings(bbox, min_area=60, max_area=200)
# get building footprints from OSM

# Initiate the constructor
gtd = GeoTaggedData()
# Define the area of interest using a bounding box (bbox)
bbox = (-83.208003,42.374646,-83.206608,42.375328) # in Detroit, USA
# we can just get house with no more than 200 square meter (single family houses with garage excluded)
gtd.getBuildings(bbox, min_area=60, max_area=200)

In [5]:

Copied!

gtd.units.plot()
gtd.units.plot()

Out[5]:

<Axes: >

No description has been provided for this image

In [5]:

Copied!

gtd.units['geometry'][0].centroid.x, gtd.units['geometry'][0].centroid.y
gtd.units['geometry'][0].centroid.x, gtd.units['geometry'][0].centroid.y

Out[5]:

(-83.2069498, 42.374744050000004)

For each house location, we find nearby (≤30 m) street view images and output perspective crops reoriented to center on the house. The source parameter selects the data provider:

source='mapillary' (default): queries the Mapillary API for panoramic images, then reprojects them in-memory. Supports pano, reoriented, multi_num, interval, year, season, and time_of_day filtering.
source='google': queries the Google Street View Static API. Always returns a perspective image facing the target — no reprojection step needed. multi_num > 1, pano, reoriented, year, season, and time_of_day are not supported and will be ignored with a warning. Requires a GOOGLE_STREETVIEW_API_KEY.

The example below uses Mapillary: panoramic images, up to 3 views per house, reoriented to face the house, filtered to daytime captures in 2024–2025.

In [6]:

Copied!





gtd.get_svi_from_locations(key = mapillary_key, # api key
                           distance = 30,       # only search for available street view with 30 meters from the house location
                           pano = True,         # only search for 360-degree street view images
                           reoriented = True,   # reorient and crop the street view images to make them only frame the house at the center of scene
                           multi_num = 3,       # return three closest street views from the house location
                           fov = 80,            # The field of view in degrees for the reoriented images
                           interval = 2,        # The interval between each street view (i.g, `interval = 2` means there should be two available images between two collected images)
                           year = (2024, 2025), # only search for images captured between 2024 and 2025
                           time_of_day = 'day'  # only search for images captured during the daytime
                           )
gtd.get_svi_from_locations(key = mapillary_key, # api key
                           distance = 30,       # only search for available street view with 30 meters from the house location
                           pano = True,         # only search for 360-degree street view images
                           reoriented = True,   # reorient and crop the street view images to make them only frame the house at the center of scene
                           multi_num = 3,       # return three closest street views from the house location
                           fov = 80,            # The field of view in degrees for the reoriented images
                           interval = 2,        # The interval between each street view (i.g, `interval = 2` means there should be two available images between two collected images)
                           year = (2024, 2025), # only search for images captured between 2024 and 2025
                           time_of_day = 'day'  # only search for images captured during the daytime
                           )

  0%|          | 0/14 [00:00<?, ?it/s]

The metadata is stored in the constructor, including:

mapillary image id,
sequence id,
when it was captured,
the original orientation angle,
image coordinates,
and the house location index.

In [9]:

Copied!

gtd.svi_metadata.head(5)
gtd.svi_metadata.head(5)

Out[9]:

	id	sequence	captured_at	compass_angle	image_lon	image_lat	url	loc_id
0	787292673517322	CQSM1xmrYkn4IKuE0d6BDc	2024-6-23-16	88.091452	-83.206931	42.374638	https://scontent-ord5-3.xx.fbcdn.net/m1/v/t6/A...	0
1	466445912772974	CQSM1xmrYkn4IKuE0d6BDc	2024-6-23-16	88.492384	-83.206807	42.374640	https://scontent-ord5-3.xx.fbcdn.net/m1/v/t6/A...	0
2	441853692093021	CQSM1xmrYkn4IKuE0d6BDc	2024-6-23-16	87.594986	-83.207055	42.374636	https://scontent-ord5-3.xx.fbcdn.net/m1/v/t6/A...	0
0	787292673517322	CQSM1xmrYkn4IKuE0d6BDc	2024-6-23-16	88.091452	-83.206931	42.374638	https://scontent-ord5-3.xx.fbcdn.net/m1/v/t6/A...	1
1	466445912772974	CQSM1xmrYkn4IKuE0d6BDc	2024-6-23-16	88.492384	-83.206807	42.374640	https://scontent-ord5-3.xx.fbcdn.net/m1/v/t6/A...	1

The data information is also stored in a dictionary format for data downloading and processing in the future.

In this case study, since the image has been reoriented and cropped, the images have been store in base64 format inside the dataset constructor.

In [11]:

Copied!

gtd.svis['loc_id'][0], gtd.svis['id'][0], gtd.svis['data'][0][:100]
gtd.svis['loc_id'][0], gtd.svis['id'][0], gtd.svis['data'][0][:100]

Out[11]:

(0,
 '787292673517322',
 'iVBORw0KGgoAAAANSUhEUgAAArwAAAH0CAIAAABQO2mIAAAgAElEQVR4AWzB6c+m53ke9uM4r+u+72d5n3eZeWfjIooSF5HaaNmW')

Alternative: same locations with Google Street View¶

Swap source='google' and pass your Google API key. The image is always returned facing the target (reorientation is built into the API call), so pano / reoriented / multi_num are not needed. Time-based filters are not supported — Google only exposes year and month, stored as captured_at = 'YYYY-MM-1-1' (day and hour are nominal placeholders).

The url column in svi_metadata has the API key replaced with the literal key placeholder so it is safe to share or persist.

In [6]:

Copied!





gtd.get_svi_from_locations(
    source = 'google',
    key = google_key,      # GOOGLE_STREETVIEW_API_KEY
    distance = 30,         # search radius in metres
    fov = 80,              # field of view passed directly to the Static API (clamped to [10, 120])
    pitch = 5,             # camera pitch
    height = 500,
    width = 700,
    # multi_num, pano, reoriented, year, season, time_of_day are not supported for Google
)
gtd.get_svi_from_locations(
    source = 'google',
    key = google_key,      # GOOGLE_STREETVIEW_API_KEY
    distance = 30,         # search radius in metres
    fov = 80,              # field of view passed directly to the Static API (clamped to [10, 120])
    pitch = 5,             # camera pitch
    height = 500,
    width = 700,
    # multi_num, pano, reoriented, year, season, time_of_day are not supported for Google
)

  0%|          | 0/14 [00:00<?, ?it/s]

getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.
getSV: year/season/time_of_day filtering is not supported for source='google' (API does not expose historical imagery). These parameters will be ignored.

In [8]:

Copied!

gtd.svi_metadata.head(5)
gtd.svi_metadata.head(5)

Out[8]:

	id	sequence	captured_at	compass_angle	image_lon	image_lat	url	loc_id
0	PhtJRmWTL6kZMnpR5lDvNA	None	2025-08-1-1	359.208256	-83.206948	42.374622	https://maps.googleapis.com/maps/api/streetvie...	0
1	PhtJRmWTL6kZMnpR5lDvNA	None	2025-08-1-1	0.102347	-83.206948	42.374622	https://maps.googleapis.com/maps/api/streetvie...	1
2	3uFBBQFo0uYmkztWSxBWAA	None	2022-07-1-1	79.019397	-83.207271	42.374917	https://maps.googleapis.com/maps/api/streetvie...	2
3	1j-ALV8yzUUwri2bJ1Fc1A	None	2022-07-1-1	77.760277	-83.207274	42.375008	https://maps.googleapis.com/maps/api/streetvie...	3
4	fJANZWNZigu6gVc0HoTmqQ	None	2025-07-1-1	81.220750	-83.207278	42.375118	https://maps.googleapis.com/maps/api/streetvie...	4

1.2 Download data (optional but recommended)¶

For batched image inference, working with local data can be usually more stable than streaming data. Therefore, downloading data (images or sound recordings) to a directory is highly recommended but optional.

In [ ]:

Copied!





gtd.download_to_dir(data='svi',
                    to_dir='/svi_download',
                    prefix='svi')
# after downloading all images, the local path of images will be stored, which allows the inference constructor to access local images
gtd.svis['path'][:5]
gtd.download_to_dir(data='svi',
                    to_dir='/svi_download',
                    prefix='svi')
# after downloading all images, the local path of images will be stored, which allows the inference constructor to access local images
gtd.svis['path'][:5]

1.3 Pass to the inference constructor¶

In [5]:

Copied!





# indicate that the street view images wil be used
gtd.set_images('svi')
# pass the dataset to the inference constructor
data = InferenceLlamacpp(geo_tagged_data=gtd)
# indicate that the street view images wil be used
gtd.set_images('svi')
# pass the dataset to the inference constructor
data = InferenceLlamacpp(geo_tagged_data=gtd)

In [6]:

Copied!

# pack images by their sample locations
data.pack_by_location()
# pack images by their sample locations
data.pack_by_location()

1.4 Batched inference¶

In [8]:

Copied!





from typing import Literal

prompt = '''
    Question: Does this house look occupied if it is not a vacant lot?

    **An occupied house means that the house is not abandoned and
    some people may live in this house even if there is not people outside**
'''

# specify model
data.llm = 'ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0'
# define output schema
data.schema = {"answer": (Literal['occupied', 'unoccupied', 'vacant'], ...),
               "explanation": (str, ...),}
# inference
data.batch_inference(prompt=prompt)
from typing import Literal

prompt = '''
    Question: Does this house look occupied if it is not a vacant lot?

    **An occupied house means that the house is not abandoned and
    some people may live in this house even if there is not people outside**
'''

# specify model
data.llm = 'ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0'
# define output schema
data.schema = {"answer": (Literal['occupied', 'unoccupied', 'vacant'], ...),
               "explanation": (str, ...),}
# inference
data.batch_inference(prompt=prompt)

Processing...: 100%|███████████████████████| 13/13 [01:36<00:00,  7.40s/it]

Out[8]:

	answer_1	explanation_1	data_1	data_2	data_3
0	occupied	The house has a well-maintained lawn and a fen...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
1	occupied	The house appears to be occupied because it is...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
2	occupied	The house in the image appears to be occupied ...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
3	occupied	The house is not a vacant lot, so it is likely...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
4	occupied	The house is not a vacant lot, so it is likely...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
5	occupied	The house is not a vacant lot, so it is likely...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
6	occupied	The house is not a vacant lot, so it is likely...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
7	occupied	The house is not a vacant lot, so it is likely...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
8	occupied	The house is not a vacant lot, so it is likely...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
9	occupied	The house in the image appears to be occupied ...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
10	occupied	The house has a car parked in the driveway, wh...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
11	occupied	The house looks occupied because there is a ca...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...
12	occupied	The house has a car parked in the driveway and...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...	/var/folders/fb/4kj6xrcs195bxml3gxcrrkq80000gn...

Alternative: same task with `InferenceUnsloth`¶

Same prompt and schema, but using the Unsloth backend. The gtd is reused as-is; pack_by_location() was already called above so multi-view inference per location still works.

In [ ]:

Copied!





data_unsloth = InferenceUnsloth(
    llm='unsloth/Qwen3-VL-3B-Instruct',
    load_in_4bit=True,
    geo_tagged_data=gtd,
    schema={
        "answer": (Literal['occupied', 'unoccupied', 'vacant'], ...),
        "explanation": (str, ...),
    },
)
data_unsloth.batch_inference(prompt=prompt, batch_size=1)
data_unsloth = InferenceUnsloth(
    llm='unsloth/Qwen3-VL-3B-Instruct',
    load_in_4bit=True,
    geo_tagged_data=gtd,
    schema={
        "answer": (Literal['occupied', 'unoccupied', 'vacant'], ...),
        "explanation": (str, ...),
    },
)
data_unsloth.batch_inference(prompt=prompt, batch_size=1)

2 What was captured in the photo?¶

2.1 Retrieve Flickr photos within a radius¶

In [3]:

Copied!





# set a center point for spatially querying data
gtd = GeoTaggedData(locations=[[114.176773,22.302554]])
gtd.get_photo_from_location(key=flickr_key,
                            distance=1000, # only searching for data within the distance from the given center point
                            max_return=200, # only return the given number of photos
                            exclude_personal_photo = True, # drop personal photos using opencv/face_detection_yunet
                            )
# set a center point for spatially querying data
gtd = GeoTaggedData(locations=[[114.176773,22.302554]])
gtd.get_photo_from_location(key=flickr_key,
                            distance=1000, # only searching for data within the distance from the given center point
                            max_return=200, # only return the given number of photos
                            exclude_personal_photo = True, # drop personal photos using opencv/face_detection_yunet
                            )

  0%|          | 0/1 [00:00<?, ?it/s]

There are more than a half of photos detected as personal photos

In [8]:

Copied!

len(gtd.photo_metadata)
len(gtd.photo_metadata)

Out[8]:

The metadata is stored in the constructor, including:

Flickr photo id,
title of the sound,
owner,
when it was recorded,
sound coordinates,
...

In [4]:

Copied!

gtd.photo_metadata.head(5)
gtd.photo_metadata.head(5)

Out[4]:

	id	title	owner	datetaken	latitude	longitude	distance_m	tags	views	url
18	55039184398	Observatory Road, Hong Kong / 天文臺道,香港	23502041@N06	2026-01-13 19:15:41	22.301397	114.174430	273.225805		11	https://live.staticflickr.com/65535/5503918439...
115	55033921774	[Big Bee Taxi]Kaiyi X3 Pro EV WN4210	200369775@N04	2026-01-04 16:43:13	22.303136	114.180794	418.698089		1	https://live.staticflickr.com/65535/5503392177...
78	55036639646	20260110 金利宝香港一日遊	30199947@N07	2026-01-10 13:43:11	22.301252	114.172058	506.210597	香港金利寶 kongkong	20	https://live.staticflickr.com/65535/5503663964...
215	54993258227	IMG_7521	204002042@N04	2025-12-04 10:07:34	22.303761	114.171822	526.726446		6	https://live.staticflickr.com/65535/5499325822...
212	54994140806	IMG_7519	204002042@N04	2025-12-04 10:02:15	22.303791	114.171830	526.791740		8	https://live.staticflickr.com/65535/5499414080...

In [5]:

Copied!

gtd.photos['id'][0], gtd.photos['data'][0]
gtd.photos['id'][0], gtd.photos['data'][0]

Out[5]:

('55039184398',
 'https://live.staticflickr.com/65535/55039184398_17460374cf_o.jpg')

2.2 Download data (optional but recommended)¶

In [9]:

Copied!





# download data to directory
gtd.download_to_dir(data='photo',
                    to_dir='/Users/xiaohaoyang/Downloads/test_download',
                    prefix='test_download')
# download data to directory
gtd.download_to_dir(data='photo',
                    to_dir='/Users/xiaohaoyang/Downloads/test_download',
                    prefix='test_download')

2.3 pass to the inference constructor¶

In [6]:

Copied!





# indicate that the flickr photos wil be used
gtd.set_images('photo')
# pass the dataset to the inference constructor
data = InferenceLlamacpp(geo_tagged_data=gtd)
# indicate that the flickr photos wil be used
gtd.set_images('photo')
# pass the dataset to the inference constructor
data = InferenceLlamacpp(geo_tagged_data=gtd)

2.4 inference¶

In [13]:

Copied!





from typing import Literal

prompt = '''
    please answer the following questions (make a guess) after seeing the photo:
    What is mian thing/focus capture in the photo and is this indoor or outdoor and dose the photo capture any outdoor urban or natural scenery?
'''

# specify model
data.llm = 'ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0'
# define output schema
data.schema = {"focus": (str, ...),
               "indoor_outdoor": (Literal['indoor', 'outdoor'], ...),
               "scenery": (Literal['neither', 'urban', 'nature', 'both'], ...),}
# inference
result = data.batch_inference(prompt=prompt)
result.head(5)
from typing import Literal

prompt = '''
    please answer the following questions (make a guess) after seeing the photo:
    What is mian thing/focus capture in the photo and is this indoor or outdoor and dose the photo capture any outdoor urban or natural scenery?
'''

# specify model
data.llm = 'ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0'
# define output schema
data.schema = {"focus": (str, ...),
               "indoor_outdoor": (Literal['indoor', 'outdoor'], ...),
               "scenery": (Literal['neither', 'urban', 'nature', 'both'], ...),}
# inference
result = data.batch_inference(prompt=prompt)
result.head(5)

Processing...: 100%|███████████████████████| 93/93 [09:03<00:00,  5.84s/it]

Out[13]:

	focus_1	indoor_outdoor_1	scenery_1	data_1
0	The main focus of the photo is the brightly li...	indoor	urban	https://live.staticflickr.com/65535/5503918439...
1	The main focus of the photo is the busy urban ...	outdoor	urban	https://live.staticflickr.com/65535/5503392177...
2	The main focus of the photo is the colorful, r...	outdoor	urban	https://live.staticflickr.com/65535/5503663964...
3	The main focus of the photo is a plate of food...	indoor	urban	https://live.staticflickr.com/65535/5499325822...
4	The main focus of the photo is a plate of food...	indoor	urban	https://live.staticflickr.com/65535/5499414080...

Alternative: same task with `InferenceUnsloth`¶

Same prompt and schema, but using the Unsloth backend. Set batch_size>1 if you have spare VRAM to amortize the per-image overhead.

In [ ]:

Copied!





# reuse the same GeoTaggedData (gtd) — set_images('photo') was already called above
data_unsloth = InferenceUnsloth(
    llm='unsloth/Qwen3-VL-3B-Instruct',
    load_in_4bit=True,
    geo_tagged_data=gtd,
    schema={
        "focus": (str, ...),
        "indoor_outdoor": (Literal['indoor', 'outdoor'], ...),
        "scenery": (Literal['neither', 'urban', 'nature', 'both'], ...),
    },
)
result = data_unsloth.batch_inference(prompt=prompt, batch_size=2)
result.head(5)
# reuse the same GeoTaggedData (gtd) — set_images('photo') was already called above
data_unsloth = InferenceUnsloth(
    llm='unsloth/Qwen3-VL-3B-Instruct',
    load_in_4bit=True,
    geo_tagged_data=gtd,
    schema={
        "focus": (str, ...),
        "indoor_outdoor": (Literal['indoor', 'outdoor'], ...),
        "scenery": (Literal['neither', 'urban', 'nature', 'both'], ...),
    },
)
result = data_unsloth.batch_inference(prompt=prompt, batch_size=2)
result.head(5)

3 Did you hear the wind?¶

3.1 Retrieve Freesound recordings within a radius¶

In [9]:

Copied!





# set a center point for spatially querying data
gtd = GeoTaggedData(locations=[[139.726978,35.658524]])
gtd.get_sound_from_location(key=freesound_key,
                            query='field_recording', # search for field recordings
                            distance=5000, #
                            max_return=200, #
                            duration=(20, 6000), # only search for recording with a duration between 15 and 6000 seconds
                            slice_duration=10, # need to mark the checkpoints every 10 second for clipping the sound
                            slice_max_num=2, # only need two clips from each sound
                            silent=False
                            )
# set a center point for spatially querying data
gtd = GeoTaggedData(locations=[[139.726978,35.658524]])
gtd.get_sound_from_location(key=freesound_key,
                            query='field_recording', # search for field recordings
                            distance=5000, #
                            max_return=200, #
                            duration=(20, 6000), # only search for recording with a duration between 15 and 6000 seconds
                            slice_duration=10, # need to mark the checkpoints every 10 second for clipping the sound
                            slice_max_num=2, # only need two clips from each sound
                            silent=False
                            )

  0%|          | 0/1 [00:00<?, ?it/s]

Alternative: same task with Radio Aporee¶

Radio Aporee field recordings live on the Internet Archive (radio-aporee-maps collection). urbanworm ships three helpers:

fetch_aporee_catalog(...) — pull a catalog of geotagged sounds from IA with optional bbox / year / hour / season filters.
enrich_aporee_catalog(...) — probe each URL once with pydub / mutagen to populate duration_s (Aporee doesn't carry duration in its metadata). Run this once and persist the CSV; subsequent calls reuse it.
gtd.get_sound_from_location(source='aporee', catalog=..., ...) — filter the catalog by spatial proximity and slice for inference. Mirrors the Freesound path including the slice_duration / max_return knobs.

No API key needed for Aporee, just the catalog (CSV path or DataFrame).

In [ ]:

Copied!





# 1) Fetch the catalog from Internet Archive (server-side bbox + year filters,
#    client-side hour + season filters). This is a one-shot operation; persist
#    the CSV and skip this step on subsequent runs.
tokyo_bay = (35.55, 139.65, 35.75, 139.85)  # (lat_min, lon_min, lat_max, lon_max)
catalog = fetch_aporee_catalog(
    bbox=tokyo_bay,
    year=(2015, 2024),
    rows=200,                  # cap for the demo
    out_path='aporee_tokyo.csv',
)
catalog.head(5)
# 1) Fetch the catalog from Internet Archive (server-side bbox + year filters,
#    client-side hour + season filters). This is a one-shot operation; persist
#    the CSV and skip this step on subsequent runs.
tokyo_bay = (35.55, 139.65, 35.75, 139.85)  # (lat_min, lon_min, lat_max, lon_max)
catalog = fetch_aporee_catalog(
    bbox=tokyo_bay,
    year=(2015, 2024),
    rows=200,                  # cap for the demo
    out_path='aporee_tokyo.csv',
)
catalog.head(5)

In [ ]:

Copied!





# 2) (Optional) probe durations once so slicing works without per-call probes.
#    Drops anything shorter than 10s. Skip this if you don't need slicing.
enrich_aporee_catalog(
    'aporee_tokyo.csv',
    out_path='aporee_tokyo.csv',
    min_duration=10,
)
# 2) (Optional) probe durations once so slicing works without per-call probes.
#    Drops anything shorter than 10s. Skip this if you don't need slicing.
enrich_aporee_catalog(
    'aporee_tokyo.csv',
    out_path='aporee_tokyo.csv',
    min_duration=10,
)

In [ ]:

Copied!





# 3) Now use it like Freesound — same `get_sound_from_location` API,
#    just `source='aporee'` and `catalog=...` instead of `key=...`.
gtd = GeoTaggedData(locations=[[139.726978, 35.658524]])
gtd.get_sound_from_location(
    source='aporee',
    catalog='aporee_tokyo.csv',
    distance=5000,                 # meters
    max_return=200,
    duration=(20, 6000),           # filter on duration_s if available
    slice_duration=10,             # 10s clip checkpoints
    slice_max_num=2,               # at most 2 clips per recording
    silent=False,
)
# 3) Now use it like Freesound — same `get_sound_from_location` API,
#    just `source='aporee'` and `catalog=...` instead of `key=...`.
gtd = GeoTaggedData(locations=[[139.726978, 35.658524]])
gtd.get_sound_from_location(
    source='aporee',
    catalog='aporee_tokyo.csv',
    distance=5000,                 # meters
    max_return=200,
    duration=(20, 6000),           # filter on duration_s if available
    slice_duration=10,             # 10s clip checkpoints
    slice_max_num=2,               # at most 2 clips per recording
    silent=False,
)

In [ ]:

Copied!

# Inspect what we got — same shape as the Freesound payload.
gtd.audio_metadata.head(5)
# Inspect what we got — same shape as the Freesound payload.
gtd.audio_metadata.head(5)

Once gtd.audios is populated, the download and inference cells below work unchanged for either source — download_to_dir(data='audio', ...) and InferenceLlamacpp(geo_tagged_data=gtd) do the same thing whether the underlying sounds came from Freesound or Aporee.

In [10]:

Copied!

len(gtd.audio_metadata)
len(gtd.audio_metadata)

Out[10]:

In [11]:

Copied!

gtd.audio_metadata.head(5)
gtd.audio_metadata.head(5)

Out[11]:

	id	name	username	license	created	duration	tags	geotag	latitude	...	url	page_url	description	num_downloads	avg_rating	slice	preview-hq-mp3	preview-hq-ogg	preview-lq-mp3	preview-lq-ogg
18	686617	Rapid Male Chanting in Temple in Japan	calebjay	https://creativecommons.org/licenses/by/4.0/	2023-05-10T07:47:39Z	28.7270	[acapella, a-cappella, acappella, field-record...	35.64929634007673 139.74117500000023	35.649296	...	https://freesound.org/people/calebjay/sounds/6...	https://freesound.org/people/calebjay/sounds/6...	A man chanting in a temple in Japan. The templ...	151	5.000000	[[0, 10000], [10000, 20000]]	https://cdn.freesound.org/previews/686/686617_...	https://cdn.freesound.org/previews/686/686617_...	https://cdn.freesound.org/previews/686/686617_...	https://cdn.freesound.org/previews/686/686617_...
1	135207	train from ebisu to shibuya 恵比寿	djgriffin	http://creativecommons.org/publicdomain/zero/1.0/	2011-11-19T06:20:01Z	160.5000	[board, dr-1, ebisu, field, japan, platform, r...	35.6467 139.71012	35.646700	...	https://freesound.org/people/djgriffin/sounds/...	https://freesound.org/people/djgriffin/sounds/...	a field recording using the tascam dr-1 from e...	486	4.888889	[[0, 10000], [10000, 20000]]	https://cdn.freesound.org/previews/135/135207_...	https://cdn.freesound.org/previews/135/135207_...	https://cdn.freesound.org/previews/135/135207_...	https://cdn.freesound.org/previews/135/135207_...
36	440606	Gong in zen buddhism	florianreichelt	http://creativecommons.org/publicdomain/zero/1.0/	2018-09-17T14:08:42Z	32.6200	[bell, buddhism, buddhist, cymbal, drum, gong,...	35.6580390655 139.749506012	35.658039	...	https://freesound.org/people/florianreichelt/s...	https://freesound.org/people/florianreichelt/s...	We recorded this sound while our trip through ...	1093	4.277778	[[0, 10000], [10000, 20000]]	https://cdn.freesound.org/previews/440/440606_...	https://cdn.freesound.org/previews/440/440606_...	https://cdn.freesound.org/previews/440/440606_...	https://cdn.freesound.org/previews/440/440606_...
38	440600	buddhsim monk is playing a gong in zen buddhism	florianreichelt	http://creativecommons.org/publicdomain/zero/1.0/	2018-09-17T14:08:32Z	38.1747	[bang, bell, blacksmith, buddhism, clang, ding...	35.6577128899 139.749545429	35.657713	...	https://freesound.org/people/florianreichelt/s...	https://freesound.org/people/florianreichelt/s...	We recorded this sound while our trip through ...	203	4.888889	[[0, 10000], [10000, 20000]]	https://cdn.freesound.org/previews/440/440600_...	https://cdn.freesound.org/previews/440/440600_...	https://cdn.freesound.org/previews/440/440600_...	https://cdn.freesound.org/previews/440/440600_...
33	799435	Hie Shrine,Shichi-Go-San,Tsuri Taiko (Noise Fi...	Hinoki.owo	https://creativecommons.org/licenses/by/4.0/	2025-04-18T11:25:21Z	52.0000	[ambiance, ambience, ambient, asia, background...	35.674787 139.739845	35.674787	...	https://freesound.org/people/Hinoki.owo/sounds...	https://freesound.org/people/Hinoki.owo/sounds...	Late November of 2024, the visit of Hie Shrine...	13	5.000000	[[0, 10000], [10000, 20000]]	https://cdn.freesound.org/previews/799/799435_...	https://cdn.freesound.org/previews/799/799435_...	https://cdn.freesound.org/previews/799/799435_...	https://cdn.freesound.org/previews/799/799435_...

5 rows × 22 columns

In [5]:

Copied!

gtd.audios['id'][0], gtd.audios['data'][0], gtd.audios['slice'][0]
gtd.audios['id'][0], gtd.audios['data'][0], gtd.audios['slice'][0]

Out[5]:

(686617,
 'https://cdn.freesound.org/previews/686/686617_13137374-hq.mp3',
 [0, 10000])

3.2 Download dataset (optional but recommended)¶

In [ ]:

Copied!

gtd.download_to_dir(data='audio',
                    to_dir='/audio_download',
                    prefix='audio_download')
gtd.download_to_dir(data='audio',
                    to_dir='/audio_download',
                    prefix='audio_download')

3.3 Pass it to the inference constructor¶

In [13]:

Copied!

# pass the dataset to the inference constructor
data = InferenceLlamacpp(geo_tagged_data=gtd)
# pass the dataset to the inference constructor
data = InferenceLlamacpp(geo_tagged_data=gtd)

3.4 Inference¶

In [17]:

Copied!





from typing import Literal

prompt = '''
    Please answer the following questions after listening the audio:
    Can you clearly hear wind sounds?

    Your answer should be yes / no
'''

# specify model
data.llm = 'ggml-org/Qwen2.5-Omni-7B-GGUF:Q8_0'
# define output schema
data.schema = {"answer": (Literal['yes', 'no'], ...)}
# inference
result = data.batch_inference(prompt=prompt,
                              audio_input = True # indicate that input data is audio
                              )
result.head(5)
from typing import Literal

prompt = '''
    Please answer the following questions after listening the audio:
    Can you clearly hear wind sounds?

    Your answer should be yes / no
'''

# specify model
data.llm = 'ggml-org/Qwen2.5-Omni-7B-GGUF:Q8_0'
# define output schema
data.schema = {"answer": (Literal['yes', 'no'], ...)}
# inference
result = data.batch_inference(prompt=prompt,
                              audio_input = True # indicate that input data is audio
                              )
result.head(5)

Processing...: 100%|███████████████████████| 82/82 [13:19<00:00,  9.75s/it]

Out[17]:

	answer_1	data_1	answer_2
0	no	https://cdn.freesound.org/previews/686/686617_...	NaN
1	no	https://cdn.freesound.org/previews/686/686617_...	NaN
2	no	https://cdn.freesound.org/previews/135/135207_...	NaN
3	no	https://cdn.freesound.org/previews/135/135207_...	NaN
4	no	https://cdn.freesound.org/previews/440/440606_...	NaN

Batched input with geolocated data¶

1 Does the house look occupied?¶

1.1 Retrieve street views at property-level¶

Alternative: same locations with Google Street View¶

1.2 Download data (optional but recommended)¶

1.3 Pass to the inference constructor¶

1.4 Batched inference¶

Alternative: same task with InferenceUnsloth¶

2 What was captured in the photo?¶

2.1 Retrieve Flickr photos within a radius¶

2.2 Download data (optional but recommended)¶

2.3 pass to the inference constructor¶

2.4 inference¶

Alternative: same task with InferenceUnsloth¶

3 Did you hear the wind?¶

3.1 Retrieve Freesound recordings within a radius¶

Alternative: same task with Radio Aporee¶

3.2 Download dataset (optional but recommended)¶

3.3 Pass it to the inference constructor¶

3.4 Inference¶

Alternative: same task with `InferenceUnsloth`¶

Alternative: same task with `InferenceUnsloth`¶