Urban sound & sight : Dataset and benchmark for audio-visual urban scene understanding

Fuentes, Magdalena - Steers, Bea - Zinemanas, Pablo - Rocamora, Martín - Bondi, Luca - Wilkins, Julia - Shi, Qianyi - Hou, Yao - Das, Samarjit - Serra, Xavier - Bello, Juan Pablo

Resumen:

Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To address this we present a curated audio-visual dataset, Urban Sound & Sight (Urbansas), developed for investigating the detection and localization of sounding vehicles in the wild. Urbansas consists of 12 hours of unlabeled data along with 3 hours of manually annotated data, including bounding boxes with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the challenges presented by the dataset and how to use its annotations for the localization of vehicles in the wild through audio models.


Detalles Bibliográficos
2022
Location awareness
Training
Industries
Annotations
Conferences
Signal processing
Benchmark testing
Audio-visual
Urban research
Traffic
Dataset
Inglés
Universidad de la República
COLIBRI
https://ieeexplore.ieee.org/document/9747644
https://hdl.handle.net/20.500.12008/31397
Acceso abierto
Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0)
Resumen:
Sumario:Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To address this we present a curated audio-visual dataset, Urban Sound & Sight (Urbansas), developed for investigating the detection and localization of sounding vehicles in the wild. Urbansas consists of 12 hours of unlabeled data along with 3 hours of manually annotated data, including bounding boxes with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the challenges presented by the dataset and how to use its annotations for the localization of vehicles in the wild through audio models.