Enhancing Urban Sound Classification with CNN-Transformer Hybrid Model and Spectrogram Augmentation
DOI:
https://doi.org/10.26692/surjss.v57i02.7708Keywords:
Urban sound classification, CNNs, LSTM, Spectrogram augmentationAbstract
Urban Sound Classification (USC) is a crucial component of audio recognition systems, with applications in smart cities, surveillance, and multimedia. Despite significant advances, the classification of environmental sounds remains a challenge due to the complex nature of urban audio signals, characterized by high intra-class variability and overlapping sound events. In this paper, we propose a novel hybrid model that integrates the strengths of Convolutional Neural Networks (CNNs) and Transformer architectures to improve the identification accuracy of urban sounds. The CNN component effectively extracts local spectral features from Mel spectrograms, while the Transformer captures global temporal dependencies through self-attention mechanisms. Additionally, we incorporate advanced spectrogram augmentation techniques, such as time masking, frequency masking, and time warping, to further enhance the model’s robustness and generalization capabilities. Experimental results on the UrbanSound8K dataset demonstrate that the proposed CNN- Transformer hybrid model outperforms traditional CNN and Long Short-Term Memory (LSTM)-based approaches, achieving a classification accuracy of 93.36%. These results highlight the effectiveness of combining CNNs with transformers and data augmentation strategies for robust urban sound classification.
References
K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE 25th international workshop on machine learning for signal processing (MLSP), pp. 1–6, IEEE, 2015.
S. U. Hassan, M. Z. Khan, M. U. G. Khan, and S. Saleem, “Robust sound classification for surveillance using time frequency audio features,” in 2019 International Conference on Communication Technologies (ComTech), pp. 13–18, IEEE, 2019.
Y. Li, X. Li, Y. Zhang, M. Liu, and W. Wang, “Anomalous sound detection using deep audio representation and a blstm network for audio surveillance of roads,” Ieee Access, vol. 6, pp. 58043–58055, 2018.
E.-L. Tan, F. A. Karnapi, L. J. Ng, K. Ooi, and W.-S. Gan, “Extracting urban sound information for residential areas in smart cities using an end-to-end iot system,” IEEE Internet of Things Journal, vol. 8, no. 18, pp. 14308–14321, 2021.
S. Chandrakala and S. Jayalakshmi, “Generative model driven representation learning in a hybrid framework for environmental audio scene and sound event recognition,” IEEE Transactions on Multimedia, vol. 22, no. 1, pp. 3–14, 2019.
N. Ijaz, F. Banoori, and I. Koo, “Reshaping bioacoustics event detection:Leveraging few-shot learning (fsl) with transductive inference and data augmentation,” Bioengineering, vol. 11, no. 7, 2024.
F. Banoori, N. Ijaz, J. Shi, K. Khan, X. Liu, S. Ahmad, A. J. Prakash, P. P?awiak, and M. Hammad, “Few-shot bioacoustics event detection using transudative inference with data ugmentation,” IEEE Sensors Letters, 2024.
M. Y. Shams, T. Abd El-Hafeez, and E. Hassan, “Acoustic data detection in large-scale emergency vehicle sirens and road noise dataset,” Expert Systems with Applications, vol. 249, p. 123608, 2024.
A. Cohen-Hadria, M. Cartwright, B. McFee, and J. P. Bello, “Voice anonymization in urban sound recordings,” in 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, IEEE, 2019.
D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
J. Abeßer, “A review of deep learning based methods for acoustic scene classification,” Applied Sciences, vol. 10, no. 6, p. 2020, 2020.
A. F. R. Nogueira, H. S. Oliveira, J. J. Machado, and J. M. R. Tavares, “Sound classification and processing of urban environments: A systematic literature review,” Sensors, vol. 22, no. 22, p. 8608, 2022.
A. Bansal and N. K. Garg, “Environmental sound classification: A descriptive review of the literature,” Intelligent systems with applications, vol. 16, p. 200115, 2022.
O. ¨ ?Inik, “Cnn hyper-parameter optimization for environmental sound classification,” Applied Acoustics, vol. 202, p. 109168, 2023.
D. M. Knigge, D. W. Romero, A. Gu, E. Gavves, E. J. Bekkers, J. M. Tomczak, M. Hoogendoorn, and J.-J. Sonke, “Modelling long range dependencies in n d: From task-specific to a general purpose cnn,” arXiv preprint arXiv:2301.10540, 2023.
S. Islam, H. Elmekki, A. Elsebai, J. Bentahar, N. Drawel, G. Rjoub, and W. Pedrycz, “A comprehensive survey on applications of transformers for deep learning tasks,” Expert Systems with Applications, p. 122666, 2023.
M. Massoudi, S. Verma, and R. Jain, “Urban sound classification using cnn,” in 2021 6th international conference on inventive computation technologies (icict), pp. 583–589, IEEE, 2021.
J. Sang, S. Park, and J. Lee, “Convolutional recurrent neural networks for urban sound classification using raw waveforms,” in 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2444–2448, IEEE, 2018.
I. Lezhenin, N. Bogach, and E. Pyshkin, “Urban sound classification using long short-term memory neural network,” in 2019 federated conference on computer science and information systems (FedCSIS),
pp. 57–60, IEEE, 2019.
D. Rothman, Transformers for Natural Language Processing: Build innovative deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, RoBERTa, and more. Packt PublishingLtd, 2021.
J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal processing letters, vol. 24, no. 3, pp. 279–283, 2017.
J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in Proceedings of the 22nd ACM internationalconference on Multimedia, pp. 1041–1044, 2014.
D. S. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. V. Le, and Y. Wu, “Specaugment on large scale datasets,” in ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), pp. 6879–6883, IEEE, 2020.
M. Bubashait and N. Hewahi, “Urban sound classification using dnn, cnn & lstm a comparative approach,” in 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), pp. 46–50, IEEE, 2021
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Sindh University Research Journal - SURJ (Science Series)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


