Enhancing Urban Sound Classification with CNN-Transformer Hybrid Model and Spectrogram Augmentation

Authors

  • NOUMAN IJAZ Department of Electrical, Electronics and Computer Engineering University of Ulsan Ulsan 44610, South Korea
  • MD NAZMUl HASSAN Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Ulsan 44610, South Korea
  • SANA ULLAH JAN School of Computing, Engineering and Built Environment, Edinburgh Napier University, EH10 5DT Edinburgh, U.K
  • INSOO KOO Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Ulsan 44610, South Korea

DOI:

https://doi.org/10.26692/surjss.v57i02.7708

Keywords:

Urban sound classification, CNNs, LSTM, Spectrogram augmentation

Abstract

Urban Sound Classification (USC) is a crucial component of audio recognition systems, with applications in smart cities, surveillance, and multimedia. Despite significant advances, the classification of environmental sounds remains a challenge due to the complex nature of urban audio signals, characterized by high intra-class variability and overlapping sound events. In this paper, we propose a novel hybrid model that integrates the strengths of Convolutional Neural Networks (CNNs) and Transformer architectures to improve the identification accuracy of urban sounds. The CNN component effectively extracts local spectral features from Mel spectrograms, while the Transformer captures global temporal dependencies through self-attention mechanisms. Additionally, we incorporate advanced spectrogram augmentation techniques, such as time masking, frequency masking, and time warping, to further enhance the model’s robustness and generalization capabilities. Experimental results on the UrbanSound8K dataset demonstrate that the proposed CNN- Transformer hybrid model outperforms traditional CNN and Long Short-Term Memory (LSTM)-based approaches, achieving a classification accuracy of 93.36%. These results highlight the effectiveness of combining CNNs with transformers and data augmentation strategies for robust urban sound classification.

References

K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE 25th international workshop on machine learning for signal processing (MLSP), pp. 1–6, IEEE, 2015.

S. U. Hassan, M. Z. Khan, M. U. G. Khan, and S. Saleem, “Robust sound classification for surveillance using time frequency audio features,” in 2019 International Conference on Communication Technologies (ComTech), pp. 13–18, IEEE, 2019.

Y. Li, X. Li, Y. Zhang, M. Liu, and W. Wang, “Anomalous sound detection using deep audio representation and a blstm network for audio surveillance of roads,” Ieee Access, vol. 6, pp. 58043–58055, 2018.

E.-L. Tan, F. A. Karnapi, L. J. Ng, K. Ooi, and W.-S. Gan, “Extracting urban sound information for residential areas in smart cities using an end-to-end iot system,” IEEE Internet of Things Journal, vol. 8, no. 18, pp. 14308–14321, 2021.

S. Chandrakala and S. Jayalakshmi, “Generative model driven representation learning in a hybrid framework for environmental audio scene and sound event recognition,” IEEE Transactions on Multimedia, vol. 22, no. 1, pp. 3–14, 2019.

N. Ijaz, F. Banoori, and I. Koo, “Reshaping bioacoustics event detection:Leveraging few-shot learning (fsl) with transductive inference and data augmentation,” Bioengineering, vol. 11, no. 7, 2024.

F. Banoori, N. Ijaz, J. Shi, K. Khan, X. Liu, S. Ahmad, A. J. Prakash, P. P?awiak, and M. Hammad, “Few-shot bioacoustics event detection using transudative inference with data ugmentation,” IEEE Sensors Letters, 2024.

M. Y. Shams, T. Abd El-Hafeez, and E. Hassan, “Acoustic data detection in large-scale emergency vehicle sirens and road noise dataset,” Expert Systems with Applications, vol. 249, p. 123608, 2024.

A. Cohen-Hadria, M. Cartwright, B. McFee, and J. P. Bello, “Voice anonymization in urban sound recordings,” in 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, IEEE, 2019.

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.

J. Abeßer, “A review of deep learning based methods for acoustic scene classification,” Applied Sciences, vol. 10, no. 6, p. 2020, 2020.

A. F. R. Nogueira, H. S. Oliveira, J. J. Machado, and J. M. R. Tavares, “Sound classification and processing of urban environments: A systematic literature review,” Sensors, vol. 22, no. 22, p. 8608, 2022.

A. Bansal and N. K. Garg, “Environmental sound classification: A descriptive review of the literature,” Intelligent systems with applications, vol. 16, p. 200115, 2022.

O. ¨ ?Inik, “Cnn hyper-parameter optimization for environmental sound classification,” Applied Acoustics, vol. 202, p. 109168, 2023.

D. M. Knigge, D. W. Romero, A. Gu, E. Gavves, E. J. Bekkers, J. M. Tomczak, M. Hoogendoorn, and J.-J. Sonke, “Modelling long range dependencies in n d: From task-specific to a general purpose cnn,” arXiv preprint arXiv:2301.10540, 2023.

S. Islam, H. Elmekki, A. Elsebai, J. Bentahar, N. Drawel, G. Rjoub, and W. Pedrycz, “A comprehensive survey on applications of transformers for deep learning tasks,” Expert Systems with Applications, p. 122666, 2023.

M. Massoudi, S. Verma, and R. Jain, “Urban sound classification using cnn,” in 2021 6th international conference on inventive computation technologies (icict), pp. 583–589, IEEE, 2021.

J. Sang, S. Park, and J. Lee, “Convolutional recurrent neural networks for urban sound classification using raw waveforms,” in 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2444–2448, IEEE, 2018.

I. Lezhenin, N. Bogach, and E. Pyshkin, “Urban sound classification using long short-term memory neural network,” in 2019 federated conference on computer science and information systems (FedCSIS),

pp. 57–60, IEEE, 2019.

D. Rothman, Transformers for Natural Language Processing: Build innovative deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, RoBERTa, and more. Packt PublishingLtd, 2021.

J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal processing letters, vol. 24, no. 3, pp. 279–283, 2017.

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in Proceedings of the 22nd ACM internationalconference on Multimedia, pp. 1041–1044, 2014.

D. S. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. V. Le, and Y. Wu, “Specaugment on large scale datasets,” in ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), pp. 6879–6883, IEEE, 2020.

M. Bubashait and N. Hewahi, “Urban sound classification using dnn, cnn & lstm a comparative approach,” in 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), pp. 46–50, IEEE, 2021

Downloads

Published

2025-12-30

How to Cite

Ijaz, N., HASSAN, M. N. ., SANA ULLAH JAN, & INSOO KOO. (2025). Enhancing Urban Sound Classification with CNN-Transformer Hybrid Model and Spectrogram Augmentation. Sindh University Research Journal - SURJ (Science Series), 57(02), 39–46. https://doi.org/10.26692/surjss.v57i02.7708