Processing Audio Data for a Neural Network

For my capstone project I am collaborating with two other students in writing a neural network that will be able to classify music genres using audio files. For this post, I want to talk about the process of getting the data ready to train the model we will be using. Processing audio data involves a lot of background knowledge. Knowing a little bit about Librosa, Spectrograms, zero crossing, and Spectral Centroid(s) will save you research time. I think that knowing the general layout of TensorFlow and some background on Fast Fourier Transforms is essential and will help you hit the ground running.

I will begin by talking a bit about Librosa. Now Librosa is a Python library used to analyze and extract useful features from audio files. You can pip install it on a virtual environment or pip install to your local machine (or to install all requirements) [5].

pip install Librosa
pip install -r requirements.txt

I prefer the local machine since I will be using it for the entire term. After you have Librosa, you will need TensorFlow, matplotlib, os, numpy, pandas, sklearn, and a lot of other libraries that I will not mention but you will be able to see in the requirements.txt file in our project directory. 

To process the audio data and get it ready to be processed by our model, common preprocessing methods will be used.  The following is a description of how the data was processed, what parameters were taken into account to process them, and what technology was used to achieve this. 

We begin this process by extracting the short-term Fourier transformation (STFT) to compute Chroma features. STFT represents information about the classification of pitch and signal structure [3]. I also had to obtain the root-mean-square deviation, 

Formula (1): root mean square deviation.

                                                 

Formula (1) is a necessary parameter and thankfully it’s already integrated with Librosa, you simply had to call it as a method function to the library. I also needed to get the Spectral Centroid, this parameter indicates where the ”center of mass” for a sound is located and is calculated as the weighted mean of the frequencies present in the sound. If the frequencies in music are same throughout then spectral centroid would be around a center and if there are high frequencies at the end of the sound, then the centroid would be towards the end [1].

I also calculated the spectral bandwidth; this is also a feature from Librosa and is defined as the band width of light at one-half the peak maximum [4]. The last features from the audio that we needed were a bit more complicated to obtain. They are the zero crossing and the Mel-Frequency Cepstral Coefficients. The Zero Crossing is the simpler of the two so I will begin there. The zero-crossing rate is the rate of sign-changes along a signal. This feature has been used heavily in both speech recognition and music information retrieval. It usually has higher values for highly percussive sounds like those in metal and rock [1]. The Mel-Frequency Cepstral Coefficients (MFCC) are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope [1]. All these features were extracted from 1000 30-second long .wav files that consisted of 10 different genres or 100 snippets per genre. 

Building a neural network is complicated. There are a lot of moving parts that need to be put in place. Thankfully, I have a great team that has been working in different features of our neural network. This week, I got to learn about processing data and getting it ready for our model to use it to train and test. Next week, I will be able to run with the script I wrote and let our model take in the processed data for training. 

References:

[1] https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d

[2] http://cs229.stanford.edu/proj2018/report/21.pdf

[3] https://www.researchgate.net/figure/Mel-Spectrogram-3-Chroma-STFT-The-Chroma-value-of-an-audio-basically-represent-the_fig4_346659500

[4] http://www.perseena.com/index/newsinfo/c_id/37/n_id/7.html

[5] https://github.com/cannon-steven/MusicGenreClassification_NeuralNetwork

Published by Mateo Estrada Jorge

Hi, I am a CS and physics double major. I like to rock climb, kayaking, and playing video games. I am always up for a game of chess!

Leave a comment

Your email address will not be published. Required fields are marked *