Mel-Spectrograms, Music, and Machines

Working on the Top-n Music Genre Classification project has exposed my team and me to various technologies that aid in music representation, efficient mathematical operations, and the creation of deep learning models. Of these technologies, Librosa, the Python package for music and audio analysis, has been my favorite so far. In particular, we have been using Librosa to perform feature extraction on our audio dataset, converting audio clips into mel-spectrograms, a visual representation of audio data. As someone with a background in music and a deep appreciation for mathematics and data analysis, working with a technology that bridges these two realms has been satisfying and stimulating. As far as data representations go, mel-spectrograms have the rare trait of being intuitively readable by machines and humans. A mel-spectrogram is essentially a matrix of floating point numbers that depict frequency against time, and when visualized, the “shapes” in the audio are clear to see.


Mel-spectrograms are created by splitting audio clips into segments and then converting these segments into frequency-domain representations. These segments get arranged into a time-frequency grid that provides rich data representing several dimensions from the original audio. We are using these mel-spectrograms as the components of the training data for our neural network, which TensorFlow and Keras power. Mel-spectrograms will also eventually be used by our application to translate audio data into a data representation that will be read by our model, which will then provide classification predictions about the audio input.


Currently, we are using the mel-spectrogram feature extraction capability from Librosa to preprocess our data for training and eventually as a component of our application to process novel audio streams. Currently, we want to create a final application that can process streaming audio data in real time. However, considering that Librosa does not support GPU acceleration, this may be a stretch, depending on the underlying CPU and other hardware variables. Considering that mel-spectrograms and other features Librosa can extract are represented as relatively large matrices, support for GPU acceleration would be a significant improvement for this particular technology. It could open up many new opportunities for its use in streaming applications.

Print Friendly, PDF & Email

Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *