Considering the project I’ve chosen is called “Top-n Music Genre Classification Neural Network,” dealing with audio data in one way or another was a given.
In the last week, my team members and I have gotten started on the initial tasks of the project which typically involved research, mainly on machine learning and-as suggested by title of this blog post-audio. While I’ve learned a surprising amount about machine learning since this project started, what I find just as surprising is how much I learned about audio.
Going into this project, I naively assumed the only piece of data we would have to give a machine learning model was more or less an audio file at face value, and with enough audio files (and corresponding labels of genres), the model will be able to train itself and make accurate predictions. After spending time reading papers on music classification, I quickly found out that approach isn’t very effective.
The concept of audio features was brought to my attention and ever since then, my efforts shifted towards understanding the different representations of audio and more importantly, which representations are distinctive between genres and which aren’t. Some of these features include mel-frequency cepstrum (MFC), tonnetz, chroma, RMS energy, zero-crossing rate, and spectral contrast, among others.
While these features are still new to me, it’s interesting to learn about these features in-depth, especially what they represent and the math involved creating them. They also create cool looking images. Overall, it’s given me a new perspective on audio that I previously glossed over and I look forward to continue learning about them.