Today, I wrote my first simple program using scikit-learn, a machine learning library for Python. During our kickoff meeting, our sponsor recommended that we familiarize ourselves with this library before getting started on our project. My teammate found a helpful tutorial and shared it with the rest of us.
Of course, the first thing the tutorial had us do was write a Hello World program. Instead of using an IDE like PyCharm or Visual Studio Code, we used Jupyter. Jupyter is helpful when writing a program that interacts with data because it allows you to view the data just below any function calls you make.
After getting familiar with Jupyter and its shortcuts, we begin working on a program that predicts a person’s musical preference based on their age and sex. We are using a simple dataset with only a few attributes and we are assuming that all males within a certain age range prefer the same type of music. The same is assumed to be true about females. At this point, the intimidation I was feeling about getting started faded away. Working with such a small set of data and having only 2 factors (age and sex) made it easy to grasp.
Our first step was to train our model to be able to predict a person’s musical preference. Since we only had one set of data, we learned how to split the data into input data for training, input data for testing, output data for training, and output data for testing. We were also able to obtain the accuracy of the model by comparing the testing output (musical preference) with the actual data. Then, we played around with using different ratios of training data to testing data and saw how drastically this changes the accuracy.
Once we got a model with a sufficiently high accuracy score, we learned how to make our model persistent so we don’t have to create a new model each time we edit our program. This isn’t exactly essential for this simple program as it deals with only a small amount of data, but it would absolutely be necessary when dealing with more complex programs and large amount of data.
We also learned how to create a .dot file, which shows us the decision tree that our model creates and uses. Below, you can see the line of code that creates this file, followed by the visual representation of the decision tree.
This tutorial showed how accessible machine learning is – with this help of scikit-learn, anyone who is familiar with Python and has even a small amount of data can write a program that utilizes machine learning to make predictions when tested. This program uses a binary decision tree, but there are several other approaches (including neural network, which we will be using in our Capstone project). I’m excited to start working with other components from this library!