Intro to ML – Part 1 – Data Exploration

I was really drawn to the senior capstone project I chose on fire risk prediction largely due to my interest in ML. I’m excited to be joining a team after I finish my degree which works heavily in leveraging big data and ML algorithms for customer insights and it’s been really interesting getting to learn through my project a little more about what ML, well, actually is.

I thought it’d be fun in the next few entries if I walk through basic ML modeling in Python with Jupyter notebook. I’ve had a little exposure to this before but I’m basically re-learning as I go, and it’s been a fun and educational process.

The dataset that I am working with is from Kaggle. This is a great resource for learning ML and finding ML datasets. In my capstone project, we are working on proprietary data so as a substitute for this exercise, I am using the Kaggle dataset on US Wages. These are the dependent variables in my dataset, the first few rows, and the commands to display them in Jupyter.

We can begin to do same basic data visualization by running scatterplots. For example, there is a clear relationship between educational level and earnings based on what we see here.

You can see that the variables are the type that we may be able to use to estimate wages – height, gender, educational level, age, etc. Before we are able to run this as a model, notice that some of our variables need to be transformed — you can’t plug “white” or “female” into an equation! We do this by breaking down the variables into dummy variables using the following command.

Thanks for joining me as I explored and learned about basic data loading and visualization in Jupyter Notebook. Please continue to follow me in the upcoming weeks as I start implementing some basic ML tools!

Leave a Reply Cancel reply