For my capstone project, “Leveraging AI for Improved Public Transit”, we finally have all the data we need to have freedom with our work. Taking the leap from an undergrad computer science student to using ML models in PyTorch to train on 91 million rows of data can be quite a hassle. What’s arguably more of a hassle is prepping that data for the model.
It is not nearly as difficult to make a good ML model than it is to make good data. Usually good data will produce a good model. The hard part is sorting out nulls and fixing errors in the data in order to put the data into a position where it can work for the model.
With big data projects like these, there becomes a separate element you have to worry about. Visualizing the data can sometimes be hard. It is easy when you have smaller size data and you can sort through the visualized table, but these are multi-GB files. You can’t just open them like a csv file, it is just too large. In order to combat this, you need to learn how to write good SQL queries. I enjoy using a mix of SQL and python, so I can also display the data in the terminal well. The pandas framework is very helpful. Specifically for this project, I found that a duckdb database is also very good.
When I first started this project, I focused a lot on getting comfortable with SQL and writing simpler ML models. I did a decent amount of data analysis too. If I could do it over again, I would focus a lot more on data prep. What people don’t realize when they first get into big data is that data prep is nearly ninety percent of the work. The analysis and modeling is only a fraction of the work, even though it seems like the key part. As I said earlier, for good models you need good data, and for good data you need data prep.
If you’re an aspiring data scientist or data work intrigues you, hammer down on your data prep skills! It will take you much farther than spending hours and hours trying to perfect an ML model or typing in SQL queries until you find something interesting. I’ve learned a lot from this project, but that is probably the most important thing I have learned.