Throughout the creation of my capstone project, we have had to adapt to many unfamiliar languages and frameworks to be able to get the most out of our project. We started using Malloy and Python for our project that granted us the ability to analyze the data, but also create models for us to use to clean and prep data.
Malloy is a programming language that is very similar to SQL, but the code is a lot more human readable. It is also slightly more efficient than SQL. SQL is the standard for most data queries and there isn’t a whole lot of documentation for Malloy, as it seems the language is a lot less widespread. While it can be a lot better than SQL, it can present problems when learning because there isn’t a diverse amount of resources.
Python is a little bit useless on its own for data prep and cleaning, so we can bring in the frameworks of Pandas and NumPy. These both have the ability to transform data and posed especially useful in last week’s work I did on the Eugene Weather Data. NumPy is more specifically useful with array manipulation. These frameworks were frameworks I was largely unfamiliar with because I hadn’t worked much with data in Python before this point.
Soon, we will be getting into work with PyTorch, which will allow us to run through our data with machine learning models. This will allow us to come up with actionable insights to problems within the data. BigQuery is also a big ML model that uses Google Cloud Services. This model can be super useful for us, however we need to be cautious about using ML models on it because just twenty-five table scans can produce around seven dollars in charges. Regular models probably won’t do that much because it isn’t a lot of functions. However, running an ML model overnight can easily rack up two hundred to four hundred dollars in charges if you aren’t careful. PyTorch would be a good option for testing ML models, but BigQuery makes it a lot easier to see the actionable insights that we are looking for faster.
What have I learned from these technologies? Well, for one there are a lot of ways that you can manipulate data using different languages and systems, each one does a slightly different thing in a slightly different way. You can use a lot of the technologies together to create better data insights. For example, cleaning the data with Malloy and then using PyTorch to run ML models on it. So far, my favorite technology to learn has been BigQuery. There’s still a lot to explore with it, but so far I like the interface and its capabilities. They also give you three hundred dollars to work with in credits.
Malloy has been great for queries and Python specifically with the Pandas and NumPy libraries have been great for data cleaning and prepping. So far I’ve used Python the most, so I’m also biased towards using that. My least favorite so far is hard because I like all of these technologies, but I might give it to Malloy just because I already know SQL, and Malloy is difficult to pick up because of the lack of tutorials and documentation.
As an update for our capstone project overall, we have been data prepping and cleaning the Eugene Weather Data and Eugene AQI Data. These are meant for practice, but simultaneously help us with our larger project. For example, we can further infer that if a certain incident or lack of rider satisfaction happens during times of higher precipitation, then we likely need to have more accommodation for said precipitation. We are also trying to find insights on what happens if a rider takes a certain route on a certain time a day. The list goes on and on, but the ML models will make all these inferences for us and allow us to think of solutions to said found problems.