Categories
Uncategorized

ML & Data Prep

For my capstone project, “Leveraging AI for Improved Public Transit”, we finally have all the data we need to have freedom with our work. Taking the leap from an undergrad computer science student to using ML models in PyTorch to train on 91 million rows of data can be quite a hassle. What’s arguably more of a hassle is prepping that data for the model.

It is not nearly as difficult to make a good ML model than it is to make good data. Usually good data will produce a good model. The hard part is sorting out nulls and fixing errors in the data in order to put the data into a position where it can work for the model.

With big data projects like these, there becomes a separate element you have to worry about. Visualizing the data can sometimes be hard. It is easy when you have smaller size data and you can sort through the visualized table, but these are multi-GB files. You can’t just open them like a csv file, it is just too large. In order to combat this, you need to learn how to write good SQL queries. I enjoy using a mix of SQL and python, so I can also display the data in the terminal well. The pandas framework is very helpful. Specifically for this project, I found that a duckdb database is also very good.

When I first started this project, I focused a lot on getting comfortable with SQL and writing simpler ML models. I did a decent amount of data analysis too. If I could do it over again, I would focus a lot more on data prep. What people don’t realize when they first get into big data is that data prep is nearly ninety percent of the work. The analysis and modeling is only a fraction of the work, even though it seems like the key part. As I said earlier, for good models you need good data, and for good data you need data prep.

If you’re an aspiring data scientist or data work intrigues you, hammer down on your data prep skills! It will take you much farther than spending hours and hours trying to perfect an ML model or typing in SQL queries until you find something interesting. I’ve learned a lot from this project, but that is probably the most important thing I have learned.

Categories
Uncategorized

New Technologies

Throughout the creation of my capstone project, we have had to adapt to many unfamiliar languages and frameworks to be able to get the most out of our project. We started using Malloy and Python for our project that granted us the ability to analyze the data, but also create models for us to use to clean and prep data.

Malloy is a programming language that is very similar to SQL, but the code is a lot more human readable. It is also slightly more efficient than SQL. SQL is the standard for most data queries and there isn’t a whole lot of documentation for Malloy, as it seems the language is a lot less widespread. While it can be a lot better than SQL, it can present problems when learning because there isn’t a diverse amount of resources.

Python is a little bit useless on its own for data prep and cleaning, so we can bring in the frameworks of Pandas and NumPy. These both have the ability to transform data and posed especially useful in last week’s work I did on the Eugene Weather Data. NumPy is more specifically useful with array manipulation. These frameworks were frameworks I was largely unfamiliar with because I hadn’t worked much with data in Python before this point.

Soon, we will be getting into work with PyTorch, which will allow us to run through our data with machine learning models. This will allow us to come up with actionable insights to problems within the data. BigQuery is also a big ML model that uses Google Cloud Services. This model can be super useful for us, however we need to be cautious about using ML models on it because just twenty-five table scans can produce around seven dollars in charges. Regular models probably won’t do that much because it isn’t a lot of functions. However, running an ML model overnight can easily rack up two hundred to four hundred dollars in charges if you aren’t careful. PyTorch would be a good option for testing ML models, but BigQuery makes it a lot easier to see the actionable insights that we are looking for faster.

What have I learned from these technologies? Well, for one there are a lot of ways that you can manipulate data using different languages and systems, each one does a slightly different thing in a slightly different way. You can use a lot of the technologies together to create better data insights. For example, cleaning the data with Malloy and then using PyTorch to run ML models on it. So far, my favorite technology to learn has been BigQuery. There’s still a lot to explore with it, but so far I like the interface and its capabilities. They also give you three hundred dollars to work with in credits.

Malloy has been great for queries and Python specifically with the Pandas and NumPy libraries have been great for data cleaning and prepping. So far I’ve used Python the most, so I’m also biased towards using that. My least favorite so far is hard because I like all of these technologies, but I might give it to Malloy just because I already know SQL, and Malloy is difficult to pick up because of the lack of tutorials and documentation.

As an update for our capstone project overall, we have been data prepping and cleaning the Eugene Weather Data and Eugene AQI Data. These are meant for practice, but simultaneously help us with our larger project. For example, we can further infer that if a certain incident or lack of rider satisfaction happens during times of higher precipitation, then we likely need to have more accommodation for said precipitation. We are also trying to find insights on what happens if a rider takes a certain route on a certain time a day. The list goes on and on, but the ML models will make all these inferences for us and allow us to think of solutions to said found problems.

Categories
Uncategorized

Clean Code & Code Smells

There are a lot of ways to make code more efficient. A lot of organizational tactics that software engineers use are a lot simpler than you might think. Code can benefit from some organization, when someone is bug-fixing, reading your code, or just wants to understand what is going on.

This is where clean code comes in. Clean code will allow you to read your own code better and be more efficient with your time. If you have variable names or comments in your code where you can’t really understand what is happening, that’s a huge problem! When working on a large group of code, especially on a team, it is important to make your code readable. One thing I have taken from the article “Clean Code — A practical approach” by Mikel Ors, is that you can even do something as little as turning an iteration to a number into a variable.

Example:

numCars = 10

for (i=0; i < numCars […]

// If you just iterated to 10, it wouldn’t show what it was iterating to and the user might get confused. It just takes a little bit of extra effort.

That being said, code smells are something to avoid! You don’t want to have unnecessary things in your code because not only can this prevent readability, but it can also slow down your code. From “Code Smells” by Jeff Atwood, it is important to ruthlessly delete your unused code. If you have a block of code lying around that doesn’t do anything to the overall project, it can make it so the program runs slightly slower and make it harder on editors trying to fix and locate code.

Example:

numCars = 10

numberCars = 5 // Isn’t being used

for (i=0; i < numCars […]

// Can confuse the user on whether this is iterating to 5 or 10

Overall, it is important to clean up your code. Just because your code works and the result is great, doesn’t mean that you can’t make it more efficient to code and faster to use. In big projects, these problems you might think are small turn very big very fast. It is important to prevent these issues while you see them.

References:

“Clean Code — A practical approach” by Mikel Ors on Medium

“Code Smells” by Jeff Atwood on Coding Horror

Categories
Uncategorized

Skill Prep

It is about time for another update for my capstone project. Last time I discussed our roadblocks with being unable to get all the data we need. That roadblock still remains. We have some data, just not all the data we need to start an analysis.

Luckily, this time we decided to be a lot more proactive and strengthen our skills in data analysis. For our first testing, we decided to use Malloy to analyze real estate data. Malloy, as mentioned earlier, is a data analysis programming language similar to SQL, but intended to be more efficient. Using Malloy, we each came up with our own analysis on the data based on the tables that were formed from the output of our code. For example, Brayden found a positive correlation the number of convenience stores in an area and the house price per unit in that same area.

Our second test is currently in progress. We are testing sentiment analysis, which will be useful for rider satisfaction analysis for LTD. We are using Spotify reviews, but we can apply this same logic to our transportation reviews. Something we picked up on is this data is especially useful for someone who understands the domain. For example, a sentiment analysis on Spotify reviews would be particularly useful to someone who works at Spotify. Working on this was also able to get us comfortable with using AI API’s to conduct the sentiment analysis.

Overall, I would say progress is being made without progress actually being made. We are staying productive and doing what we can with what we have. When the data comes in, we will have a lot more to discuss and hopefully have some very interesting things to share.

Categories
Uncategorized

A Murky Start

Recently, I have started my capstone project with 3 other students, Brayden, Yuji, and Jacob. We started to dive into the topic “Leveraging AI for Improved Public Transit”, specifically looking at the Lane Transit District. With help from our sponsors, we’ve been able to start honing in on research for Lane.

We have had a little bit of a rough start with actually making physical progress, but the first part is just data collection. Often people underestimate the amount of time needed to collect data from various sources. As of right now, we are aware of what programs we are using (ie. BigQuery and Malloy), but largely unfamiliar with the data as of right now.

Once we get the data from the Lane Transit District, we will be able to use Malloy to clean up the data. BigQuery will help us find actionable insights in our data, so we can inform LTD of our findings. We are also going to collect data from organizations like CAHOOTS, which is an emergency service that will have good data for dangerous riding situations or weather data we might need.

Overall, the structure of the capstone course we are in is not to our benefit. We haven’t physically been able to do a lot. Even though we’ve done a lot of thorough research, we haven’t had actual data to work with. This week we are working with mock data to be able to have optimal strategies to find as many insights as we can from the actual data we’re going to get.

Categories
Uncategorized

Hello world!

Hi, my name is Zach Benedetti and I am a Computer Science student at Oregon State University. I currently reside here in Corvallis, but I am from Tigard, Oregon.

I have 2 younger brothers and many cousins, 3 of which went to Oregon State and I have many aunts and uncles that also did, so the choice was easy. I’m historically a duck fan due to my dad, but I grew to like the beavers more and more as I grew up. This created some friendly drama in the household.

I enjoy hiking, watching football, running, weightlifting, and going out. Honestly, anything outdoors is a huge appeal to me. Obsessed is an understatement, but keeping up with the NFL is a huge hobby of mine. I root for the Houston Texans.

As far as my Computer Science background, I have always retained interest in the field. Me and my mom went to the Apple Store at a young age to fix our family laptop and I pointed out I wanted to be one of the help desk assistants one day. Since then, my goals have been a little bit more ambitious than that, but mostly gravitated toward the same track. I was also considering the field of business, since I am very much people-oriented, but I just decided to minor in it.

I haven’t had a paid job for CS yet, but I will have one come this spring (2025) for the MECOP program. I’ve been able to come up with a lot of projects from and outside of school. Most of these are along the lines of front-end website development or smaller back-end projects.

I’ve always liked creating things that are useful and make me more productive in my daily life. This is why most of my projects have been to-do lists or efficiency software. It also translates well to my leadership position for SigEp because I can program things to make everyone’s life easier, even if it’s in a minor way.

I’ll be documenting my capstone project, where I have interest in projects related to AI and Software Engineering. I have a very systematic way of doing things; my projects, my school work, and my daily tasks. I’ll discuss this more in my later posts as the project continues.

This was just a little bit about me, so you can know who I am as the future blog posts come out regarding my project. You’ll eventually learn how I balance my time with leadership positions and outside involvement as well as things I enjoy doing in my free time on top of working on a large project. Stay tuned!