This blog post will be about a few more tricks for manipulating data that I’ve found to be useful. Creating columns, utilizing the group by function and sorting values are all good ways to analyze data.
Creating a column helps you to experiment with data without altering any existing columns.
marathon['increased'] = np.where((
marathon
['run_2'] > marathon
['run_1']), True, False)
Say we have a dataframe called marathon. Existing columns in marathon are run_1 and run_2. If we wanted to mark all rows where run_2 was faster than run_1 then creating a column helps solve that problem. We can then filter on marathon[‘increased’] if the value is True to get all data where the runner increased from run_1 to run_2.
Using the group by function with creating a column is also really useful. We can create a new column based on grouped results to analyze the data further.
marathon
['avg_run_1_country'] = marathon
.groupby(['country'])['run_1'].transform('mean')
In the example above, we have the marathon dataframe that has the same column run_1 but also has a column country. We can create a new column that groups the run_1 results by country and then transforms the results into a mean value. This way we can display the average time by country.
Using the sort function we can then sort the data in a way that is presentable.
marathon_top_10 = marathon.sort_values('avg_run_1_country',ascending=False).head(10)
In the above example we create a new dataframe by using the marathon dataframe from earlier and sorting the values descending by avg_run_1_country. The head(10) function let’s us display the top 10 results. The dataframe will then have the top 10 avg run_1 by country.
We can do a quick adjustment on the data to make it only show what we want.
marathon_top_10 = marathon_top_10
[[‘country’,’avg_run_1_country’]]
This would trim down the dataframe to only show the country and avg_run_1_country columns.
This process flow can help with getting started with data analysis using pandas. You can start by creating some columns that are useful for the data. You can then group, filter, and manipulate the columns to enhance your analysis. The existing data stays the same so you can always add on additional columns for exploring the data.
Leave a Reply