This tutorial will cover grouping and summarising data with Pandas and Plotly, and handling missing data with Pandas.
It is based on:
pynoon_data_3.ipynb
As we did last lesson, let’s load the Pandas and Plotly libraries, and load our AirBnB listings data into Pandas:
import pandas as pd
import plotly.express as px
listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
listings_df
We can see how big the DataFrame is using .shape
:
listings_df.shape
We can also see what columns we have available to work with:
listings_df.columns
When analysing data, we often want to compare different groups.
One simple way is to colour data points by a categorical column when plotting. For example, colouring listings by their room type:
px.scatter_geo(listings_df, lon='longitude', lat='latitude', color='room_type')
We can also specify a categorical column as a facet_row
or facet_col
to generate a separate graph for data points
having each “facet” value.
In the plot below, note that we:
accommodates
column, so that the
facet plots are sorted by value.histnorm='percent'
so that groups with few data
points still have a “full size” plot.px.histogram(
listings_df.sort_values(by='accommodates'),
x='price_nzd',
facet_row='accommodates',
height=2000,
histnorm='percent',
)
It can be useful to compute our own categorical columns, such as to compare one category against all others:
listings_df['is_queenstown'] = listings_df['region_parent_name'] == 'Queenstown-Lakes District'
px.histogram(
listings_df,
x='price_nzd',
facet_row='is_queenstown',
histnorm='percent',
)
We can also use Pandas’ groupby()
to split up a
DataFrame according to a categorical attribute:
for accommodates, accommodates_group_df in listings_df.groupby('accommodates'):
print('Accommodates:', accommodates)
display(accommodates_group_df)
groupby()
can also be used to produce a DataFrame of
aggregate statistics for each group. In the following, see how we:
['accommodates', 'room_type']
)['price_nzd', 'review_scores_rating']
)mean()
sum()
, std()
, and
more.listings_df.groupby(['accommodates', 'room_type'])[['price_nzd', 'review_scores_rating']].mean()
Note that groupby()
turns the group columns into an
index. We can turn it back into a regular column with
reset_index()
:
listing_groups_df = listings_df.groupby(['accommodates', 'room_type'])[['price_nzd', 'review_scores_rating']].mean().reset_index()
listing_groups_df
Such DataFrames produced with groupby()
can be very
useful for producing plots of statistics, like bar charts:
px.bar(
listing_groups_df,
x='accommodates',
y='price_nzd',
color='room_type',
barmode='group',
title='Mean Price',
)
We can also export our new DataFrame to a CSV file:
listing_groups_df.to_csv('listing_groups.csv')
We can also use .count()
to count the number of rows in
each group, but notice that the count for the rating column is less than
the price column:
listings_df.groupby('accommodates')[['price_nzd', 'review_scores_rating']].count().reset_index()
This is because there are missing
(aka
null
, aka NA
) values in the rating column that
are ignored by count()
.
We can use isna()
to get a mask of rows where a value is
missing:
listings_df['review_scores_rating'].isna()
listings_df[listings_df['review_scores_rating'].isna()]
We can then invert the mask with ~
and filter out rows
with missing values:
listings_df[~listings_df['review_scores_rating'].isna()]
Alternatively, we can use fillna()
to replace missing
values with a fixed value. For example, we make want to consider
un-reviewed listings with a mid-point rating:
listings_df['review_scores_rating_filled'] = listings_df['review_scores_rating'].fillna(2.5)
listings_df
Finally, dropna()
on a DataFrame can be used to simply
remove any rows where any column contains a missing value:
listings_df.dropna()