This tutorial will cover grouping and summarising data with Pandas and Plotly, and handling missing data with Pandas.
It is based on:
pynoon_data_3.ipynb
As we did last lesson, let’s load the Pandas and Plotly libraries, and load our AirBnB listings data into Pandas:
import pandas as pd
import plotly.express as px
listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
listings_df
When analysing data, we often want to compare different groups or subsets.
Plotly makes a lot of group comparisons easy if we have a tidy DataFrame - with a row for each data point and a column for each attribute - including categorical attributes we want to group by.
For example, we can specify a categorical column to colour data points by:
px.scatter_geo(listings_df, lon='longitude', lat='latitude', color='room_type')
Note: You can select/deselect colours in the legend.
We can also specify a numerical column to determine the colour:
px.scatter_geo(listings_df, lon='longitude', lat='latitude', color='price_nzd')
We can go further and create a separate subplot for the data points
belonging to each group by specifying a categorical column as a
facet_row
or facet_col
.
In the plot below, note that we:
accommodates
column, so that the
facet plots are sorted by value.histnorm='percent'
so that groups with few data
points still have a “full size” plot.px.histogram(
listings_df.sort_values(by='accommodates'),
x='price_nzd',
facet_row='accommodates',
height=2000,
histnorm='percent',
)
Sometimes we’d like to treat a numeric column like a categorical column in our visualisation.
We can do this by using pd.qcut()
to create a
categorical column by dividing a numeric column into bins.
pd.qcut()
makes each bin have the same
number of data points, while pd.cut()
makes each
bin cover an equal-length interval of the numeric
column.For example, we can plot the distribution of review counts for different tiers of pricing:
listings_df['price_bin'] = pd.qcut(listings_df['price_nzd'], q=10).astype(str)
px.box(
# Ensure lower price bins are shown first.
listings_df.sort_values(by='price_nzd'),
x='price_bin',
y='number_of_reviews',
)
We can also use Pandas’ groupby()
to split up a
DataFrame according to a categorical attribute.
If you’re familiar with SQL, you’ll see similarities to the
GROUP BY
clause.
.shape
returns the number of rows and number of columns
as a pair (tuple)for room_type, room_type_group_df in listings_df.groupby('room_type'):
print('Room type', room_type)
print(room_type_group_df.shape[0])
groupby()
can also be used to produce a DataFrame of
aggregate statistics for each group. In the following, see how we:
['accommodates', 'room_type']
)['price_nzd', 'review_scores_rating']
)mean()
sum()
, std()
, and
more.listings_df.groupby(['accommodates', 'room_type'])[['price_nzd', 'review_scores_rating']].mean()
Note that groupby()
turns the group columns into an
index. We can turn it back into a regular column with
reset_index()
:
stats_df = listings_df.groupby(['accommodates', 'room_type'])[['price_nzd', 'review_scores_rating']].mean().reset_index()
stats_df
.groupby()
is a very powerful tool for reshaping data
into the right tidy format that will support the plot you
want.
Such DataFrames produced with groupby()
can be very
useful for producing plots of statistics, like bar charts:
px.bar(
stats_df.sort_values(by='accommodates'),
x='accommodates',
y='price_nzd',
color='room_type',
barmode='group',
title='Mean Price',
)
We can also export our new DataFrame to a CSV file:
listing_groups_df.to_csv('listing_groups.csv')
We can use isna()
to get a mask of rows where a value is
missing:
listings_df[listings_df['review_scores_rating'].isna()]
We can invert the mask with ~
and filter out rows with
missing values:
listings_df[~listings_df['review_scores_rating'].isna()]
Alternatively, we can use fillna()
to replace missing
values with a fixed value. For example, we make want to consider
un-reviewed listings with a mid-point rating:
listings_df['review_scores_rating_filled'] = listings_df['review_scores_rating'].fillna(2.5)
listings_df
Finally, dropna()
on a DataFrame can be used to simply
remove any rows where any column contains a missing value:
listings_df.dropna()