This tutorial will cover Boolean comparisons and more Pandas for analysing and manipulating tabular data.
It is based on:
DEMO-ONLY TUTORIAL BEGINS HERE
Comparison operators can be used to compare two values:
1 < 2
In the code above, we are effectively asking “is 1 less than 2?”.
The value returned is either True
or False
,
a Boolean value:
type(1 < 2)
To check if two values are equal, use two equal signs (as a single equal sign is used by Python for assignment):
'pynoon' == 'PYNOON'
We can directly state a Boolean value by name - but remember to start with a capital letter!
True
Boolean values can be inverted with not
:
not True
Multiple Boolean values can be combined with and
and
or
, or inverted with not
:
True and False
True or False
As usual, parentheses can be used to group operations:
True and (True or False)
Comparisons are commonly used to conditionally run different lines of
code using if
statements (as seen in more detail in the
Futurecoder lessons):
bag_kg = 3
if bag_kg <= 7:
print('Bag allowed as carry-on')
print('Please proceed to board the plane')
else:
print('Please check your bag')
<=
is “less than or equal to”
Note: The
else:
clause is optional
FOLLOW-ALONG TUTORIAL BEGINS HERE
pynoon_data_2.ipynb
As we did last lesson, let’s load our AirBnB listings data into Pandas:
import pandas as pd
listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
We can load data from local files or, in this case, directly from a URL.
listings_df
We can see how big the DataFrame is using .shape
:
listings_df.shape
We can also see what columns we have available to work with:
listings_df.columns
Every column has a type of value stored in it:
listings_df.info()
object
(we
only have strings here)last_review
column should be a datetimeWe can convert date columns from strings to dates:
pd.to_datetime(listings_df['last_review'])
We must assign the transformed column to replace the original column:
listings_df['last_review'] = pd.to_datetime(listings_df['last_review'])
listings_df.info()
Applying standard Python maths operators on a Series performs the operation on each value in the Series and returns a new Series:
listings_df['price_nzd'] * 0.5
Performing maths with two Series applies the operation element-wise to each pair of values from the two Series:
listings_df['price_nzd'] / listings_df['accommodates']
Just as we can updated columns, we can add a new column to an existing DataFrame by assigning to a new column name that doesn’t exist in the DataFrame yet:
listings_df['price_per_person'] = listings_df['price_nzd'] / listings_df['accommodates']
listings_df
Using a comparison operator on a Series performs the comparison to each value in the Series, and returns a new Series full of Boolean values:
listings_df['region_parent_name'] == 'Auckland'
Remember: the type of the value determines what an operation will do with it.
We can use the Boolean Series, commonly called a
mask, to return a new DataFrame that is
filtered to contain only the rows where the mask is
True
:
auckland_mask = listings_df['region_parent_name'] == 'Auckland'
listings_df[auckland_mask]
Let’s use filtering to plot the ratings of highly reviewed listings.
First, import Plotly and plot the ratings:
%pip install plotly nbformat
import plotly.express as px
px.histogram(listings_df, x='review_scores_rating')
Now, filter the DataFrame to only contain listings with more than 100 reviews:
px.histogram(listings_df[listings_df['number_of_reviews'] > 100], x='review_scores_rating')
Just as we could use not
, and
, and
or
to combine Boolean values, we can also combine Boolean
Series.
Let’s remind ourselves of the value of
auckland_mask
:
auckland_mask
Just like how we can use not
to invert a Boolean value,
we can use ~
to invert a mask:
~auckland_mask
We can also use |
to perform an or
operation between two masks, and &
to perform an
and
operation:
good_mask = listings_df['review_scores_rating'] > 4.9
good_mask
auckland_mask | good_mask
auckland_mask & good_mask
One final note: Conditions on NaN values always return False.
The tools for transforming and filtering data we’ve used here are the kind of code you’ll spend much of your time writing when analysing data with Pandas.
Find out what other functions (methods) and variables (attributes) are attached to DataFrames and Series from their reference documentation:
Similarly, look at the user guide and reference documentation for Plotly express to see what other plot types are available and what arguments they will accept to configure them:
We can also remove the dollar sign from each price:
listings_df['price'].str.replace('$', '', regex=False)
We can “chain” additional method calls on the results of previous method calls to replace commas and convert the column data type from string to float:
listings_df['price'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(float)
listings_df['price'] = listings_df['price'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(float)
listings_df.info()
listings_df