This tutorial will cover Boolean comparisons and more Pandas for analysing and manipulating tabular data.
It is based on:
DEMO-ONLY TUTORIAL BEGINS HERE
Comparison operators can be used to compare two values:
1 < 2
In the code above, we are effectively asking “is 1 less than 2?”.
Let’s try the “greater than or equal to” operator next:
1 >= 2
The value returned is either True
or False
,
a Boolean value:
type(1 < 2)
To check if two values are equal, use two equal signs (as a single equal sign is used by Python for assignment):
'pynoon' == 'pynoon'
'pynoon' == 'PYNOON'
We can directly state a Boolean value by name - but remember to start with a capital letter!
True
Boolean values can be inverted with not
:
not True
Multiple Boolean values can be combined with and
and
or
, or inverted with not
:
True and False
True or False
As usual, parentheses can be used to group operations:
True and (True or False)
Comparisons are commonly used to conditionally run different lines of
code using if
statements (as seen in more detail in the
Futurecoder lessons):
bag_kg = 3
if bag_kg < 7:
print('Bag allowed as carry-on')
print('Please proceed to board the plane')
else:
print('Please check your bag')
Note: The
else:
clause is optional
FOLLOW-ALONG TUTORIAL BEGINS HERE
pynoon_data_2.ipynb
As we did last lesson, let’s load our AirBnB listings data into Pandas:
import pandas as pd
listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
We can load data from local files or, in this case, directly from a URL.
listings_df
Recall that we can refer to a single column from a DataFrame,
returning a value with a type of Pandas’ Series
:
listings_df['region_parent_name']
type(listings_df['region_parent_name'])
Using a comparison operator on a Series performs the comparison to each value in the Series, and returns a new Series full of Boolean values:
listings_df['region_parent_name'] == 'Auckland'
Remember: the type of the value determines what an operation will do with it.
We can use the Boolean Series, commonly called a
mask, to return a new DataFrame that is
filtered to contain only the rows where the mask is
True
:
auckland_mask = listings_df['region_parent_name'] == 'Auckland'
listings_df[auckland_mask]
Let’s use filtering to plot the ratings of highly reviewed listings.
First, import Plotly and plot the ratings:
%pip install plotly nbformat
import plotly.express as px
px.histogram(listings_df, x='review_scores_rating')
Now, filter the DataFrame to only contain listings with more than 100 reviews:
px.histogram(listings_df[listings_df['number_of_reviews'] > 100], x='review_scores_rating')
Just as we could use not
, and
, and
or
to combine Boolean values, we can also combine Boolean
Series.
Let’s remind ourselves of the value of
auckland_mask
:
auckland_mask
Just like how we can use not
to invert a Boolean value,
we can use ~
to invert a mask:
~auckland_mask
We can also use |
to perform an or
operation between two masks, and &
to perform an
and
operation:
good_mask = listings_df['review_scores_rating'] > 4.9
good_mask
auckland_mask | good_mask
auckland_mask & good_mask
One final note: Conditions on NaN values always return False.
Performing maths on a Series applies the operation to each value in the Series, returning a new Series:
nzd_to_aud = 0.93
listings_df['price_nzd'] * nzd_to_aud
Performing maths with two Series applies the operation element-wise to each pair of values from the two Series:
listings_df['price_nzd'] / listings_df['accommodates']
We can add a new column to an existing DataFrame by assigning to a new column name that doesn’t exist in the DataFrame yet:
listings_df['price_per_person'] = listings_df['price_nzd'] / listings_df['accommodates']
listings_df
listings_df['price_per_person'].describe()
Find out what other functions (methods) and variables (attributes) are attached to DataFrames and Series from their reference documentation:
Similarly, look at the user guide and reference documentation for Plotly express to see what other plot types are available and what arguments they will accept to configure them:
Every column has a type of value stored in it:
listings_df.info()
object
(we
only have strings here)We can convert date columns from strings to dates:
pd.to_datetime(listings_df['host_since'])
We must assign the transformed columns to replace the original columns:
listings_df['host_since'] = pd.to_datetime(listings_df['host_since'])
listings_df['last_review'] = pd.to_datetime(listings_df['last_review'])
listings_df.info()
listings_df
We can also remove the dollar sign from each price:
listings_df['price'].str.replace('$', '', regex=False)
We can “chain” additional method calls on the results of previous method calls to replace commas and convert the column data type from string to float:
listings_df['price'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(float)
listings_df['price'] = listings_df['price'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(float)
listings_df.info()
listings_df