PyNoon Data Lesson 4 - Tutorial

This tutorial will cover defining and using your own functions, and reducing duplication in your code.

It is based on:

Setup

  1. Make a new notebook for this lesson
  2. What’s the first thing to do? RENAME IT!
  3. Name it pynoon_starter_3.ipynb

Defining Functions

def print_greeting():
    print('Hello World!')
    print('How are you?')

Defining the function does not execute the code inside it, but we can call the function just like any other function:

print_greeting()
def print_greeting(name):
    print(f'Hello {name}!')
    print('How are you?')
print_greeting('Cooper')
def shorten_description(description, max_length):
    if len(description) > max_length:
        short_description = description[:max_length] + '...'
        return short_description
    return description
short_description = shorten_description('This is a very long description', 10)
short_description
shorten_description('12345', 5) == '12345'
assert shorten_description('12345', 5) == '12345'

Now let’s test that the description is limited to the given max_length:

assert shorten_description('123456789', 5) == '12...'

Hmmm, why did that fail?

shorten_description('123456789', 5)

Aha! we need take into account the length of the ellipsis:

def shorten_description(description, max_length):
    if len(description) > max_length:
        ellipsis = '...'
        return description[:(max_length - len(ellipsis))] + ellipsis
    return description

Short aside: F-Strings

name = 'Ben'
print(f'Hello {name}')
print(f'Hello {name.upper()}')

Applying functions to DataFrames

import pandas as pd
listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
listings_df

While Pandas provides many more functions for transforming DataFrames and Series, it is still often convenient to express a transformation as plain-old-Python code applied to a single value or row.

We can do this by writing our transformation as a regular Python function and then applying it to a Series or DataFrame.

To transform a listing ID into a URL, we can do the following:

id = 'l11909616'
f'https://www.airbnb.co.nz/rooms/{id[1:]}'

Let’s define a function to transform a listing ID into a URL:

def id_to_url(id):
    return f'https://www.airbnb.co.nz/rooms/{id[1:]}'
id_to_url('l11909616')

Calling .apply(id_to_url) on a single column Series passes each item in the Series to the function and returns a new Series where each value is the corresponding value returned by the function. We can then assign the resulting Series into a new url column:

listings_df['url'] = listings_df['id'].apply(id_to_url)
listings_df

We can also use .apply() with axis='columns on an entire DataFrame to pass an entire row at a time to the function.

The output will still be a single Series of the returned values.

def listing_to_description(row):
    room_type = row['room_type']
    host_name = row['host_name']
    return f'{room_type} by {host_name}'

listings_df['description'] = listings_df.apply(listing_to_description, axis='columns')
listings_df

The row passed into the function will be a Series representing a single row in the DataFrame.

We can access the row’s value for each column in the same way we access columns in a DataFrame.

One important point to know about .apply() is that Pandas built-in operations will often be much faster than running plain-old-Python on each row.

However, this often won’t make much of a difference until you’re dealing with hundreds of thousands or millions of rows. And remember, when exploring the data it’s most important for you to be able to quickly translate your ideas into working code!

Organising code with functions

akl_listings_df = listings_df[listings_df['region_parent_name'] == 'Auckland']
akl_average_price = akl_listings_df['price_nzd'].median()
akl_above_average_price_df = akl_listings_df[akl_listings_df['price_nzd'] > akl_average_price]
display(akl_above_average_price_df)

wlg_listings_df = listings_df[listings_df['region_parent_name'] == 'Wellington City']
wlg_average_rating = wlg_listings_df['review_scores_rating'].median()
wlg_above_average_rating_df = wlg_listings_df[wlg_listings_df['review_scores_rating'] > wlg_average_rating]
display(wlg_above_average_rating_df)
def get_above_average_listings_df(listings_df, comparison_column):
    """Returns the subset of the given listings_df that is above average
    according to the given comparison_column."""
    average_value = listings_df[comparison_column].median()
    return listings_df[listings_df[comparison_column] > average_value]
akl_above_average_price_df = get_above_average_listings_df(
    listings_df=listings_df[listings_df['region_parent_name'] == 'Wellington City'],
    comparison_column='price_nzd',
)
wlg_above_average_rating_df = get_above_average_listings_df(
    listings_df=listings_df[listings_df['region_parent_name'] == 'Wellington City'],
    comparison_column='review_scores_rating',
)