PyNoon Data Lesson 4 - Tutorial

This tutorial will cover defining and using your own functions, and reducing duplication in your code.

It is based on:

Setup

  1. Make a new notebook for this lesson
  2. What’s the first thing to do? RENAME IT!
  3. Name it pynoon_starter_3.ipynb

Defining Functions

def print_greeting():
    print('Hello World!')
    print('How are you?')

Defining the function does not execute the code inside it, but we can call the function just like any other function:

print_greeting()
def print_greeting(name):
    print(f'Hello {name}!')
    print('How are you?')
print_greeting('Cooper')
def shorten_description(description, max_length):
    if len(description) > max_length:
        short_description = description[:max_length] + '...'
        return short_description
    return description
short_description = shorten_description('This is a very long description', 10)
short_description
shorten_description('12345', 5) == '12345'
assert shorten_description('12345', 5) == '12345'

Now let’s test that the description is limited to the given max_length:

assert shorten_description('123456789', 5) == '12...'

Hmmm, why did that fail?

shorten_description('123456789', 5)

Aha! we need take into account the length of the ellipsis:

def shorten_description(description, max_length):
    if len(description) > max_length:
        ellipsis = '...'
        return description[:(max_length - len(ellipsis))] + ellipsis
    return description

Short aside: F-Strings

name = 'Ben'
print(f'Hello {name}')
print(f'Hello {name.upper()}')

Applying functions to DataFrames

import pandas as pd
listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
listings_df

To transform a listing ID into a URL, we can do the following:

id = 'l11909616'
f'https://www.airbnb.co.nz/rooms/{id[1:]}'

Let’s define a function to transform a listing ID into a URL:

def id_to_url(id):
    return f'https://www.airbnb.co.nz/rooms/{id[1:]}'
id_to_url('l11909616')

Calling .apply(id_to_url) on a single column Series passes each item in the Series to the function and returns a new Series where each value is the corresponding value returned by the function. We can then assign the resulting Series into a new url column:

listings_df['url'] = listings_df['id'].apply(id_to_url)
listings_df

We can also use .apply() with axis='columns on an entire DataFrame to pass an entire row at a time to the function:

def listing_to_description(row):
    room_type = row['room_type']
    host_name = row['host_name']
    return f'{room_type} by {host_name}'

listings_df['description'] = listings_df.apply(listing_to_description, axis='columns')
listings_df

Organising code with functions

akl_listings_df = listings_df[listings_df['region_parent_name'] == 'Auckland']
akl_average_price = akl_listings_df['price_nzd'].median()
akl_above_average_price_df = akl_listings_df[akl_listings_df['price_nzd'] > akl_average_price]
display(akl_above_average_price_df)

wlg_listings_df = listings_df[listings_df['region_parent_name'] == 'Wellington City']
wlg_average_rating = wlg_listings_df['review_scores_rating'].median()
wlg_above_average_rating_df = wlg_listings_df[wlg_listings_df['review_scores_rating'] > wlg_average_rating]
display(wlg_above_average_rating_df)
def get_above_average_listings_df(listings_df, comparison_column):
    """Returns the subset of the given listings_df that is above average
    according to the given comparison_column."""
    average_value = listings_df[comparison_column].median()
    return listings_df[listings_df[comparison_column] > average_value]
akl_above_average_price_df = get_above_average_listings_df(
    listings_df=listings_df[listings_df['region_parent_name'] == 'Wellington City'],
    comparison_column='price_nzd',
)
wlg_above_average_rating_df = get_above_average_listings_df(
    listings_df=listings_df[listings_df['region_parent_name'] == 'Wellington City'],
    comparison_column='review_scores_rating',
)