PyNoon Data Lesson 1 - Tutorial

This tutorial will cover:

It is based on:

DEMO-ONLY TUTORIAL BEGINS HERE

Modules and Libraries

To get more functions (and other things) to use in Python, you can import modules, which are just files of Python code.

To use the contents of a module, we must import it once (usually at the top of our notebook or script):

import math

Then we can use functions and variables from the module:

math.pi
math.sqrt(4)

We can use the help() function to find out what is available in a module:

help(math)

Sometimes we want to import only certain functions or variables from a module…

from math import pi, e

Or import the module under an aliased name…

import pandas as pd

3rd party Python libraries can be downloaded with pip, which can be run as follows in a notebook:

%pip install pandas

Anyone can publish a 3rd party library, so it’s prudent to ask the following questions before installing a library:

The webpage for a library on pypi.org will often have links to:

For example, check out: pypi.org/project/pandas

FOLLOW-ALONG TUTORIAL BEGINS HERE

Setup

  1. Make a new notebook for this lesson
  2. What’s the first thing to do? RENAME IT!
  3. Name it pynoon_data_1.ipynb

Let’s look at the data we’re going to analyse:

Use the link on pynoon.github.io/lessons to open inside_airbnb_listings_nz_2023_09.csv

Pandas DataFrames

We can use the pandas library to work with tabular data in Python.

import pandas as pd

Importing Pandas as the alias pd is conventional, and saves us typing out pandas everytime we want to use it.

To load the CSV into a DataFrame:

listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
type(listings_df)

Let’s look at the DataFrame:

listings_df

Only a few rows from the top and bottom of the DataFrame are shown

We can sort the DataFrame by a column:

listings_df.sort_values('review_scores_rating')

Note: This doesn’t change the original DataFrame, it produced a new DataFrame that is sorted

We can extract a single column from the DataFrame:

listings_df['name']

The type of a single column or row from a Pandas DataFrame is a Series:

type(listings_df['name'])

Also handy to get a new DataFrame that only contains a subset of columns:

listings_df[['latitude', 'longitude']]

Summary Statistics

We can get summary statistics for all numeric columns:

listings_df.describe()

And we can get specific summary statistics for column’s individually:

listings_df['price_nzd'].max()
listings_df['review_scores_rating'].mean()

Plotting Demo

As a quick demo of the power of DataFrames, we can install and use the Plotly library to create plots from DataFrame columns:

%pip install plotly nbformat
import plotly.express as px
px.scatter(listings_df, x='longitude', y='latitude')
px.histogram(listings_df, x='review_scores_rating')