This tutorial will cover:
It is based on:
DEMO-ONLY TUTORIAL BEGINS HERE
To get more functions (and other things) to use in Python, you can import modules, which are just files of Python code.
To use the contents of a module, we must import it once (usually at the top of our notebook or script):
import math
Then we can use functions and variables from the module:
math.pi
math.sqrt(4)
We can use the help()
function to find out what is
available in a module:
help(math)
Sometimes we want to import only certain functions or variables from a module…
from math import pi, e
Or import the module under an aliased name…
import pandas as pd
math
is one of the many modules in Python’s
standard library, which is well known for its extensive
functionality.
3rd party Python libraries can be downloaded with pip
,
which can be run as follows in a notebook:
%pip install pandas
Anyone can publish a 3rd party library, so it’s prudent to ask the following questions before installing a library:
- Has someone you trust recommended this library?
- Is there concrete evidence of an active user base for this library?
- Am I sure I am spelling the library’s name correctly? (malicious users have been known to publish “typo-squatting” libraries with similar names to popular libraries)
The webpage for a library on pypi.org
will often have
links to:
For example, check out: pypi.org/project/pandas
FOLLOW-ALONG TUTORIAL BEGINS HERE
pynoon_data_1.ipynb
Let’s look at the data we’re going to analyse:
Use the link on pynoon.github.io/lessons to
open inside_airbnb_listings_nz_2023_09.csv
We can use the pandas
library to work with tabular data
in Python.
numpy
import pandas as pd
Importing Pandas as the alias
pd
is conventional, and saves us typing outpandas
everytime we want to use it.
To load the CSV into a DataFrame:
listings_df = pd.read_csv('https://pynoon.github.io/data/inside_airbnb_listings_nz_2023_09.csv')
df
is a conventional variable suffix for a
DataFrame.read_*
functions for reading
from different types of files (e.g. read_excel()
,
read_parquet()
).type(listings_df)
Let’s look at the DataFrame:
listings_df
Only a few rows from the top and bottom of the DataFrame are shown
We can sort the DataFrame by a column:
listings_df.sort_values('review_scores_rating')
Note: This doesn’t change the original DataFrame, it produced a new DataFrame that is sorted
We can extract a single column from the DataFrame:
listings_df['name']
The type of a single column or row from a Pandas
DataFrame
is a Series
:
type(listings_df['name'])
Also handy to get a new DataFrame that only contains a subset of columns:
listings_df[['latitude', 'longitude']]
We can get summary statistics for all numeric columns:
listings_df.describe()
And we can get specific summary statistics for column’s individually:
listings_df['price_nzd'].max()
listings_df['review_scores_rating'].mean()
As a quick demo of the power of DataFrames, we can install and use the Plotly library to create plots from DataFrame columns:
%pip install plotly nbformat
import plotly.express as px
px.scatter(listings_df, x='longitude', y='latitude')
px.histogram(listings_df, x='review_scores_rating')