In this video we learn about Pandas. The Pandas Library is built on top of Numpy and is designed to make working with data fast and easy. Like Numpy, the library includes data structures and functions to manipulate that data.
from pandas import Series, Dataframe
import pandas as pd
There are two main data structures in Pandas: Series and DataFrame
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index
The simplest Series is formed from only an array of data:
s1 = Series([1,2, 3, 4, 5])
s1
The result is:
0 1
1 2
2 3
3 4
4 5
dtype: int64
The left column is the index and the right the values at those particular index values. If you don’t specify an index one is autogenerated that starts at 0 and goes to the length of the Series.
You can extract the index and the values with:
s1.index
which results in
RangeIndex(start=0, stop=5, step=1)
and
s1.values
which results in
array([1, 2, 3, 4, 5])
You can specify your own index for a Series using the index
parameter:
s2 = Series([1380, 1205, 102, 36, 1100], index=['Faith', 'Addie', 'Peyton', 'Erin', 'Margaret'])
s2
which results in:
Faith 1380
Addie 1205
Peyton 102
Erin 36
Margaret 1100
dtype: int64
Again, the column on the left is the index and the one on the right the values.
s2.index
results in:
Index(['Faith', 'Addie', 'Peyton', 'Erin', 'Margaret'], dtype='object')
and
s2.values
results in
array([1380, 1205, 102, 36, 1100])
Recall that s1
is
0 1
1 2
2 3
3 4
4 5
dtype: int64
So
s1[s1 % 2 == 0]
results in:
1 2
3 4
dtype: int64
where the index values 1 and 3 are preserved. Also
s2[s2 > 1000]
results in:
Faith 1380
Addie 1205
Margaret 1100
dtype: int64
and
s2 * 2
results in:
Faith 2760
Addie 2410
Peyton 204
Erin 72
Margaret 2200
dtype: int64
So, for example,
'Faith' in s2
results in True
and
'Dylan' in s2
results in False
.
First, let’s create a Python dictionary:
data1 = { 'state': ['New Mexico', 'Arizona', 'Utah', 'Colorado' ], 'population': [2.01, 7.28, 3.21, 5.76], 'poverty':[.197, .140, .110, .115]}
data1
which returns:
{'population': [2.01, 7.28, 3.21, 5.76],
'poverty': [0.197, 0.14, 0.11, 0.115],
'state': ['New Mexico', 'Arizona', 'Utah', 'Colorado']}
Now we can construct a DataFrame:
states = DataFrame(data1)
states
which outputs
state population poverty
0 New Mexico 2.01 0.197
1 Arizona 7.28 0.140
2 Utah 3.21 0.110
3 Colorado 5.76 0.115
Since we didn’t specify an index, one is autogenerated for us as shown in the left column - 0, 1, 2, 3. If we want to specify one we can use the index
parameter:
states = DataFrame(data1, index=['NM', 'AZ', 'UT', 'CO'])
states
state population poverty
NM New Mexico 2.01 0.197
AZ Arizona 7.28 0.140
UT Utah 3.21 0.110
CO Colorado 5.76 0.115
states['population']
outputs
NM 2.01
AZ 7.28
UT 3.21
CO 5.76
Name: population, dtype: float64
and
states[['population', 'poverty']]
outputs
population poverty
NM 2.01 0.197
AZ 7.28 0.140
UT 3.21 0.110
CO 5.76 0.115
One common way to extract a row is to use loc
states.loc['NM']
state New Mexico
population 2.01
poverty 0.197
Name: NM, dtype: object
and by using conditionals
states[states['population'] < 5]
results in
state population poverty
NM New Mexico 2.01 0.197
UT Utah 3.21 0.110
and
states[(states['population'] < 5) & (states['poverty'] < .15)]
results in
state population poverty
UT Utah 3.21 0.110
We can execute operating systems commands in a Jupyter Notebook by preceding the command with a bang (an exclamation point). For example,
ls
pwd
both work as expected. And
curl https://raw.githubusercontent.com/zacharski/ml-class/master/data/athletesMissingValue.csv
produces:
Name,Sport,Height,Weight
Asuka Teramoto,Gymnastics,54,66
Brittainey Raven,Basketball,,162
Chen Nan,Basketball,78,204
Gabby Douglas,Gymnastics,49,90
Helalia Johannes,Track,65,99
Irina Miketenko,Track,,106
Jennifer Lacy,Basketball,75,175
Note that this is a CSV (comma separated values) file. In this case the file starts with the names of the columns:
Name,Sport,Height,Weight
Again, commas separate the columns.
To create a DataFrame from a CSV file we use read_csv
athletes = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/athletesMissingValue.csv')
athletes
results in the DataFrame:
Name Sport Height Weight
0 Asuka Teramoto Gymnastics 54.0 66
1 Brittainey Raven Basketball NaN 162
2 Chen Nan Basketball 78.0 204
3 Gabby Douglas Gymnastics 49.0 90
4 Helalia Johannes Track 65.0 99
5 Irina Miketenko Track NaN 106
6 Jennifer Lacy Basketball 75.0 175
7 Kara Goucher Track 67.0 123
Again, notice that since we didn’t specify an index one was autogenerated for us (0, 1, 2 …).
Let us say we want to make a copy of the athletes
DataFrame. We might be tempted to do something like
women_athletes = athletes
Let us change the sport of Asuka in the original athletes
DataFrame to Rugby
athletes.at[0, 'Sport'] = 'Rugby'
Now let’s look at our womens_athletes
copy:
women_athletes
Name Sport Height Weight
0 Asuka Teramoto Rugby 54.0 66
1 Brittainey Raven Basketball NaN 162
2 Chen Nan Basketball 78.0 204
You can see that this value also got updated in the copy. This is similar to how arrays work in most programming languages. athletes
and women_athletes
are pointing to the same location in memory. To make a copy of the values of the DataFrame we, unsurprisingly, use the copy
command:
ath = athletes.copy()
Now, if you change a value in one DataFrame it won’t affect the values in the other!
In the above example we see NaN
which stands for not a number and is a sentinel value that represents missing data. For example, in the original file we had the line:
Brittainey Raven,Basketball,,162
which indicates there is no value in the height column. When this is read into a DataFrame that row is converted to:
1 Brittainey Raven Basketball NaN 162
We can specify a column of the original data to be the index by using the index_col
parameter:
athletes2 = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/athletesMissingValue.csv', index_col='Name')
athletes2
results in
Sport Height Weight
Name
Asuka Teramoto Gymnastics 54.0 66
Brittainey Raven Basketball NaN 162
Chen Nan Basketball 78.0 204
Gabby Douglas Gymnastics 49.0 90
Helalia Johannes Track 65.0 99
Irina Miketenko Track NaN 106
Jennifer Lacy Basketball 75.0 175
Kara Goucher Track 67.0 123
We use index_col
when we read in the file. Alternatively, we can do so after the fact with set_index
:
athletes.set_index('Name', inplace=True)
athletes
In machine learning tasks, it is a good idea to get a sense of the data before forging ahead with the ML part of the project. We can use:
athletes.describe()
which results in
Height Weight
count 25.000000 28.000000
mean 65.200000 121.428571
std 7.416198 40.274455
min 49.000000 66.000000
25% 61.000000 97.000000
50% 65.000000 106.000000
75% 70.000000 156.750000
max 78.000000 204.000000
You can get statistics separately:
athletes.mean()
prints:
Height 65.200000
Weight 121.428571
dtype: float64
and combined with a conditional:
athletes[athletes['Sport'] == 'Basketball'].mean()
Height 73.250000
Weight 174.222222
dtype: float64
The average basketball player is 8 inches taller than the average in our dataset.
Other functions include:
athletes.max()
athletes.min()
Of course we can combine these with arithmetic operators:
athletes[['Height', 'Weight']] - athletes.mean()
results in:
Height Weight
Name
Asuka Teramoto -11.2 -55.428571
Brittainey Raven NaN 40.571429
Chen Nan 12.8 82.571429
Gabby Douglas -16.2 -31.428571
Helalia Johannes -0.2 -22.428571
Irina Miketenko NaN -15.428571
So Asuka Teramoto is 11 inches shorter than average and 55 pounds lighter.
We can do the same calculation with:
athletes[['Height', 'Weight']].sub(athletes.mean())
We can sort a DataFrame either by the index or by one or more of the columns.
To sort by the index do:
athletes.sort_index()
This will sort the DataFrame alphabetically by the name of the athlete (Asuka coming before Chen). To sort in reverse we can do:
athletes.sort_index(ascending=False)
which would put Chen before Asuka.
athletes.sort_values('Height')
which will sort by height, shortest person first. To get the tallest person first:
athletes.sort_values('Height', ascending=False)
and to sort first by Height and then by Weight:
athletes.sort_values(['Height', 'Weight'], ascending=False)