Machine Learning Fundamentals

Introduction to Pandas

In this video we learn about Pandas. The Pandas Library is built on top of Numpy and is designed to make working with data fast and easy. Like Numpy, the library includes data structures and functions to manipulate that data.

Pandas Import Conventions

from pandas import Series, Dataframe
import pandas as pd

Pandas Data Structures

There are two main data structures in Pandas: Series and DataFrame

Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index

The simplest Series is formed from only an array of data:

s1 = Series([1,2, 3, 4, 5])
s1

The result is:

  1
  2
  3
  4
  5
dtype: int64

The left column is the index and the right the values at those particular index values. If you don’t specify an index one is autogenerated that starts at 0 and goes to the length of the Series.

You can extract the index and the values with:

s1.index

which results in

RangeIndex(start=0, stop=5, step=1)

and

s1.values

which results in

array([1, 2, 3, 4, 5])

Specifying your own index

You can specify your own index for a Series using the index parameter:

s2 = Series([1380, 1205, 102, 36, 1100], index=['Faith', 'Addie', 'Peyton', 'Erin', 'Margaret'])
s2

which results in:

Faith       1380
Addie       1205
Peyton       102
Erin          36
Margaret    1100
dtype: int64

Again, the column on the left is the index and the one on the right the values.

s2.index

results in:

Index(['Faith', 'Addie', 'Peyton', 'Erin', 'Margaret'], dtype='object')

and

s2.values

results in

array([1380, 1205,  102,   36, 1100])

Boolean operations preserve the index

Recall that s1 is

  1
  2
  3
  4
  5
dtype: int64

s1[s1 % 2 == 0]

results in:

1    2
3    4
dtype: int64

where the index values 1 and 3 are preserved. Also

s2[s2 > 1000]

results in:

Faith       1380
Addie       1205
Margaret    1100
dtype: int64

and

s2 * 2

results in:

Faith       2760
Addie       2410
Peyton       204
Erin          72
Margaret    2200
dtype: int64

Series have some similarities to Python dictionaries

So, for example,

'Faith' in s2

results in True and

'Dylan' in s2

results in False.

DataFrame - a spreadsheet like structure

ordered collection of columns
row index and column index

One way to construct a DataFrame is from a Python dictionary:

First, let’s create a Python dictionary:

data1  = { 'state': ['New Mexico', 'Arizona', 'Utah', 'Colorado' ], 'population': [2.01, 7.28, 3.21, 5.76], 'poverty':[.197, .140, .110, .115]}
data1

which returns:

{'population': [2.01, 7.28, 3.21, 5.76],
 'poverty': [0.197, 0.14, 0.11, 0.115],
 'state': ['New Mexico', 'Arizona', 'Utah', 'Colorado']}

Now we can construct a DataFrame:

states = DataFrame(data1)
states

which outputs

	state       population	poverty
New Mexico  2.01        0.197
Arizona	    7.28        0.140
Utah	    3.21        0.110
Colorado    5.76        0.115

Since we didn’t specify an index, one is autogenerated for us as shown in the left column - 0, 1, 2, 3. If we want to specify one we can use the index parameter:

states = DataFrame(data1, index=['NM', 'AZ', 'UT', 'CO'])
states

	state       population	poverty
NM	New Mexico  2.01        0.197
AZ	Arizona	    7.28        0.140
UT	Utah	    3.21        0.110
CO	Colorado    5.76        0.115

Extracting one or more columns:

states['population']

outputs

NM    2.01
AZ    7.28
UT    3.21
CO    5.76
Name: population, dtype: float64

and

states[['population', 'poverty']]

outputs

     population  poverty
NM         2.01    0.197
AZ         7.28    0.140
UT         3.21    0.110
CO         5.76    0.115

Extracting rows

One common way to extract a row is to use loc

states.loc['NM']

state         New Mexico
population          2.01
poverty            0.197
Name: NM, dtype: object

and by using conditionals

states[states['population'] < 5]

results in

         state   population  poverty
NM	New Mexico         2.01    0.197
UT	      Utah         3.21    0.110

and

states[(states['population'] < 5) & (states['poverty'] < .15)]

results in

         state   population  poverty
UT	      Utah         3.21    0.110

Executing Arbitrary Operating System Commands

We can execute operating systems commands in a Jupyter Notebook by preceding the command with a bang (an exclamation point). For example,

ls 
pwd

both work as expected. And

curl https://raw.githubusercontent.com/zacharski/ml-class/master/data/athletesMissingValue.csv

produces:

Name,Sport,Height,Weight
Asuka Teramoto,Gymnastics,54,66
Brittainey Raven,Basketball,,162
Chen Nan,Basketball,78,204
Gabby Douglas,Gymnastics,49,90
Helalia Johannes,Track,65,99
Irina Miketenko,Track,,106
Jennifer Lacy,Basketball,75,175

Note that this is a CSV (comma separated values) file. In this case the file starts with the names of the columns:

Name,Sport,Height,Weight

Again, commas separate the columns.

Creating a DataFrame from a CSV file

To create a DataFrame from a CSV file we use read_csv

athletes = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/athletesMissingValue.csv')
athletes

results in the DataFrame:

               Name        Sport   Height  Weight
  Asuka Teramoto   Gymnastics     54.0      66
Brittainey Raven   Basketball      NaN     162
        Chen Nan   Basketball     78.0     204
   Gabby Douglas   Gymnastics     49.0      90
Helalia Johannes        Track     65.0      99
 Irina Miketenko        Track      NaN     106
   Jennifer Lacy   Basketball     75.0     175
    Kara Goucher        Track     67.0     123

Again, notice that since we didn’t specify an index one was autogenerated for us (0, 1, 2 …).

Copying a DataFrame

Let us say we want to make a copy of the athletes DataFrame. We might be tempted to do something like

women_athletes = athletes

Let us change the sport of Asuka in the original athletes DataFrame to Rugby

athletes.at[0, 'Sport'] = 'Rugby'

Now let’s look at our womens_athletes copy:

women_athletes
	Name	Sport	Height	Weight
0	Asuka Teramoto	Rugby	54.0	66
1	Brittainey Raven	Basketball	NaN	162
2	Chen Nan	Basketball	78.0	204

You can see that this value also got updated in the copy. This is similar to how arrays work in most programming languages. athletes and women_athletes are pointing to the same location in memory. To make a copy of the values of the DataFrame we, unsurprisingly, use the copy command:

ath = athletes.copy()

Now, if you change a value in one DataFrame it won’t affect the values in the other!

Missing Data

In the above example we see NaN which stands for not a number and is a sentinel value that represents missing data. For example, in the original file we had the line:

Brittainey Raven,Basketball,,162

which indicates there is no value in the height column. When this is read into a DataFrame that row is converted to:

1  Brittainey Raven   Basketball      NaN     162

Creating an index.

We can specify a column of the original data to be the index by using the index_col parameter:

athletes2 = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/athletesMissingValue.csv', index_col='Name')
athletes2

results in

                           Sport   Height  Weight
			   Name
     Asuka Teramoto   Gymnastics     54.0      66
   Brittainey Raven   Basketball      NaN     162
           Chen Nan   Basketball     78.0     204
      Gabby Douglas   Gymnastics     49.0      90
   Helalia Johannes        Track     65.0      99
    Irina Miketenko        Track      NaN     106
      Jennifer Lacy   Basketball     75.0     175
       Kara Goucher        Track     67.0     123

We use index_col when we read in the file. Alternatively, we can do so after the fact with set_index:

athletes.set_index('Name', inplace=True)
athletes

DataFrame Basic Statistics

In machine learning tasks, it is a good idea to get a sense of the data before forging ahead with the ML part of the project. We can use:

athletes.describe()

which results in

           Height       Weight
count   25.000000    28.000000
mean    65.200000   121.428571
std      7.416198    40.274455
min     49.000000    66.000000
25%     61.000000    97.000000
50%     65.000000   106.000000
75%     70.000000   156.750000
max     78.000000   204.000000

You can get statistics separately:

athletes.mean()

prints:

Height     65.200000
Weight    121.428571
dtype: float64

and combined with a conditional:

athletes[athletes['Sport'] == 'Basketball'].mean()

Height     73.250000
Weight    174.222222
dtype: float64

The average basketball player is 8 inches taller than the average in our dataset.

Other functions include:

athletes.max()
athletes.min()

Of course we can combine these with arithmetic operators:

athletes[['Height', 'Weight']] - athletes.mean()

results in:

	             Height       Weight
Name		
Asuka Teramoto    -11.2   -55.428571
Brittainey Raven    NaN    40.571429
Chen Nan           12.8    82.571429
Gabby Douglas     -16.2   -31.428571
Helalia Johannes   -0.2   -22.428571
Irina Miketenko     NaN   -15.428571

So Asuka Teramoto is 11 inches shorter than average and 55 pounds lighter.

We can do the same calculation with:

athletes[['Height', 'Weight']].sub(athletes.mean())

Sorting

We can sort a DataFrame either by the index or by one or more of the columns.

sort by index

To sort by the index do:

athletes.sort_index()

This will sort the DataFrame alphabetically by the name of the athlete (Asuka coming before Chen). To sort in reverse we can do:

athletes.sort_index(ascending=False)

which would put Chen before Asuka.

Sort by one or more columns

athletes.sort_values('Height')

which will sort by height, shortest person first. To get the tallest person first:

athletes.sort_values('Height', ascending=False)

and to sort first by Height and then by Weight:

athletes.sort_values(['Height', 'Weight'], ascending=False)

Previous section:

Colab And Numpy

Next section:

scikit Learn