Pandas Tutorial


exds_00: Getting Started

Introduction:

Now more than ever, data has a unique ability to describe the many facets of our reality. Thanks to relatively recent (and accelerating) advances in computing,those with proper motivation can use data to affect change in any realm they so choose, from ones more directly related to computer science such as machine learning to others with not so obvious connections such as biology and economics!

You have likely heard a lot of talk recently about data science, but what exactly is data science? A quick search gives this definition: “an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge from data across a broad range of application domains”. Now, this is a mouthful, but essentially, data science pulls from the best of computer science, statistics, and mathematics to make data understandable and actionable. This is the key, because what worth does data have if we can not interpret it and make decisions based on it?

pandas and Data Science in Python:

Learning Python in this class has already given you a headstart in data science, since Python is incredibly well-equipped for handling large datasets with many specialized third-party libraries. The industry standard for this type of work is a library called pandas. There are other libraries that work well with data science in Python that we may explore later, but pandas is all you need for now! :)

Below, you will see the usual convention for importing pandas:

Importing pandas

import pandas as pd

Now, that was not so bad! :) but we have a ways to go…

The End-all Be-all of pandas: the DataFrame

In pandas, the primary data type that you will work is known as a DataFrame. Those who have worked in R may know this by the same name, and those who have in the SAS programming lanaguage may know this as a data set! Here, you will see a declaration of a DataFrame in pandas:

# Setting up the dictionary with str keys and list[int] values
d: dict[str, list[int]] = {'col1': [1, 2], 'col2': [3, 4]}

# Creating the DataFrame from the above structure
df = pd.DataFrame(data=d)

# We do not have to use df.head() (more info to come) here because the DataFrame is quite small (this will not be true going forward)
print(df)
   col1  col2
0     1     3
1     2     4

You might have noticed something very interesting, a marked similarity between this declaration and something you have worked with before, perhaps in EX04! Full disclosure, this is entirely purposeful. EX04 was meant to help you understand the underlying functionality of a DataFrame in pandas! Many things work similarly in pandas to how you implemented them in EX04. But, more on that later!

Importing Data with pandas

It stands to reason that, to work with data, we need data! In pandas, we can read in many kinds of files, but the most common for datasets is the csv or “comma-separated value” file. The dataset imported below contains information about some Pokemon! To view your data, you should use a command of the form print(dataframe.head(n)), where n is the number of top rows you wish to display. If left blank, this will default to 5.

Additionally, if we choose to modify the DataFrame, we can actually write to a new csv that will include the modifications using a similar command to the one you will see below. Do NOT worry if the code for modification seems foreign, we will get to that in time, but the gist of it is that we are ordering or sorting the pokemon DataFrame by ascending values of HP! It is worth noting that we could write back to the original csv, but in some cases it may be worth preserving the original dataset.

# Reading in the CSV
pokemon = pd.read_csv("pokemon.csv")

# Sorting the values rows and resetting their indices
pokemon_sorted = pokemon.sort_values("HP",ascending=True)
pokemon_sorted = pokemon_sorted.reset_index(drop=True)

# Writing the new DataFrame to the new csv
pokemon_sorted.to_csv("pokemon_new.csv",index=False)

Summarizing the data in your DataFrame

After we load the data, the structure of our DataFrame may not be immediately obvious, and usually using a command such as df.head() works totally fine. But, sometimes we need a bit more information to understand our data before we attempt to analyze it. See the code below for some commands that may be useful in interpreting the structure of your DataFrame.

# Getting the number of rows in your DataFrame
print(len(pokemon))

# Getting the dimensions of your DataFrame (rows, columns)
pokemon.shape

# Getting a list of the column names in your DataFrame
pokemon.columns

# Provides the most comprehensive overview of the columns in your DataFrame
# also provides the name, count of non-null values and datatype of each column
# as well as some information about memory usage!
pokemon.info()
800
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  800 non-null    int64 
 1   #           800 non-null    int64 
 2   Name        800 non-null    object
 3   Type 1      800 non-null    object
 4   Type 2      414 non-null    object
 5   Total       800 non-null    int64 
 6   HP          800 non-null    int64 
 7   Attack      800 non-null    int64 
 8   Defense     800 non-null    int64 
 9   Sp. Atk     800 non-null    int64 
 10  Sp. Def     800 non-null    int64 
 11  Speed       800 non-null    int64 
 12  Generation  800 non-null    int64 
 13  Legendary   800 non-null    bool  
dtypes: bool(1), int64(10), object(3)
memory usage: 82.2+ KB

Subsetting by variable (column) in a DataFrame

Of course, having the csv loaded is great, but there is not much use in a DataFrame if we can not actually get to the data and perform meaningful analyses. So, we will start by selecting columns! In pandas, there several ways to do this. See the code below and read the comments for insight into the action of each command and note the addition of another library called numpy, which will help us perform some basic analysis on the data we collect.

Something of note is that the very last command uses something called a regular expression. We do not cover this topic in COMP 110, but they are quite useful and something you will learn more about in later COMP courses if you so choose to take them! :)

# Importing numpy
import numpy as np

# Selecting a column using . notation and find the mean HP of the pokemon in our dataset
hp = pokemon.HP
print(np.mean(hp))

# Selecting a column using subscription notation (should be more familiar), also finding the standard deviation using numpy
defense = pokemon["Defense"]
print(np.std(defense))

# Selecting multiple columns, can be useful to "declutter" your data or more directly compare columns
narrowed_pokemon = pokemon[["Name","HP","Attack","Defense"]]
print(narrowed_pokemon.head())

# Selecting columns in a range using .loc() and columns at certian indices using .iloc()
names_and_types = pokemon.loc[:,'Name':'Type 2']
narrowed_pokemon = pokemon.iloc[:,[2,6,7,8]] # note that this achieves the same thing as the subscription notation above!
print(names_and_types.head())
print(narrowed_pokemon.head())

# Filtering columns using .filter(), getting only the columns that start with T
types_and_total = pokemon.filter(regex="^T")
print(types_and_total.head())

# Selecting columns based on data type, only columns of a numeric types
numbers_only = pokemon.select_dtypes(include='number')
print(numbers_only.head())
69.25875
31.164004777146342
                    Name  HP  Attack  Defense
0              Bulbasaur  45      49       49
1                Ivysaur  60      62       63
2               Venusaur  80      82       83
3  VenusaurMega Venusaur  80     100      123
4             Charmander  39      52       43
                    Name Type 1  Type 2
0              Bulbasaur  Grass  Poison
1                Ivysaur  Grass  Poison
2               Venusaur  Grass  Poison
3  VenusaurMega Venusaur  Grass  Poison
4             Charmander   Fire     NaN
                    Name  HP  Attack  Defense
0              Bulbasaur  45      49       49
1                Ivysaur  60      62       63
2               Venusaur  80      82       83
3  VenusaurMega Venusaur  80     100      123
4             Charmander  39      52       43
  Type 1  Type 2  Total
0  Grass  Poison    318
1  Grass  Poison    405
2  Grass  Poison    525
3  Grass  Poison    625
4   Fire     NaN    309
   Unnamed: 0  #  Total  HP  Attack  Defense  Sp. Atk  Sp. Def  Speed  \
0           0  1    318  45      49       49       65       65     45   
1           1  2    405  60      62       63       80       80     60   
2           2  3    525  80      82       83      100      100     80   
3           3  3    625  80     100      123      122      120     80   
4           4  4    309  39      52       43       60       50     65   

   Generation  
0           1  
1           1  
2           1  
3           1  
4           1  

Subsetting by observation (row) in a DataFrame

Next, we will focus on how to subset data by row. This is where Boolean expressions come into play since we typically want the rows we filter to meet a specific requirement such as one of its values falling within a range of numbers or being equal to some particular string. In this section, you will see how to generate a sample from larger dataset. However, pay particular attention to the use of Boolean conditions to filter rows of the DataFrame, as this is the most common and readable method in pandas.

# Using .head() and tail() to select the first and last 'n' rows respectively (n defaults to 5)
first_5 = pokemon.head(5)
last_5 = pokemon.tail(5)
print(first_5)
print(last_5)

# Filtering rows for a specific Boolean condition, all Pokemon with HP greater than 100
healthy_pokemon = pokemon[pokemon.HP > 100]
print(healthy_pokemon.head())

# Dropping duplicate observations from a DataFrame
no_repeats = pokemon.drop_duplicates()
print(no_repeats.head())

# Randomly sample (select) (1-frac)100% of the rows in the DataFrame
sample_frac = pokemon.sample(frac=0.25)
print(sample_frac.head())

# Randomly sample (select) n of the rows in the DataFrame
sample_n = pokemon.sample(n=10)
print(sample_n)
   Unnamed: 0  #                   Name Type 1  Type 2  Total  HP  Attack  \
0           0  1              Bulbasaur  Grass  Poison    318  45      49   
1           1  2                Ivysaur  Grass  Poison    405  60      62   
2           2  3               Venusaur  Grass  Poison    525  80      82   
3           3  3  VenusaurMega Venusaur  Grass  Poison    625  80     100   
4           4  4             Charmander   Fire     NaN    309  39      52   

   Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  
0       49       65       65     45           1      False  
1       63       80       80     60           1      False  
2       83      100      100     80           1      False  
3      123      122      120     80           1      False  
4       43       60       50     65           1      False  
     Unnamed: 0    #                 Name   Type 1 Type 2  Total  HP  Attack  \
795         795  719              Diancie     Rock  Fairy    600  50     100   
796         796  719  DiancieMega Diancie     Rock  Fairy    700  50     160   
797         797  720  HoopaHoopa Confined  Psychic  Ghost    600  80     110   
798         798  720   HoopaHoopa Unbound  Psychic   Dark    680  80     160   
799         799  721            Volcanion     Fire  Water    600  80     110   

     Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  
795      150      100      150     50           6       True  
796      110      160      110    110           6       True  
797       60      150      130     70           6       True  
798       60      170      130     80           6       True  
799      120      130       90     70           6       True  
     Unnamed: 0    #        Name  Type 1 Type 2  Total   HP  Attack  Defense  \
44           44   39  Jigglypuff  Normal  Fairy    270  115      45       20   
45           45   40  Wigglytuff  Normal  Fairy    435  140      70       45   
96           96   89         Muk  Poison    NaN    500  105     105       75   
120         120  112      Rhydon  Ground   Rock    485  105     130      120   
121         121  113     Chansey  Normal    NaN    450  250       5        5   

     Sp. Atk  Sp. Def  Speed  Generation  Legendary  
44        45       25     20           1      False  
45        85       50     45           1      False  
96        65      100     50           1      False  
120       45       45     40           1      False  
121       35      105     50           1      False  
   Unnamed: 0  #                   Name Type 1  Type 2  Total  HP  Attack  \
0           0  1              Bulbasaur  Grass  Poison    318  45      49   
1           1  2                Ivysaur  Grass  Poison    405  60      62   
2           2  3               Venusaur  Grass  Poison    525  80      82   
3           3  3  VenusaurMega Venusaur  Grass  Poison    625  80     100   
4           4  4             Charmander   Fire     NaN    309  39      52   

   Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  
0       49       65       65     45           1      False  
1       63       80       80     60           1      False  
2       83      100      100     80           1      False  
3      123      122      120     80           1      False  
4       43       60       50     65           1      False  
     Unnamed: 0    #        Name  Type 1    Type 2  Total  HP  Attack  \
720         720  652  Chesnaught   Grass  Fighting    530  88     107   
249         249  230     Kingdra   Water    Dragon    540  75      95   
160         160  148   Dragonair  Dragon       NaN    420  61      84   
659         659  598  Ferrothorn   Grass     Steel    489  74      94   
461         461  414      Mothim     Bug    Flying    424  70      94   

     Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  
720      122       74       75     64           6      False  
249       95       95       95     85           2      False  
160       65       70       70     70           1      False  
659      131       54      116     20           5      False  
461       50       94       50     66           4      False  
     Unnamed: 0    #                 Name  Type 1    Type 2  Total   HP  \
529         529  477             Dusknoir   Ghost       NaN    525   45   
585         585  526             Gigalith    Rock       NaN    515   85   
787         787  711  GourgeistSuper Size   Ghost     Grass    494   85   
777         777  707               Klefki   Steel     Fairy    470   57   
363         363  332             Cacturne   Grass      Dark    475   70   
398         398  363               Spheal     Ice     Water    290   70   
313         313  289              Slaking  Normal       NaN    670  150   
401         401  366             Clamperl   Water       NaN    345   35   
277         277  256            Combusken    Fire  Fighting    405   60   
254         254  235             Smeargle  Normal       NaN    250   55   

     Attack  Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  
529     100      135       65      135     45           4      False  
585     135      130       60       80     25           5      False  
787     100      122       58       75     54           6      False  
777      80       91       80       87     75           6      False  
363     115       60      115       60     55           3      False  
398      40       50       55       50     25           3      False  
313     160      100       95       65    100           3      False  
401      64       85       74       55     32           3      False  
277      85       60       85       60     55           3      False  
254      20       35       20       45     75           2      False  

Conclusion of Tutorial:

To wrap things up, let’s recall what we have learned so far:

  • importing pandas
  • creating a DataFrame, underlying structure
  • reading from and writing to .csv files
  • observing the structure of a specific DataFrame
  • selecting data by column and by row
  • some basic analysis using numpy

Congratulations on taking the first step in your data science journey. If you so choose, you can try out the Data Science Challenge in this folder which will push you to apply the skills you learned in this tutorial and do some of your own analysis on a dataset! Great work and good luck to all of you future data scientists! :)

Contributor(s): Kyle Sorensen