Downloading data from web¶

use requests module
response object = request.get(url)
response.raise_for_status() - will check if request is success or not
download it to a file and write response content to it

with open(target_csv_path, "wb") as f:
    f.write(response.content)

import requests

download_url="https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"

response = requests.get(download_url)
response.raise_for_status() # check the request was succesful

with open(target_csv_path, "wb") as f:
    f.write(response.content)

print ("Download ready")

Download ready

Reading csv file using pandas¶

import pandas as pd

nba = pd.read_csv("nba_all_elo.csv")
type(nba)

pandas.core.frame.DataFrame

len(df) - gives us the number of rows in df¶

len(nba) # to determine number of rows - len()

126314

df.shape - to see no of rows and columns¶

nba.shape # to see dimensionality, to see tuple (rows and columns)

(126314, 23)

To see the first few rows of the data frame¶

nba.head()

To see all the columns on the terminal¶

pd.set_option("display.max.columns", None) # to see all the 23 columns
nba.head()

To set the precision value for decimal fields¶

pd.set_option("display.precision", 2) #we dont need 6 decimal places
nba.head()

To see the last rows of the data frame¶

nba.tail()

nba.tail(3) # to get the last 3 lines of NBA

To check the data types and columns - df.info( )¶

nba.info() # discover data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB

Object type in Pandas " a catcha all for columns that pandas did not recognize as any other specific type"
In practise, it often means that all of the values in the column are strings

Showing basic statistics of a data frame - describe( )¶

default: describe() only analyzes numeric columns
we can provide other data types if we use include parameter

nba.describe()

import numpy as np
nba.describe(include = np.object)

Exploratory data analysis ## - to answer more questions about our dataset¶

value_counts() - to examine how often specific values occur in a column

nba["team_id"].value_counts()

BOS    5997
NYK    5769
LAL    5078
DET    4985
PHI    4533
CHI    4307
PHO    4080
ATL    4035
MIL    4034
POR    3870
HOU    3820
CLE    3810
GSW    3701
SEA    3547
SAS    3515
IND    3364
DEN    3312
UTA    3145
DAL    3013
NJN    2939
LAC    2563
SAC    2488
MIA    2371
ORL    2207
MIN    2131
WSB    1992
TOR    1634
WAS    1475
CIN    1230
MEM    1197
       ... 
OAK     172
NOP     168
PTC     168
PRO     168
MMT     168
NOK     164
HSM     159
TRI     135
WSA      91
MMS      89
CAP      89
TEX      89
MNM      88
MNP      85
CHO      82
NYN      82
CHP      80
CHZ      80
ANA      78
NJA      78
AND      72
SHE      65
CLR      63
DNN      62
WAT      62
INJ      60
DTF      60
PIT      60
TRH      60
SDS      11
Name: team_id, Length: 104, dtype: int64

nba["fran_id"].value_counts()

Lakers          6024
Celtics         5997
Knicks          5769
Warriors        5657
Pistons         5650
Sixers          5644
Hawks           5572
Kings           5475
Wizards         4582
Spurs           4309
Bulls           4307
Pacers          4227
Thunder         4178
Rockets         4154
Nuggets         4120
Nets            4106
Suns            4080
Bucks           4034
Trailblazers    3870
Cavaliers       3810
Clippers        3733
Jazz            3555
Mavericks       3013
Heat            2371
Pelicans        2254
Magic           2207
Timberwolves    2131
Grizzlies       1657
Raptors         1634
Hornets          894
Colonels         846
Squires          799
Spirits          777
Stars            756
Sounds           697
Baltimore        467
Floridians       440
Condors          430
Capitols         291
Olympians        282
Sails            274
Stags            260
Bombers          249
Steamrollers     168
Packers           72
Redskins          65
Rebels            63
Denver            62
Waterloo          62
Huskies           60
Falcons           60
Ironmen           60
Jets              60
Name: fran_id, dtype: int64

nba.loc[nba["fran_id"]=="Lakers", "team_id"].value_counts()

LAL    5078
MNL     946
Name: team_id, dtype: int64

nba.loc[nba["team_id"]=="MNL", "date_game"].min()

'1/1/1949'

nba.loc[nba["team_id"]=="MNL", "date_game"].max()

'4/9/1959'

nba.loc[nba["team_id"]=="MNL","date_game"].agg(("min","max"))

min    1/1/1949
max    4/9/1959
Name: date_game, dtype: object

# How many points the Boston Celtics have scored
nba.loc[nba["team_id"]=="BOS", "pts"].sum()

626484

Understanding Data Frame Objects¶

About sequence of Series objects that share the same index

revenues = pd.Series([5555, 7000, 1980])
print(revenues)

0    5555
1    7000
2    1980
dtype: int64

print(revenues.values) # A sequence of values

[5555 7000 1980]

print(revenues.index) # a sequence of identifiers, which is the index

RangeIndex(start=0, stop=3, step=1)

# Series, can have an arbitary type of index
# Index is a list of city names represented by Strings
city_revenues = pd.Series(
    [4200, 8000, 6500], # series - python list because it only has a positional index
    index = ["Amsterdam", "Toronto","Tokyo"]
)
print(city_revenues)

Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64

# To construct a Series with a label index from a python dictionary
city_employee_count= pd.Series({"Amsterdam":5, "Tokyo":8})
print(city_employee_count)

Amsterdam    5
Tokyo        8
dtype: int64

# Dictionary keys become the index,
# Dictionary values are the Series values
# keys() and in keyword
print(city_employee_count.keys())

Index(['Amsterdam', 'Tokyo'], dtype='object')

print("Tokyo" in city_employee_count)

True

# We can have 2 series objects with cities as keys
# city_revenues 
# city_employee_count
city_data = pd.DataFrame({
    "revenue": city_revenues,
    "employee_count":city_employee_count
})
print(city_data)

           revenue  employee_count
Amsterdam     4200             5.0
Tokyo         6500             8.0
Toronto       8000             NaN

# now the new DataFrame index is the union of 2 series indices
print(city_data.index)

Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')

# Just like series, DF stores its values in NumPy array
print(city_data.values)

[[4.2e+03 5.0e+00]
 [6.5e+03 8.0e+00]
 [8.0e+03     nan]]

# We can refer to the 2 dimensions of a dataframe as AXES
print(city_data.axes)

[Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object'), Index(['revenue', 'employee_count'], dtype='object')]

print(city_data.axes[0]) # [0] = row index

Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')

print(city_data.axes[1]) # [1] = column index

Index(['revenue', 'employee_count'], dtype='object')

# A DF is also a dicitionary like data structure
# Hence it supports .keys() and in
print(city_data.keys())

Index(['revenue', 'employee_count'], dtype='object')

print("Amsterdam" in city_data)

False

print(city_data["revenue"])

Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64

print(city_data.revenue)

Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64

Does NBA Dataset contain pts or points column?¶

print(nba.axes)

[RangeIndex(start=0, stop=126314, step=1), Index(['gameorder', 'game_id', 'lg_id', '_iscopy', 'year_id', 'date_game',
       'seasongame', 'is_playoffs', 'team_id', 'fran_id', 'pts', 'elo_i',
       'elo_n', 'win_equiv', 'opp_id', 'opp_fran', 'opp_pts', 'opp_elo_i',
       'opp_elo_n', 'game_location', 'game_result', 'forecast', 'notes'],
      dtype='object')]

print(nba.index)

RangeIndex(start=0, stop=126314, step=1)

print("points" in nba.keys())

False

print("pts" in nba.keys())

True

Accessing Series Elements¶

2 panda's specific access methods¶

1.loc 2.iloc

Using the indexing operator¶

Series has 2 indices:
1. A positional or implicit index which is always a Range Index
2. A label or explicit index which can contain any hashtable objects

print(city_revenues)

Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64

print(city_revenues["Toronto"])

8000

print(city_revenues[1])

8000

print(city_revenues[-1])

6500

print(city_revenues[1:])

Toronto    8000
Tokyo      6500
dtype: int64

print(city_revenues["Toronto":])

Toronto    8000
Tokyo      6500
dtype: int64

indexing operator ([ ]) is convinient, but there's a caveat¶

if labels are also numbers??¶

colors = pd.Series(
    ["red","purple","blue","green","yellow"],
    index = [1,2,3,5,8]
)
print(colors)

1       red
2    purple
3      blue
5     green
8    yellow
dtype: object

print(colors.loc[1]) #.loc - refers to the label index

red

print(colors.iloc[1]) #.iloc - refers to the positional index

purple

print(colors.iloc[1:3]) # returns the positional indices of 1 and 2

2    purple
3      blue
dtype: object

print(colors.loc[3:8]) # returns elements with explicit index with bw 3 and 8

3      blue
5     green
8    yellow
dtype: object

Accessing data frame elements with dot notation may not work or may lead to surprises¶

This is when a column name coincides with a DataFrame attribute or method name

toys = pd.DataFrame([
    {"name":"ball", "shape":"sphere"},
    {"name":"Rubik's cube", "shape":"cube"}
])
print(toys["shape"])

0    sphere
1      cube
Name: shape, dtype: object

Using .loc and .iloc¶

city_data.loc["Amsterdam"] : selects the row with label index "Amsterdam"
city_data.loc["Tokyo":"Tornoto"] : selects the rows with label indices from "Tokyo" to "Toronto"
city_data.iloc[1]: selects the row with the positional index 1, which is Tokyo

print(city_data)

           revenue  employee_count
Amsterdam     4200             5.0
Tokyo         6500             8.0
Toronto       8000             NaN

print(city_data.loc["Amsterdam"])

revenue           4200.0
employee_count       5.0
Name: Amsterdam, dtype: float64

print(city_data.loc["Tokyo": "Toronto"])

         revenue  employee_count
Tokyo       6500             8.0
Toronto     8000             NaN

print(city_data.iloc[1])

revenue           6500.0
employee_count       8.0
Name: Tokyo, dtype: float64

## To display the second to the last row of nba dataset##

print(nba.iloc[-2])

gameorder               63157
game_id          201506170CLE
lg_id                     NBA
_iscopy                     0
year_id                  2015
date_game           6/16/2015
seasongame                102
is_playoffs                 1
team_id                   CLE
fran_id             Cavaliers
pts                        97
elo_i                 1.7e+03
elo_n                 1.7e+03
win_equiv                  59
opp_id                    GSW
opp_fran             Warriors
opp_pts                   105
opp_elo_i             1.8e+03
opp_elo_n             1.8e+03
game_location               H
game_result                 L
forecast                 0.48
notes                     NaN
Name: 126312, dtype: object

.loc and .iloc - accept a second parameter¶

First parameter: select rows based on the indices
Second parameter: selects the columns
These parameters together can be helped to select a subset of rows and columns from your DF

city_data.loc["Amsterdam": "Tokyo", "revenue"]

print(city_data.loc["Amsterdam":"Tokyo", "revenue"])

Amsterdam    4200
Tokyo        6500
Name: revenue, dtype: int64

## For NBA dataset, select all gamnes between labels 5555 and 5559 and get only scores###

print(nba.loc[5555:5559,["team_id", "pts"]])

     team_id  pts
5555     FTW   83
5556     BOS   95
5557     NYK   74
5558     ROC   81
5559     SYR   86

Querying the dataset¶

Apart from accessing subsets of huge dataset based on its indices,
we can select rows based on values in our data set to query data

create a new DF that contains only games played after 2010

current_decade = nba[nba["year_id"]>2010]
print(current_decade.shape)

(12658, 23)

## select rows based where specific field is not null

games_with_notes = nba[nba["notes"].notnull()]
print(games_with_notes.shape)

(5424, 23)

## we can perform string methods too ##
## Get the franchises that ends with "ERS" ###

ers = nba[nba["fran_id"].str.endswith("ers")]
print(ers.shape)

(27797, 23)

## To combine, multiple criteria to query, use | & ##

## Search for baltimore games where both teams scored over 100 points ##
## To see each game only once, exclude duplicates ##

print(nba[
    (nba["_iscopy"]==0) &
    (nba["pts"]>100) &
    (nba["opp_pts"]>100) &
    (nba["team_id"]=="BLB")
])

      gameorder       game_id lg_id  _iscopy  year_id  date_game  seasongame  \
1726        864  194902260BLB   NBA        0     1949  2/26/1949          53   
4890       2446  195301100BLB   NBA        0     1953  1/10/1953          32   
4909       2455  195301140BLB   NBA        0     1953  1/14/1953          34   
5208       2605  195303110BLB   NBA        0     1953  3/11/1953          66   
5825       2913  195402220BLB   NBA        0     1954  2/22/1954          60   

      is_playoffs team_id    fran_id  pts    elo_i    elo_n  win_equiv opp_id  \
1726            0     BLB  Baltimore  114  1421.94  1419.43      38.56    MNL   
4890            0     BLB  Baltimore  126  1328.67  1356.65      25.80    BOS   
4909            0     BLB  Baltimore  104  1349.83  1346.36      24.88    MNL   
5208            0     BLB  Baltimore  107  1284.52  1282.24      19.58    NYK   
5825            0     BLB  Baltimore  110  1303.75  1301.97      20.74    BOS   

     opp_fran  opp_pts  opp_elo_i  opp_elo_n game_location game_result  \
1726   Lakers      115    1637.99    1640.49             H           L   
4890  Celtics      105    1591.14    1563.17             H           W   
4909   Lakers      112    1665.44    1668.91             H           L   
5208   Knicks      113    1649.15    1651.44             H           L   
5825  Celtics      111    1591.49    1593.27             H           L   

      forecast            notes  
1726      0.34              NaN  
4890      0.28              NaN  
4909      0.22              NaN  
5208      0.18     at Boston MA  
5825      0.25  at Worcester MA

## Spring 1992, 2 teams from LA played a home game at another court. 
## query to find both the games
## both the teams have an ID starting with "LA"
print(nba.info())
print(nba[
    (nba["_iscopy"]==0) &
    (nba["team_id"].str.startswith("LA")) &
    (nba["year_id"]==1992) &
    (nba["notes"].notnull())
])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB
None
       gameorder       game_id lg_id  _iscopy  year_id date_game  seasongame  \
68901      34451  199205030LAC   NBA        0     1992  5/3/1992          86   
68903      34452  199205030LAL   NBA        0     1992  5/3/1992          86   

       is_playoffs team_id   fran_id  pts    elo_i   elo_n  win_equiv opp_id  \
68901            1     LAC  Clippers  115  1546.90  1556.0      47.89    UTA   
68903            1     LAL    Lakers   76  1489.65  1474.4      39.40    POR   

           opp_fran  opp_pts  opp_elo_i  opp_elo_n game_location game_result  \
68901          Jazz      107    1647.51    1638.41             H           W   
68903  Trailblazers      102    1655.12    1670.37             H           L   

       forecast                                       notes  
68901      0.50  at Anaheim, CA (Anaheim Convention Center)  
68903      0.41                            at Las Vegas, NV

Grouping and Aggregating Data¶

Sum, Mean, Average Value of a group of elements¶

print(city_revenues.sum())

18700

print(city_revenues.max())

8000

points = nba["pts"]
print(points.sum())

12976235

## passing sort=False, will lead to performance gains
print(nba.groupby("fran_id", sort=False)["pts"].sum())

fran_id
Huskies           3995
Knicks          582497
Stags            20398
Falcons           3797
Capitols         22387
Celtics         626484
Steamrollers     12372
Ironmen           3674
Bombers          17793
Rebels            4474
Warriors        591224
Baltimore        37219
Jets              4482
Pistons         572758
Lakers          637444
Kings           569245
Hawks           567261
Denver            4818
Olympians        22864
Redskins          5372
Waterloo          4921
Packers           6193
Sixers          585891
Wizards         474809
Bulls           437269
Thunder         437735
Squires          91127
Stars            84940
Rockets         432504
Colonels         94435
Pacers          438288
Nuggets         445780
Spurs           453822
Spirits          85874
Sounds           75582
Floridians       49568
Nets            417809
Condors          49642
Bucks           418326
Suns            437486
Clippers        380523
Cavaliers       380416
Trailblazers    402695
Sails            30080
Jazz            363155
Mavericks       309239
Pelicans        220794
Heat            229103
Timberwolves    207693
Magic           219436
Grizzlies       157683
Raptors         158370
Hornets          84489
Name: pts, dtype: int64

## grouping by multiple columns

nba[
    (nba["fran_id"]=="Spurs") &
    (nba["year_id"] > 2010)
].groupby(["year_id", "game_result"])["game_id"].count()

year_id  game_result
2011     L              25
         W              63
2012     L              20
         W              60
2013     L              30
         W              73
2014     L              27
         W              78
2015     L              31
         W              58
Name: game_id, dtype: int64

# Take a look at GSW 2014-15 season (year_id= 2015)
# How many wins and losses did they have during regular season and playoffs

nba[
    (nba["fran_id"]=="Warriors") &
    (nba["year_id"]==2015)
].groupby(["is_playoffs", "game_result"])["game_id"].count()

is_playoffs  game_result
0            L              15
             W              67
1            L               5
             W              16
Name: game_id, dtype: int64

Manipulating DF columns¶

# create a copy of your DF to work with
df = nba.copy()
df.shape

(126314, 23)

df["difference"]=df.pts - df.opp_pts
df.shape

(126314, 24)

df["difference"].max()

68

# rename game_result and game_locatoin
renamed_df = df.rename(
    columns = {"game_result":"result", "game_location":"location"}
)
renamed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 24 columns):
gameorder      126314 non-null int64
game_id        126314 non-null object
lg_id          126314 non-null object
_iscopy        126314 non-null int64
year_id        126314 non-null int64
date_game      126314 non-null object
seasongame     126314 non-null int64
is_playoffs    126314 non-null int64
team_id        126314 non-null object
fran_id        126314 non-null object
pts            126314 non-null int64
elo_i          126314 non-null float64
elo_n          126314 non-null float64
win_equiv      126314 non-null float64
opp_id         126314 non-null object
opp_fran       126314 non-null object
opp_pts        126314 non-null int64
opp_elo_i      126314 non-null float64
opp_elo_n      126314 non-null float64
location       126314 non-null object
result         126314 non-null object
forecast       126314 non-null float64
notes          5424 non-null object
difference     126314 non-null int64
dtypes: float64(6), int64(8), object(10)
memory usage: 23.1+ MB

## Delete the unwanted columns
##  Ex: delete 4 columns related to ELO
df.shape

(126314, 24)

elo_columns = ["elo_i", "elo_n", "opp_elo_i", "opp_elo_n"]
df.drop(elo_columns, inplace=True, axis=1)
df.shape

(126314, 20)

Specifying Data types¶

Pandas infer schema option does pretty good job but its not perfect
Performance can be improved, if we select the right data type for our columns upfront

df.info()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: float64(2), int64(8), object(10)
memory usage: 19.3+ MB

10 columns are having the data type object
Most of these object columns contain aribitary text,
But, there are some candidates for datatype conversion

##use .to_datetime() to specify all gamedates as datetime objects
df["date_game"] = pd.to_datetime(df["date_game"])

df["game_location"].nunique()

3

df["game_location"].value_counts()

H    63138
A    63138
N       38
Name: game_location, dtype: int64

But what data type would we use for game location column?
In a relational database, we would go with varchar type, but rather an enum
Pandas provides categorical data fro the same purpose

Advantages of categorical data over unstructured text

It makes validation easy
Save a ton of memory as Pandas will only use the unique values internally
The higher the ratio of total values to unique values, the more space savings we get

df["game_location"] = pd.Categorical(df["game_location"])

df["game_location"].dtype

CategoricalDtype(categories=['A', 'H', 'N'], ordered=False)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null datetime64[ns]
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null category
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: category(1), datetime64[ns](1), float64(2), int64(8), object(8)
memory usage: 18.4+ MB

## chaning game_result column
df["game_result"].nunique()

2

df["game_result"].value_counts()

L    63157
W    63157
Name: game_result, dtype: int64

# to improve performance, make game_result column into Categorical column
df["game_result"] = pd.Categorical(df["game_result"])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null datetime64[ns]
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null category
game_result      126314 non-null category
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: category(2), datetime64[ns](1), float64(2), int64(8), object(7)
memory usage: 17.6+ MB

Cleaning Data¶

MISSING VALUES * :
- .info() - shows many non-null values
- null values often indicate problem in data-gathering process
- they make some analysis impossible
- EASIEST WAYS TO DEAL WITH MISSING DATA **
- dropna() - ignores the missing values
- replace them with default values
INVALID VALUES *:
- can be more dangerous than missing values
- these are often more challenging to detect if our dataset is enormous or used manual entry
- but we can implement some sanity checks with queries and aggregations
- use describe( )

INCONSISTENT VALUES *:
- It would look good but doesn't fit with values in other columns
- we can define some query criteria that are mutually exclusive and verify that they dont occur together
- check whether the columns are consistent using .empty attribute

nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB

rows_without_missing_data = nba.dropna()
rows_without_missing_data.shape

(5424, 23)

# We can also drop problematic columns that are not relevant for our analysis
# to do this, use dropna() again and provide axis = 1
# this wont contain the sometimes empty notes column

data_without_missing_columns = nba.dropna(axis=1)
data_without_missing_columns.shape

(126314, 22)

# replacing with default values

data_with_default_notes = nba.copy()
data_with_default_notes["notes"].fillna(
    value="no notes at all",
    inplace=True
)

data_with_default_notes["notes"].describe()

count              126314
unique                232
top       no notes at all
freq               120890
Name: notes, dtype: object

# INVALID VALUES
nba.describe()

# year makes sense between 1947 - 2015
# but pts 0?
nba[nba["pts"]==0] # game was forfeited, so we may want to remove from dataset

# value of the fiels pts, opp_pts and game_result should be consistent
# so check with .empty attribute

nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].empty

True

Combining Multiple Datasets¶

further_city_data = pd.DataFrame(
    {"revenue":[7000,3400], "employee_count":[2,2]},
    index=["New York","Barcelona"]
)

city_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Amsterdam to Toronto
Data columns (total 2 columns):
revenue           3 non-null int64
employee_count    2 non-null float64
dtypes: float64(1), int64(1)
memory usage: 152.0+ bytes

further_city_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, New York to Barcelona
Data columns (total 2 columns):
revenue           2 non-null int64
employee_count    2 non-null int64
dtypes: int64(2)
memory usage: 48.0+ bytes

# add these citites to city_data using .concat()
# By default- combines along axis=0, i.e., it appends rows
all_city_data = pd.concat([city_data, further_city_data], sort=False)
all_city_data

# use axis =1 , to append columns using concat
city_countries = pd.DataFrame({
    "country": ["Holland","Japan","Holland","Canada","Spain"],
    "capital":[1,1,0,0,0]},
    index = ["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]
)
cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)
cities

# To combine only the cities that appear in both DF objects, set inner join

pd.concat([all_city_data, city_countries], axis=1, join="inner")

# If we do not want to combine based on index, use merge()

countries = pd.DataFrame({
    "population_millions": [17,127,37],
    "continent": ["Europe", "Asia", "North America"]
}, index=["Holland", "Japan", "Canada"])

pd.merge(cities, countries, left_on="country", right_index=True)

cities

countries

# merge - performs inner join by default
# change it using how parameter

pd.merge(
    cities,
    countries,
    left_on="country",
    right_index=True,
    how='left'
)

import sys  
!{sys.executable} -m pip install --user matplotlib

Requirement already satisfied: matplotlib in /Users/ladusu863/Library/Python/3.7/lib/python/site-packages (3.1.2)
Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.7/site-packages (from matplotlib) (1.16.4)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/site-packages (from matplotlib) (2.8.0)
Requirement already satisfied: cycler>=0.10 in /Users/ladusu863/Library/Python/3.7/lib/python/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/site-packages (from matplotlib) (2.4.1.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/ladusu863/Library/Python/3.7/lib/python/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib) (1.12.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (41.0.1)

Visualizing Data Frame¶

use %matplotlib inline
Both Series and DataFrame objects have .plot() [ A wrapper around matplotlib.pyplot.plot() ]
Default: line plot

%matplotlib inline

# Visualize how many points knicks scored throughout the seasons
nba[nba["fran_id"]=="Knicks"].groupby("year_id")["pts"].sum().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x11be9b438>

# To create a bar plot
nba["fran_id"].value_counts().head(10).plot(kind="bar")

<matplotlib.axes._subplots.AxesSubplot at 0x11be1d6d8>

# 2013, Heat won championship
# create a pie char showing counts of wins and losses during season

nba[
    (nba["fran_id"] == "Heat") &
    (nba["year_id"] == 2013)
]["game_result"].value_counts().plot(kind="pie")

<matplotlib.axes._subplots.AxesSubplot at 0x11ce11128>

	gameorder	_iscopy	year_id	seasongame	is_playoffs	pts	elo_i	elo_n	win_equiv	opp_pts	opp_elo_i	opp_elo_n	forecast
count	126314.00	126314.0	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00
mean	31579.00	0.5	1988.20	43.53	0.06	102.73	1495.24	1495.24	41.71	102.73	1495.24	1495.24	0.50
std	18231.93	0.5	17.58	25.38	0.24	14.81	112.14	112.46	10.63	14.81	112.14	112.46	0.22
min	1.00	0.0	1947.00	1.00	0.00	0.00	1091.64	1085.77	10.15	0.00	1091.64	1085.77	0.02
25%	15790.00	0.0	1975.00	22.00	0.00	93.00	1417.24	1416.99	34.10	93.00	1417.24	1416.99	0.33
50%	31579.00	0.5	1990.00	43.00	0.00	103.00	1500.95	1500.95	42.11	103.00	1500.95	1500.95	0.50
75%	47368.00	1.0	2003.00	65.00	0.00	112.00	1576.06	1576.29	49.64	112.00	1576.06	1576.29	0.67
max	63157.00	1.0	2015.00	108.00	1.00	186.00	1853.10	1853.10	71.11	186.00	1853.10	1853.10	0.98

	gameorder	_iscopy	year_id	seasongame	is_playoffs	pts	elo_i	elo_n	win_equiv	opp_pts	opp_elo_i	opp_elo_n	forecast
count	126314.00	126314.0	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00	126314.00
mean	31579.00	0.5	1988.20	43.53	0.06	102.73	1495.24	1495.24	41.71	102.73	1495.24	1495.24	0.50
std	18231.93	0.5	17.58	25.38	0.24	14.81	112.14	112.46	10.63	14.81	112.14	112.46	0.22
min	1.00	0.0	1947.00	1.00	0.00	0.00	1091.64	1085.77	10.15	0.00	1091.64	1085.77	0.02
25%	15790.00	0.0	1975.00	22.00	0.00	93.00	1417.24	1416.99	34.10	93.00	1417.24	1416.99	0.33
50%	31579.00	0.5	1990.00	43.00	0.00	103.00	1500.95	1500.95	42.11	103.00	1500.95	1500.95	0.50
75%	47368.00	1.0	2003.00	65.00	0.00	112.00	1576.06	1576.29	49.64	112.00	1576.06	1576.29	0.67
max	63157.00	1.0	2015.00	108.00	1.00	186.00	1853.10	1853.10	71.11	186.00	1853.10	1853.10	0.98

soletechie

Monday, December 23, 2019

nba_analysis_with_pandas(Source:RealPython)

Downloading data from web¶

Reading csv file using pandas¶

len(df) - gives us the number of rows in df¶

df.shape - to see no of rows and columns¶

To see the first few rows of the data frame¶

To see all the columns on the terminal¶

To set the precision value for decimal fields¶

To see the last rows of the data frame¶

To check the data types and columns - df.info( )¶

Showing basic statistics of a data frame - describe( )¶

Exploratory data analysis ## - to answer more questions about our dataset¶

Understanding Data Frame Objects¶

Does NBA Dataset contain pts or points column?¶

Accessing Series Elements¶

2 panda's specific access methods¶

Using the indexing operator¶

indexing operator ([ ]) is convinient, but there's a caveat¶

if labels are also numbers??¶

Accessing data frame elements with dot notation may not work or may lead to surprises¶

Using .loc and .iloc¶

.loc and .iloc - accept a second parameter¶

Querying the dataset¶

Grouping and Aggregating Data¶

Sum, Mean, Average Value of a group of elements¶

Manipulating DF columns¶

Specifying Data types¶

Cleaning Data¶

Combining Multiple Datasets¶

Visualizing Data Frame¶

No comments:

Post a Comment

Pages

Search This Blog

	gameorder	game_id	lg_id	_iscopy	year_id	date_game	seasongame	team_id	fran_id	...	win_equiv	opp_id	opp_fran	opp_pts	opp_elo_i	opp_elo_n	game_location	game_result	forecast	notes
0	1	194611010TRH	NBA	0	1947	11/1/1946	1	TRH	Huskies	...	40.294830	NYK	Knicks	68	1300.0000	1306.7233	H	L	0.640065	NaN
1	1	194611010TRH	NBA	1	1947	11/1/1946	1	NYK	Knicks	...	41.705170	TRH	Huskies	66	1300.0000	1293.2767	A	W	0.359935	NaN
2	2	194611020CHS	NBA	0	1947	11/2/1946	1	CHS	Stags	...	42.012257	NYK	Knicks	47	1306.7233	1297.0712	H	W	0.631101	NaN
3	2	194611020CHS	NBA	1	1947	11/2/1946	2	NYK	Knicks	...	40.692783	CHS	Stags	63	1300.0000	1309.6521	A	L	0.368899	NaN
4	3	194611020DTF	NBA	0	1947	11/2/1946	1	DTF	Falcons	...	38.864048	WSC	Capitols	50	1300.0000	1320.3811	H	L	0.640065	NaN

	gameorder	game_id	lg_id	_iscopy	year_id	date_game	seasongame	is_playoffs	team_id	fran_id	pts	elo_i	elo_n	win_equiv	opp_id	opp_fran	opp_pts	opp_elo_i	opp_elo_n	game_location	game_result	forecast	notes
126309	63155	201506110CLE	NBA	0	2015	6/11/2015	100	1	CLE	Cavaliers	82	1723.41	1704.39	60.31	GSW	Warriors	103	1790.96	1809.98	H	L	0.55	NaN
126310	63156	201506140GSW	NBA	0	2015	6/14/2015	102	1	GSW	Warriors	104	1809.98	1813.63	68.01	CLE	Cavaliers	91	1704.39	1700.74	H	W	0.77	NaN
126311	63156	201506140GSW	NBA	1	2015	6/14/2015	101	1	CLE	Cavaliers	91	1704.39	1700.74	60.01	GSW	Warriors	104	1809.98	1813.63	A	L	0.23	NaN
126312	63157	201506170CLE	NBA	0	2015	6/16/2015	102	1	CLE	Cavaliers	97	1700.74	1692.09	59.29	GSW	Warriors	105	1813.63	1822.29	H	L	0.48	NaN
126313	63157	201506170CLE	NBA	1	2015	6/16/2015	103	1	GSW	Warriors	105	1813.63	1822.29	68.52	CLE	Cavaliers	97	1700.74	1692.09	A	W	0.52	NaN

	game_id	lg_id	date_game	team_id	fran_id	opp_id	opp_fran	game_location	game_result	notes
count	126314	126314	126314	126314	126314	126314	126314	126314	126314	5424
unique	63157	2	12426	104	53	104	53	3	2	231
top	200412270MIA	NBA	4/17/2013	BOS	Lakers	BOS	Lakers	H	L	at New York NY
freq	2	118016	30	5997	6024	5997	6024	63138	63157	440

	revenue	employee_count
Amsterdam	4200	5.0
Tokyo	6500	8.0
Toronto	8000	NaN
New York	7000	2.0
Barcelona	3400	2.0

	revenue	employee_count	country	capital
Amsterdam	4200.0	5.0	Holland	1.0
Tokyo	6500.0	8.0	Japan	1.0
Toronto	8000.0	NaN	Canada	0.0
New York	7000.0	2.0	NaN	NaN
Barcelona	3400.0	2.0	Spain	0.0
Rotterdam	NaN	NaN	Holland	0.0

	population_millions	continent
Holland	17	Europe
Japan	127	Asia
Canada	37	North America