Monday, December 23, 2019

nba_analysis_with_pandas(Source:RealPython)

exploring_with_pandas_real_python_example

Downloading data from web

  • use requests module
  • response object = request.get(url)
  • response.raise_for_status() - will check if request is success or not
  • download it to a file and write response content to it
with open(target_csv_path, "wb") as f:
    f.write(response.content)
In [6]:
import requests

download_url="https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"

response = requests.get(download_url)
response.raise_for_status() # check the request was succesful

with open(target_csv_path, "wb") as f:
    f.write(response.content)

print ("Download ready")
Download ready

Reading csv file using pandas

In [7]:
import pandas as pd

nba = pd.read_csv("nba_all_elo.csv")
type(nba)
Out[7]:
pandas.core.frame.DataFrame

len(df) - gives us the number of rows in df

In [8]:
len(nba) # to determine number of rows - len()
Out[8]:
126314

df.shape - to see no of rows and columns

In [9]:
nba.shape # to see dimensionality, to see tuple (rows and columns)
Out[9]:
(126314, 23)

To see the first few rows of the data frame

In [10]:
nba.head()
Out[10]:
gameorder game_id lg_id _iscopy year_id date_game seasongame is_playoffs team_id fran_id ... win_equiv opp_id opp_fran opp_pts opp_elo_i opp_elo_n game_location game_result forecast notes
0 1 194611010TRH NBA 0 1947 11/1/1946 1 0 TRH Huskies ... 40.294830 NYK Knicks 68 1300.0000 1306.7233 H L 0.640065 NaN
1 1 194611010TRH NBA 1 1947 11/1/1946 1 0 NYK Knicks ... 41.705170 TRH Huskies 66 1300.0000 1293.2767 A W 0.359935 NaN
2 2 194611020CHS NBA 0 1947 11/2/1946 1 0 CHS Stags ... 42.012257 NYK Knicks 47 1306.7233 1297.0712 H W 0.631101 NaN
3 2 194611020CHS NBA 1 1947 11/2/1946 2 0 NYK Knicks ... 40.692783 CHS Stags 63 1300.0000 1309.6521 A L 0.368899 NaN
4 3 194611020DTF NBA 0 1947 11/2/1946 1 0 DTF Falcons ... 38.864048 WSC Capitols 50 1300.0000 1320.3811 H L 0.640065 NaN

5 rows × 23 columns

To see all the columns on the terminal

In [11]:
pd.set_option("display.max.columns", None) # to see all the 23 columns
nba.head()
Out[11]:
gameorder game_id lg_id _iscopy year_id date_game seasongame is_playoffs team_id fran_id pts elo_i elo_n win_equiv opp_id opp_fran opp_pts opp_elo_i opp_elo_n game_location game_result forecast notes
0 1 194611010TRH NBA 0 1947 11/1/1946 1 0 TRH Huskies 66 1300.0000 1293.2767 40.294830 NYK Knicks 68 1300.0000 1306.7233 H L 0.640065 NaN
1 1 194611010TRH NBA 1 1947 11/1/1946 1 0 NYK Knicks 68 1300.0000 1306.7233 41.705170 TRH Huskies 66 1300.0000 1293.2767 A W 0.359935 NaN
2 2 194611020CHS NBA 0 1947 11/2/1946 1 0 CHS Stags 63 1300.0000 1309.6521 42.012257 NYK Knicks 47 1306.7233 1297.0712 H W 0.631101 NaN
3 2 194611020CHS NBA 1 1947 11/2/1946 2 0 NYK Knicks 47 1306.7233 1297.0712 40.692783 CHS Stags 63 1300.0000 1309.6521 A L 0.368899 NaN
4 3 194611020DTF NBA 0 1947 11/2/1946 1 0 DTF Falcons 33 1300.0000 1279.6189 38.864048 WSC Capitols 50 1300.0000 1320.3811 H L 0.640065 NaN

To set the precision value for decimal fields

In [12]:
pd.set_option("display.precision", 2) #we dont need 6 decimal places
nba.head()
Out[12]:
gameorder game_id lg_id _iscopy year_id date_game seasongame is_playoffs team_id fran_id pts elo_i elo_n win_equiv opp_id opp_fran opp_pts opp_elo_i opp_elo_n game_location game_result forecast notes
0 1 194611010TRH NBA 0 1947 11/1/1946 1 0 TRH Huskies 66 1300.00 1293.28 40.29 NYK Knicks 68 1300.00 1306.72 H L 0.64 NaN
1 1 194611010TRH NBA 1 1947 11/1/1946 1 0 NYK Knicks 68 1300.00 1306.72 41.71 TRH Huskies 66 1300.00 1293.28 A W 0.36 NaN
2 2 194611020CHS NBA 0 1947 11/2/1946 1 0 CHS Stags 63 1300.00 1309.65 42.01 NYK Knicks 47 1306.72 1297.07 H W 0.63 NaN
3 2 194611020CHS NBA 1 1947 11/2/1946 2 0 NYK Knicks 47 1306.72 1297.07 40.69 CHS Stags 63 1300.00 1309.65 A L 0.37 NaN
4 3 194611020DTF NBA 0 1947 11/2/1946 1 0 DTF Falcons 33 1300.00 1279.62 38.86 WSC Capitols 50 1300.00 1320.38 H L 0.64 NaN

To see the last rows of the data frame

In [13]:
nba.tail()
Out[13]:
gameorder game_id lg_id _iscopy year_id date_game seasongame is_playoffs team_id fran_id pts elo_i elo_n win_equiv opp_id opp_fran opp_pts opp_elo_i opp_elo_n game_location game_result forecast notes
126309 63155 201506110CLE NBA 0 2015 6/11/2015 100 1 CLE Cavaliers 82 1723.41 1704.39 60.31 GSW Warriors 103 1790.96 1809.98 H L 0.55 NaN
126310 63156 201506140GSW NBA 0 2015 6/14/2015 102 1 GSW Warriors 104 1809.98 1813.63 68.01 CLE Cavaliers 91 1704.39 1700.74 H W 0.77 NaN
126311 63156 201506140GSW NBA 1 2015 6/14/2015 101 1 CLE Cavaliers 91 1704.39 1700.74 60.01 GSW Warriors 104 1809.98 1813.63 A L 0.23 NaN
126312 63157 201506170CLE NBA 0 2015 6/16/2015 102 1 CLE Cavaliers 97 1700.74 1692.09 59.29 GSW Warriors 105 1813.63 1822.29 H L 0.48 NaN
126313 63157 201506170CLE NBA 1 2015 6/16/2015 103 1 GSW Warriors 105 1813.63 1822.29 68.52 CLE Cavaliers 97 1700.74 1692.09 A W 0.52 NaN
In [14]:
nba.tail(3) # to get the last 3 lines of NBA
Out[14]:
gameorder game_id lg_id _iscopy year_id date_game seasongame is_playoffs team_id fran_id pts elo_i elo_n win_equiv opp_id opp_fran opp_pts opp_elo_i opp_elo_n game_location game_result forecast notes
126311 63156 201506140GSW NBA 1 2015 6/14/2015 101 1 CLE Cavaliers 91 1704.39 1700.74 60.01 GSW Warriors 104 1809.98 1813.63 A L 0.23 NaN
126312 63157 201506170CLE NBA 0 2015 6/16/2015 102 1 CLE Cavaliers 97 1700.74 1692.09 59.29 GSW Warriors 105 1813.63 1822.29 H L 0.48 NaN
126313 63157 201506170CLE NBA 1 2015 6/16/2015 103 1 GSW Warriors 105 1813.63 1822.29 68.52 CLE Cavaliers 97 1700.74 1692.09 A W 0.52 NaN

To check the data types and columns - df.info( )

In [15]:
nba.info() # discover data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB
  • Object type in Pandas " a catcha all for columns that pandas did not recognize as any other specific type"
  • In practise, it often means that all of the values in the column are strings

Showing basic statistics of a data frame - describe( )

  • default: describe() only analyzes numeric columns
  • we can provide other data types if we use include parameter
In [16]:
nba.describe()
Out[16]:
gameorder _iscopy year_id seasongame is_playoffs pts elo_i elo_n win_equiv opp_pts opp_elo_i opp_elo_n forecast
count 126314.00 126314.0 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00
mean 31579.00 0.5 1988.20 43.53 0.06 102.73 1495.24 1495.24 41.71 102.73 1495.24 1495.24 0.50
std 18231.93 0.5 17.58 25.38 0.24 14.81 112.14 112.46 10.63 14.81 112.14 112.46 0.22
min 1.00 0.0 1947.00 1.00 0.00 0.00 1091.64 1085.77 10.15 0.00 1091.64 1085.77 0.02
25% 15790.00 0.0 1975.00 22.00 0.00 93.00 1417.24 1416.99 34.10 93.00 1417.24 1416.99 0.33
50% 31579.00 0.5 1990.00 43.00 0.00 103.00 1500.95 1500.95 42.11 103.00 1500.95 1500.95 0.50
75% 47368.00 1.0 2003.00 65.00 0.00 112.00 1576.06 1576.29 49.64 112.00 1576.06 1576.29 0.67
max 63157.00 1.0 2015.00 108.00 1.00 186.00 1853.10 1853.10 71.11 186.00 1853.10 1853.10 0.98
In [17]:
import numpy as np
nba.describe(include = np.object)
Out[17]:
game_id lg_id date_game team_id fran_id opp_id opp_fran game_location game_result notes
count 126314 126314 126314 126314 126314 126314 126314 126314 126314 5424
unique 63157 2 12426 104 53 104 53 3 2 231
top 200412270MIA NBA 4/17/2013 BOS Lakers BOS Lakers H L at New York NY
freq 2 118016 30 5997 6024 5997 6024 63138 63157 440

Exploratory data analysis ## - to answer more questions about our dataset

  • value_counts() - to examine how often specific values occur in a column
In [18]:
nba["team_id"].value_counts()
Out[18]:
BOS    5997
NYK    5769
LAL    5078
DET    4985
PHI    4533
CHI    4307
PHO    4080
ATL    4035
MIL    4034
POR    3870
HOU    3820
CLE    3810
GSW    3701
SEA    3547
SAS    3515
IND    3364
DEN    3312
UTA    3145
DAL    3013
NJN    2939
LAC    2563
SAC    2488
MIA    2371
ORL    2207
MIN    2131
WSB    1992
TOR    1634
WAS    1475
CIN    1230
MEM    1197
       ... 
OAK     172
NOP     168
PTC     168
PRO     168
MMT     168
NOK     164
HSM     159
TRI     135
WSA      91
MMS      89
CAP      89
TEX      89
MNM      88
MNP      85
CHO      82
NYN      82
CHP      80
CHZ      80
ANA      78
NJA      78
AND      72
SHE      65
CLR      63
DNN      62
WAT      62
INJ      60
DTF      60
PIT      60
TRH      60
SDS      11
Name: team_id, Length: 104, dtype: int64
In [19]:
nba["fran_id"].value_counts()
Out[19]:
Lakers          6024
Celtics         5997
Knicks          5769
Warriors        5657
Pistons         5650
Sixers          5644
Hawks           5572
Kings           5475
Wizards         4582
Spurs           4309
Bulls           4307
Pacers          4227
Thunder         4178
Rockets         4154
Nuggets         4120
Nets            4106
Suns            4080
Bucks           4034
Trailblazers    3870
Cavaliers       3810
Clippers        3733
Jazz            3555
Mavericks       3013
Heat            2371
Pelicans        2254
Magic           2207
Timberwolves    2131
Grizzlies       1657
Raptors         1634
Hornets          894
Colonels         846
Squires          799
Spirits          777
Stars            756
Sounds           697
Baltimore        467
Floridians       440
Condors          430
Capitols         291
Olympians        282
Sails            274
Stags            260
Bombers          249
Steamrollers     168
Packers           72
Redskins          65
Rebels            63
Denver            62
Waterloo          62
Huskies           60
Falcons           60
Ironmen           60
Jets              60
Name: fran_id, dtype: int64
In [20]:
nba.loc[nba["fran_id"]=="Lakers", "team_id"].value_counts()
Out[20]:
LAL    5078
MNL     946
Name: team_id, dtype: int64
In [21]:
nba.loc[nba["team_id"]=="MNL", "date_game"].min()
Out[21]:
'1/1/1949'
In [22]:
nba.loc[nba["team_id"]=="MNL", "date_game"].max()
Out[22]:
'4/9/1959'
In [23]:
nba.loc[nba["team_id"]=="MNL","date_game"].agg(("min","max"))
Out[23]:
min    1/1/1949
max    4/9/1959
Name: date_game, dtype: object
In [24]:
# How many points the Boston Celtics have scored
nba.loc[nba["team_id"]=="BOS", "pts"].sum()
Out[24]:
626484

Understanding Data Frame Objects

  • About sequence of Series objects that share the same index
In [25]:
revenues = pd.Series([5555, 7000, 1980])
print(revenues)
0    5555
1    7000
2    1980
dtype: int64
In [26]:
print(revenues.values) # A sequence of values
[5555 7000 1980]
In [27]:
print(revenues.index) # a sequence of identifiers, which is the index
RangeIndex(start=0, stop=3, step=1)
In [28]:
# Series, can have an arbitary type of index
# Index is a list of city names represented by Strings
city_revenues = pd.Series(
    [4200, 8000, 6500], # series - python list because it only has a positional index
    index = ["Amsterdam", "Toronto","Tokyo"]
)
print(city_revenues)
Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64
In [29]:
# To construct a Series with a label index from a python dictionary
city_employee_count= pd.Series({"Amsterdam":5, "Tokyo":8})
print(city_employee_count)
Amsterdam    5
Tokyo        8
dtype: int64
In [30]:
# Dictionary keys become the index,
# Dictionary values are the Series values
# keys() and in keyword
print(city_employee_count.keys())
Index(['Amsterdam', 'Tokyo'], dtype='object')
In [31]:
print("Tokyo" in city_employee_count)
True
In [32]:
# We can have 2 series objects with cities as keys
# city_revenues 
# city_employee_count
city_data = pd.DataFrame({
    "revenue": city_revenues,
    "employee_count":city_employee_count
})
print(city_data)
           revenue  employee_count
Amsterdam     4200             5.0
Tokyo         6500             8.0
Toronto       8000             NaN
In [33]:
# now the new DataFrame index is the union of 2 series indices
print(city_data.index)
Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')
In [34]:
# Just like series, DF stores its values in NumPy array
print(city_data.values)
[[4.2e+03 5.0e+00]
 [6.5e+03 8.0e+00]
 [8.0e+03     nan]]
In [35]:
# We can refer to the 2 dimensions of a dataframe as AXES
print(city_data.axes)
[Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object'), Index(['revenue', 'employee_count'], dtype='object')]
In [36]:
print(city_data.axes[0]) # [0] = row index
Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')
In [37]:
print(city_data.axes[1]) # [1] = column index
Index(['revenue', 'employee_count'], dtype='object')
In [38]:
# A DF is also a dicitionary like data structure
# Hence it supports .keys() and in
print(city_data.keys())
Index(['revenue', 'employee_count'], dtype='object')
In [39]:
print("Amsterdam" in city_data)
False
In [40]:
print(city_data["revenue"])
Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64
In [41]:
print(city_data.revenue)
Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64

Does NBA Dataset contain pts or points column?

In [42]:
print(nba.axes)
[RangeIndex(start=0, stop=126314, step=1), Index(['gameorder', 'game_id', 'lg_id', '_iscopy', 'year_id', 'date_game',
       'seasongame', 'is_playoffs', 'team_id', 'fran_id', 'pts', 'elo_i',
       'elo_n', 'win_equiv', 'opp_id', 'opp_fran', 'opp_pts', 'opp_elo_i',
       'opp_elo_n', 'game_location', 'game_result', 'forecast', 'notes'],
      dtype='object')]
In [43]:
print(nba.index)
RangeIndex(start=0, stop=126314, step=1)
In [44]:
print("points" in nba.keys())
False
In [45]:
print("pts" in nba.keys())
True

Accessing Series Elements

2 panda's specific access methods

1.loc 2.iloc

Using the indexing operator

  • Series has 2 indices:
    1. A positional or implicit index which is always a Range Index
    2. A label or explicit index which can contain any hashtable objects
In [46]:
print(city_revenues)
Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64
In [47]:
print(city_revenues["Toronto"])
8000
In [48]:
print(city_revenues[1])
8000
In [49]:
print(city_revenues[-1])
6500
In [50]:
print(city_revenues[1:])
Toronto    8000
Tokyo      6500
dtype: int64
In [51]:
print(city_revenues["Toronto":])
Toronto    8000
Tokyo      6500
dtype: int64

indexing operator ([ ]) is convinient, but there's a caveat

if labels are also numbers??

In [52]:
colors = pd.Series(
    ["red","purple","blue","green","yellow"],
    index = [1,2,3,5,8]
)
print(colors)
1       red
2    purple
3      blue
5     green
8    yellow
dtype: object
In [53]:
print(colors.loc[1]) #.loc - refers to the label index
red
In [54]:
print(colors.iloc[1]) #.iloc - refers to the positional index
purple
In [55]:
print(colors.iloc[1:3]) # returns the positional indices of 1 and 2
2    purple
3      blue
dtype: object
In [56]:
print(colors.loc[3:8]) # returns elements with explicit index with bw 3 and 8
3      blue
5     green
8    yellow
dtype: object

Accessing data frame elements with dot notation may not work or may lead to surprises

  • This is when a column name coincides with a DataFrame attribute or method name
In [57]:
toys = pd.DataFrame([
    {"name":"ball", "shape":"sphere"},
    {"name":"Rubik's cube", "shape":"cube"}
])
print(toys["shape"])
0    sphere
1      cube
Name: shape, dtype: object

Using .loc and .iloc

  • city_data.loc["Amsterdam"] : selects the row with label index "Amsterdam"

  • city_data.loc["Tokyo":"Tornoto"] : selects the rows with label indices from "Tokyo" to "Toronto"

  • city_data.iloc[1]: selects the row with the positional index 1, which is Tokyo

In [58]:
print(city_data)
           revenue  employee_count
Amsterdam     4200             5.0
Tokyo         6500             8.0
Toronto       8000             NaN
In [59]:
print(city_data.loc["Amsterdam"])
revenue           4200.0
employee_count       5.0
Name: Amsterdam, dtype: float64
In [60]:
print(city_data.loc["Tokyo": "Toronto"])
         revenue  employee_count
Tokyo       6500             8.0
Toronto     8000             NaN
In [61]:
print(city_data.iloc[1])
revenue           6500.0
employee_count       8.0
Name: Tokyo, dtype: float64
In [62]:
## To display the second to the last row of nba dataset##

print(nba.iloc[-2])
gameorder               63157
game_id          201506170CLE
lg_id                     NBA
_iscopy                     0
year_id                  2015
date_game           6/16/2015
seasongame                102
is_playoffs                 1
team_id                   CLE
fran_id             Cavaliers
pts                        97
elo_i                 1.7e+03
elo_n                 1.7e+03
win_equiv                  59
opp_id                    GSW
opp_fran             Warriors
opp_pts                   105
opp_elo_i             1.8e+03
opp_elo_n             1.8e+03
game_location               H
game_result                 L
forecast                 0.48
notes                     NaN
Name: 126312, dtype: object

.loc and .iloc - accept a second parameter

  • First parameter: select rows based on the indices
  • Second parameter: selects the columns
  • These parameters together can be helped to select a subset of rows and columns from your DF

city_data.loc["Amsterdam": "Tokyo", "revenue"]

In [63]:
print(city_data.loc["Amsterdam":"Tokyo", "revenue"])
Amsterdam    4200
Tokyo        6500
Name: revenue, dtype: int64
In [64]:
## For NBA dataset, select all gamnes between labels 5555 and 5559 and get only scores###

print(nba.loc[5555:5559,["team_id", "pts"]])
     team_id  pts
5555     FTW   83
5556     BOS   95
5557     NYK   74
5558     ROC   81
5559     SYR   86

Querying the dataset

  • Apart from accessing subsets of huge dataset based on its indices,
  • we can select rows based on values in our data set to query data

create a new DF that contains only games played after 2010

In [65]:
current_decade = nba[nba["year_id"]>2010]
print(current_decade.shape)
(12658, 23)
In [66]:
## select rows based where specific field is not null

games_with_notes = nba[nba["notes"].notnull()]
print(games_with_notes.shape)
(5424, 23)
In [67]:
## we can perform string methods too ##
## Get the franchises that ends with "ERS" ###

ers = nba[nba["fran_id"].str.endswith("ers")]
print(ers.shape)
(27797, 23)
In [68]:
## To combine, multiple criteria to query, use | & ##

## Search for baltimore games where both teams scored over 100 points ##
## To see each game only once, exclude duplicates ##

print(nba[
    (nba["_iscopy"]==0) &
    (nba["pts"]>100) &
    (nba["opp_pts"]>100) &
    (nba["team_id"]=="BLB")
])
      gameorder       game_id lg_id  _iscopy  year_id  date_game  seasongame  \
1726        864  194902260BLB   NBA        0     1949  2/26/1949          53   
4890       2446  195301100BLB   NBA        0     1953  1/10/1953          32   
4909       2455  195301140BLB   NBA        0     1953  1/14/1953          34   
5208       2605  195303110BLB   NBA        0     1953  3/11/1953          66   
5825       2913  195402220BLB   NBA        0     1954  2/22/1954          60   

      is_playoffs team_id    fran_id  pts    elo_i    elo_n  win_equiv opp_id  \
1726            0     BLB  Baltimore  114  1421.94  1419.43      38.56    MNL   
4890            0     BLB  Baltimore  126  1328.67  1356.65      25.80    BOS   
4909            0     BLB  Baltimore  104  1349.83  1346.36      24.88    MNL   
5208            0     BLB  Baltimore  107  1284.52  1282.24      19.58    NYK   
5825            0     BLB  Baltimore  110  1303.75  1301.97      20.74    BOS   

     opp_fran  opp_pts  opp_elo_i  opp_elo_n game_location game_result  \
1726   Lakers      115    1637.99    1640.49             H           L   
4890  Celtics      105    1591.14    1563.17             H           W   
4909   Lakers      112    1665.44    1668.91             H           L   
5208   Knicks      113    1649.15    1651.44             H           L   
5825  Celtics      111    1591.49    1593.27             H           L   

      forecast            notes  
1726      0.34              NaN  
4890      0.28              NaN  
4909      0.22              NaN  
5208      0.18     at Boston MA  
5825      0.25  at Worcester MA  
In [69]:
## Spring 1992, 2 teams from LA played a home game at another court. 
## query to find both the games
## both the teams have an ID starting with "LA"
print(nba.info())
print(nba[
    (nba["_iscopy"]==0) &
    (nba["team_id"].str.startswith("LA")) &
    (nba["year_id"]==1992) &
    (nba["notes"].notnull())
])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB
None
       gameorder       game_id lg_id  _iscopy  year_id date_game  seasongame  \
68901      34451  199205030LAC   NBA        0     1992  5/3/1992          86   
68903      34452  199205030LAL   NBA        0     1992  5/3/1992          86   

       is_playoffs team_id   fran_id  pts    elo_i   elo_n  win_equiv opp_id  \
68901            1     LAC  Clippers  115  1546.90  1556.0      47.89    UTA   
68903            1     LAL    Lakers   76  1489.65  1474.4      39.40    POR   

           opp_fran  opp_pts  opp_elo_i  opp_elo_n game_location game_result  \
68901          Jazz      107    1647.51    1638.41             H           W   
68903  Trailblazers      102    1655.12    1670.37             H           L   

       forecast                                       notes  
68901      0.50  at Anaheim, CA (Anaheim Convention Center)  
68903      0.41                            at Las Vegas, NV  

Grouping and Aggregating Data

Sum, Mean, Average Value of a group of elements

In [70]:
print(city_revenues.sum())
18700
In [71]:
print(city_revenues.max())
8000
In [72]:
points = nba["pts"]
print(points.sum())
12976235
In [73]:
## passing sort=False, will lead to performance gains
print(nba.groupby("fran_id", sort=False)["pts"].sum())
fran_id
Huskies           3995
Knicks          582497
Stags            20398
Falcons           3797
Capitols         22387
Celtics         626484
Steamrollers     12372
Ironmen           3674
Bombers          17793
Rebels            4474
Warriors        591224
Baltimore        37219
Jets              4482
Pistons         572758
Lakers          637444
Kings           569245
Hawks           567261
Denver            4818
Olympians        22864
Redskins          5372
Waterloo          4921
Packers           6193
Sixers          585891
Wizards         474809
Bulls           437269
Thunder         437735
Squires          91127
Stars            84940
Rockets         432504
Colonels         94435
Pacers          438288
Nuggets         445780
Spurs           453822
Spirits          85874
Sounds           75582
Floridians       49568
Nets            417809
Condors          49642
Bucks           418326
Suns            437486
Clippers        380523
Cavaliers       380416
Trailblazers    402695
Sails            30080
Jazz            363155
Mavericks       309239
Pelicans        220794
Heat            229103
Timberwolves    207693
Magic           219436
Grizzlies       157683
Raptors         158370
Hornets          84489
Name: pts, dtype: int64
In [74]:
## grouping by multiple columns

nba[
    (nba["fran_id"]=="Spurs") &
    (nba["year_id"] > 2010)
].groupby(["year_id", "game_result"])["game_id"].count()
Out[74]:
year_id  game_result
2011     L              25
         W              63
2012     L              20
         W              60
2013     L              30
         W              73
2014     L              27
         W              78
2015     L              31
         W              58
Name: game_id, dtype: int64
In [75]:
# Take a look at GSW 2014-15 season (year_id= 2015)
# How many wins and losses did they have during regular season and playoffs

nba[
    (nba["fran_id"]=="Warriors") &
    (nba["year_id"]==2015)
].groupby(["is_playoffs", "game_result"])["game_id"].count()
Out[75]:
is_playoffs  game_result
0            L              15
             W              67
1            L               5
             W              16
Name: game_id, dtype: int64

Manipulating DF columns

In [76]:
# create a copy of your DF to work with
df = nba.copy()
df.shape
Out[76]:
(126314, 23)
In [77]:
df["difference"]=df.pts - df.opp_pts
df.shape
Out[77]:
(126314, 24)
In [78]:
df["difference"].max()
Out[78]:
68
In [79]:
# rename game_result and game_locatoin
renamed_df = df.rename(
    columns = {"game_result":"result", "game_location":"location"}
)
renamed_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 24 columns):
gameorder      126314 non-null int64
game_id        126314 non-null object
lg_id          126314 non-null object
_iscopy        126314 non-null int64
year_id        126314 non-null int64
date_game      126314 non-null object
seasongame     126314 non-null int64
is_playoffs    126314 non-null int64
team_id        126314 non-null object
fran_id        126314 non-null object
pts            126314 non-null int64
elo_i          126314 non-null float64
elo_n          126314 non-null float64
win_equiv      126314 non-null float64
opp_id         126314 non-null object
opp_fran       126314 non-null object
opp_pts        126314 non-null int64
opp_elo_i      126314 non-null float64
opp_elo_n      126314 non-null float64
location       126314 non-null object
result         126314 non-null object
forecast       126314 non-null float64
notes          5424 non-null object
difference     126314 non-null int64
dtypes: float64(6), int64(8), object(10)
memory usage: 23.1+ MB
In [80]:
## Delete the unwanted columns
##  Ex: delete 4 columns related to ELO
df.shape
Out[80]:
(126314, 24)
In [81]:
elo_columns = ["elo_i", "elo_n", "opp_elo_i", "opp_elo_n"]
df.drop(elo_columns, inplace=True, axis=1)
df.shape
Out[81]:
(126314, 20)

Specifying Data types

  • Pandas infer schema option does pretty good job but its not perfect
  • Performance can be improved, if we select the right data type for our columns upfront

df.info()

In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: float64(2), int64(8), object(10)
memory usage: 19.3+ MB
  • 10 columns are having the data type object
  • Most of these object columns contain aribitary text,
  • But, there are some candidates for datatype conversion
In [83]:
##use .to_datetime() to specify all gamedates as datetime objects
df["date_game"] = pd.to_datetime(df["date_game"])
In [84]:
df["game_location"].nunique()
Out[84]:
3
In [85]:
df["game_location"].value_counts()
Out[85]:
H    63138
A    63138
N       38
Name: game_location, dtype: int64
  • But what data type would we use for game location column?
  • In a relational database, we would go with varchar type, but rather an enum
  • Pandas provides categorical data fro the same purpose

Advantages of categorical data over unstructured text

  • It makes validation easy
  • Save a ton of memory as Pandas will only use the unique values internally
  • The higher the ratio of total values to unique values, the more space savings we get
In [86]:
df["game_location"] = pd.Categorical(df["game_location"])
In [87]:
df["game_location"].dtype
Out[87]:
CategoricalDtype(categories=['A', 'H', 'N'], ordered=False)
In [88]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null datetime64[ns]
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null category
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: category(1), datetime64[ns](1), float64(2), int64(8), object(8)
memory usage: 18.4+ MB
In [89]:
## chaning game_result column
df["game_result"].nunique()
Out[89]:
2
In [90]:
df["game_result"].value_counts()
Out[90]:
L    63157
W    63157
Name: game_result, dtype: int64
In [91]:
# to improve performance, make game_result column into Categorical column
df["game_result"] = pd.Categorical(df["game_result"])
In [92]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 20 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null datetime64[ns]
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
game_location    126314 non-null category
game_result      126314 non-null category
forecast         126314 non-null float64
notes            5424 non-null object
difference       126314 non-null int64
dtypes: category(2), datetime64[ns](1), float64(2), int64(8), object(7)
memory usage: 17.6+ MB

Cleaning Data

  • MISSING VALUES * :

    • .info() - shows many non-null values
    • null values often indicate problem in data-gathering process
    • they make some analysis impossible

    • EASIEST WAYS TO DEAL WITH MISSING DATA **

    • dropna() - ignores the missing values
    • replace them with default values
  • INVALID VALUES *:

    • can be more dangerous than missing values
    • these are often more challenging to detect if our dataset is enormous or used manual entry
    • but we can implement some sanity checks with queries and aggregations
    • use describe( )
  • INCONSISTENT VALUES *:
    • It would look good but doesn't fit with values in other columns
    • we can define some query criteria that are mutually exclusive and verify that they dont occur together
    • check whether the columns are consistent using .empty attribute
In [93]:
nba.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes            5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB
In [94]:
rows_without_missing_data = nba.dropna()
rows_without_missing_data.shape
Out[94]:
(5424, 23)
In [95]:
# We can also drop problematic columns that are not relevant for our analysis
# to do this, use dropna() again and provide axis = 1
# this wont contain the sometimes empty notes column

data_without_missing_columns = nba.dropna(axis=1)
data_without_missing_columns.shape
Out[95]:
(126314, 22)
In [96]:
# replacing with default values

data_with_default_notes = nba.copy()
data_with_default_notes["notes"].fillna(
    value="no notes at all",
    inplace=True
)

data_with_default_notes["notes"].describe()
Out[96]:
count              126314
unique                232
top       no notes at all
freq               120890
Name: notes, dtype: object
In [97]:
# INVALID VALUES
nba.describe()
Out[97]:
gameorder _iscopy year_id seasongame is_playoffs pts elo_i elo_n win_equiv opp_pts opp_elo_i opp_elo_n forecast
count 126314.00 126314.0 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00 126314.00
mean 31579.00 0.5 1988.20 43.53 0.06 102.73 1495.24 1495.24 41.71 102.73 1495.24 1495.24 0.50
std 18231.93 0.5 17.58 25.38 0.24 14.81 112.14 112.46 10.63 14.81 112.14 112.46 0.22
min 1.00 0.0 1947.00 1.00 0.00 0.00 1091.64 1085.77 10.15 0.00 1091.64 1085.77 0.02
25% 15790.00 0.0 1975.00 22.00 0.00 93.00 1417.24 1416.99 34.10 93.00 1417.24 1416.99 0.33
50% 31579.00 0.5 1990.00 43.00 0.00 103.00 1500.95 1500.95 42.11 103.00 1500.95 1500.95 0.50
75% 47368.00 1.0 2003.00 65.00 0.00 112.00 1576.06 1576.29 49.64 112.00 1576.06 1576.29 0.67
max 63157.00 1.0 2015.00 108.00 1.00 186.00 1853.10 1853.10 71.11 186.00 1853.10 1853.10 0.98
In [98]:
# year makes sense between 1947 - 2015
# but pts 0?
nba[nba["pts"]==0] # game was forfeited, so we may want to remove from dataset
Out[98]:
gameorder game_id lg_id _iscopy year_id date_game seasongame is_playoffs team_id fran_id pts elo_i elo_n win_equiv opp_id opp_fran opp_pts opp_elo_i opp_elo_n game_location game_result forecast notes
26684 13343 197210260VIR ABA 1 1973 10/26/1972 7 0 DNR Nuggets 0 1460.34 1457.45 40.41 VIR Squires 2 1484.19 1487.08 A L 0.33 at Richmond VA; forfeit to VIR
In [99]:
# value of the fiels pts, opp_pts and game_result should be consistent
# so check with .empty attribute

nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].empty
Out[99]:
True

Combining Multiple Datasets

In [100]:
further_city_data = pd.DataFrame(
    {"revenue":[7000,3400], "employee_count":[2,2]},
    index=["New York","Barcelona"]
)
In [101]:
city_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Amsterdam to Toronto
Data columns (total 2 columns):
revenue           3 non-null int64
employee_count    2 non-null float64
dtypes: float64(1), int64(1)
memory usage: 152.0+ bytes
In [102]:
further_city_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, New York to Barcelona
Data columns (total 2 columns):
revenue           2 non-null int64
employee_count    2 non-null int64
dtypes: int64(2)
memory usage: 48.0+ bytes
In [103]:
# add these citites to city_data using .concat()
# By default- combines along axis=0, i.e., it appends rows
all_city_data = pd.concat([city_data, further_city_data], sort=False)
all_city_data
Out[103]:
revenue employee_count
Amsterdam 4200 5.0
Tokyo 6500 8.0
Toronto 8000 NaN
New York 7000 2.0
Barcelona 3400 2.0
In [104]:
# use axis =1 , to append columns using concat
city_countries = pd.DataFrame({
    "country": ["Holland","Japan","Holland","Canada","Spain"],
    "capital":[1,1,0,0,0]},
    index = ["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]
)
cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)
cities
Out[104]:
revenue employee_count country capital
Amsterdam 4200.0 5.0 Holland 1.0
Tokyo 6500.0 8.0 Japan 1.0
Toronto 8000.0 NaN Canada 0.0
New York 7000.0 2.0 NaN NaN
Barcelona 3400.0 2.0 Spain 0.0
Rotterdam NaN NaN Holland 0.0
In [105]:
# To combine only the cities that appear in both DF objects, set inner join

pd.concat([all_city_data, city_countries], axis=1, join="inner")
Out[105]:
revenue employee_count country capital
Amsterdam 4200 5.0 Holland 1
Tokyo 6500 8.0 Japan 1
Toronto 8000 NaN Canada 0
Barcelona 3400 2.0 Spain 0
In [106]:
# If we do not want to combine based on index, use merge()

countries = pd.DataFrame({
    "population_millions": [17,127,37],
    "continent": ["Europe", "Asia", "North America"]
}, index=["Holland", "Japan", "Canada"])

pd.merge(cities, countries, left_on="country", right_index=True)
Out[106]:
revenue employee_count country capital population_millions continent
Amsterdam 4200.0 5.0 Holland 1.0 17 Europe
Rotterdam NaN NaN Holland 0.0 17 Europe
Tokyo 6500.0 8.0 Japan 1.0 127 Asia
Toronto 8000.0 NaN Canada 0.0 37 North America
In [107]:
cities
Out[107]:
revenue employee_count country capital
Amsterdam 4200.0 5.0 Holland 1.0
Tokyo 6500.0 8.0 Japan 1.0
Toronto 8000.0 NaN Canada 0.0
New York 7000.0 2.0 NaN NaN
Barcelona 3400.0 2.0 Spain 0.0
Rotterdam NaN NaN Holland 0.0
In [108]:
countries
Out[108]:
population_millions continent
Holland 17 Europe
Japan 127 Asia
Canada 37 North America
In [109]:
# merge - performs inner join by default
# change it using how parameter

pd.merge(
    cities,
    countries,
    left_on="country",
    right_index=True,
    how='left'
)
Out[109]:
revenue employee_count country capital population_millions continent
Amsterdam 4200.0 5.0 Holland 1.0 17.0 Europe
Tokyo 6500.0 8.0 Japan 1.0 127.0 Asia
Toronto 8000.0 NaN Canada 0.0 37.0 North America
New York 7000.0 2.0 NaN NaN NaN NaN
Barcelona 3400.0 2.0 Spain 0.0 NaN NaN
Rotterdam NaN NaN Holland 0.0 17.0 Europe
In [110]:
import sys  
!{sys.executable} -m pip install --user matplotlib
Requirement already satisfied: matplotlib in /Users/ladusu863/Library/Python/3.7/lib/python/site-packages (3.1.2)
Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.7/site-packages (from matplotlib) (1.16.4)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/site-packages (from matplotlib) (2.8.0)
Requirement already satisfied: cycler>=0.10 in /Users/ladusu863/Library/Python/3.7/lib/python/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/site-packages (from matplotlib) (2.4.1.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/ladusu863/Library/Python/3.7/lib/python/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib) (1.12.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (41.0.1)

Visualizing Data Frame

  • use %matplotlib inline
  • Both Series and DataFrame objects have .plot() [ A wrapper around matplotlib.pyplot.plot() ]
  • Default: line plot
In [111]:
%matplotlib inline
In [112]:
# Visualize how many points knicks scored throughout the seasons
nba[nba["fran_id"]=="Knicks"].groupby("year_id")["pts"].sum().plot()
Out[112]:
<matplotlib.axes._subplots.AxesSubplot at 0x11be9b438>
In [113]:
# To create a bar plot
nba["fran_id"].value_counts().head(10).plot(kind="bar")
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x11be1d6d8>
In [114]:
# 2013, Heat won championship
# create a pie char showing counts of wins and losses during season

nba[
    (nba["fran_id"] == "Heat") &
    (nba["year_id"] == 2013)
]["game_result"].value_counts().plot(kind="pie")
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ce11128>
In [ ]:
 

No comments:

Post a Comment