Downloading data from web¶
- use requests module
- response object = request.get(url)
- response.raise_for_status() - will check if request is success or not
- download it to a file and write response content to it
with open(target_csv_path, "wb") as f:
f.write(response.content)
In [6]:
import requests
download_url="https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"
response = requests.get(download_url)
response.raise_for_status() # check the request was succesful
with open(target_csv_path, "wb") as f:
f.write(response.content)
print ("Download ready")
Reading csv file using pandas¶
In [7]:
import pandas as pd
nba = pd.read_csv("nba_all_elo.csv")
type(nba)
Out[7]:
len(df) - gives us the number of rows in df¶
In [8]:
len(nba) # to determine number of rows - len()
Out[8]:
df.shape - to see no of rows and columns¶
In [9]:
nba.shape # to see dimensionality, to see tuple (rows and columns)
Out[9]:
To see the first few rows of the data frame¶
In [10]:
nba.head()
Out[10]:
To see all the columns on the terminal¶
In [11]:
pd.set_option("display.max.columns", None) # to see all the 23 columns
nba.head()
Out[11]:
To set the precision value for decimal fields¶
In [12]:
pd.set_option("display.precision", 2) #we dont need 6 decimal places
nba.head()
Out[12]:
To see the last rows of the data frame¶
In [13]:
nba.tail()
Out[13]:
In [14]:
nba.tail(3) # to get the last 3 lines of NBA
Out[14]:
To check the data types and columns - df.info( )¶
In [15]:
nba.info() # discover data types
- Object type in Pandas " a catcha all for columns that pandas did not recognize as any other specific type"
- In practise, it often means that all of the values in the column are strings
Showing basic statistics of a data frame - describe( )¶
- default: describe() only analyzes numeric columns
- we can provide other data types if we use include parameter
In [16]:
nba.describe()
Out[16]:
In [17]:
import numpy as np
nba.describe(include = np.object)
Out[17]:
Exploratory data analysis ## - to answer more questions about our dataset¶
- value_counts() - to examine how often specific values occur in a column
In [18]:
nba["team_id"].value_counts()
Out[18]:
In [19]:
nba["fran_id"].value_counts()
Out[19]:
In [20]:
nba.loc[nba["fran_id"]=="Lakers", "team_id"].value_counts()
Out[20]:
In [21]:
nba.loc[nba["team_id"]=="MNL", "date_game"].min()
Out[21]:
In [22]:
nba.loc[nba["team_id"]=="MNL", "date_game"].max()
Out[22]:
In [23]:
nba.loc[nba["team_id"]=="MNL","date_game"].agg(("min","max"))
Out[23]:
In [24]:
# How many points the Boston Celtics have scored
nba.loc[nba["team_id"]=="BOS", "pts"].sum()
Out[24]:
Understanding Data Frame Objects¶
- About sequence of Series objects that share the same index
In [25]:
revenues = pd.Series([5555, 7000, 1980])
print(revenues)
In [26]:
print(revenues.values) # A sequence of values
In [27]:
print(revenues.index) # a sequence of identifiers, which is the index
In [28]:
# Series, can have an arbitary type of index
# Index is a list of city names represented by Strings
city_revenues = pd.Series(
[4200, 8000, 6500], # series - python list because it only has a positional index
index = ["Amsterdam", "Toronto","Tokyo"]
)
print(city_revenues)
In [29]:
# To construct a Series with a label index from a python dictionary
city_employee_count= pd.Series({"Amsterdam":5, "Tokyo":8})
print(city_employee_count)
In [30]:
# Dictionary keys become the index,
# Dictionary values are the Series values
# keys() and in keyword
print(city_employee_count.keys())
In [31]:
print("Tokyo" in city_employee_count)
In [32]:
# We can have 2 series objects with cities as keys
# city_revenues
# city_employee_count
city_data = pd.DataFrame({
"revenue": city_revenues,
"employee_count":city_employee_count
})
print(city_data)
In [33]:
# now the new DataFrame index is the union of 2 series indices
print(city_data.index)
In [34]:
# Just like series, DF stores its values in NumPy array
print(city_data.values)
In [35]:
# We can refer to the 2 dimensions of a dataframe as AXES
print(city_data.axes)
In [36]:
print(city_data.axes[0]) # [0] = row index
In [37]:
print(city_data.axes[1]) # [1] = column index
In [38]:
# A DF is also a dicitionary like data structure
# Hence it supports .keys() and in
print(city_data.keys())
In [39]:
print("Amsterdam" in city_data)
In [40]:
print(city_data["revenue"])
In [41]:
print(city_data.revenue)
Does NBA Dataset contain pts or points column?¶
In [42]:
print(nba.axes)
In [43]:
print(nba.index)
In [44]:
print("points" in nba.keys())
In [45]:
print("pts" in nba.keys())
In [46]:
print(city_revenues)
In [47]:
print(city_revenues["Toronto"])
In [48]:
print(city_revenues[1])
In [49]:
print(city_revenues[-1])
In [50]:
print(city_revenues[1:])
In [51]:
print(city_revenues["Toronto":])
In [52]:
colors = pd.Series(
["red","purple","blue","green","yellow"],
index = [1,2,3,5,8]
)
print(colors)
In [53]:
print(colors.loc[1]) #.loc - refers to the label index
In [54]:
print(colors.iloc[1]) #.iloc - refers to the positional index
In [55]:
print(colors.iloc[1:3]) # returns the positional indices of 1 and 2
In [56]:
print(colors.loc[3:8]) # returns elements with explicit index with bw 3 and 8
Accessing data frame elements with dot notation may not work or may lead to surprises¶
- This is when a column name coincides with a DataFrame attribute or method name
In [57]:
toys = pd.DataFrame([
{"name":"ball", "shape":"sphere"},
{"name":"Rubik's cube", "shape":"cube"}
])
print(toys["shape"])
Using .loc and .iloc¶
city_data.loc["Amsterdam"] : selects the row with label index "Amsterdam"
city_data.loc["Tokyo":"Tornoto"] : selects the rows with label indices from "Tokyo" to "Toronto"
city_data.iloc[1]: selects the row with the positional index 1, which is Tokyo
In [58]:
print(city_data)
In [59]:
print(city_data.loc["Amsterdam"])
In [60]:
print(city_data.loc["Tokyo": "Toronto"])
In [61]:
print(city_data.iloc[1])
In [62]:
## To display the second to the last row of nba dataset##
print(nba.iloc[-2])
.loc and .iloc - accept a second parameter¶
- First parameter: select rows based on the indices
- Second parameter: selects the columns
- These parameters together can be helped to select a subset of rows and columns from your DF
city_data.loc["Amsterdam": "Tokyo", "revenue"]
In [63]:
print(city_data.loc["Amsterdam":"Tokyo", "revenue"])
In [64]:
## For NBA dataset, select all gamnes between labels 5555 and 5559 and get only scores###
print(nba.loc[5555:5559,["team_id", "pts"]])
Querying the dataset¶
- Apart from accessing subsets of huge dataset based on its indices,
- we can select rows based on values in our data set to query data
create a new DF that contains only games played after 2010
In [65]:
current_decade = nba[nba["year_id"]>2010]
print(current_decade.shape)
In [66]:
## select rows based where specific field is not null
games_with_notes = nba[nba["notes"].notnull()]
print(games_with_notes.shape)
In [67]:
## we can perform string methods too ##
## Get the franchises that ends with "ERS" ###
ers = nba[nba["fran_id"].str.endswith("ers")]
print(ers.shape)
In [68]:
## To combine, multiple criteria to query, use | & ##
## Search for baltimore games where both teams scored over 100 points ##
## To see each game only once, exclude duplicates ##
print(nba[
(nba["_iscopy"]==0) &
(nba["pts"]>100) &
(nba["opp_pts"]>100) &
(nba["team_id"]=="BLB")
])
In [69]:
## Spring 1992, 2 teams from LA played a home game at another court.
## query to find both the games
## both the teams have an ID starting with "LA"
print(nba.info())
print(nba[
(nba["_iscopy"]==0) &
(nba["team_id"].str.startswith("LA")) &
(nba["year_id"]==1992) &
(nba["notes"].notnull())
])
In [70]:
print(city_revenues.sum())
In [71]:
print(city_revenues.max())
In [72]:
points = nba["pts"]
print(points.sum())
In [73]:
## passing sort=False, will lead to performance gains
print(nba.groupby("fran_id", sort=False)["pts"].sum())
In [74]:
## grouping by multiple columns
nba[
(nba["fran_id"]=="Spurs") &
(nba["year_id"] > 2010)
].groupby(["year_id", "game_result"])["game_id"].count()
Out[74]:
In [75]:
# Take a look at GSW 2014-15 season (year_id= 2015)
# How many wins and losses did they have during regular season and playoffs
nba[
(nba["fran_id"]=="Warriors") &
(nba["year_id"]==2015)
].groupby(["is_playoffs", "game_result"])["game_id"].count()
Out[75]:
Manipulating DF columns¶
In [76]:
# create a copy of your DF to work with
df = nba.copy()
df.shape
Out[76]:
In [77]:
df["difference"]=df.pts - df.opp_pts
df.shape
Out[77]:
In [78]:
df["difference"].max()
Out[78]:
In [79]:
# rename game_result and game_locatoin
renamed_df = df.rename(
columns = {"game_result":"result", "game_location":"location"}
)
renamed_df.info()
In [80]:
## Delete the unwanted columns
## Ex: delete 4 columns related to ELO
df.shape
Out[80]:
In [81]:
elo_columns = ["elo_i", "elo_n", "opp_elo_i", "opp_elo_n"]
df.drop(elo_columns, inplace=True, axis=1)
df.shape
Out[81]:
Specifying Data types¶
- Pandas infer schema option does pretty good job but its not perfect
- Performance can be improved, if we select the right data type for our columns upfront
df.info()
In [82]:
df.info()
- 10 columns are having the data type object
- Most of these object columns contain aribitary text,
- But, there are some candidates for datatype conversion
In [83]:
##use .to_datetime() to specify all gamedates as datetime objects
df["date_game"] = pd.to_datetime(df["date_game"])
In [84]:
df["game_location"].nunique()
Out[84]:
In [85]:
df["game_location"].value_counts()
Out[85]:
- But what data type would we use for game location column?
- In a relational database, we would go with varchar type, but rather an enum
- Pandas provides categorical data fro the same purpose
Advantages of categorical data over unstructured text
- It makes validation easy
- Save a ton of memory as Pandas will only use the unique values internally
- The higher the ratio of total values to unique values, the more space savings we get
In [86]:
df["game_location"] = pd.Categorical(df["game_location"])
In [87]:
df["game_location"].dtype
Out[87]:
In [88]:
df.info()
In [89]:
## chaning game_result column
df["game_result"].nunique()
Out[89]:
In [90]:
df["game_result"].value_counts()
Out[90]:
In [91]:
# to improve performance, make game_result column into Categorical column
df["game_result"] = pd.Categorical(df["game_result"])
In [92]:
df.info()
Cleaning Data¶
MISSING VALUES * :
- .info() - shows many non-null values
- null values often indicate problem in data-gathering process
they make some analysis impossible
EASIEST WAYS TO DEAL WITH MISSING DATA **
- dropna() - ignores the missing values
- replace them with default values
INVALID VALUES *:
- can be more dangerous than missing values
- these are often more challenging to detect if our dataset is enormous or used manual entry
- but we can implement some sanity checks with queries and aggregations
- use describe( )
- INCONSISTENT VALUES *:
- It would look good but doesn't fit with values in other columns
- we can define some query criteria that are mutually exclusive and verify that they dont occur together
- check whether the columns are consistent using .empty attribute
In [93]:
nba.info()
In [94]:
rows_without_missing_data = nba.dropna()
rows_without_missing_data.shape
Out[94]:
In [95]:
# We can also drop problematic columns that are not relevant for our analysis
# to do this, use dropna() again and provide axis = 1
# this wont contain the sometimes empty notes column
data_without_missing_columns = nba.dropna(axis=1)
data_without_missing_columns.shape
Out[95]:
In [96]:
# replacing with default values
data_with_default_notes = nba.copy()
data_with_default_notes["notes"].fillna(
value="no notes at all",
inplace=True
)
data_with_default_notes["notes"].describe()
Out[96]:
In [97]:
# INVALID VALUES
nba.describe()
Out[97]:
In [98]:
# year makes sense between 1947 - 2015
# but pts 0?
nba[nba["pts"]==0] # game was forfeited, so we may want to remove from dataset
Out[98]:
In [99]:
# value of the fiels pts, opp_pts and game_result should be consistent
# so check with .empty attribute
nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].empty
Out[99]:
Combining Multiple Datasets¶
In [100]:
further_city_data = pd.DataFrame(
{"revenue":[7000,3400], "employee_count":[2,2]},
index=["New York","Barcelona"]
)
In [101]:
city_data.info()
In [102]:
further_city_data.info()
In [103]:
# add these citites to city_data using .concat()
# By default- combines along axis=0, i.e., it appends rows
all_city_data = pd.concat([city_data, further_city_data], sort=False)
all_city_data
Out[103]:
In [104]:
# use axis =1 , to append columns using concat
city_countries = pd.DataFrame({
"country": ["Holland","Japan","Holland","Canada","Spain"],
"capital":[1,1,0,0,0]},
index = ["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]
)
cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)
cities
Out[104]:
In [105]:
# To combine only the cities that appear in both DF objects, set inner join
pd.concat([all_city_data, city_countries], axis=1, join="inner")
Out[105]:
In [106]:
# If we do not want to combine based on index, use merge()
countries = pd.DataFrame({
"population_millions": [17,127,37],
"continent": ["Europe", "Asia", "North America"]
}, index=["Holland", "Japan", "Canada"])
pd.merge(cities, countries, left_on="country", right_index=True)
Out[106]:
In [107]:
cities
Out[107]:
In [108]:
countries
Out[108]:
In [109]:
# merge - performs inner join by default
# change it using how parameter
pd.merge(
cities,
countries,
left_on="country",
right_index=True,
how='left'
)
Out[109]:
In [110]:
import sys
!{sys.executable} -m pip install --user matplotlib
Visualizing Data Frame¶
- use %matplotlib inline
- Both Series and DataFrame objects have .plot() [ A wrapper around matplotlib.pyplot.plot() ]
- Default: line plot
In [111]:
%matplotlib inline
In [112]:
# Visualize how many points knicks scored throughout the seasons
nba[nba["fran_id"]=="Knicks"].groupby("year_id")["pts"].sum().plot()
Out[112]:
In [113]:
# To create a bar plot
nba["fran_id"].value_counts().head(10).plot(kind="bar")
Out[113]:
In [114]:
# 2013, Heat won championship
# create a pie char showing counts of wins and losses during season
nba[
(nba["fran_id"] == "Heat") &
(nba["year_id"] == 2013)
]["game_result"].value_counts().plot(kind="pie")
Out[114]:
In [ ]: