Python for Data Science - Pandas
Pandas is a package which is built on top of Numpy and provides efficient implementation of DataFrame
. These dataframes are multidimensional arrays with attached row and column labels and often with heterogenous types.In this post we will focus on Series
, DataFrame
and related structure effectively.
Installing and Using Pandas
Once pandas is installed you can import and check the version.
import pandas as pd
pd.__version__
## '0.24.2'
Pandas Objects
There are three fundamental pandas objects or structures : the Series
, DataFrame
and Index
.
The Pandas Series Object
A Pandas Series
is a one-dimensional array of indexed data. It can be created from a list
or array
as follows :
data = pd.Series([0.23,1,3,5.89,67])
data
## 0 0.23
## 1 1.00
## 2 3.00
## 3 5.89
## 4 67.00
## dtype: float64
As Series
wraps both sequence of values and sequence of indices, which we can access with the values
and index
attributes.
** Value**
data.values
## array([ 0.23, 1. , 3. , 5.89, 67. ])
data.index
## RangeIndex(start=0, stop=5, step=1)
Accessing data by associated index.
data[2]
## 3.0
👉 Pandas Series is much more flexible and general as compare to one-dimensional numpy array.
Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
data = pd.Series([0,2,4,6,8], index = ['a','b','c','d','e'])
data
## a 0
## b 2
## c 4
## d 6
## e 8
## dtype: int64
Series as specialized dictionary
Constructing a series
object directly from a Python dictionary.
mob_dict = {'Nokia': 37834,
'Samsung':234893,
'Apple':93849,
'Oneplus':298343}
mobile = pd.Series(mob_dict)
mobile
## Nokia 37834
## Samsung 234893
## Apple 93849
## Oneplus 298343
## dtype: int64
By default, a Series will be created where the index is drawn from the sorted keys.
Accessing items in dictionary style
mobile['Oneplus']
## 298343
Series also supports array style operations such as slicing
mobile['Nokia':'Samsung']
## Nokia 37834
## Samsung 234893
## dtype: int64
Constructing Series objects
pd.Series(data, index = index)
where index
is optional argument and data can be one of many entities.
For ex:- Data can be list or numpy array in which case index defaults to an integer sequence.
pd.Series([2,4,5,9])
## 0 2
## 1 4
## 2 5
## 3 9
## dtype: int64
data can be scalar
pd.Series(5, index = [100,200,300])
## 100 5
## 200 5
## 300 5
## dtype: int64
data can be dicitonary in which index defaults to the sorted dictionary keys
pd.Series({2:'a', 1:'b', 3:'c'})
## 2 a
## 1 b
## 3 c
## dtype: object
DataFrame Object
If series is an analog of a one dimension array with flexible indices, a dataframe is analog of a two dimensional array with both flexible row indices and flexible column names.
mob_dict = {'Nokia': 35,
'Samsung':65,
'Apple':16,
'Oneplus':10}
models = pd.Series(mob_dict)
models
## Nokia 35
## Samsung 65
## Apple 16
## Oneplus 10
## dtype: int64
We can use dictionary to construct a single two dimensional object containing this information.
mobiles = pd.DataFrame({'Sales' : mobile, 'models': models})
mobiles
## Sales models
## Nokia 37834 35
## Samsung 234893 65
## Apple 93849 16
## Oneplus 298343 10
Like the Series object, the DataFrame has an index attribute that gives access to the index labels
mobiles.index
## Index(['Nokia', 'Samsung', 'Apple', 'Oneplus'], dtype='object')
mobiles.columns
## Index(['Sales', 'models'], dtype='object')
mobiles['models']
## Nokia 35
## Samsung 65
## Apple 16
## Oneplus 10
## Name: models, dtype: int64
A pandas dataframe can be constructed in many ways.Here are few examples
- From single Series object
pd.DataFrame(mobile, columns=['mobile'])
## mobile
## Nokia 37834
## Samsung 234893
## Apple 93849
## Oneplus 298343
- From list of dicts
Any list of dictionaries can be made into dataframe.
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
data
## [{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
pd.DataFrame(data)
## a b
## 0 0 0
## 1 1 2
## 2 2 4
- From dictionary of series objects
pd.DataFrame({'Sales':mobile, 'models': models})
## Sales models
## Nokia 37834 35
## Samsung 234893 65
## Apple 93849 16
## Oneplus 298343 10
- From two dimensional numpy array
import numpy as np
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
## foo bar
## a 0.349491 0.801267
## b 0.553471 0.268641
## c 0.262647 0.805099
- From numpy structured array
A = np.zeros(3,dtype=[('A', 'i8'), ('B', 'f8')])
A
## array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
## A B
## 0 0 0.0
## 1 0 0.0
## 2 0 0.0
Data Indexing and Selection
As we know a series object acts in many ways like a one-dimensional Numpy array and in many ways like a standard python dictionary.
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = ['a','b', 'c','d'])
data
## a 0.25
## b 0.50
## c 0.75
## d 1.00
## dtype: float64
data['a']
## 0.25
data.keys()
## Index(['a', 'b', 'c', 'd'], dtype='object')
list(data.items())
## [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
you can extend a dictionary by assigning to a new key
data['e']=1.34
data
## a 0.25
## b 0.50
## c 0.75
## d 1.00
## e 1.34
## dtype: float64
Series as one-dimensional array
slice by explicit index
data['a':'c']
## a 0.25
## b 0.50
## c 0.75
## dtype: float64
slice by implicit integer index
data[0:2]
## a 0.25
## b 0.50
## dtype: float64
masking
data[(data > 0.3) & (data < 0.8)]
## b 0.50
## c 0.75
## dtype: float64
fancy indexing
data[['a','e']]
## a 0.25
## e 1.34
## dtype: float64
👉 When slicing with an explicit index (i.e., data[‘a’:‘c’]), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice
Indexers: loc, iloc and ix
data = pd.Series(['a', 'b', 'c'], index = [1,3,5])
data
## 1 a
## 3 b
## 5 c
## dtype: object
explicit index when indexing
data[1]
## 'a'
implicit index when slicing
data[1:3]
## 3 b
## 5 c
## dtype: object
Due to above confusion in case of integer indexes, pandas provide some special indexer attributes explicitly expose certain indexing schemes.
First, the loc
attribute allows indexing and slicing that always references the explicit index:
data.loc[1]
## 'a'
data.loc[1:3]
## 1 a
## 3 b
## dtype: object
The iloc
attribute allows indexing and slicing that always references the implicit Python-style index
data.iloc[1]
## 'b'
data.iloc[1:3]
## 3 b
## 5 c
## dtype: object
A third indexing attribute, ix
, is a hybrid of the two, and for Series
objects is equivalent to standard []-based indexing.The purpose of the ix
indexer will become more apparent in the context of DataFrame
.
Data Selection in Data Frame
A DataFrame
acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series
structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.
Data Frame as a Dictionary
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
## area pop
## California 423967 38332521
## Texas 695662 26448193
## New York 141297 19651127
## Florida 170312 19552860
## Illinois 149995 12882135
The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:
data['area']
## California 423967
## Texas 695662
## New York 141297
## Florida 170312
## Illinois 149995
## Name: area, dtype: int64
Equivalently, we can use attribute-style access with column names that are strings:
data.area
## California 423967
## Texas 695662
## New York 141297
## Florida 170312
## Illinois 149995
## Name: area, dtype: int64
This attribute-style column access actually accesses the exact same object as the dictionary-style access
data.area is data['area']
## True
👉 If the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible.
data.pop is data['pop']
## False
You should avoid the temptation to try column assignment via attribute.
Dictionary-style syntax can also be used to modify the object, in this case adding a new column
data['Density'] = data['pop']/data['area']
data
## area pop Density
## California 423967 38332521 90.413926
## Texas 695662 26448193 38.018740
## New York 141297 19651127 139.076746
## Florida 170312 19552860 114.806121
## Illinois 149995 12882135 85.883763
** DataFrame as 2-dimensional Array
As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute :
data.values
## array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
## [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
## [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
## [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
## [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])
With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns
data.T
## California Texas New York Florida Illinois
## area 4.239670e+05 6.956620e+05 1.412970e+05 1.703120e+05 1.499950e+05
## pop 3.833252e+07 2.644819e+07 1.965113e+07 1.955286e+07 1.288214e+07
## Density 9.041393e+01 3.801874e+01 1.390767e+02 1.148061e+02 8.588376e+01
When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:
data.values[0]
## array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
and passing a single “index” to a DataFrame accesses a column:
data['area']
## California 423967
## Texas 695662
## New York 141297
## Florida 170312
## Illinois 149995
## Name: area, dtype: int64
Thus for array-style indexing, we need another convention. Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:
data
## area pop Density
## California 423967 38332521 90.413926
## Texas 695662 26448193 38.018740
## New York 141297 19651127 139.076746
## Florida 170312 19552860 114.806121
## Illinois 149995 12882135 85.883763
data.iloc[:3, :2]
## area pop
## California 423967 38332521
## Texas 695662 26448193
## New York 141297 19651127
Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:
data.loc[:'Illinois', :'pop']
## area pop
## California 423967 38332521
## Texas 695662 26448193
## New York 141297 19651127
## Florida 170312 19552860
## Illinois 149995 12882135
The ix
indexer allows a hybrid of these two approaches:
data.ix[:3, :'pop']
## area pop
## California 423967 38332521
## Texas 695662 26448193
## New York 141297 19651127
In the loc indexer we can combine masking and fancy indexing as in the following:
data.loc[data.Density > 100, ['pop', 'Density']]
## pop Density
## New York 19651127 139.076746
## Florida 19552860 114.806121
Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:
data.iloc[0, 2] = 90
data
## area pop Density
## California 423967 38332521 90.000000
## Texas 695662 26448193 38.018740
## New York 141297 19651127 139.076746
## Florida 170312 19552860 114.806121
## Illinois 149995 12882135 85.883763
Quick analysis of any data
data.describe()
## area pop Density
## count 5.000000 5.000000e+00 5.000000
## mean 316246.600000 2.337337e+07 93.557074
## std 242437.411951 9.640386e+06 37.681565
## min 141297.000000 1.288214e+07 38.018740
## 25% 149995.000000 1.955286e+07 85.883763
## 50% 170312.000000 1.965113e+07 90.000000
## 75% 423967.000000 2.644819e+07 114.806121
## max 695662.000000 3.833252e+07 139.076746
To get the complete information about the data set we can use info()
function.
data.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 5 entries, California to Illinois
## Data columns (total 3 columns):
## area 5 non-null int64
## pop 5 non-null int64
## Density 5 non-null float64
## dtypes: float64(1), int64(2)
## memory usage: 320.0+ bytes
To sort the DataFrame
data.sort_values(by = ['area'], ascending=True, inplace= False)
## area pop Density
## New York 141297 19651127 139.076746
## Illinois 149995 12882135 85.883763
## Florida 170312 19552860 114.806121
## California 423967 38332521 90.000000
## Texas 695662 26448193 38.018740
👉 Inplace = True will make changes to the data
We can sort by multiple columns as well
data.sort_values(by = ['area', 'pop'], ascending=[True,False], inplace= False)
## area pop Density
## New York 141297 19651127 139.076746
## Illinois 149995 12882135 85.883763
## Florida 170312 19552860 114.806121
## California 423967 38332521 90.000000
## Texas 695662 26448193 38.018740
How to remove duplicate rows ?
consider below DataFrame
data = pd.DataFrame({'k1':['one']*3 + ['two']*4, 'k2':[3,2,1,3,3,4,4]})
data
## k1 k2
## 0 one 3
## 1 one 2
## 2 one 1
## 3 two 3
## 4 two 3
## 5 two 4
## 6 two 4
To sort values
data.sort_values(by = 'k2')
## k1 k2
## 2 one 1
## 1 one 2
## 0 one 3
## 3 two 3
## 4 two 3
## 5 two 4
## 6 two 4
remove duplicates
data.drop_duplicates()
## k1 k2
## 0 one 3
## 1 one 2
## 2 one 1
## 3 two 3
## 5 two 4
We can even removes duplicates based on particular column. Let’s remove duplicate values from the k1 column.
data.drop_duplicates(subset='k1')
## k1 k2
## 0 one 3
## 3 two 3