Python for Data Science

Pandas is a package which is built on top of Numpy and provides efficient implementation of DataFrame. These dataframes are multidimensional arrays with attached row and column labels and often with heterogenous types.In this post we will focus on Series, DataFrame and related structure effectively.

Installing and Using Pandas

Once pandas is installed you can import and check the version.

import pandas as pd
pd.__version__

## '0.24.2'

Pandas Objects

There are three fundamental pandas objects or structures : the Series, DataFrame and Index.

The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows :

data = pd.Series([0.23,1,3,5.89,67])
data

## 0     0.23
## 1     1.00
## 2     3.00
## 3     5.89
## 4    67.00
## dtype: float64

As Series wraps both sequence of values and sequence of indices, which we can access with the values and index attributes.

** Value**

data.values

## array([ 0.23,  1.  ,  3.  ,  5.89, 67.  ])

data.index

## RangeIndex(start=0, stop=5, step=1)

Accessing data by associated index.

data[2]

## 3.0

👉 Pandas Series is much more flexible and general as compare to one-dimensional numpy array.

Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.

data = pd.Series([0,2,4,6,8], index = ['a','b','c','d','e'])
data

## a    0
## b    2
## c    4
## d    6
## e    8
## dtype: int64

Series as specialized dictionary

Constructing a series object directly from a Python dictionary.

mob_dict = {'Nokia': 37834,
'Samsung':234893,
'Apple':93849,
'Oneplus':298343}
mobile = pd.Series(mob_dict)
mobile

## Nokia       37834
## Samsung    234893
## Apple       93849
## Oneplus    298343
## dtype: int64

By default, a Series will be created where the index is drawn from the sorted keys.

Accessing items in dictionary style

mobile['Oneplus']

## 298343

Series also supports array style operations such as slicing

mobile['Nokia':'Samsung']

## Nokia       37834
## Samsung    234893
## dtype: int64

Constructing Series objects

pd.Series(data, index = index)

where index is optional argument and data can be one of many entities.

For ex:- Data can be list or numpy array in which case index defaults to an integer sequence.

pd.Series([2,4,5,9])

## 0    2
## 1    4
## 2    5
## 3    9
## dtype: int64

data can be scalar

pd.Series(5, index = [100,200,300])

## 100    5
## 200    5
## 300    5
## dtype: int64

data can be dicitonary in which index defaults to the sorted dictionary keys

pd.Series({2:'a', 1:'b', 3:'c'})

## 2    a
## 1    b
## 3    c
## dtype: object

DataFrame Object

If series is an analog of a one dimension array with flexible indices, a dataframe is analog of a two dimensional array with both flexible row indices and flexible column names.

mob_dict = {'Nokia': 35,
'Samsung':65,
'Apple':16,
'Oneplus':10}
models = pd.Series(mob_dict)
models

## Nokia      35
## Samsung    65
## Apple      16
## Oneplus    10
## dtype: int64

We can use dictionary to construct a single two dimensional object containing this information.

mobiles = pd.DataFrame({'Sales' : mobile, 'models': models})
mobiles

##           Sales  models
## Nokia     37834      35
## Samsung  234893      65
## Apple     93849      16
## Oneplus  298343      10

Like the Series object, the DataFrame has an index attribute that gives access to the index labels

mobiles.index

## Index(['Nokia', 'Samsung', 'Apple', 'Oneplus'], dtype='object')

mobiles.columns

## Index(['Sales', 'models'], dtype='object')

mobiles['models']

## Nokia      35
## Samsung    65
## Apple      16
## Oneplus    10
## Name: models, dtype: int64

A pandas dataframe can be constructed in many ways.Here are few examples

From single Series object

pd.DataFrame(mobile, columns=['mobile'])

##          mobile
## Nokia     37834
## Samsung  234893
## Apple     93849
## Oneplus  298343

From list of dicts

Any list of dictionaries can be made into dataframe.

data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
data

## [{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

pd.DataFrame(data)

##    a  b
## 0  0  0
## 1  1  2
## 2  2  4

From dictionary of series objects

pd.DataFrame({'Sales':mobile, 'models': models})

##           Sales  models
## Nokia     37834      35
## Samsung  234893      65
## Apple     93849      16
## Oneplus  298343      10

From two dimensional numpy array

import numpy as np
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

##         foo       bar
## a  0.349491  0.801267
## b  0.553471  0.268641
## c  0.262647  0.805099

From numpy structured array

A = np.zeros(3,dtype=[('A', 'i8'), ('B', 'f8')])
A

## array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

pd.DataFrame(A)

##    A    B
## 0  0  0.0
## 1  0  0.0
## 2  0  0.0

Data Indexing and Selection

As we know a series object acts in many ways like a one-dimensional Numpy array and in many ways like a standard python dictionary.

data = pd.Series([0.25, 0.5, 0.75, 1.0], index = ['a','b', 'c','d'])
data

## a    0.25
## b    0.50
## c    0.75
## d    1.00
## dtype: float64

data['a']

## 0.25

data.keys()

## Index(['a', 'b', 'c', 'd'], dtype='object')

list(data.items())

## [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

you can extend a dictionary by assigning to a new key

data['e']=1.34
data

## a    0.25
## b    0.50
## c    0.75
## d    1.00
## e    1.34
## dtype: float64

Series as one-dimensional array

slice by explicit index

data['a':'c']

## a    0.25
## b    0.50
## c    0.75
## dtype: float64

slice by implicit integer index

data[0:2]

## a    0.25
## b    0.50
## dtype: float64

masking

data[(data > 0.3) & (data < 0.8)]

## b    0.50
## c    0.75
## dtype: float64

fancy indexing

data[['a','e']]

## a    0.25
## e    1.34
## dtype: float64

👉 When slicing with an explicit index (i.e., data[‘a’:‘c’]), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice

Indexers: loc, iloc and ix

data = pd.Series(['a', 'b', 'c'], index = [1,3,5])
data

## 1    a
## 3    b
## 5    c
## dtype: object

explicit index when indexing

data[1]

## 'a'

implicit index when slicing

data[1:3]

## 3    b
## 5    c
## dtype: object

Due to above confusion in case of integer indexes, pandas provide some special indexer attributes explicitly expose certain indexing schemes.

First, the loc attribute allows indexing and slicing that always references the explicit index:

data.loc[1]

## 'a'

data.loc[1:3]

## 1    a
## 3    b
## dtype: object

The iloc attribute allows indexing and slicing that always references the implicit Python-style index

data.iloc[1]

## 'b'

data.iloc[1:3]

## 3    b
## 5    c
## dtype: object

A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based indexing.The purpose of the ix indexer will become more apparent in the context of DataFrame.

Data Selection in Data Frame

A DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

Data Frame as a Dictionary

area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127
## Florida     170312  19552860
## Illinois    149995  12882135

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

data['area']

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

data.area

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## Name: area, dtype: int64

This attribute-style column access actually accesses the exact same object as the dictionary-style access

data.area is data['area']

## True

👉 If the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible.

data.pop is data['pop']

## False

You should avoid the temptation to try column assignment via attribute.

Dictionary-style syntax can also be used to modify the object, in this case adding a new column

data['Density'] = data['pop']/data['area']
data

##               area       pop     Density
## California  423967  38332521   90.413926
## Texas       695662  26448193   38.018740
## New York    141297  19651127  139.076746
## Florida     170312  19552860  114.806121
## Illinois    149995  12882135   85.883763

** DataFrame as 2-dimensional Array

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute :

data.values

## array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
##        [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
##        [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
##        [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
##        [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns

data.T

##            California         Texas      New York       Florida      Illinois
## area     4.239670e+05  6.956620e+05  1.412970e+05  1.703120e+05  1.499950e+05
## pop      3.833252e+07  2.644819e+07  1.965113e+07  1.955286e+07  1.288214e+07
## Density  9.041393e+01  3.801874e+01  1.390767e+02  1.148061e+02  8.588376e+01

When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:

data.values[0]

## array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

and passing a single “index” to a DataFrame accesses a column:

data['area']

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## Name: area, dtype: int64

Thus for array-style indexing, we need another convention. Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

data

##               area       pop     Density
## California  423967  38332521   90.413926
## Texas       695662  26448193   38.018740
## New York    141297  19651127  139.076746
## Florida     170312  19552860  114.806121
## Illinois    149995  12882135   85.883763

data.iloc[:3, :2]

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127

Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:

data.loc[:'Illinois', :'pop']

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127
## Florida     170312  19552860
## Illinois    149995  12882135

The ix indexer allows a hybrid of these two approaches:

data.ix[:3, :'pop']

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127

In the loc indexer we can combine masking and fancy indexing as in the following:

data.loc[data.Density > 100, ['pop', 'Density']]

##                pop     Density
## New York  19651127  139.076746
## Florida   19552860  114.806121

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

data.iloc[0, 2] = 90
data

##               area       pop     Density
## California  423967  38332521   90.000000
## Texas       695662  26448193   38.018740
## New York    141297  19651127  139.076746
## Florida     170312  19552860  114.806121
## Illinois    149995  12882135   85.883763

Quick analysis of any data

data.describe()

##                 area           pop     Density
## count       5.000000  5.000000e+00    5.000000
## mean   316246.600000  2.337337e+07   93.557074
## std    242437.411951  9.640386e+06   37.681565
## min    141297.000000  1.288214e+07   38.018740
## 25%    149995.000000  1.955286e+07   85.883763
## 50%    170312.000000  1.965113e+07   90.000000
## 75%    423967.000000  2.644819e+07  114.806121
## max    695662.000000  3.833252e+07  139.076746

To get the complete information about the data set we can use info() function.

data.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 5 entries, California to Illinois
## Data columns (total 3 columns):
## area       5 non-null int64
## pop        5 non-null int64
## Density    5 non-null float64
## dtypes: float64(1), int64(2)
## memory usage: 320.0+ bytes

To sort the DataFrame

data.sort_values(by = ['area'], ascending=True, inplace= False)

##               area       pop     Density
## New York    141297  19651127  139.076746
## Illinois    149995  12882135   85.883763
## Florida     170312  19552860  114.806121
## California  423967  38332521   90.000000
## Texas       695662  26448193   38.018740

👉 Inplace = True will make changes to the data

We can sort by multiple columns as well

data.sort_values(by = ['area', 'pop'], ascending=[True,False], inplace= False)

##               area       pop     Density
## New York    141297  19651127  139.076746
## Illinois    149995  12882135   85.883763
## Florida     170312  19552860  114.806121
## California  423967  38332521   90.000000
## Texas       695662  26448193   38.018740

How to remove duplicate rows ?

consider below DataFrame

data = pd.DataFrame({'k1':['one']*3 + ['two']*4, 'k2':[3,2,1,3,3,4,4]})
data

##     k1  k2
## 0  one   3
## 1  one   2
## 2  one   1
## 3  two   3
## 4  two   3
## 5  two   4
## 6  two   4

To sort values

data.sort_values(by = 'k2')

##     k1  k2
## 2  one   1
## 1  one   2
## 0  one   3
## 3  two   3
## 4  two   3
## 5  two   4
## 6  two   4

remove duplicates

data.drop_duplicates()

##     k1  k2
## 0  one   3
## 1  one   2
## 2  one   1
## 3  two   3
## 5  two   4

We can even removes duplicates based on particular column. Let’s remove duplicate values from the k1 column.

data.drop_duplicates(subset='k1')

##     k1  k2
## 0  one   3
## 3  two   3

Python for Data Science - Pandas

Installing and Using Pandas

Pandas Objects

The Pandas Series Object

Series as specialized dictionary

Constructing Series objects

DataFrame Object

Data Indexing and Selection

Series as one-dimensional array

Indexers: loc, iloc and ix

Data Selection in Data Frame

Vidyasagar Bhargava

Python for Data Science - Pandas

Installing and Using Pandas

Pandas Objects

The Pandas Series Object

Series as specialized dictionary

Constructing Series objects

DataFrame Object

Data Indexing and Selection

Series as one-dimensional array

Indexers: loc, iloc and ix

Data Selection in Data Frame

Vidyasagar Bhargava

Deep learning in E-Commerce

Python for Data Science - Pandas

Python for Data Science - Numpy

Python for Data Science - Introduction

Basic Python Crash Course - Part 2

Basic python crash course - Part 1

Hypothesis Testing made simple

Gradient Descent from scratch and visualization

Machine Learning in Python

Softmax function from scratch