Python Basics - NumPy and Pandas

At the end of this week, you will be able to:

Practice with Python Basics
Practice using NumPy and Pandas libraries
Write your first Python script!

References

Python Data Science Handbook (VanderPlas 2016). Free access
Think Python (Downey 2015). Free access
Data Science and Analytics with Python (Rogel-Salazar 2018)

Introduction to Python

Python has emerged over the last recent years as one of the most used tools for data science projects. It is known for code readability and interactive features. Similar to R, Python is supported by a large number of packages that extend its features and functions. Common packages are, to name few:

NumPy: provides functions for manipulating arrays
Pandas: provides functions for manipulating data frames
Matplotlib: provides functions for visualizations and plotting
Statsmodels: provides functions for statistical models
Scikit-learn: provides functions for machine learning algorithms

Getting started with Python

We will use RStudio IDE to run Python but, there are other IDEs that you may want to check for your information such as Pycharm, Jupyter, and others. We will be using Python 3. We will see that there are multiple similarities between R and Python.

Indentation refers to the spaces at the beginning of a code line. The indentation in Python is very important.

Recordings of this week provide lessons about the following concepts:

Python Basics

Python Variables:

# This is Python Code
print("Hello World!")

Hello World!

You can name a variable following these rules:

One word
Use only letters, numbers, and the underscore (_) character
Can’t begin with a number
Python is case-sensitive

x = "HeyHey"
y = 40
x

'HeyHey'

x, y = "Hey", 45 # Assign values to multiple variables
print(x)

Hey

print(y)

ranks = ["first","second","third"] # list
x, y, z = ranks
print(ranks)

['first', 'second', 'third']

'first'

'second'

'third'

def myf():
  x="Hello"
  print(x)
  
myf()

Hello

def myf():
  global x # x to be global - outside the function
  x="Hello"
  print(x)
  
myf()

Hello

Data Types:

x = str(3)    # x will be '3'
x = int(3)    # x will be 3
x = float(3)  # x is a float - 3.0
x = 1j       # x is complex
x = range(5,45)    # x is a range type
x = [1,2,1,24,54,45,2,1]  # x is a list
x = (1,2,1,24,54,45,2,1)  # x is a tuple
x = {"name" : "Ach", "age" : 85}  # x is a dict (mapping)

Math operations:

5+4   # Addition

5*4   # Multiplication

5**4  # power / exponent

print("Hey"*3) # String operations

HeyHeyHey

import math as mt # More more math functions using package *math*
mt.cos(556) # cosine function

-0.9980848261016746

import random # generate random numbers
print(random.randrange(1, 10))

import numpy as np # generate random numbers
print(np.random.normal(loc=0,scale=1,size=2))

[-1.01193164 -0.62035193]

Strings operations:

word = "Hello There!"
word[1] # accessing characters in a String

'e'

for z in word:
  print(z)

H
e
l
l
o
 
T
h
e
r
e
!

len(word) # strings length

"or" in word # check if "or" is in word

False

word1 = "Do you use Python or R or both!"
"or" in word1 # check if "or" is in word1

True

Python assignment operators:

Operator	Example	Results
=	x = 10	x = 10
+=	x += 10	x = x+10
-=	x -= 10	x = x-10
*=	x *= 10	x = x*10
/=	x /= 10	x = x/10
%=	x %= 10	x = x%10
**=	x **= 10	x = x**10

If-Else Statements:

h = 2
if h > 2:
 print("Yes!") # indentation very important other ERROR
elif h > 50:
 print("Yes Yes!")
else:
  print("No")

No

For Loop Statements:

for k in range(1,10): 
  print(str(k)) # does not show up 10; goes up to 9

Python Numpy

NumPy is a Python library. It stands for Numerical Python and very useful for manipulating arrays. It is faster than using Lists and quite useful for machine learning applications.

import numpy # this code import NumPy library
arr1 =  numpy.array([1,2,45,564,98]) # create array using NumPy
print(arr1)

[  1   2  45 564  98]

Usually, we give a Library an alias such as np for the NumPy library. Array objects in NumPy are called ndarray. We can pass any array (list, tuple, etc.) to the function array():

import numpy as np
arr1 = np.array([1,2,45,564,98])
print(arr1)

[  1   2  45 564  98]

# Multidimensional arrays!
d0 = np.array(56)
d1 = np.array([15, 52, 83, 84, 55])
d2 = np.array([[1, 2, 3], [4, 5, 6]])
d3 = np.array([[[1, 2, 3], [4, 5, 6]], [[11, 21, 31], [41, 51, 61]]])

print(d0.ndim) # print dimension

print(d1.ndim)

print(d2.ndim)

print(d3.ndim)

Array Indexing:

import numpy as np

D2 = np.array([[1,2,3,4,5], [6,7,8,9,10]], dtype=float)

print('4th element on 1st dim: ', D2[0, 3])

4th element on 1st dim:  4.0

print('4th element on 2nd dim: ', D2[1, 3])

4th element on 2nd dim:  9.0

print('1st dim: ', D2[0, :])

1st dim:  [1. 2. 3. 4. 5.]

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print("From the start to index 2 (not included): ", arr[:2])

From the start to index 2 (not included):  [1 2]

print("From the index 2 (included) to the end: ", arr[2:])

From the index 2 (included) to the end:  [3 4 5 6 7]

Arithmetic operations and Math/Stat functions:

import numpy as np

a = np.array([[1,2,3,4,5], [6,7,8,9,10]], dtype="f")
b = np.array([[10,20,30,40,50], [60,70,80,90,100]], dtype="i")

np.subtract(b,a) # b-a

array([[ 9., 18., 27., 36., 45.],
       [54., 63., 72., 81., 90.]])

np.add(b,a) # b+a

array([[ 11.,  22.,  33.,  44.,  55.],
       [ 66.,  77.,  88.,  99., 110.]])

np.divide(b,a) # b/a

array([[10., 10., 10., 10., 10.],
       [10., 10., 10., 10., 10.]])

np.multiply(b,a) # b*a

array([[  10.,   40.,   90.,  160.,  250.],
       [ 360.,  490.,  640.,  810., 1000.]])

np.exp(a) # exponential function

array([[2.7182820e+00, 7.3890557e+00, 2.0085537e+01, 5.4598148e+01,
        1.4841316e+02],
       [4.0342877e+02, 1.0966332e+03, 2.9809580e+03, 8.1030840e+03,
        2.2026467e+04]], dtype=float32)

np.log(a) # natural logarithm function

array([[0.       , 0.6931472, 1.0986123, 1.3862944, 1.609438 ],
       [1.7917595, 1.9459102, 2.0794415, 2.1972246, 2.3025851]],
      dtype=float32)

np.sqrt(a) # square root function

array([[1.       , 1.4142135, 1.7320508, 2.       , 2.236068 ],
       [2.4494898, 2.6457512, 2.828427 , 3.       , 3.1622777]],
      dtype=float32)

np.full((3,3),5) # 3x3 constant array

array([[5, 5, 5],
       [5, 5, 5],
       [5, 5, 5]])

a.mean() # mean

np.float32(5.5)

a.std() # standard deviation

np.float32(2.8722813)

a.var() # variance

np.float32(8.25)

a.mean(axis=0) # mean across axis 0 (rows)

array([3.5, 4.5, 5.5, 6.5, 7.5], dtype=float32)

np.median(a) # median

np.float32(5.5)

np.median(a,axis=0) # median

array([3.5, 4.5, 5.5, 6.5, 7.5], dtype=float32)

Random numbers generation:

Random is a module in NumPy to offer functions to work with random numbers.

from numpy import random

x = random.randint(100) # a random integer from 0 to 100
print(x)

x = random.rand(10) # 10 random numbers float from 0 to 1
print(x)

[0.7103468  0.39685383 0.81285286 0.80701665 0.921142   0.4840836
 0.83442244 0.80832368 0.59405329 0.69103172]

x = random.randint(100,size=(10)) # 10 random integers from 0 to 100
print(x)

[11 91 23 69 11  1 10 98 23 13]

x = random.randint(100,size=(10,10)) # 10x10 random integers from 0 to 100
print(x)

[[31 11 98  4 54 28 26 37 61 46]
 [33 30 47 50 62 60 21 61 55 23]
 [34  1 99 88 16 26 43 82 40 72]
 [20 61 86 92 78 52 13 17 34 61]
 [10 50 16  7 32 31 28 26 87 11]
 [50 25 21 91 52  0 97 26 99  3]
 [99 88 99 85 47 85  2 76  1  2]
 [98 58  5 99 32  7 54  9 87 95]
 [54 67 49 69 23 78 96 59 65 34]
 [85 98 79  9 19 54 91 41 60 94]]

x = random.choice([100,12,0,45]) # sample one value from an array
print(x)

x = random.choice([100,12,0,45],size=(10)) # sample one value from an array
print(x)

[100  12   0  12   0  45  45 100 100  45]

x = random.choice([100, 12, 0, 45], p=[0.1, 0.3, 0.6, 0.0], size=(10)) # Probability sampling
print(x)

[ 0  0 12 12  0  0  0  0  0  0]

x = random.normal(loc=1, scale=0.5, size=(10)) # Normal distribution
print(x)

[0.94518906 0.52280145 0.64625995 1.14689861 1.28152303 0.72201845
 0.97455791 1.06225798 1.31525931 0.35428343]

x = random.normal(loc=1, scale=0.5, size=(10)) # Normal distribution
print(x)

[0.57114677 0.65539387 0.92993001 0.68774553 1.31484725 0.96349392
 1.14567454 0.77124733 0.47719705 1.25891459]

📚 For more reading visit Introduction to NumPy.

Python Pandas

Pandas is a Python library. It is useful for data wrangling and working with data sets. Pandas refers to both Panel Data and Python Data Analysis. This is a handy Cheat Sheet for Pandas for data wrangling.

import pandas as pd

a = [1,6,8]
series = pd.Series(a) # this is a panda series
print(series)

0    1
1    6
2    8
dtype: int64

mydata = {
  "calories": [1000, 690, 190],
  "duration": [50, 40, 20]
}
mydataframe = pd.DataFrame(mydata) # data frame
mydataframe

   calories  duration
0      1000        50
1       690        40
2       190        20

Read CSV Files

CSV files are a simple way to store large data sets. Data Frame Pandas can read CSV files easily:

import pandas as pd
import numpy as np

df = pd.read_csv("../datasets/mycars.csv")
print(df.info()) # Info about Data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  50 non-null     int64
 1   speed       50 non-null     int64
 2   dist        50 non-null     int64
dtypes: int64(3)
memory usage: 1.3 KB
None

df.head()

   Unnamed: 0  speed  dist
0           1      4     2
1           2      4    10
2           3      7     4
3           4      7    22
4           5      8    16

df.loc[3,"speed"] = np.nan # insert NaN in the row 10 in speed column
df.head()

   Unnamed: 0  speed  dist
0           1    4.0     2
1           2    4.0    10
2           3    7.0     4
3           4    NaN    22
4           5    8.0    16

newdf = df.dropna() # drop NA cells
newdf.head()

   Unnamed: 0  speed  dist
0           1    4.0     2
1           2    4.0    10
2           3    7.0     4
4           5    8.0    16
5           6    9.0    10

df.dropna(inplace = True) # drop NA cells and replace "df" with the new data
df.head()

   Unnamed: 0  speed  dist
0           1    4.0     2
1           2    4.0    10
2           3    7.0     4
4           5    8.0    16
5           6    9.0    10

df = pd.read_csv("../datasets/mycars.csv")
df.fillna(100, inplace = True) # replace NA values with 100.

df["speed"].fillna(10, inplace = True) # replace NA values with 10 only in column "speed"

<string>:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

x = df["speed"].mean() # find mean of speed
df["speed"].fillna(x, inplace = True) # replace NA values with mean only in column "speed"

<string>:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

print(df.duplicated().head()) # show duplicates

0    False
1    False
2    False
3    False
4    False
dtype: bool

df.drop_duplicates().head() # drop duplicates

   Unnamed: 0  speed  dist
0           1      4     2
1           2      4    10
2           3      7     4
3           4      7    22
4           5      8    16

🛎 🎙️ Recordings on Canvas will cover more details and examples! Have fun learning and coding 😃! Let me know how I can help!

📚 👈 Assignments - Python Basics

Instructions are posted on Canvas.