12) Intro to NumPy (prodounced Num-Pie), Numerical Python

Libraries commonly used for Data Science

numpy, scipy, matplotlib, pandas, and scikit-learn will all be at least briefly covered in this course. The two we have not used yet are the last two; you can import them with:

$ conda install pandas scikit-learn

We’ll go over each one, starting with NumPy, which we’ve already used several times.

Related references:

The Case for NumPy

NumPy provides an efficient way to store and manipulate multi-dimensional dense arrays in Python. The important features of NumPy are:

  • It provides an ndarray structure, which allows efficient storage and manipulation of vectors, matrices, and higher-dimensional datasets.
  • It provides a readable and efficient syntax for operating on this data, from simple element-wise arithmetic to more complicated linear algebraic operations. (As we’ve previously discussed: do not write your own loops; use libraries as much as possible to speed up programming and your programs)

In the simplest case, NumPy arrays look a lot like Python lists. For example, here is an array containing the range of numbers 1 to 9 (compare this with Python’s built-in range()):

In [1]:
import numpy as np

x = np.arange(1, 10)
print(type(x), x)

y = list(range(1, 10))
print(type(y), y)
<class 'numpy.ndarray'> [1 2 3 4 5 6 7 8 9]
<class 'list'> [1, 2, 3, 4, 5, 6, 7, 8, 9]

Key differences include:

  • We can directly do math on an ndarray, versus needing a loop (= slow) for lists
  • Lists are always 1D (although you can have lists of lists) while arrays can have any number of dimensions
In [2]:
x_squared = x**2
print(x_squared)
y_squared = [val ** 2 for val in y]
print(y_squared)
[ 1  4  9 16 25 36 49 64 81]
[1, 4, 9, 16, 25, 36, 49, 64, 81]
In [3]:
m = x.reshape((3,3))  # This reshape command will only work if the total size remains the same
print("matrix:")
print(m)
matrix:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Furthermore, NumPy knows how to do lots of math, including linear algebra.

What is a matrix? Who remembers what the transpose of a matrix is?

For next class, would you like a review of linear algebra basics?

In [4]:
print(m.T)
[[1 4 7]
 [2 5 8]
 [3 6 9]]

The multiple ways to make ndarrays include making them from lists or lists of lists.

In [5]:
# an array from a list
a = np.array([3.14, 4, 2, 3])
print(a, a.shape)
[ 3.14  4.    2.    3.  ] (4,)
In [6]:
m * 2.5
Out[6]:
array([[  2.5,   5. ,   7.5],
       [ 10. ,  12.5,  15. ],
       [ 17.5,  20. ,  22.5]])
In [7]:
# nested lists result in multi-dimensional arrays
list_of_lists = [list(range(i, i + 3)) for i in [2, 4, 6, 8]]
print(list_of_lists)
b = np.array(list_of_lists)
print(b, b.shape)
[[2, 3, 4], [4, 5, 6], [6, 7, 8], [8, 9, 10]]
[[ 2  3  4]
 [ 4  5  6]
 [ 6  7  8]
 [ 8  9 10]] (4, 3)
In [8]:
#  so do lists of lists
c = np.array([[1, 2], [3, 4], [5, 6]])
print(c, c.shape)
[[1 2]
 [3 4]
 [5 6]] (3, 2)

If you don’t have a specific set of values you want to use, it is more efficient to directly generate ndarrays.

In [9]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)
Out[9]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [10]:
# Create a 3x5 floating-point (the default type) array filled with ones
np.ones((3, 5))
Out[10]:
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])
In [11]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), np.nan)
Out[11]:
array([[ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan]])
In [12]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)
Out[12]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
In [13]:
np.arange(0, 1, 5)
Out[13]:
array([0])
In [14]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)
Out[14]:
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])
In [15]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))
Out[15]:
array([[ 0.18421248,  0.51522334,  0.85313282],
       [ 0.62069783,  0.99603971,  0.22842997],
       [ 0.5869599 ,  0.27569295,  0.43192871]])
In [16]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))
Out[16]:
array([[-0.50834015,  0.16258983, -1.1034789 ],
       [-2.19197721, -0.46079505,  0.80806383],
       [-1.39395404, -1.69823953,  0.6562409 ]])
In [17]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))
Out[17]:
array([[2, 0, 9],
       [2, 5, 4],
       [6, 3, 8]])
In [18]:
# Create a 4x4 identity matrix
np.eye(4)
Out[18]:
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])
In [19]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)
Out[19]:
array([ -2.68156159e+154,  -2.68156159e+154,  -2.68156159e+154])

Other data types you can use in numpy arrays include booleans and complex numbers.

In [20]:
# The follow two commands are equivalent
np.ones((3, 5), dtype=bool)
np.full((3, 5), True)
Out[20]:
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)
In [21]:
np.zeros((3, 5), dtype=complex)
Out[21]:
array([[ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j]])

Some Useful NumPy Array Attributes

First let’s discuss some useful array attributes. We’ll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array. We’ll use NumPy’s random number generator, which we will seed with a set value in order to ensure that the same random arrays are generated each time this code is run:

In [22]:
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array
In [23]:
print("x3 shape:", x3.shape) # shape
print("x3 ndim: ", x3.ndim) # number of dimensions
print("x3 size: ", x3.size) # total size of the array
print("dtype:", x3.dtype) # data type
x3 shape: (3, 4, 5)
x3 ndim:  3
x3 size:  60
dtype: int64

Some Ways ndarrays Are Like Python Lists: Slicing and indexing

What we previously learned about list slicing and indexing applies here, too:

In [24]:
x1
print(x1)
print(type(x1))
[5 0 3 3 7 9]
<class 'numpy.ndarray'>
In [25]:
x1[4]
Out[25]:
7
In [26]:
x1[0]
Out[26]:
5
In [27]:
x1[-2]
Out[27]:
7
In [28]:
# Let's compare 2D indexing
list2 = [[3, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]
print(list2)
print(x2)
[[3, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]
[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]
In [29]:
print(list2[2][0])
print(x2[2, 0])
1
1
In [30]:
# And modification
list2[0][1] = 12
x2[0, 1] = 12
print(list2)
print(x2)
[[3, 12, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]
[[ 3 12  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]

The general format for slicing is:

x[start:stop:step]
In [31]:
z = np.arange(10)
d = list(range(10))
print(z)
print(d)
[0 1 2 3 4 5 6 7 8 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [32]:
print(z[:5])
print(d[:5])
[0 1 2 3 4]
[0, 1, 2, 3, 4]
In [33]:
print(z[4:7])
print(d[4:7])
[4 5 6]
[4, 5, 6]
In [34]:
print(z[::2])  # every other element
print(d[::2])
[0 2 4 6 8]
[0, 2, 4, 6, 8]
In [35]:
print(z[3::2])  # every other element, starting at 3
print(d[3::2])
[3 5 7 9]
[3, 5, 7, 9]
In [36]:
print(z[::-1])  # all elements, reversed
print(d[::-1])
[9 8 7 6 5 4 3 2 1 0]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

And some differences

Lists can be heterogeneous. Arrays cannot.

In [37]:
list2[0][1] = 8.4
x2[0, 1] = 8.4
print(list2)
print(x2)
[[3, 8.4, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]
[[3 8 2 4]
 [7 6 8 8]
 [1 6 7 7]]
In [38]:
x2[0] = x2[1] * 1.1
print(x2)
[[7 6 8 8]
 [7 6 8 8]
 [1 6 7 7]]

For ndarrays, slicing works similarly for higher dimensional arrays. Things are more complicated for lists of lists.

In [39]:
print(list2)
print(x2)
[[3, 8.4, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]
[[7 6 8 8]
 [7 6 8 8]
 [1 6 7 7]]
In [40]:
print(list2[:2][:1])
print(list2[:2])
print(x2[:2, :3])
[[3, 8.4, 2, 4]]
[[3, 8.4, 2, 4], [7, 6, 8, 8]]
[[7 6 8]
 [7 6 8]]
In [41]:
print(list2[2][:2])
[1, 6]
In [42]:
print(x2[0])  # equivalent to x2[0, :], prints the first row
print(list2[0])
[7 6 8 8]
[3, 8.4, 2, 4]

Subparts of an array will change the parent array

In [43]:
print(x2)
x2_part = x2[:2, :2]
print(x2_part)
[[7 6 8 8]
 [7 6 8 8]
 [1 6 7 7]]
[[7 6]
 [7 6]]
In [44]:
x2_part[0,1] = 0
print(x2)
[[7 0 8 8]
 [7 6 8 8]
 [1 6 7 7]]

We can get around this similarly to how we worked with this problem in lists–make a copy!

In [45]:
x2_part_copy = x2[:2, :2].copy()
x2_part_copy[0, 0] = 42
print(x2)
[[7 0 8 8]
 [7 6 8 8]
 [1 6 7 7]]

Array Concatenation and Splitting

There are three main ways that arrays can be joined: np.concatenate, np.vstack, and np.hstack. In each case, the arrays to be joined must be of compatible dimensions.

np.concatenate tacks the second array onto the first. You can specify the axis along which it is to be joined (default is axis=0).

In [46]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = np.concatenate([x, y])
print(z)
[1 2 3 3 2 1]
In [47]:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
c = np.concatenate([a, b])
print(c)
[[1 2]
 [3 4]
 [5 6]]
In [48]:
# multiple arrays can be concatenated at once
d = np.array([[7, 8]])
e = np.concatenate([a, b.T, d.T], axis=1)
print(e)
[[1 2 5 7]
 [3 4 6 8]]

np.vstack and np.hstack work like concatenate, without your having to remember which axis is vertical and which is horizontal.

In [49]:
print(a)
print(b)
[[1 2]
 [3 4]]
[[5 6]]
In [50]:
f = np.vstack([a, b])
print(f)
[[1 2]
 [3 4]
 [5 6]]
In [51]:
g = np.hstack([a, b.T])
print(g)
[[1 2 5]
 [3 4 6]]

The opposite of stacking is splitting.

In [52]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3, x4 = np.split(x, [3, 5, 6])  # the second argument gives the split points
print(x1, x2, x3, x4)
[1 2 3] [99 99] [3] [2 1]
In [53]:
grid = np.arange(25).reshape((5, 5))
print(grid)
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
In [54]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
[[0 1 2 3 4]
 [5 6 7 8 9]]
[[10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
In [55]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)
[[ 0  1]
 [ 5  6]
 [10 11]
 [15 16]
 [20 21]]
[[ 2  3  4]
 [ 7  8  9]
 [12 13 14]
 [17 18 19]
 [22 23 24]]

Similarly, np.dsplit will split arrays along the third axis (depth).

Next up: computations with ndarrays!