NOTE: This is a lab project accompanying the following book [MLF] and it should be used together with the book.
[MLF] H. Jiang, "Machine Learning Fundamentals: A Concise Introduction", Cambridge University Press, 2021. (bibtex)
The purpose of this lab is to introduce vectorization-based programming style, which is essential for implementing machine learning algorithms using some high-level programming languages (such as Python, Matlab), and then give some recipes on how to load data sets that will be used in all projects in the book [MLF].
Prerequisites: basic understanding on Python, Numpy and JAX.
Vectorization is a special programing style for numerical computation, in which we packs all arguments into vectors/matrices and casts all numerical operations as matrix operations. By doing so, we try to get rid of loops and array indexing so as to deliver clean and effecient programs even when they are written in high-level programming languages, such as Python, Matlab.
Vectorization: programming using vectors, matrices, and even tensors without explicitly looping, indexing, conditioning over vector/matrix/tensor elements.
Advantages:
The key idea in vectorization-based programming is to use linear algebra techniques to pack the data into vectors / matrices /tensors. In the following, let us use two simple examples to explain how to write efficient vectorization-based codes, and further show how much faster these vectorized codes can run compared with the regular programming style (in either CPUs or GPUs).
clipping all elements in a vector/matrix to [0,1].
1.1 POOR programming style using loops, indexing and conditioning
def clip_loops(A):
for i in range(A.shape[0]):
for j in range(A.shape[1]):
if (A[i,j] > 1):
A[i,j] = 1
elif (A[i,j] < 0):
A[i,j] = 0
return A
1.2 use a numpy function clip( ) to do it efficiently
import numpy as np
A = np.clip(A,0,1)
1.3 Running on a random matrix to compare their results and run-in speed.
import numpy as np
X = np.random.normal(size=(5000,784))*1.5
X1 = clip_loops(X)
X2 = np.clip(X,0,1)
print(np.sum((X1-X2)*(X1-X2)))
%timeit clip_loops(X)
%timeit np.clip(X,0,1)
1.4 Summary:
Write a function to compute the sample covariance matrix from a set of data samples: $$\mathcal{D} = \big\{ \mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N \big\}$$ as $$ \mathbf{S} = \frac{1}{N} \, \sum_{i=1}^N \, (\mathbf{x}_i - \bar{\mathbf{x}} ) \, (\mathbf{x}_i - \bar{\mathbf{x}} )^\intercal $$ (NOTE: the sample covariance matrix is discussed in Section 4.2 Linear Dimension Reduction, which is the main backbone in some popular machine learning methods such as PCA and LDA.)
2.1 regular (but poor) programing using loops
import numpy as np
## input: matrix X as N by d, each row is a feature vector of d dimensions
def cov_loops(X):
N = X.shape[0]
d = X.shape[1]
# compute the mean
mean = np.zeros((d))
for i in range(N):
for j in range(d):
mean[j] += X[i,j]
for j in range(d):
mean[j] /= N
# compute the sample covariance matrix
S = np.zeros((d,d))
z = np.zeros((d))
for i in range(N):
for j in range(d):
z[j] = X[i,j] - mean[j]
for m in range(d):
for n in range(d):
S[m,n] += z[m] * z[n]
for m in range(d):
for n in range(d):
S[m,n] /= N
return S
2.2 Vecctorized codes using numpy for CPUs
For the above sample convariance example, we use the following identity (refer to Q2.3 on page 64): $$ \sum_{i=1}^m \mathbf{x}_i \mathbf{x}_i^\intercal = \mathbf{X} \mathbf{X}^\intercal $$
import numpy as np
## input: matrix X as N by d, each row is a feature vector of d dimensions
def cov_vec(X):
mean = np.mean(X, axis=0)
return (X-mean) @ (X-mean).T / X.shape[0]
2.3 Vectorized codes using JAX for GPUs
Vectorized codes can be further accelerated in GPUs using some python library such as JAX.
import jax.numpy as jnp
from jax import jit
## input: matrix X as N by d, each row is a feature vector of d dimensions
@jit
def cov_jax(X):
X = jnp.array(X)
mean = jnp.mean(X, axis=0)
return (X-mean) @ (X-mean).T / X.shape[0]
2.4 Running on some random samples, and compare the results and running speed of the above three implementations.
import numpy as np
# N=300 samples, whose dimensions are d=784
X = np.random.normal(size=(300,784))
%timeit -n1 -r1 cov_loops(X)
%timeit -n3 -r3 cov_vec(X)
%timeit -n3 -r3 cov_jax(X)
S1 = cov_loops(X)
S2 = cov_vec(X)
S3 = cov_jax(X)
print(np.trace(S1), np.trace(S2), np.trace(S3))
Compare CPU vs. GPU implementations on a larger sample set:
import numpy as np
# N=5000 samples, whose dimensions are d=784
X = np.random.normal(size=(5000,784))
%timeit cov_vec(X)
%timeit cov_jax(X)
S2 = cov_vec(X)
S3 = cov_jax(X)
print( np.trace(S2), np.trace(S3))
2.5 Summary:
# show the GPU type used in the above computation
!nvidia-smi
The MNIST data set is used in Lab Project I (page 92), Lab Project II (page 129) and Lab Project IV (page 200).
gdown
. The data will be located in a folder called 'MNIST', including four files: train-images-idx3-ubyte, train-labels-idx1-ubyte, t10k-images-idx3-ubyte, t10k-labels-idx1-ubyte. !gdown --folder https://drive.google.com/drive/folders/1r20aRjc2iu9O3kN3Xj9jNYY2uMgcERY1 2> /dev/null
# install idx2numpy
!pip install idx2numpy
import idx2numpy
import numpy as np
# load training images and labels
train_data=idx2numpy.convert_from_file('MNIST/train-images-idx3-ubyte')
train_data = np.reshape(train_data,(60000,28*28))
train_label = idx2numpy.convert_from_file('MNIST/train-labels-idx1-ubyte')
print(train_data.shape)
print(train_label.shape)
# load testing images and labels
test_data=idx2numpy.convert_from_file('MNIST/t10k-images-idx3-ubyte')
test_data = np.reshape(test_data,(10000,28*28))
test_label = idx2numpy.convert_from_file('MNIST/t10k-labels-idx1-ubyte')
print(test_data.shape)
print(test_label.shape)
# install python_mnist
!pip install python_mnist
from mnist import MNIST
import numpy as np
mnist_loader = MNIST('MNIST')
train_data, train_label = mnist_loader.load_training()
test_data, test_label = mnist_loader.load_testing()
train_data = np.array(train_data)
train_label = np.array(train_label)
test_data = np.array(test_data)
test_label = np.array(test_label)
print(train_data.shape)
print(train_label.shape)
print(test_data.shape)
print(test_label.shape)
# use a matplotlib function to display some MNIST images
import matplotlib.pyplot as plt
import numpy as np
# functions to show an image
def show_img(num):
img = train_data[num,:]
img = img.reshape(28,-1)
print(str(img.shape) + ' No.' + str(num) + ' label:' + str(train_label[num]))
plt.imshow(img, cmap='gray')
plt.show()
fig = plt.figure()
for i in range(9):
img = train_data[i,:]
img = img.reshape(28,-1)
ax = fig.add_subplot(3,3,i+1)
ax.title.set_text(str(train_label[i]))
plt.imshow(img, cmap='gray')
plt.show(block=True)
The Boston House data set is used in Q7.7 on page 150.
Download boston.csv using gdown
.
Refer to here for the description of the data.
!gdown --folder https://drive.google.com/drive/folders/12L9XNwhIH2wQBa4-IdQrhsrtgRFbeIMZ 2> /dev/null
import pandas as pd
import numpy as np
raw_data = pd.read_csv('Boston/boston.csv', header=None)
data_rows = np.reshape(raw_data.to_numpy(), (506,14))
data = data_rows[:,:13]
target = data_rows[:,13]
print(data.shape)
print(target.shape)
!gdown --folder https://drive.google.com/drive/folders/1w2hM-TIzvFFYHp_5JCVEGTA3DWNNskx9 2> /dev/null
import pandas as pd
train_dataframe = pd.read_csv("AmesHouse/train.csv")
test_dataframe = pd.read_csv("AmesHouse/test.csv")
print(train_dataframe.shape)
print(test_dataframe.shape)
# WARNING: both train_dataframe and test_dataframe contain symbolic features (refer to the data description)
# they need be pre-processed to numbers prior to model training and testing
!gdown --folder https://drive.google.com/drive/folders/1agkY7npAHzav-e1yYIVBJNgfux5AsmlX 2> /dev/null
import pandas as pd
train_dataframe = pd.read_csv("MLF-Gauss/train-gaussian.csv")
test_dataframe = pd.read_csv("MLF-Gauss/test-gaussian.csv")
train_data = train_dataframe.to_numpy()
test_data = test_dataframe.to_numpy()
print(train_data.shape)
print(test_data.shape)
Write a program to compute pair-wise Euclean distances (i.e., $L_2$ norm) between all vectors in a set of data samples: $$\mathcal{D} = \big\{ \mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N \big\}$$ where each $\mathbf{x}_i \in \mathbb{R}^d$. Store all these pair-wise distances in a symmetric marix $\mathbf{E} \in \mathbb{R}^{N \times N}$, where each element $e_{ij}$ indicates the distance between $\mathbf{x}_i$ and $\mathbf{x}_j$.
Implement it using regular loop-based programing style.
Implement it use vectorization.
Generate some random samples, compare the above two implementations in terms of running speed.
Refer to Q6.8 (page 130), write some programs to compute the SVM matrix $Q$ for three different kernel types: linear, polynomial and RBF functions.
Implement it using regular loop-based programing style.
Use vectorization to re-write vectorized codes to compute the matrix $Q$ for all three kernel types.
Generate some random samples, compare the above two implementations in terms of running speed.