Machine Learning Part I
1. Introduction
1.1 What is Machine Learning?
Machine learning was described by Arthur Samuel as the "field of study that gives computers the ability to learn without being explicitly programmed".
The following three principles are central to Machine Learning:
- a pattern exists: machine learning will not work on data that is completely random.
- the pattern cannot be "pinned down" mathematically: if a pattern can be pinned down mathematically (for example Newton's equations), it is probably more appropriate to use the original equations
- we have data on that pattern: machine learning algorithms require relevant data, in order to search for patterns in that data.
The last point raises the following question - how much data is needed in order to learn?
1.2 The Feasibility of Learning - how much data do we need in order to learn?
Two related concepts can give us insight as to why we are able to learn from data, and how does having more data help us to learn--these are Hoeffding's Inequality and the law of large numbers.
1.2.1 Hoeffding's Inequality
Consider an infinite bin, from which we take a sample of size , which we find to have a sample frequency, . However, the bin frequency, , is unknown. However, Hoeffding's Inequality allows us to calculate a probability bound between these two quantities, i.e.:
This inequality applies to a single bin, and is valid for all and and the bound does not depend on .
It illustrates the tradeoff between , , and the bound, i.e. the larger the sample size, , the smaller our probability bound. On the other hand, the smaller the tolerance, , the harder it will be to keep the probability small.
1.2.2 The Law of Large Numbers
A related concept is the weak law of large numbers, given by:
As the sample size, , approaches infinity, the probability approaches zero.
Further Reading:
- Machine Learning Theory - Part 1: Introduction | Mostafa Samir
- Machine Learning Theory - Part 2: Generalization Bounds | Mostafa Samir
1.3 Libraries used in this exercise
The following Python libraries are used in this exercise
from __future__ import print_function, division
import numpy as np
import matplotlib.pyplot as plt
import itertools
np.set_printoptions(edgeitems=3,infstr='inf',linewidth=75,nanstr='nan',precision=4,suppress=False,threshold=1000,formatter=None)
%matplotlib inline
2. Machine Learning Example
2.1 Learning Boolean Target Function
In this example, we intend to "learn" a Boolean target function. The function takes in input vector of size 3, say [0, 1, 0] or [0, 1, 1] and outputs a single output, , which can be zero or 1. If we enumerate the set of all possible target functions, we would have distinct Boolean functions on 3 Boolean inputs.
In this example, we wish to score 4 different hypotheses, based on how well they perform on "out of sample points". The scoring system is 1 point per correct point, i.e. if the function gets 3 "out of sample points" correct, it will get a score of 3. We wish to calcuate the total score when enumerating over the entire set of all possible functions.
In the code below, the 256 distinct Boolean functions are enumerated using itertools.product([0,1],repeat=8)
and a score is calculated for each of 4 possible cases. It is determined that the score for each of these cases is identical.
def calculate_score_last3(y,g):
total_score=0
for y_last in y[:,5:]:
total_score += np.sum(y_last==g)
return total_score
g_a = np.vstack(np.array([1,1,1,1,1]))
g_b = np.vstack(np.array([0,0,0,0,0]))
g_c = np.vstack(np.array([0,1,1,0,1]))
g_d = np.logical_not(g_a).astype(int)
y = np.array(list(itertools.product([0,1],repeat=8)))
print("Scores:\n(a) {}\n(b) {}\n(c) {}\n(d) {}".format(
calculate_score_last3(y,[1,1,1]),
calculate_score_last3(y,[0,0,0]),
calculate_score_last3(y,[0,0,1]),
calculate_score_last3(y,[1,1,0])))
3. The Perceptron
3.1 Introduction & Background
The Perceptron is attributed to the work of Frank Rosenblatt in 1957. It is a binary classifier that works on the basis of whether a dot product, , exceeds a certain threshold:
This can be written in the form of a hypothesis given by:
3.2 Generating Sample Data
To generate the sample data for the perceptron classifier above, a three-column random matrix with the first column as ones is created as follows:
def generate_data(n,seed=None):
if seed is not None:
np.random.seed(seed)
x0 = np.ones(n)
x1 = np.random.uniform(low=-1,high=1,size=(2,n))
return np.vstack((x0,x1)).T
3.3 Creating a random line inside the region of interest
The region of interest is , where denotes the Cartesian Product.
A random line is created, and to ensure that it falls within the region of interest, it is created from two random points, and which are generated within . The equation for this line in slope-intercept form and in the hypothesis / weights form are derived as follows.
3.3.1 Equation of the random line in the slope-intercept form,
3.3.2 Equation of the random line in hypothesis [weights] form
The hypothesis, in the two dimensional case (), and since :
The decision boundary, is given by:
This can be converted to the slope-intercept form if we take to be the axis, and to be the axis, i.e.:
Comparison with the equation yields the following relationships:
If we arbitrarily set , we arrive at the following set of weights, which is consistent with the decision boundary denoted by :
def get_random_line(seed=None):
X = generate_data(2,seed=seed)
x = X[:,1]
y = X[:,2]
m = (y[1]-y[0])/(x[1]-x[0])
c = y[0] - m*x[0]
return np.array([-c,-m,1])
def draw_line(ax,w,marker='g--',label=None):
m = -w[1]/w[2]
c = -w[0]/w[2]
x = np.linspace(-1,1,20)
y = m*x + c
if label is None:
ax.plot(x,y,marker)
else:
ax.plot(x,y,marker,label=label)
def get_hypothesis(X,w):
h=np.dot(X,w)
return np.sign(h).astype(int)
3.4 Generating a Training Dataset
The following code generates a training dataset (separated into positive and negative classes according to a random line in ) and plots it.
def plot_data(fig,plot_id,X,y=None,w_arr=None,my_x=None,title=None):
ax = fig.add_subplot(plot_id)
if y is None:
ax.plot(X[:,1],X[:,2],'gx')
else:
ax.plot(X[y > 0,1],X[y > 0,2],'b+',label='Positive (+)')
ax.plot(X[y < 0,1],X[y < 0,2],'ro',label='Negative (-)')
ax.set_xlim(-1,1)
ax.set_ylim(-1,1)
ax.grid(True)
if w_arr is not None:
if isinstance(w_arr,list) is not True:
w_arr=[w_arr]
for i,w in enumerate(w_arr):
if i==0:
draw_line(ax,w,'g-',label='Theoretical')
else:
draw_line(ax,w,'g--')
if my_x is not None:
ax.plot([my_x[0]],[my_x[1]],'kx',markersize=10)
if title is not None:
ax.set_title(title)
ax.legend(loc='best',frameon=True)
def create_dataset(N,make_plot=True,seed=None):
X = generate_data(N,seed=seed)
w_theoretical = get_random_line()
y = get_hypothesis(X,w_theoretical)
if make_plot is True:
fig = plt.figure(figsize=(7,5))
plot_data(fig,111,X,y,w_theoretical,title="Initial Dataset")
return X,y,w_theoretical
3.5 The Perceptron Learning Algorithm
The Preceptron Learning Algorithm (PLA) is implemented in the following steps
- Calculate which can take on values of -1, 0, or 1 for each sample.
- Compare with to find misclassified point(s) if any.
- Pick one misclassified point at random.
- Iterate the weights according to the PLA: , where is the correct classification for the misclassified point, and is the misclassified point.
The code below also keeps track of the weights, misclassification error, and misclassified point at each iteration.
def PLA(w,X,y0,n_iterations=10,verbose=True):
assert len(y0)==X.shape[0]
n=len(y0)
x_arr = list()
w_arr = list()
m_arr = list()
for i in range(n_iterations):
h = get_hypothesis(X,w)
bad = h != y0
bad = np.argwhere(bad).flatten()
if len(bad) > 0:
idx = np.random.choice(bad,1)[0]
my_x = X[idx,:]
m_arr.append(100.0*len(bad)/n)
x_arr.append(my_x)
w_arr.append(np.copy(w))
w += np.dot(y0[idx],my_x)
if verbose is True:
print("iter {}: {}% misclassified, w={}" \
.format(i,m_arr[-1],w_arr[-1]))
else:
m_arr.append(0.0)
x_arr.append(np.array([1.0, np.nan,np.nan]))
w_arr.append(np.copy(w))
if verbose is True:
print("iter {}: zero misclassified (PLA has converged)".format(i))
return w,w_arr,m_arr,x_arr
print("PLA failed to converge after {} iterations".format(i))
return None,None,None,None
3.6 Implementing the PLA
Here, we generate a sample dataset of 10 points and plot it. The perceptron learning algorithm, starting from an initial weight of , converges in less than 10 iterations.
X,y,w_theoretical = create_dataset(N=10,make_plot=True,seed=247)
w0 = np.array([0,0,0],dtype=float)
w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=100,verbose=True)
def draw_plot_steps(fig,plot_id,X,y,w_theoretical,w_arr,x_arr,idx_arr):
assert len(idx_arr) <= 9
for idx in idx_arr:
print("w_arr[{}] = {}, x_arr[{}] = {}".format(idx,w_arr[idx],idx,x_arr[idx][1:]))
plot_data(fig,plot_id,X,y,[w_theoretical] + [w_arr[idx]],
x_arr[idx][1:],title="iteration {}".format(idx))
plot_id += 1
fig = plt.figure(figsize=(10,15))
draw_plot_steps(fig,421,X,y,w_theoretical,w_arr,x_arr,np.arange(len(w_arr)-1)+1)
def plot_convergence(m_arr):
fig = plt.figure(figsize=(7,5))
ax = fig.add_subplot(111)
ax.plot(m_arr,'g+-',markersize=10)
ax.grid(True)
ax.set_title("Convergence")
ax.set_xlabel("Iterations")
ax.set_ylabel("Misclassification Error (%)")
plot_convergence(m_arr)
3.7 Number of iterations required for convergence
The following code calculates the number of iterations required for convergence and plots its distribution.
def plot_histogram(my_count,bins=200,x_max=80):
fig = plt.figure(figsize=(7,5))
ax = fig.add_subplot(111)
ax.hist(my_count,bins=bins);
ax.set_xlim(0,x_max)
ax.grid(True)
def get_iteration_distribution(N,n_trials=1000,max_iterations=10000,summary=True):
n_iterations=np.zeros(n_trials,dtype=int)
w0 = np.array([0,0,0],dtype=float)
for i in range(n_trials):
X,y,w_theoretical = create_dataset(N=N,make_plot=False,seed=None)
w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=max_iterations,verbose=False)
n_iterations[i]=len(w_arr)
if summary is True:
print("Minumum iterations: {}".format(np.min(n_iterations)))
print("Maximum iterations: {}".format(np.max(n_iterations)))
print("Mean iterations: {}".format(np.mean(n_iterations)))
return n_iterations
n_iterations = get_iteration_distribution(N=10,n_trials=1000)
plot_histogram(n_iterations,bins=200,x_max=50)
3.8 Calculate the misclassification error for the converged weights
If we know the theoretical decision boundary, w_theoretical
, that 'knows' the correct classification of points, we can calculate the number of points misclassified by w
(the final weights after convergence using the PLA) via random sampling. The misclassification error is slightly less than 20%.
def calculate_misclassification(w_theoretical,w,n_samples=1000,verbose=True):
X = generate_data(n_samples,seed=None)
y0 = get_hypothesis(X,w_theoretical)
y = get_hypothesis(X,w)
n_correct = np.sum(y == y0)
if verbose is True:
if w_theoretical[0] != 0.0:
print("Theoretical Weights : {}".format(w_theoretical/w_theoretical[0]))
else:
print("Theoretical Weights : {}".format(w_theoretical))
print("PLA Predicted Weights: {}".format(w))
print("Correct points = {}".format(n_correct))
print("Incorrect points = {}".format(n_samples-n_correct))
print("Misclassification= {}%".format(np.round(100 * (n_samples-n_correct)/n_samples, 4)))
fig = plt.figure(figsize=(7,5))
plot_data(fig,111,X,y0,[w_theoretical,w])
return (n_samples-n_correct)/n_samples
misclassification = calculate_misclassification(w_theoretical,w,verbose=True)
misclassification
def plot_misclassification(my_count,bins=20,x_max=0.25):
fig = plt.figure(figsize=(7,5))
ax = fig.add_subplot(111)
ax.hist(my_count,bins=bins);
ax.set_xlim(0,x_max)
ax.grid(True)
def get_misclassification_distribution(N=10,n_trials=1000,max_iterations=10000,summary=True):
w0 = np.array([0,0,0],dtype=float)
misclassification=np.zeros(n_trials)
for i in range(n_trials):
X,y,w_theoretical = create_dataset(N,make_plot=False,seed=None)
w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=max_iterations,verbose=False)
misclassification[i]=calculate_misclassification(w_theoretical,w,n_samples=1000,verbose=False)
if summary is True:
print("Minumum misclassification: {}".format(np.min(misclassification)))
print("Maximum misclassification: {}".format(np.max(misclassification)))
print("Mean misclassification: {}".format(np.mean(misclassification)))
return misclassification
3.9 Iteration distribution and misclassification distribution for N=10
Here, we find that for =10, an average of about 10 iterations is required for convergence. The average misclassification error is about 10%.
n_iterations = get_iteration_distribution(N=10,n_trials=1000)
plot_histogram(n_iterations,bins=100,x_max=40)
misclassification = get_misclassification_distribution(N=10)
plot_misclassification(misclassification,bins=20,x_max=0.4)
3.10 Iteration distribution and misclassification distribution for N=100
For =100, an average of about 80~100 iterations is required for convergence. The average misclassification error is slightly over ~1%.
n_iterations = get_iteration_distribution(N=100,n_trials=1000)
plot_histogram(n_iterations,bins=300,x_max=300)
misclassification = get_misclassification_distribution(N=100)
plot_misclassification(misclassification,bins=40,x_max=0.05)
3.11 Convergence Plot for N=100
X,y,w_theoretical = create_dataset(N=100,make_plot=True,seed=12345)
w0 = np.array([0,0,0],dtype=float)
w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=1000,verbose=False)
plot_convergence(m_arr)
fig = plt.figure(figsize=(8,7))
plot_data(fig,111,X,y,[w_theoretical] + w_arr)
'Machine Learning > course' 카테고리의 다른 글
How do I start AI/ML/DL? (0) | 2017.03.18 |
---|---|
Machine Learning/Neural Networks recommended courses (0) | 2017.03.16 |
Machine Learning with MATLAB video lectures (0) | 2016.12.30 |
machine learning course by Professor Andrew Ng - Lecture notes and exercise code (0) | 2016.12.15 |
Andrew Ng's Machine Learning lectures list (0) | 2016.12.14 |