Machine Learning Part I

Machine Learning/course 2016. 10. 28. 09:17

https://nathanielng.github.io/machine-learning/perceptron/perceptron.html

Machine Learning Part I

1. Introduction

1.1 What is Machine Learning?

Machine learning was described by Arthur Samuel as the "field of study that gives computers the ability to learn without being explicitly programmed".

The following three principles are central to Machine Learning:

a pattern exists: machine learning will not work on data that is completely random.
the pattern cannot be "pinned down" mathematically: if a pattern can be pinned down mathematically (for example Newton's equations), it is probably more appropriate to use the original equations
we have data on that pattern: machine learning algorithms require relevant data, in order to search for patterns in that data.

The last point raises the following question - how much data is needed in order to learn?

1.2 The Feasibility of Learning - how much data do we need in order to learn?

Two related concepts can give us insight as to why we are able to learn from data, and how does having more data help us to learn--these are Hoeffding's Inequality and the law of large numbers.

1.2.1 Hoeffding's Inequality

Consider an infinite bin, from which we take a sample of size $N$ , which we find to have a sample frequency, $ν$ . However, the bin frequency, $μ$ , is unknown. However, Hoeffding's Inequality allows us to calculate a probability bound between these two quantities, i.e.:

P [| ν - μ | > ϵ] \leq 2 e^{- 2 ϵ^{2} N}

This inequality applies to a single bin, and is valid for all $N$ and $ϵ$ and the bound does not depend on $μ$ .

It illustrates the tradeoff between $N$ , $ϵ$ , and the bound, i.e. the larger the sample size, $N$ , the smaller our probability bound. On the other hand, the smaller the tolerance, $ϵ$ , the harder it will be to keep the probability small.

1.2.2 The Law of Large Numbers

A related concept is the weak law of large numbers, given by:

lim_{m \to \infty} P [| \underset{X \sim P}{E} [X] - \frac{1}{m} \sum_{i = 1}^{m} x_{i} | > ϵ] = 0

As the sample size, $m$ , approaches infinity, the probability approaches zero.

Further Reading:

Machine Learning Theory - Part 1: Introduction | Mostafa Samir
Machine Learning Theory - Part 2: Generalization Bounds | Mostafa Samir

1.3 Libraries used in this exercise

The following Python libraries are used in this exercise

In [1]:

from __future__ import print_function, division
import numpy as np
import matplotlib.pyplot as plt
import itertools
np.set_printoptions(edgeitems=3,infstr='inf',linewidth=75,nanstr='nan',precision=4,suppress=False,threshold=1000,formatter=None)
%matplotlib inline

2. Machine Learning Example

2.1 Learning Boolean Target Function

In this example, we intend to "learn" a Boolean target function. The function takes in input vector of size 3, say [0, 1, 0] or [0, 1, 1] and outputs a single output, $y_{n}$ , which can be zero or 1. If we enumerate the set of all possible target functions, we would have $2^{2^{3}} = 256$ distinct Boolean functions on 3 Boolean inputs.

In this example, we wish to score 4 different hypotheses, $g_{a}, g_{b}, g_{c}, g_{d}$ based on how well they perform on "out of sample points". The scoring system is 1 point per correct point, i.e. if the function gets 3 "out of sample points" correct, it will get a score of 3. We wish to calcuate the total score when enumerating over the entire set of all possible functions.

In the code below, the 256 distinct Boolean functions are enumerated using itertools.product([0,1],repeat=8) and a score is calculated for each of 4 possible cases. It is determined that the score for each of these cases is identical.

In [2]:

def calculate_score_last3(y,g):
    total_score=0
    for y_last in y[:,5:]:
        total_score += np.sum(y_last==g)
    return total_score

In [3]:

g_a = np.vstack(np.array([1,1,1,1,1]))
g_b = np.vstack(np.array([0,0,0,0,0]))
g_c = np.vstack(np.array([0,1,1,0,1]))
g_d = np.logical_not(g_a).astype(int)
y = np.array(list(itertools.product([0,1],repeat=8)))

print("Scores:\n(a) {}\n(b) {}\n(c) {}\n(d) {}".format(
    calculate_score_last3(y,[1,1,1]),
    calculate_score_last3(y,[0,0,0]),
    calculate_score_last3(y,[0,0,1]),
    calculate_score_last3(y,[1,1,0])))

Scores:
(a) 384
(b) 384
(c) 384
(d) 384

3. The Perceptron

3.1 Introduction & Background

The Perceptron is attributed to the work of Frank Rosenblatt in 1957. It is a binary classifier that works on the basis of whether a dot product, $w \cdot x$ , exceeds a certain threshold:

f (x) = {\begin{array}{cl} 1, & if w \cdot x > 0 \\ 0, & otherwise \end{array}

This can be written in the form of a hypothesis given by:

h (x) = sign (w^{T} x) = sign (\sum_{i = 1}^{N} w_{i} x_{i})

3.2 Generating Sample Data

To generate the sample data for the perceptron classifier above, a three-column random matrix with the first column as ones is created as follows:

In [4]:

def generate_data(n,seed=None):
    if seed is not None:
        np.random.seed(seed)
    x0 = np.ones(n)
    x1 = np.random.uniform(low=-1,high=1,size=(2,n))
    return np.vstack((x0,x1)).T

3.3 Creating a random line inside the region of interest

The region of interest is $X = [- 1, 1] \times [- 1, 1]$ , where $\times$ denotes the Cartesian Product.

A random line is created, and to ensure that it falls within the region of interest, it is created from two random points, $(x_{0}, y_{0})$ and $(x_{1}, y_{1})$ which are generated within $X$ . The equation for this line in slope-intercept form and in the hypothesis / weights form are derived as follows.

3.3.1 Equation of the random line in the slope-intercept form, $y = m x + c$

Slope, m = \frac{y_{1} - y_{0}}{x_{1} - x_{0}}

Intercept, c = y_{0} - m x_{0}

3.3.2 Equation of the random line in hypothesis [weights] form

The hypothesis, in the two dimensional case ( $N = 2$ ), and since $x_{0} = 1$ :

h (x) = sign (w_{0} + w_{1} x_{1} + w_{2} x_{2})

The decision boundary, $h (x) = 0$ is given by:

w_{0} + w_{1} x_{1} + w_{2} x_{2} = 0

This can be converted to the slope-intercept form if we take $x_{1}$ to be the $x$ axis, and $x_{2}$ to be the $y$ axis, i.e.:

w_{0} + w_{1} x + w_{2} y = 0

y = - \frac{w_{1}}{w_{2}} x - \frac{w_{0}}{w_{2}}

Comparison with the equation $y = m x + c$ yields the following relationships:

m = - \frac{w_{1}}{w_{2}}

c = - \frac{w_{0}}{w_{2}}

If we arbitrarily set $w_{2} = 1$ , we arrive at the following set of weights, which is consistent with the decision boundary denoted by $y = m x + c$ :

w = (- c, - m, 1)

In [5]:

def get_random_line(seed=None):
    X = generate_data(2,seed=seed)
    x = X[:,1]
    y = X[:,2]
    m = (y[1]-y[0])/(x[1]-x[0])
    c = y[0] - m*x[0]
    return np.array([-c,-m,1])

def draw_line(ax,w,marker='g--',label=None):
    m = -w[1]/w[2]
    c = -w[0]/w[2]
    x = np.linspace(-1,1,20)
    y = m*x + c
    if label is None:
        ax.plot(x,y,marker)
    else:
        ax.plot(x,y,marker,label=label)
    
def get_hypothesis(X,w):
    h=np.dot(X,w)
    return np.sign(h).astype(int)

3.4 Generating a Training Dataset

The following code generates a training dataset (separated into positive and negative classes according to a random line in $X$ ) and plots it.

In [6]:

def plot_data(fig,plot_id,X,y=None,w_arr=None,my_x=None,title=None):
    ax = fig.add_subplot(plot_id)
    if y is None:
        ax.plot(X[:,1],X[:,2],'gx')
    else:
        ax.plot(X[y > 0,1],X[y > 0,2],'b+',label='Positive (+)')
        ax.plot(X[y < 0,1],X[y < 0,2],'ro',label='Negative (-)')
    ax.set_xlim(-1,1)
    ax.set_ylim(-1,1)
    ax.grid(True)
    if w_arr is not None:
        if isinstance(w_arr,list) is not True:
            w_arr=[w_arr]
        for i,w in enumerate(w_arr):
            if i==0:
                draw_line(ax,w,'g-',label='Theoretical')
            else:
                draw_line(ax,w,'g--')
    if my_x is not None:
        ax.plot([my_x[0]],[my_x[1]],'kx',markersize=10)
    if title is not None:
        ax.set_title(title)
    ax.legend(loc='best',frameon=True)

In [7]:

def create_dataset(N,make_plot=True,seed=None):
    X = generate_data(N,seed=seed)
    w_theoretical = get_random_line()
    y = get_hypothesis(X,w_theoretical)
    if make_plot is True:
        fig = plt.figure(figsize=(7,5))
        plot_data(fig,111,X,y,w_theoretical,title="Initial Dataset")
    return X,y,w_theoretical

3.5 The Perceptron Learning Algorithm

The Preceptron Learning Algorithm (PLA) is implemented in the following steps

Calculate $h (x) = sign (w^{T} x)$ which can take on values of -1, 0, or 1 for each sample.
Compare $h (x)$ with $y$ to find misclassified point(s) if any.
Pick one misclassified point at random.
Iterate the weights according to the PLA: $w = w + y_{n} x_{n}$ , where $y_{n}$ is the correct classification for the misclassified point, and $x_{n}$ is the misclassified point.

The code below also keeps track of the weights, misclassification error, and misclassified point at each iteration.

In [8]:

def PLA(w,X,y0,n_iterations=10,verbose=True):
    assert len(y0)==X.shape[0]
    n=len(y0)
    x_arr = list()
    w_arr = list()
    m_arr = list()
    for i in range(n_iterations):
        h   = get_hypothesis(X,w)
        bad = h != y0
        bad = np.argwhere(bad).flatten()
        if len(bad) > 0:
            idx  = np.random.choice(bad,1)[0]
            my_x = X[idx,:]
            m_arr.append(100.0*len(bad)/n)
            x_arr.append(my_x)
            w_arr.append(np.copy(w))
            w += np.dot(y0[idx],my_x)
            if verbose is True:
                print("iter {}: {}% misclassified, w={}" \
                      .format(i,m_arr[-1],w_arr[-1]))
        else:
            m_arr.append(0.0)
            x_arr.append(np.array([1.0, np.nan,np.nan]))
            w_arr.append(np.copy(w))
            if verbose is True:
                print("iter {}: zero misclassified (PLA has converged)".format(i))
            return w,w_arr,m_arr,x_arr
    print("PLA failed to converge after {} iterations".format(i))
    return None,None,None,None

3.6 Implementing the PLA

Here, we generate a sample dataset of 10 points and plot it. The perceptron learning algorithm, starting from an initial weight of $(0, 0, 0)$ , converges in less than 10 iterations.

In [9]:

X,y,w_theoretical = create_dataset(N=10,make_plot=True,seed=247)
w0 = np.array([0,0,0],dtype=float)

In [10]:

w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=100,verbose=True)

iter 0: 100.0% misclassified, w=[ 0.  0.  0.]
iter 1: 40.0% misclassified, w=[ 1.     -0.2676 -0.3799]
iter 2: 10.0% misclassified, w=[ 0.     -1.0827  0.2181]
iter 3: 20.0% misclassified, w=[ 1.     -0.83    0.9908]
iter 4: 20.0% misclassified, w=[ 0.     -1.4921  1.3024]
iter 5: zero misclassified (PLA has converged)

In [11]:

def draw_plot_steps(fig,plot_id,X,y,w_theoretical,w_arr,x_arr,idx_arr):
    assert len(idx_arr) <= 9
    for idx in idx_arr:
        print("w_arr[{}] = {}, x_arr[{}] = {}".format(idx,w_arr[idx],idx,x_arr[idx][1:]))
        plot_data(fig,plot_id,X,y,[w_theoretical] + [w_arr[idx]],
                  x_arr[idx][1:],title="iteration {}".format(idx))
        plot_id += 1

fig = plt.figure(figsize=(10,15))
draw_plot_steps(fig,421,X,y,w_theoretical,w_arr,x_arr,np.arange(len(w_arr)-1)+1)

w_arr[1] = [ 1.     -0.2676 -0.3799], x_arr[1] = [ 0.8151 -0.598 ]
w_arr[2] = [ 0.     -1.0827  0.2181], x_arr[2] = [ 0.2527  0.7728]
w_arr[3] = [ 1.     -0.83    0.9908], x_arr[3] = [ 0.6621 -0.3116]
w_arr[4] = [ 0.     -1.4921  1.3024], x_arr[4] = [-0.2676 -0.3799]
w_arr[5] = [ 1.     -1.7598  0.9225], x_arr[5] = [ nan  nan]

In [12]:

def plot_convergence(m_arr):
    fig = plt.figure(figsize=(7,5))
    ax = fig.add_subplot(111)
    ax.plot(m_arr,'g+-',markersize=10)
    ax.grid(True)
    ax.set_title("Convergence")
    ax.set_xlabel("Iterations")
    ax.set_ylabel("Misclassification Error (%)")
plot_convergence(m_arr)

3.7 Number of iterations required for convergence

The following code calculates the number of iterations required for convergence and plots its distribution.

In [13]:

def plot_histogram(my_count,bins=200,x_max=80):
    fig = plt.figure(figsize=(7,5))
    ax = fig.add_subplot(111)
    ax.hist(my_count,bins=bins);
    ax.set_xlim(0,x_max)
    ax.grid(True)
    
def get_iteration_distribution(N,n_trials=1000,max_iterations=10000,summary=True):
    n_iterations=np.zeros(n_trials,dtype=int)
    w0 = np.array([0,0,0],dtype=float)
    for i in range(n_trials):
        X,y,w_theoretical = create_dataset(N=N,make_plot=False,seed=None)
        w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=max_iterations,verbose=False)
        n_iterations[i]=len(w_arr)
        
    if summary is True:
        print("Minumum iterations: {}".format(np.min(n_iterations)))
        print("Maximum iterations: {}".format(np.max(n_iterations)))
        print("Mean    iterations: {}".format(np.mean(n_iterations)))
    return n_iterations

In [14]:

n_iterations = get_iteration_distribution(N=10,n_trials=1000)
plot_histogram(n_iterations,bins=200,x_max=50)

Minumum iterations: 1
Maximum iterations: 240
Mean    iterations: 9.006

3.8 Calculate the misclassification error for the converged weights

If we know the theoretical decision boundary, w_theoretical, that 'knows' the correct classification of points, we can calculate the number of points misclassified by w (the final weights after convergence using the PLA) via random sampling. The misclassification error is slightly less than 20%.

In [15]:

def calculate_misclassification(w_theoretical,w,n_samples=1000,verbose=True):
    X  = generate_data(n_samples,seed=None)
    y0 = get_hypothesis(X,w_theoretical)
    y  = get_hypothesis(X,w)
    n_correct = np.sum(y == y0)
    if verbose is True:
        if w_theoretical[0] != 0.0:
            print("Theoretical Weights  : {}".format(w_theoretical/w_theoretical[0]))
        else:
            print("Theoretical Weights  : {}".format(w_theoretical))
        print("PLA Predicted Weights: {}".format(w))
        print("Correct points   = {}".format(n_correct))
        print("Incorrect points = {}".format(n_samples-n_correct))
        print("Misclassification= {}%".format(np.round(100 * (n_samples-n_correct)/n_samples, 4)))
        fig = plt.figure(figsize=(7,5))
        plot_data(fig,111,X,y0,[w_theoretical,w])
    return (n_samples-n_correct)/n_samples

In [16]:

misclassification = calculate_misclassification(w_theoretical,w,verbose=True)
misclassification

Theoretical Weights  : [ 1.     -2.8719  4.1079]
PLA Predicted Weights: [ 1.     -1.7598  0.9225]
Correct points   = 825
Incorrect points = 175
Misclassification= 17.5%

Out[16]:

0.17499999999999999

In [17]:

def plot_misclassification(my_count,bins=20,x_max=0.25):
    fig = plt.figure(figsize=(7,5))
    ax = fig.add_subplot(111)
    ax.hist(my_count,bins=bins);
    ax.set_xlim(0,x_max)
    ax.grid(True)

In [18]:

def get_misclassification_distribution(N=10,n_trials=1000,max_iterations=10000,summary=True):
    w0 = np.array([0,0,0],dtype=float)
    misclassification=np.zeros(n_trials)
    for i in range(n_trials):
        X,y,w_theoretical = create_dataset(N,make_plot=False,seed=None)
        w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=max_iterations,verbose=False)
        misclassification[i]=calculate_misclassification(w_theoretical,w,n_samples=1000,verbose=False)
        
    if summary is True:
        print("Minumum misclassification: {}".format(np.min(misclassification)))
        print("Maximum misclassification: {}".format(np.max(misclassification)))
        print("Mean    misclassification: {}".format(np.mean(misclassification)))
    return misclassification

3.9 Iteration distribution and misclassification distribution for N=10

Here, we find that for $N$ =10, an average of about 10 iterations is required for convergence. The average misclassification error is about 10%.

In [19]:

n_iterations = get_iteration_distribution(N=10,n_trials=1000)
plot_histogram(n_iterations,bins=100,x_max=40)

Minumum iterations: 1
Maximum iterations: 265
Mean    iterations: 9.471

In [20]:

misclassification = get_misclassification_distribution(N=10)
plot_misclassification(misclassification,bins=20,x_max=0.4)

Minumum misclassification: 0.002
Maximum misclassification: 0.474
Mean    misclassification: 0.108496

3.10 Iteration distribution and misclassification distribution for N=100

For $N$ =100, an average of about 80~100 iterations is required for convergence. The average misclassification error is slightly over ~1%.

In [21]:

n_iterations = get_iteration_distribution(N=100,n_trials=1000)
plot_histogram(n_iterations,bins=300,x_max=300)

Minumum iterations: 1
Maximum iterations: 6713
Mean    iterations: 81.904

In [22]:

misclassification = get_misclassification_distribution(N=100)
plot_misclassification(misclassification,bins=40,x_max=0.05)

Minumum misclassification: 0.0
Maximum misclassification: 0.075
Mean    misclassification: 0.014019000000000002

3.11 Convergence Plot for N=100

In [23]:

X,y,w_theoretical = create_dataset(N=100,make_plot=True,seed=12345)
w0 = np.array([0,0,0],dtype=float)
w,w_arr,m_arr,x_arr = PLA(w0,X,y,n_iterations=1000,verbose=False)
plot_convergence(m_arr)

In [24]:

fig = plt.figure(figsize=(8,7))
plot_data(fig,111,X,y,[w_theoretical] + w_arr)

//anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:10: RuntimeWarning: invalid value encountered in double_scalars
//anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:11: RuntimeWarning: invalid value encountered in double_scalars

In [ ]:

저작자표시 비영리 동일조건

'Machine Learning > course' 카테고리의 다른 글

How do I start AI/ML/DL? (0)	2017.03.18
Machine Learning/Neural Networks recommended courses (0)	2017.03.16
Machine Learning with MATLAB video lectures (0)	2016.12.30
machine learning course by Professor Andrew Ng - Lecture notes and exercise code (0)	2016.12.15
Andrew Ng's Machine Learning lectures list (0)	2016.12.14

Posted by uniqueone

Be the only one, not the best one