'Deep Learning'에 해당되는 글 593건

  1. 2017.04.13 파이썬 쥬피터를 이용한 텐서플로우 개발환경 구성하기
  2. 2017.04.12 10 Free Must-Read Books for Machine Learning and Data Science
  3. 2017.04.11 Trend Prediction using LSTM RNNs with Keras implementation (Tensorflow)
  4. 2017.04.11 Python TensorFlow Tutorial – Build a Neural Network
  5. 2017.04.10 My experiments with AlexNet, using Keras and Theano
  6. 2017.04.08 speech-to-text called DeepSpeech 소스코드 논문
  7. 2017.04.08 10 minutes Practical TensorFlow Tutorial for quick learners
  8. 2017.04.05 Why Momentum Really Works
  9. 2017.04.05 How Deep Neural Networks Work 동영상 강좌 24:37
  10. 2017.04.03 Object detection에 사용되는 RCNN, Fast RCNN, Faster RCNN 등을 간결하게 설명해놓은 깃북페이지
  11. 2017.04.02 awesome-deep-learning
  12. 2017.04.02 초짜 대학원생의 입장에서 이해하는 GAN 시리즈
  13. 2017.04.02 cs231n cs224d 한국어 강의동영상 ,고려대학교 산업경영공학부 Data Science & Business Analytics 연구실
  14. 2017.04.02 10 minutes Practical TensorFlow Tutorial for quick learners
  15. 2017.04.01 Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
  16. 2017.03.30 Classifying White Blood Cells With Deep Learning (Code and data included!)
  17. 2017.03.28 How to Install TensorFlow | NVIDIA
  18. 2017.03.28 deep-photo-styletransfer
  19. 2017.03.28 TensorFlow RNN Tutorial
  20. 2017.03.28 Torch7 연습 자료를 한국어로 번역
  21. 2017.03.28 Five video classification methods implemented in Keras and TensorFlow
  22. 2017.03.27 From the basics to slightly more interesting applications of Tensorflow
  23. 2017.03.27 Simple tutorials using Google's TensorFlow Framework
  24. 2017.03.27 TensorFlow Tutorial and Examples for beginners https://tensorflow.org
  25. 2017.03.26 A miscellany of fun deep learning papers
  26. 2017.03.26 Faster R-CNN 한글 설명
  27. 2017.03.25 tensorflow install & pycharm install & setting
  28. 2017.03.25 Ubuntu 16.04LTS + TensorFlow + Cuda + Cudnn + Pycharm + Anaconda
  29. 2017.03.24 텐서플로우 빌드하는 방법 (텐서플로우에 내코드 추가하기)
  30. 2017.03.24 Time Series Forecasting with Python 7-Day Mini-Course

파이썬 쥬피터를 이용한 텐서플로우 개발환경 구성하기 | Popit
http://www.popit.kr/%ed%8c%8c%ec%9d%b4%ec%8d%ac-%ec%a5%ac%ed%94%bc%ed%84%b0%eb%a5%bc-%ec%9d%b4%ec%9a%a9%ed%95%9c-%ed%85%90%ec%84%9c%ed%94%8c%eb%a1%9c%ec%9a%b0-%ea%b0%9c%eb%b0%9c%ed%99%98%ea%b2%bd-%ea%b5%ac%ec%84%b1/
Posted by uniqueone
,

10 Free Must-Read Books for Machine Learning and Data Science
http://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html

 

 

What better way to enjoy this spring weather than with some free machine learning and data science ebooks? Right? Right?

Here is a quick collection of such books to start your fair weather study off on the right foot. The list begins with a base of statistics, moves on to machine learning foundations, progresses to a few bigger picture titles, has a quick look at an advanced topic or 2, and ends off with something that brings it all together. A mix of classic and contemporary titles, hopefully you find something new (to you) and of interest here.

Free books!

1. Think Stats: Probability and Statistics for Programmers
By Allen B. Downey

Think Stats is an introduction to Probability and Statistics for Python programmers.

Think Stats emphasizes simple techniques you can use to explore real data sets and answer interesting questions. The book presents a case study using data from the National Institutes of Health. Readers are encouraged to work on a project with real datasets.

2. Probabilistic Programming & Bayesian Methods for Hackers
By Cam Davidson-Pilon

An intro to Bayesian methods and probabilistic programming from a computation/understanding-first, mathematics-second point of view.

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author's own prior opinion.

3. Understanding Machine Learning: From Theory to Algorithms
By Shai Shalev-Shwartz and Shai Ben-David

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides a theoretical account of the fundamentals underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics, the book covers a wide array of central topics unaddressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds.

4. The Elements of Statistical Learning
By Trevor Hastie, Robert Tibshirani and Jerome Friedman

This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.

5. An Introduction to Statistical Learning with Applications in R
By Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.

6. Foundations of Data Science
By Avrim Blum, John Hopcroft, and Ravindran Kannan

While traditional areas of computer science remain highly important, increasingly researchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as an understanding of automata theory, algorithms, and related topics gave students an advantage in the last 40 years.

7. A Programmer's Guide to Data Mining: The Ancient Art of the Numerati
By Ron Zacharski

This guide follows a learn-by-doing approach. Instead of passively reading the book, I encourage you to work through the exercises and experiment with the Python code I provide. I hope you will be actively involved in trying out and programming data mining techniques. The textbook is laid out as a series of small steps that build on each other until, by the time you complete the book, you have laid the foundation for understanding data mining techniques.

8. Mining of Massive Datasets
By Jure Leskovec, Anand Rajaraman and Jeff Ullman

The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).

The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.

9. Deep Learning
By Ian Goodfellow, Yoshua Bengio and Aaron Courville

The Deep Learning textbook is a resource intended to help students and practitioners enter the field of machine learning in general and deep learning in particular. The online version of the book is now complete and will remain available online for free.

10. Machine Learning Yearning
By Andrew Ng

AI, Machine Learning and Deep Learning are transforming numerous industries. But building a machine learning system requires that you make practical decisions:

  • Should you collect more training data?
  • Should you use end-to-end deep learning?
  • How do you deal with your training set not matching your test set?
  • and many more.

Historically, the only way to learn how to make these "strategy" decisions has been a multi-year apprenticeship in a graduate program or company. I am writing a book to help you quickly gain this skill, so that you can become better at building AI systems.

Posted by uniqueone
,
https://m.facebook.com/groups/5582633474?view=permalink&id=10155938232383475

Trend Prediction using LSTM RNNs with Keras implementation (Tensorflow)


Posted by uniqueone
,
http://adventuresinmachinelearning.com/python-tensorflow-tutorial/

 

 

Google’s TensorFlow has been a hot topic in deep learning recently.  The open source software, designed to allow efficient computation of data flow graphs, is especially suited to deep learning tasks.  It is designed to be executed on single or multiple CPUs and GPUs, making it a good option for complex deep learning tasks.  In it’s most recent incarnation – version 1.0 – it can even be run on certain mobile operating systems.

This introductory tutorial to TensorFlow will give an overview of some of the basic concepts of TensorFlow in Python.  These will be a good stepping stone to building more complex deep learning networks, such as Convolution Neural Networks and Recurrent Neural Networks, in the package.  We’ll be creating a simple three-layer neural network to classify the MNIST dataset.  This tutorial assumes that you are familiar with the basics of neural networks, which you can get up to scratch with in the neural networks tutorial if required.  To install TensorFlow, follow the instructions here. First, let’s have a look at the main ideas of TensorFlow.

1.0 TensorFlow graphs

TensorFlow is based on graph based computation – “what on earth is that?”, you might say.  It’s an alternative way of conceptualising mathematical calculations.  Consider the following expression a=(b+c)(c+2) .  We can break this function down into the following components:

d=b+ce=c+2a=dc

 

Now we can represent these operations graphically as:

TensorFlow tutorial - simple computational graph
Simple computational graph

This may seem like a silly example – but notice a powerful idea in expressing the equation this way: two of the computations (d=b+c and e=c+2 ) can be performed in parallel.  By splitting up these calculations across CPUs or GPUs, this can give us significant gains in computational times.  These gains are a must for big data applications and deep learning – especially for complicated neural network architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).  The idea behind TensorFlow is to ability to create these computational graphs in code and allow significant performance improvements via parallel operations and other efficiency gains.

We can look at a similar graph in TensorFlow below, which shows the computational graph of a three-layer neural network.

TensorFlow tutorial - data flow graph
TensorFlow data flow graph

The animated data flows between different nodes in the graph are tensors which are multi-dimensional data arrays.  For instance, the input data tensor may be 5000 x 64 x 1, which represents a 64 node input layer with 5000 training samples.  After the input layer there is a hidden layer with rectified linear units as the activation function.  There is a final output layer (called a “logit layer” in the above graph) which uses cross entropy as a cost/loss function.  At each point we see the relevant tensors flowing to the “Gradients” block which finally flow to the Stochastic Gradient Descent optimiser which performs the back-propagation and gradient descent.

Here we can see how computational graphs can be used to represent the calculations in neural networks, and this, of course, is what TensorFlow excels at.  Let’s see how to perform some basic mathematical operations in TensorFlow to get a feel for how it all works.

2.0 A Simple TensorFlow example

Let’s first make TensorFlow perform our little example calculation above – a=(b+c)(c+2) .  First we need to introduce ourselves to TensorFlow variables and constants.  Let’s declare some then I’ll explain the syntax:

import tensorflow as tf

# first, create a TensorFlow constant
const = tf.constant(2.0, name="const")
    
# create TensorFlow variables
b = tf.Variable(2.0, name='b')
c = tf.Variable(1.0, name='c')

As can be observed above, TensorFlow constants can be declared using the tf.constant function, and variables with the tf.Variable function.  The first element in both is the value to be assigned the constant / variable when it is initialised.  The second is an optional name string which can be used to label the constant / variable – this is handy for when you want to do visualisations (as will be discussed briefly later).  TensorFlow will infer the type of the constant / variable from the initialised value, but it can also be set explicitly using the optional dtype argument.  TensorFlow has many of its own types like tf.float32, tf.int32 etc. – see them all here.

It’s important to note that, as the Python code runs through these commands, the variables haven’t actually been declared as they would have been if you just had a standard Python declaration (i.e. b = 2.0).  Instead, all the constants, variables, operations and the computational graph are only created when the initialisation commands are run.

Next, we create the TensorFlow operations:

# now create some operations
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')

TensorFlow has a wealth of operations available to perform all sorts of interactions between variables, some of which we’ll get to later in the tutorial.  The operations above are pretty obvious, and they instantiate the operations b+c , c+2.0 and de .

The next step is to setup an object to initialise the variables and the graph structure:

# setup the variable initialisation
init_op = tf.global_variables_initializer()

Ok, so now we are all set to go.  To run the operations between the variables, we need to start a TensorFlow session – tf.Session.  The TensorFlow session is an object where all operations are run.  Using the with Python syntax, we can run the graph with the following code:

# start the session
with tf.Session() as sess:
    # initialise the variables
    sess.run(init_op)
    # compute the output of the graph
    a_out = sess.run(a)
    print("Variable a is {}".format(a_out))

The first command within the with block is the initialisation, which is run with the, well, run command.  Next we want to figure out what the variable a should be.  All we have to do is run the operation which calculates a i.e. a = tf.multiply(d, e, name=’a’).  Note that a is an operation, not a variable and therefore it can be run.  We do just that with the sess.run(a) command and assign the output to a_out, the value of which we then print out.

Note something cool – we defined operations d and e which need to be calculated before we can figure out what a is.  However, we don’t have to explicitly run those operations, as TensorFlow knows what other operations and variables the operation a depends on, and therefore runs the necessary operations on its own.  It does this through its data flow graph which shows it all the required dependencies. Using the TensorBoard functionality, we can see the graph that TensorFlow created in this little program:

TensorFlow tutorial - simple graph
Simple TensorFlow graph

Now that’s obviously a trivial example – what if we had an array of b values that we wanted to calculate the value of a over?

2.1 The TensorFlow placeholder

Let’s also say that we didn’t know what the value of the array b would be during the declaration phase of the TensorFlow problem (i.e. before the with tf.Session() as sess) stage.  In this case, TensorFlow requires us to declare the basic structure of the data by using the tf.placeholder variable declaration.  Let’s use it for b:

# create TensorFlow variables
b = tf.placeholder(tf.float32, [None, 1], name='b')

Because we aren’t providing an initialisation in this declaration, we need to tell TensorFlow what data type each element within the tensor is going to be.  In this case, we want to use tf.float32.  The second argument is the shape of the data that will be “injected” into this variable.  In this case, we want to use a (? x 1) sized array – because we are being cagey about how much data we are supplying to this variable (hence the “?”), the placeholder is willing to accept a None argument in the size declaration.  Now we can inject as much 1-dimensional data that we want into the b variable.

The only other change we need to make to our program is in the sess.run(a,…) command:

a_out = sess.run(a, feed_dict={b: np.arange(0, 10)[:, np.newaxis]})

Note that we have added the feed_dict argument to the sess.run(a,…) command.  Here we remove the mystery and specify exactly what the variable b is to be – a one-dimensional range from 0 to 10.  As suggested by the argument name, feed_dict, the input to be supplied is a Python dictionary, with each key being the name of the placeholder that we are filling.

When we run the program again this time we get:

Variable a is [[  3.]
 [  6.]
 [  9.]
 [ 12.]
 [ 15.]
 [ 18.]
 [ 21.]
 [ 24.]
 [ 27.]
 [ 30.]]

Notice how TensorFlow adapts naturally from a scalar output (i.e. a singular output when a=9.0) to a tensor (i.e. an array/matrix)?  This is based on its understanding of how the data will flow through the graph.

Now we are ready to build a basic MNIST predicting neural network.

3.0 A Neural Network Example

Now we’ll go through an example in TensorFlow of creating a simple three layer neural network.  In future articles, we’ll show how to build more complicated neural network structures such as convolution neural networks and recurrent neural networks.  For this example though, we’ll keep it simple.  If you need to scrub up on your neural network basics, check out my popular tutorial on the subject.  In this example, we’ll be using the MNIST dataset (and its associated loader) that the TensorFlow package provides.  This MNIST dataset is a set of 28×28 pixel grayscale images which represent hand-written digits.  It has 55,000 training rows, 10,000 testing rows and 5,000 validation rows.

We can load the data by running:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

The one_hot=True argument specifies that instead of the labels associated with each image being the digit itself i.e. “4”, it is a vector with “one hot” node and all the other nodes being zero i.e. [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].  This lets us easily feed it into the output layer of our neural network.

3.1 Setting things up

Next, we can set-up the placeholder variables for the training data (and some training parameters):

# Python optimisation variables
learning_rate = 0.5
epochs = 10
batch_size = 100

# declare the training data placeholders
# input x - for 28 x 28 pixels = 784
x = tf.placeholder(tf.float32, [None, 784])
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.float32, [None, 10])

Notice the x input layer is 784 nodes corresponding to the 28 x 28 (=784) pixels, and the y output layer is 10 nodes corresponding to the 10 possible digits.  Again, the size of x is (? x 784), where the ? stands for an as yet unspecified number of samples to be input – this is the function of the placeholder variable.

Now we need to setup the weight and bias variables for the three layer neural network.  There are always L-1 number of weights/bias tensors, where L is the number of layers.  So in this case, we need to setup two tensors for each:

# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random_normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random_normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random_normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([10]), name='b2')

Ok, so let’s unpack the above code a little.  First, we declare some variables for W1 and b1, the weights and bias for the connections between the input and hidden layer.  This neural network will have 300 nodes in the hidden layer, so the size of the weight tensor W1 is [784, 300].  We initialise the values of the weights using a random normal distribution with a mean of zero and a standard deviation of 0.03.  TensorFlow has a replicated version of the numpy random normal function, which allows you to create a matrix of a given size populated with random samples drawn from a given distribution.  Likewise, we create W2 and b2 variables to connect the hidden layer to the output layer of the neural network.

Next, we have to setup node inputs and activation functions of the hidden layer nodes:

# calculate the output of the hidden layer
hidden_out = tf.add(tf.matmul(x, W1), b1)
hidden_out = tf.nn.relu(hidden_out)

In the first line, we execute the standard matrix multiplication of the weights (W1) by the input vector x and we add the bias b1.  The matrix multiplication is executed using the tf.matmul operation.  Next, we finalise the hidden_out operation by applying a rectified linear unit activation function to the matrix multiplication plus bias.  Note that TensorFlow has a rectified linear unit activation already setup for us, tf.nn.relu.

This is to execute the following equations, as detailed in the neural networks tutorial:

z(l+1)=W(l)x+b(l)h(l+1)=f(z(l+1))

 

Now, let’s setup the output layer, y_:

# now calculate the hidden layer output - in this case, let's use a softmax activated
# output layer
y_ = tf.nn.softmax(tf.add(tf.matmul(hidden_out, W2), b2))

Again we perform the weight multiplication with the output from the hidden layer (hidden_out) and add the bias, b2.  In this case, we are going to use a softmax activation for the output layer – we can use the included TensorFlow softmax function tf.nn.softmax.

We also have to include a cost or loss function for the optimisation / backpropagation to work on. Here we’ll use the cross entropy cost function, represented by:

J=1mi=1mj=1nyj(i)log(yj_(i))+(1yj(i))log(1yj_(i))

 

Where yj(i) is the ith training label for output node j, yj_(i) is the ith predicted label for output node j, m is the number of training / batch samples and n is the number .  There are two operations occurring in the above equation.  The first is the summation of the logarithmic products and additions across all the output nodes.  The second is taking a mean of this summation across all the training samples.  We can implement this cross entropy cost function in TensorFlow with the following code:

y_clipped = tf.clip_by_value(y_, 1e-10, 0.9999999)
cross_entropy = -tf.reduce_mean(tf.reduce_sum(y * tf.log(y_clipped)
                         + (1 - y) * tf.log(1 - y_clipped), axis=1))

Some explanation is required.  The first line is an operation converting the output y_ to a clipped version, limited between 1e-10 to 0.999999.  This is to make sure that we never get a case were we have a log(0) operation occurring during training – this would return NaN and break the training process.  The second line is the cross entropy calculation.

To perform this calculation, first we use TensorFlow’s tf.reduce_sum function – this function basically takes the sum of a given axis of the tensor you supply.  In this case, the tensor that is supplied is the element-wise cross-entropy calculation for a single node and training sample i.e.: yj(i)log(yj_(i))+(1yj(i))log(1yj_(i)) .  Remember that y and y_clipped in the above calculation are (m x 10) tensors – therefore we need to perform the first sum over the second axis.  This is specified using the axis=1 argument, where “1” actually refers to the second axis when we have a zero-based indices system like Python.

After this operation, we have an (m x 1) tensor.  To take the mean of this tensor and complete our cross entropy cost calculation (i.e. execute this part 1mi=1m ), we use TensorFlow’s tf.reduce_mean function.  This function simply takes the mean of whatever tensor you provide it.  So now we have a cost function that we can use in the training process.

Let’s setup the optimiser in TensorFlow:

# add an optimiser
optimiser = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cross_entropy)

Here we are just using the gradient descent optimiser provided by TensorFlow.  We initialize it with a learning rate, then specify what we want it to do – i.e. minimise the cross entropy cost operation we created.  This function will then perform the gradient descent (for more details on gradient descent see here and here) and the backpropagation for you.  How easy is that?  TensorFlow has a library of popular neural network training optimisers, see here.

Finally, before we move on to the main show, were we actually run the operations, let’s setup the variable initialisation operation and an operation to measure the accuracy of our predictions:

# finally setup the initialisation operator
init_op = tf.global_variables_initializer()

# define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

The correct prediction operation correct_prediction makes use of the TensorFlow tf.equal function which returns True or False depending on whether to arguments supplied to it are equal.  The tf.argmax function is the same as the numpy argmax function, which returns the index of the maximum value in a vector / tensor.  Therefore, the correct_prediction operation returns a tensor of size (m x 1) of True and False values designating whether the neural network has correctly predicted the digit.  We then want to calculate the mean accuracy from this tensor – first we have to cast the type of the correct_prediction operation from a Boolean to a TensorFlow float in order to perform the reduce_mean operation.  Once we’ve done that, we now have an accuracy operation ready to assess the performance of our neural network.

3.2 Setting up the training

We now have everything we need to setup the training process of our neural network.  I’m going to show the full code below, then talk through it:

 # start the session
 with tf.Session() as sess:
    # initialise the variables
    sess.run(init_op)
    total_batch = int(len(mnist.train.labels) / batch_size)
    for epoch in range(epochs):
         avg_cost = 0
         for i in range(total_batch):
             batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)
              _, c = sess.run([optimiser, cross_entropy], 
                          feed_dict={x: batch_x, y: batch_y})
             avg_cost += c / total_batch
         print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost))
    print(sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels}))

Stepping through the lines above, the first couple relate to setting up the with statement and running the initialisation operation.  The third line relates to our mini-batch training scheme that we are going to run for this neural network.  If you want to know about mini-batch gradient descent, check out this post.  In the third line, we are calculating the number of batches to run through in each training epoch.  After that, we loop through each training epoch and initialise an avg_cost variable to keep track of the average cross entropy cost for each epoch.  The next line is where we extract a randomised batch of samples, batch_x and batch_y, from the MNIST training dataset.  The TensorFlow provided MNIST dataset has a handy utility function, next_batch, that makes it easy to extract batches of data for training.

The following line is where we run two operations.  Notice that sess.run is capable of taking a list of operations to run as its first argument.  In this case, supplying [optimiser, cross_entropy] as the list means that both these operations will be performed.  As such, we get two outputs, which we have assigned to the variables _ and c.  We don’t really care too much about the output from the optimiser operation but we want to know the output from the cross_entropy operation – which we have assigned to the variable c.  Note, we run the optimiser (and cross_entropy) operation on the batch samples.  In the following line, we use c to calculate the average cost for the epoch.

Finally, we print out our progress in the average cost, and after the training is complete, we run the accuracy operation to print out the accuracy of our trained network on the test set.  Running this program produces the following output:

Epoch: 1 cost = 0.586
Epoch: 2 cost = 0.213
Epoch: 3 cost = 0.150
Epoch: 4 cost = 0.113
Epoch: 5 cost = 0.094
Epoch: 6 cost = 0.073
Epoch: 7 cost = 0.058
Epoch: 8 cost = 0.045
Epoch: 9 cost = 0.036
Epoch: 10 cost = 0.027

Training complete!
0.9787

There we go – approximately 98% accuracy on the test set, not bad.  We could do a number of things to improve the model, such as regularisation (see this tips and tricks post), but here we are just interested in exploring TensorFlow.  You can also use TensorBoard visualisation to look at things like the increase in accuracy over the epochs:

TensorFlow tutorial - TensorBoard accuracy plot
TensorBoard plot of the increase in accuracy over 10 epochs

In a future article, I’ll introduce you to TensorBoard visualisation, which is a really nice feature of TensorFlow.  For now, I hope this tutorial was instructive and helps get you going on the TensorFlow journey – in future articles I’ll show you how to build more complex neural networks such as convolution neural networks and recurrent neural networks. So stay tuned.

Have fun!

 

 

Posted by uniqueone
,

GitHub - duggalrahul/AlexNet-Experiments-Keras: Code examples for training AlexNet using Keras and Theano
https://github.com/duggalrahul/AlexNet-Experiments-Keras
Posted by uniqueone
,
https://m.facebook.com/groups/107107546348803?view=permalink&id=413934485666106

As a lot of you read, Baidu has released their paper on speech-to-text called DeepSpeech. As written in paper, their end-to-end architecture offers 7x speed-up over previous architectures. And as I understand - sets the new state-of-the-art.

Papers are fun, but without data and code their hard to implement for a lot of individuals.

Great thing is that mozilla open-sourced DeepSpeech model that is implemented in Tensorflow. So now all of us can use it, tweak it, train it.

Gihub: https://github.com/mozilla/DeepSpeech
Blog: https://svail.github.io/mandarin/
Paper: http://jmlr.org/proceedings/papers/v48/amodei16.pdf
Posted by uniqueone
,

10 minutes Practical TensorFlow Tutorial for quick learners | CV-Tricks.com
http://cv-tricks.com/artificial-intelligence/deep-learning/deep-learning-frameworks/tensorflow-tutorial/
Posted by uniqueone
,
http://distill.pub/2017/momentum/

 

 

Why Momentum Really Works

Step size α = 0.02
Momentum β = 0.99
We often think of Momentum as a means of dampening oscillations and speeding up the iterations, leading to faster convergence. But it has other interesting behavior. It allows a larger range of step-sizes to be used, and creates its own oscillations. What is going on?

Here’s a popular story about momentum [1, 2, 3]: gradient descent is a man walking down a hill. He follows the steepest path downwards, his progress is slow, but steady. Momentum is a heavy ball rolling down the same hill. The added inertia acts both as a smoother and an accelerator, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima.

This standard story isn’t wrong, but it fails to explain many important behaviors of momentum. In fact, momentum can be understood far more precisely if we study it on the right model.

One nice model is the convex quadratic. This model is rich enough to reproduce momentum’s local dynamics in real problems, and yet simple enough to be understood in closed form. This balance gives us powerful traction for understanding this algorithm.


We begin with Gradient Descent. The algorithm has as many virtues, but speed is not one of them. It is simple -- when optimizing a smooth function ff, we make a small step in the gradient wk+1=wkαf(wk).w^{k+1} = w^k-\alpha\nabla f(w^k). For a step size small enough, gradient descent makes a monotonic improvement at every iteration. It always converges, albeit to a local minimum. And under a few weak curvature conditions it can even get there at an exponential rate.

But the exponential decrease, though appealing in theory, can often be infuriatingly small. Things often begin quite well -- with an impressive, almost immediate decrease in the loss. But as the iterations progress, things start to slow down. You start to get a nagging feeling you're not making as much progress as you should be. What has gone wrong?

The problem could be the optimizer's old nemesis, pathological curvature. Pathological curvature is, simply put, regions of ff which aren't scaled properly. The landscapes are often described as a valleys, trenches, canals and ravines. The iterates either jump between valleys, or approach the optimum in small, timid steps. Progress along certain directions grind to a halt. In these unfortunate regions, gradient descent fumbles.

Momentum proposes the following tweak to gradient descent. We give gradient descent a short-term memory: zk+1=βzk+f(wk)wk+1=wkαzk+1 \begin{aligned} z^{k+1}&=\beta z^{k}+\nabla f(w^{k})\\[0.4em] w^{k+1}&=w^{k}-\alpha z^{k+1} \end{aligned} the change is innocent, and costs almost nothing. When β=0 \beta = 0 , we recover gradient descent. But for β=0.99 \beta = 0.99 (sometimes 0.999 0.999, if things are really bad), this appears to be the boost we need. Our iterations regain that speed and boldness it lost, speeding to the optimum with a renewed energy.

Optimizers call this minor miracle "acceleration".

The new algorithm may seem at first glance like a cheap hack. A simple trick to get around gradient descent's more aberrant behavior - a smoother for oscillations between steep canyons. But the truth, if anything, is the other way round. It is gradient descent which is the hack. First, momentum gives up to a quadratic speedup on many functions. 1 This is no small matter -- this is similar to the speedup you get from the Fast Fourier Transform, Quicksort, and Grover's Algorithm. When the universe gives you quadratic speedups, you should start to pay attention.

But there's more. A lower bound, courtesy of Nesterov [5], states that momentum is, in a certain very narrow and technical sense, optimal. Now, this doesn't mean it is the best algorithm for all functions in all circumstances. But it does satisfy some curiously beautiful mathematical properties which scratch a very human itch for perfection and closure. But more on that later. Let's say this for now -- momentum is an algorithm for the book.


First Steps: Gradient Descent

We begin by studying gradient descent on the simplest model possible which isn't trivial -- the convex quadratic, f(w)=12wTAwbTw,wRn.f(w) = \tfrac{1}{2}w^TAw - b^Tw, \qquad w \in \mathbf{R}^n. When AA is invertible, the optimal solution ww^{\star} occurs at w=A1b. w^{\star} = A^{-1}b. Simple as this model may be, it is rich enough to approximate many functions (think of AA as your favorite model of curvature -- the Hessian, Fisher Information Matrix [6], etc) and captures all the key features of pathological curvature. And more importantly, we can write an exact closed formula for gradient descent on this function.

This is how it goes. Since f(w)=Awb\nabla f(w)=Aw - b, the iterates are wk+1=wkα(Awkb) w^{k+1}=w^{k}- \alpha (Aw^{k} - b) Here's the trick. There is a very natural space to view gradient descent where all the dimensions act independently - the eigenvectors of AA.

Every symmetric matrix AA has an eigenvalue decomposition A=Qdiag(λ1,,λn)QT,Q=[q1,,qn], A=Q \text{diag}(\lambda_{1},\ldots,\lambda_{n})Q^{T},\qquad Q = [q_1,\ldots,q_n], and as per convention, we will assume that the λi\lambda_i's are sorted, from smallest λ1\lambda_1 to biggest λn\lambda_n. If we perform a change of basis, xk=QT(wkw)x^{k} = Q^T(w^{k} - w^\star), the iterations break apart, becoming: xik+1=xikαλixik=(1αλi)xik=(1αλi)kxi0 \begin{aligned} x_{i}^{k+1} & =x_{i}^{k}-\alpha \lambda_ix_{i}^{k} \\[0.4em] &= (1-\alpha\lambda_i)x^k_i=(1-\alpha \lambda_i)^kx^0_i \end{aligned} Moving back to our original space ww, we can see that wkw=Qxk=inxi0(1αλi)kqi w^k - w^\star = Qx^k=\sum_i^n x^0_i(1-\alpha\lambda_i)^k q_i and there we have it - gradient descent in closed form.

 

Decomposing the Error

The above equation admits a simple interpretation. Each element of x0x^0 is the component of the error in the initial guess in QQ-basis. There are nn such errors, and each of these errors follow their own, solitary path to the minimum, decreasing exponentially with a compounding rate of 1αλi1-\alpha\lambda_i. The closer that number is to 11, the slower it converges.

For most step-sizes, the eigenvectors with largest eigenvalues converge the fastest. This triggers an explosion of progress in the first few iterations, before things slow down as the smaller eigenvector's struggles are revealed. By writing the contributions of each eigenspaces's error to the loss f(wk)f(w)=(1αλi)2kλi[xi0]2 f(w^{k})-f(w^{\star})=\sum(1-\alpha\lambda_{i})^{2k}\lambda_{i}[x_{i}^{0}]^2 we can visualize the contributions of each error component to the loss.

Optimization can be seen as combination of several component problems, shown here as 1 2 3 with eigenvalues λ1=0.01\lambda_1=0.01, λ2=0.1\lambda_2=0.1, and λ3=1\lambda_3=1 respectively.
Step Size
Optimal Step Size
020406080100120140

 

Choosing A Step-size

The above analysis gives us immediate guidance as to how to set a step-size α\alpha. In order to converge, each 1αλi|1-\alpha \lambda_i| must be strictly less than 1. All workable step-sizes, therefore, fall in the interval 0<αλi<2.0<\alpha\lambda_i<2. The overall convergence rate is determined by the slowest error component, which must be either λ1\lambda_1 or λn\lambda_n: rate(α) = maxi1αλi = max{1αλ1, 1αλn} \begin{aligned}\text{rate}(\alpha) & ~=~ \max_{i}\left|1-\alpha\lambda_{i}\right|\\[0.9em] & ~=~ \max\left\{|1-\alpha\lambda_{1}|,~ |1-\alpha\lambda_{n}|\right\} \end{aligned}

This overall rate is minimized when the rates for λ1\lambda_1 and λn\lambda_n are the same -- this mirrors our informal observation in the previous section that the optimal step size causes the first and last eigenvectors to converge at the same time. If we work this through we get: optimal α = argminα rate(α) = 2λ1+λnoptimal rate = minα rate(α) = λn/λ11λn/λ1+1 \begin{aligned} \text{optimal }\alpha ~=~{\mathop{\text{argmin}}\limits_\alpha} ~\text{rate}(\alpha) & ~=~\frac{2}{\lambda_{1}+\lambda_{n}}\\[1.4em] \text{optimal rate} ~=~{\min_\alpha} ~\text{rate}(\alpha) & ~=~\frac{\lambda_{n}/\lambda_{1}-1}{\lambda_{n}/\lambda_{1}+1} \end{aligned}

Notice the ratio λn/λ1\lambda_n/\lambda_1 determines the convergence rate of the problem. In fact, this ratio appears often enough that we give it a name, and a symbol - the condition number. condition number:=κ:=λnλ1 \text{condition number} := \kappa :=\frac{\lambda_n}{\lambda_1} The condition number means many things. It is a measure of how close to singular a matrix is. It is a measure of how robust A1bA^{-1}b is to perturbations in bb. And in this context, the condition number gives us a measure of how poorly gradient descent will perform. A ratio of κ=1\kappa = 1 is ideal, giving convergence in one step (of course, the function is trivial). Unfortunately the larger the ratio, the slower gradient descent will be. The condition number is therefore a direct measure pathological of curvature.


Example: Polynomial Regression

The above analysis reveals an insight : all errors are not made equal. Indeed, there are different kinds of errors, nn to be exact, one for each of the eigenvectors of AA. And gradient descent is better at correcting some kinds of errors than others. But what do the eigenvectors of AA mean? Surprisingly, in many applications they admit a very concrete interpretation.

Lets see how this plays out in polynomial regression. Given 1D data, ξi\xi_i, our problem is to fit the model model(ξ)=w1p1(ξ)++wnpn(ξ)pi=ξξi1 \text{model}(\xi)=w_{1}p_{1}(\xi)+\cdots+w_{n}p_{n}(\xi)\qquad p_{i}=\xi\mapsto\xi^{i-1} to our observations, did_i. This model, though nonlinear in the input ξ\xi is linear in the weights, and therefore we can write the model as a linear combination of monomials, like:

Because of the linearity, we can fit this model at our data ξi\xi_i using linear regression on the model mismatch minimizew12i(model(ξi)di)2  =  12Zwd2 \text{minimize}_w \qquad\tfrac{1}{2}\sum_i (\text{model}(\xi_{i})-d_{i})^{2} ~~=~~ \tfrac{1}{2}\|Zw - d\|^2 where Z=(1ξ1ξ12ξ1n11ξ2ξ22ξ2n11ξmξm2ξmn1). Z=\left(\begin{array}{ccccc} 1 & \xi_{1} & \xi_{1}^{2} & \ldots & \xi_{1}^{n-1}\\ 1 & \xi_{2} & \xi_{2}^{2} & \ldots & \xi_{2}^{n-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & \xi_{m} & \xi_{m}^{2} & \ldots & \xi_{m}^{n-1} \end{array}\right).

The path of convergence, as we know, is elucidated when we view the iterates in the space of QQ (the eigenvectors of ZTZZ^T Z). So let's recast our regression problem in the basis of QQ. First, we do a change of basis, by rotating ww into QwQw, and counter-rotating our feature maps pp into eigenspace, p¯\bar{p}. We can now conceptualize the same regression as one over a different polynomial basis, with the model model(ξ) = x1p¯1(ξ) +  + xnp¯n(ξ)p¯i=qijpj. \text{model}(\xi)~=~x_{1}\bar{p}_{1}(\xi)~+~\cdots~+~x_{n}\bar{p}_{n}(\xi)\qquad \bar{p}_{i}=\sum q_{ij}p_j. This model is identical to the old one. But these new features p¯\bar{p} (which I call "eigenfeatures") and weights have the pleasing property that each coordinate acts independently of the others. Now our optimization problem breaks down, really, into nn small 1D optimization problems. And each coordinate can be optimized greedily and independently, one at a time in any order, to produce the final, global, optimum. The eigenfeatures are also much more informative,

The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a model which is in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to the observations changes the entire model dramatically. And the eigenfeatures, the principle components of the data, give us exactly the decomposition we need to sort the features by its sensitivity to perturbations in did_i's. The most robust components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the smallest eigenvalues).

This measure of robustness, by a rather convenient coincidence, is also a measure of how easily an eigenspace converges. And thus, the "pathological directions" -- the eigenspaces which converge the slowest -- are also those which are most sensitive to noise! So starting at a simple initial point like 00 (by a gross abuse of language, let's think of this as a prior), we track the iterates till a desired level of complexity is reached. Let's see how this plays out in gradient descent.

This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of regularization, such as Tikhonov Regression. Both methods try to suppress the components of the smallest eigenvalues directly, though they employ different methods of spectral decay.2 But early stopping has a distinct advantage. Once the stepsize is chosen, there are no regularization parameters to fiddle with. Indeed, in the course of a single optimization, we have the entire family of models, from underfitted to overfitted, at our disposal. This gift, it seems, doesn't to come at a price. A beautiful free lunch [7] indeed.


The Dynamics of Momentum

Let's turn our attention back to momentum. Recall the momentum update is zk+1=βzk+f(wk)wk+1=wkαzk+1. \begin{aligned} z^{k+1}&=\beta z^{k}+\nabla f(w^{k})\\[0.4em] w^{k+1}&=w^{k}-\alpha z^{k+1}. \end{aligned} Since f(wk)=Awkb\nabla f(w^k) = Aw^k - b, the update on the quadratic is zk+1=βzk+(Awkb)wk+1=wkαzk+1 \begin{aligned} z^{k+1}&=\beta z^{k}+ (Aw^{k}-b)\\[0.4em] w^{k+1}&=w^{k}-\alpha z^{k+1} \end{aligned} following [8], we go through the same motions, with the change of basis xk=Q(wkw) x^{k} = Q(w^{k} - w^\star) and yk=Qzk y^{k} = Qz^{k} to yield the update rule yik+1=βyik+λixikxik+1=xikαyik+1. \begin{aligned} y_{i}^{k+1}&=\beta y_{i}^{k}+\lambda_{i}x_{i}^{k}\\[0.4em] x_{i}^{k+1}&=x_{i}^{k}-\alpha y_{i}^{k+1}. \end{aligned} in which each component of acts independently of the other components (though xikx^k_i and yiky^k_i are coupled). This lets us rewrite our iterates as 3 (yikxik)=Rk(yi0xi0)R=(βλiαβ1αλi). \left(\!\!\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\!\!\right)=R^k\left(\!\!\begin{array}{c} y_{i}^{0}\\ x_{i}^{0} \end{array}\!\!\right) \qquad R = \left(\!\!\begin{array}{cc} \beta & \lambda_{i}\\ -\alpha\beta & 1-\alpha\lambda_{i} \end{array}\!\!\right). There are many ways of taking a matrix to the kthk^{th} power. But for the 2×22 \times 2 case there is an elegant and little known formula [9] in terms of the eigenvectors of RR, σ1\sigma_1 and σ2\sigma_2. Rk={σ1kR1σ2kR2σ1σ2σ1k(kR/σ1(k1)I)σ1=σ2,Rj=RσjIσ1σ2 \color{#AAA}{\color{black}{R^{k}}=\begin{cases} \color{black}{\sigma_{1}^{k}}R_{1}-\color{black}{\sigma_{2}^{k}}R_{2} & \sigma_{1}\neq\sigma_{2}\\ \sigma_{1}^{k}(kR/\sigma_1-(k-1)I) & \sigma_{1}=\sigma_{2} \end{cases},\qquad R_{j}=\frac{R-\sigma_{j}I}{\sigma_{1}-\sigma_{2}}} This formula is rather complicated, but the takeaway here is that it plays the exact same role the individual convergence rates, 1αλi1-\alpha\lambda_i do in gradient descent. But instead of one geometric series, we have two coupled series, which may have real or complex values. The convergence rate is therefore the slowest of the two rates, max{σ1,σ2}\max\{|\sigma_{1}|,|\sigma_{2}|\} 4. By plotting this out, we see there are distinct regions of the parameter space which reveal a rich taxonomy of convergence behavior [10]:

Convergence Rate
A plot of max{σ1,σ2}\max\{|\sigma_1|, |\sigma_2|\} reveals distinct regions, each with its own style of convergence.

For what values of α\alpha and β\beta does momentum converge? Since we need both σ1\sigma_1 and σ2\sigma_2 to converge, our convergence criteria is now max{σ1,σ2}<1\max\{|\sigma_{1}|,|\sigma_{2}|\} < 1. The range of available step-sizes work out 5 to be 0<αλi<2+2βfor0β<10<\alpha\lambda_{i}<2+2\beta \qquad \text{for} \qquad 0 \leq \beta < 1 We recover the previous result for gradient descent when β=0\beta = 0. But notice an immediate boon we get. Momentum allows us to a crank up the step-size up by a factor of 2 before diverging.


The Critical Damping Coefficient

The true magic happens, however, when we find the sweet spot of α\alpha and β\beta. Let us try to first optimize over β\beta.

Momentum admits an interesting physical interpretation when α\alpha is [11] small: it is a discretization of a damped harmonic oscillator. Consider a physical simulation operating in discrete time (like a video game).

yik+1y_{i}^{k+1} == ++ λixik\lambda_{i}x_{i}^{k}
and perturbed by an external force field
We can think of yik-y_i^k as velocity
βyik\beta y_{i}^{k}
which is dampened at each step
xik+1x_i^{k+1} == xikαyik+1x_i^k - \alpha y_i^{k+1}
And xx is our particle's position
which is moved at each step by a small amount in the direction of the velocity yik+1y^{k+1}_i.

We can break this equation apart to see how each component affects the dynamics of the system. Here we plot, for 150150 iterates, the particle's velocity (the horizontal axis) against its position (the vertical axis), in a phase diagram.

This system is best imagined as a weight suspended on a spring. We pull the weight down by one unit, and we study the path it goes as it returns to equilibrium. In the analogy, the spring is the source of our external force λixik\lambda_ix^k_i, and equilibrium is the state when both the position xikx^k_i, and the speed, yiky^k_i are 0. The choice of β\beta crucially affects the rate of return to equilibrium.

 

The critical β=(1αλi)2\beta = (1 - \sqrt{\alpha \lambda_i})^2 gives us a convergence rate (in eigenspace ii) of 1αλi.1 - \sqrt{\alpha\lambda_i}. A square root improvement over gradient descent, 1αλi1-\alpha\lambda_i! Alas, this only applies to the error in the ithi^{th} eigenspace, with α\alpha fixed.

Optimal parameters

To get a global convergence rate, we must optimize over both α\alpha and β\beta. This is a more complicated affair,6 but they work out to be α=(2λ1+λn)2β=(λnλ1λn+λ1)2 \alpha = \left(\frac{2}{\sqrt{\lambda_{1}}+\sqrt{\lambda_{n}}}\right)^{2} \quad \beta = \left(\frac{\sqrt{\lambda_{n}}-\sqrt{\lambda_{1}}}{\sqrt{\lambda_{n}}+\sqrt{\lambda_{1}}}\right)^{2} Plug this into the convergence rate, and you get

κ1κ+1\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}
Convergence rate, Momentum
κ1κ+1 \frac{\kappa-1}{\kappa+1}
Convergence rate, Gradient Descent

With barely a modicum of extra effort, we have essentially square rooted the condition number! These gains, in principle, require explicit knowledge of λ1\lambda_1 and λn\lambda_n. But the formulas reveal a simple guideline. When the problem's conditioning is poor, the optimal α\alpha is approximately twice that of gradient descent, and the momentum term is close to 11. So set β\beta as close to 11 as you can, and then find the highest α\alpha which still converges. Being at the knife's edge of divergence, like in gradient descent, is a good place to be.

We can do the same decomposition here with momentum, with eigenvalues λ1=0.01\lambda_1=0.01, λ2=0.1\lambda_2=0.1, and λ3=1\lambda_3=1. Though the decrease is no longer monotonic, but significantly faster.
f(wk)f(w)f(w^k) - f(w^\star)
Note that the optimal parameters do not necessarily imply the fastest convergence, though, only the fastest asymptotic convergence rate.
Step size α =
Momentum β =

While the loss function of gradient descent had a graceful, monotonic curve, optimization with momentum displays clear oscillations. These ripples are not restricted to quadratics, and occur in all kinds of functions in practice. They are not cause for alarm, but are an indication that extra tuning of the hyperparameters is required.


Example: The Colorization Problem

Lets look at how momentum accelerates convergence with a concrete example. On a grid of pixels let GG be the graph with vertices as pixels, EE be the set of edges connecting each pixel to its four neighboring pixels, and DD be a small set of a few distinguished vertices. Consider the problem of minimizing

minimize\text{minimize} 12iD(wi1)2\qquad \frac{1}{2} \sum_{i\in D} (w_i - 1)^2
The colorizer pulls distinguished pixels towards 1
++ 12i,jE(wiwj)2.\frac{1}{2} \sum_{i,j\in E} (w_i - w_j)^2.
The smoother spreads out the color

The optimal solution to this problem is a vector of all 11's 7. An inspection of the gradient iteration reveals why we take a long time to get there. The gradient step, for each component, is some form of weighted average of the current value and its neighbors: wik+1=wikαjN(wikwjk){α(wik1)iD0iD w_{i}^{k+1}=w_{i}^{k}-\alpha\sum_{j\in N}(w_{i}^{k}-w_{j}^{k})-\begin{cases} \alpha(w_{i}^{k}-1) & i\in D\\ 0 & i\notin D \end{cases} This kind of local averaging is effective at smoothing out local variations in the pixels, but poor at taking advantage of global structure. The updates are akin to a drop of ink, diffusing through water. Movement towards equilibrium is made only through local corrections and so left undisturbed, its march towards the solution is slow and laborious. Fortunately, momentum speeds things up significantly.

The eigenvectors of the colorization problem form a generalized Fourier basis for RnR^n. The smallest eigenvalues have low frequencies, hence gradient descent corrects high frequency errors well but not low frequency ones.

In vectorized form, the colorization problem is

minimize\text{minimize}
The smoother's quadratic form is the Graph Laplacian
12iD(xTeieiTxeiTx)\frac{1}{2}\sum_{i\in D}\left(x^{T}e_{i}e_{i}^{T}x-e_{i}^{T}x\right) ++ 12xTLGx\frac{1}{2}x^{T}L_{G}x
And the colorizer is a small low rank correction with a linear term. eie_i is the ithi^{th} unit vector.

The Laplacian matrix ,LGL_G 8, which dominates the behavior of the optimization problem, is a valuable bridge between linear algebra and graph theory. This is a rich field of study, but one fact is pertinent to our discussion here. The conditioning of LGL_G, here defined as the ratio of the second eigenvector to the last (the first eigenvalue is always 0, with eigenvector equal to the matrix of all 1's), is directly connected to the connectivity of the graph.

Small world graphs, like expanders and dense graphs, have excellent conditioning
The conditioning of grids improves with its dimensionality.
And long, wiry graphs, like paths, condition poorly.

This observations carries through to the colorization problem, and the intuition behind this should be clear. Well connected graphs allow rapid diffusion of information through the edges, while graphs with poor connectivity do not. And this principle, taken to the extreme, furnishes a class of functions so hard to optimize they reveal the limits of first order optimization.


The Limits of Descent

Let's take a step back. We have, with a clever trick, improved the convergence of gradient descent by a quadratic factor with the introduction of a single auxiliary sequence. But is this the best we can do? Could we improve convergence even more with two sequences? Could one perhaps choose the α\alpha's and β\beta's intelligently and adaptively? It is tempting to ride this wave of optimism - to the cube root and beyond!

Unfortunately, while improvements to the momentum algorithm do exist, they all run into a certain, critical, almost inescapable lower bound.

Adventures in Algorithmic Space

To understand the limits of what we can do, we must first formally define the algorithmic space in which we are searching. Here's one possible definition. The observation we will make is that both Gradient Descent and momentum can be "unrolled". Indeed, since w1=w0  αf(w0)w2=w1  αf(w1)=w0  αf(w0)  αf(w1) wk=w0  αf(w0)          αf(wk) \begin{array}{lll} w^{1} & \!= & \!w^{0} ~-~ \alpha\nabla f(w^{0})\\[0.35em] w^{2} & \!= & \!w^{1} ~-~ \alpha\nabla f(w^{1})\\[0.35em] & \!= & \!w^{0} ~-~ \alpha\nabla f(w^{0}) ~-~ \alpha\nabla f(w^{1})\\[0.35em] & ~ \!\vdots \\ w^{k} & \!= & \!w^{0} ~-~ \alpha\nabla f(w^{0}) ~-~~~~ \cdots\cdots ~~~~-~ \alpha\nabla f(w^{k}) \end{array} we can write gradient descent as: wk  =  w0  αikf(wi) w^{k} ~~=~~ w^{0} ~-~ \alpha\sum_i^k\nabla f(w^{i}) A similar trick can be done with momentum, wk  =  w0 + αik(1βi)1βf(wi) w^{k} ~~=~~ w^{0} ~+~ \alpha\sum_i^k\frac{(1-\beta^{i})}{1-\beta}\nabla f(w^i) In fact, all manner of first order algorithms, including the Conjugate Gradient algorithm, AdaMax, Averaged Gradient and more can be written in (though not quite so neatly) this unrolled form. Therefore the class of algorithms for which wk  =  w0 + ikγikf(wi) for some γik w^{k} ~~=~~ w^{0} ~+~ \sum_{i}^{k}\gamma_{i}^{k}\nabla f(w^{i}) \qquad \text{ for some } \gamma_{i}^{k} contains momentum, gradient descent and a whole bunch of other algorithms you might dream up. This is what is assumed in Assumption 2.1.4 [5] of Nesterov. But let's push this even further, and expand this class to allow different step sizes for different directions. wk  =  w0 + ikΓikf(wi) for some diagonal matrix Γik. w^{k} ~~=~~ w^{0} ~+~ \sum_{i}^{k}\Gamma_{i}^{k}\nabla f(w^{i}) \quad \text{ for some diagonal matrix } \Gamma_{i}^{k} . This class of methods covers most of the popular algorithms for training neural networks, including ADAM and AdaGrad. We shall refer to this class of methods as "Linear First Order Methods", and we will show a single function all these methods ultimately fail on.

The Resisting Oracle

Earlier, when we talked about the colorizer problem, we observed that wiry graphs cause bad conditioning in our optimization problem. Taking this to its extreme, we can look at a graph consisting of a single path -- a function so badly conditioned that Nesterov called a variant of it "the worst function in the world". The function follows the same structure as the colorizer problem, and we shall call this the Convex Rosenbrock,

fn(w)f^n(w) ==
with a colorizer of one node
12(w11)2\frac{1}{2}\left(w_{1}-1\right)^{2} ++ 12i=1n(wiwi+1)2\frac{1}{2}\sum_{i=1}^{n}(w_{i}-w_{i+1})^{2}
strong couplings adjacent nodes in the path,
++ 2κ1w2\frac{2}{\kappa-1}\|w\|^{2}
and a small regularization term.

The optimal solution of this problem is wi=(κ1κ+1)i w_{i}^{\star}=\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{i} and the condition number of the problem fnf^n approaches κ\kappa as nn goes to infinity. Now observe the behavior of the momentum algorithm on this function, starting from w0=0w^0 = 0.

Step size α =
Momentum β =
Here we see the first 50 iterates of momentum on the Convex Rosenbrock for n=25n=25. The behavior here is similar to that of any Linear First Order Algorithm.
This triangle is a "dead zone" of our iterates. The iterates are always 0, no matter what the parameters.
The remaining expanding space is the "light cone" of our iterate's influence. Momentum does very well here with the optimal parameters.
Error
Weights

The observations made in the above diagram are true for any Linear First Order algorithm. Let us prove this. First observe that each component of the gradient depends only on the values directly before and after it: f(x)i=2wiwi1wi+1+4κ1wi,i1. \nabla f(x)_{i}=2w_{i}-w_{i-1}-w_{i+1} +\frac{4}{\kappa-1} w_{i}, \qquad i \neq 1. Therefore the fact we start at 0 guarantees that that component must remain stoically there till an element either before and after it turns nonzero. And therefore, by induction, for any linear first order algorithm,

w0=[  0,0,0,0,0,0 ]w1=[ w11,0,0,0,0,0 ]w2=[ w12,w22,0,0,0,0 ] wk=[ w1k,w2k,w3k,wkk,0,0 ] \begin{array}{lllllllll} w^{0} & = & [~~0, & 0, & 0, & \ldots & 0, & 0, & \ldots & 0~]\\[0.35em] w^{1} & = & [~w_{1}^{1}, & 0, & 0, & \ldots & 0, & 0, & \ldots & 0~]\\[0.35em] w^{2} & = & [~w_{1}^{2}, & w_{2}^{2}, & 0, & \ldots & 0, & 0, & \ldots & 0~]\\[0.35em] & ~ \vdots \\ w^{k} & = & [~w_{1}^{k}, & w_{2}^{k}, & w_{3}^{k}, & \ldots & w_{k}^{k}, & 0, & \ldots & 0~]\\ \end{array}

Think of this restriction as a "speed of light" of information transfer. Error signals will take at least kk steps to move from w0w_0 to wkw_k. We can therefore sum up the errors which can not have changed yet 9 wkwmaxi{wi}=(κ1κ+1)k+1=(κ1κ+1)kw0w \begin{aligned} \|w^{k}-w^{\star}\|_{\infty}&\geq\max_{i}\{|w_{i}^{\star}|\}\\[0.9em]&=\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{k+1}\\[0.9em]&=\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{k}\|w^{0}-w^{\star}\|_{\infty} \end{aligned} As nn gets large, and the condition number of fnf^n approaches κ\kappa. And the gap therefore closes, the convergence rate that momentum promises matches the best any linear first order algorithm can do. And we arrive at the disappointing conclusion that on this problem, we cannot do better.

Like many such lower bounds, this result must not be taken literally, but spiritually. It, perhaps, gives a sense of closure and finality to our investigation. But this is not the final word on first order optimization. This lower bound does not preclude the possibility, for example, of reformulating the problem to change the condition number itself! There is still much room for speedups, if you understand the right places to look.

Momentum with Stochastic Gradients

There is a final point worth addressing. All the discussion above assumes access to the true gradient -- a luxury seldom afforded in modern machine learning. Computing the exact gradient requires a full pass over all the data, the cost of which can be prohibitively expensive. Instead, randomized approximations of the gradient, like minibatch sampling, are often used as a plug in replacement of f(w)\nabla f(w). We can write the approximation in two parts,

f(w)\nabla f(w)
the true gradient
++ error(w)\text{error}(w)
and an approximation error.
If the estimator is unbiased e.g. E[error(w)]=0\mathbf{E}[\text{error}(w)] = 0

It is helpful to think of our approximate gradient as the injection of a special kind of noise into our iteration. And using the machinery developed in the previous sections, we can deal with this extra term directly. On a quadratic, the error term cleaves cleanly into a separate term, where 10

(yikxik) \left(\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\right)
the noisy iterates are a sum of
== Rk(yi0xi0)R^{k}\left(\begin{array}{c} y_{i}^{0}\\ x_{i}^{0} \end{array}\right)
the noiseless, deterministic iterates and
++ ϵiki=1kRki(1α)\epsilon^k_i \sum_{i=1}^{k}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right)
a decaying sum of the errors, where ϵk=Qerror(wk)\epsilon^k = Q \cdot \text{error}(w^k).

The error term, ϵk\epsilon^k, with its dependence on the wkw^k, is a fairly hairy object. Following [10], we model this as independent 0 mean gaussian noise. In this simplified model, the objective too, breaks into two separable components, a sum of a deterministic error and a stochastic error 11, visualized here.

We decompose the expected value of the objective value Ef(w)f(w)\mathbf{E} f(w) - f(w^\star) into a deterministic part and a stochastic part .
Ef(w)f(w)\mathbf{E} f(w) - f(w^\star)
The small black dots are a single run of stochastic gradient
Step size α =
Momentum β =
As [1] observes, the optimization has two phases. In the initial transient phase the magnitude of the noise is smaller than the magnitude of the gradient, and Momentum still makes good progress. In the second, stochastic phase, the noise overwhelms the gradient, and momentum is less effective.

Note that there are a set of unfortunate tradeoffs which seem to pit the two components of error against each other. Lowering the step-size, for example, decreases the stochastic error, but also slows down the rate of convergence. And increasing momentum, contrary to popular belief, causes the errors to compound. Despite these undesirable properties, stochastic gradient descent with momentum has still been shown to have competitive performance on neural networks. As [1] has observed, the transient phase seems to matter more than the fine-tuning phase in machine learning. And in fact, it has been recently suggested [12] that this noise is a good thing - acting as a implicit regularizer, which like early stopping, prevents overfitting in the fine tuning phase of optimization.


Onwards and Downwards

The study of acceleration is seeing a small revival within the optimization community. If the ideas in this article excite you, you may wish to read [13], which fully explores the idea of momentum as the discretization of a certain differential equation. But other, less physical interpretations exist. There an algebraic interpretation of momentum in terms of approximating polynomials [3, 14]. Geometric interpretations are emerging [15, 16], connecting momentum to older methods, like the Ellipsoid method. And finally, there are interpretations relating momentum to duality [17], perhaps providing a clue as how to accelerate second order methods and Quasi Newton (for a first step, see [18]). But Like the proverbial blind men feeling an elephant, momentum seems like something bigger than the sum of its parts. One day, hopefully soon, the many perspectives will unify into a satisfying whole.

 

Acknowledgments

I am deeply indebted to the editorial contributions of Shan Carter and Chris Olah, without which this article would be greatly impoverished. Shan Carter provided complete redesigns of many of my original interactive widgets, a visual coherence for all the figures, and valuable optimizations to the page's performance. Chris Olah provided impeccable editorial feedback at all levels of detail and abstraction - from the structure of the content, to the alignment of equations.

I am also grateful to Michael Nielsen for providing the title of this article, which really tied the article together. Marcos Ginestra provided editorial input for the earliest drafts of this article, and spiritual encouragement when I needed it the most. And my gratitude extends to my reviewers, Matt Hoffman and Anonymous Reviewer B for their astute observations and criticism. I would like to thank Reviewer B, in particular, for pointing out two non-trivial errors in the original manuscript (discussion here). The contour plotting library in the hero visualization was provided by Mike Bostock.

Footnotes

  1. It is possible, however, to construct very specific counterexamples where momentum does not converge, even on convex functions. See [4] for a counterexample.
  2. In Tikhonov Regression we add a quadratic penalty to the regression, minimizing minimize12Zwd2+η2w2=12wT(ZTZ+ηI)w(Zd)Tw \text{minimize}\qquad\tfrac{1}{2}\|Zw-d\|^{2}+\frac{\eta}{2}\|w\|^{2}=\tfrac{1}{2}w^{T}(Z^{T}Z+\eta I)w-(Zd)^{T}w Recall that ZTZ=Qdiag(Λ1,,Λn)QTZ^{T}Z=Q\text{diag}(\Lambda_{1},\ldots,\Lambda_{n})Q^T. The solution to Tikhonov Regression is therefore (ZTZ+ηI)1(Zd)=Qdiag(1λ1+η,,1λn+η)QT(Zd) (Z^{T}Z+\eta I)^{-1}(Zd)=Q\text{diag}\left(\frac{1}{\lambda_{1}+\eta},\cdots,\frac{1}{\lambda_{n}+\eta}\right)Q^T(Zd) We can think of regularization as a function which decays the largest eigenvalues, as follows Tikhonov Regularized λi=1λi+η=1λi(1(1+λi/η)1). \text{Tikhonov Regularized } \lambda_i = \frac{1}{\lambda_{i}+\eta}=\frac{1}{\lambda_{i}}\left(1-\left(1+\lambda_{i}/\eta\right)^{-1}\right). Gradient descent can be seen as employing a similar decay, but with the decay rate  Gradient Descent Regularized λi=1λi(1(1αλi)k) \text{ Gradient Descent Regularized } \lambda_i = \frac{1}{\lambda_i} \left( 1-\left(1-\alpha\lambda_{i}\right)^{k} \right) instead. Note that this decay is dependent on the stepsize.
  3. This is true as we can write updates in matrix form as (10α1)(yik+1xik+1)=(βλi01)(yikxik) \left(\!\!\begin{array}{cc} 1 & 0\\ \alpha & 1 \end{array}\!\!\right)\Bigg(\!\!\begin{array}{c} y_{i}^{k+1}\\ x_{i}^{k+1} \end{array}\!\!\Bigg)=\left(\!\!\begin{array}{cc} \beta & \lambda_{i}\\ 0 & 1 \end{array}\!\!\right)\left(\!\!\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\!\!\right) which implies, by inverting the matrix on the left, (yik+1xik+1)=(βλiαβ1αλi)(yikxik)=Rk+1(xi0yi0) \Bigg(\!\!\begin{array}{c} y_{i}^{k+1}\\ x_{i}^{k+1} \end{array}\!\!\Bigg)=\left(\!\!\begin{array}{cc} \beta & \lambda_{i}\\ -\alpha\beta & 1-\alpha\lambda_{i} \end{array}\!\!\right)\left(\!\!\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\!\!\right)=R^{k+1}\left(\!\!\begin{array}{c} x_{i}^{0}\\ y_{i}^{0} \end{array}\!\!\right)
  4. We can write out the convergence rates explicitly. The eigenvalues are σ1=12(1αλ+β+(αλ+β+1)24β)σ2=12(1αλ+β(αλ+β+1)24β) \begin{aligned} \sigma_{1} & =\frac{1}{2}\left(1-\alpha\lambda+\beta+\sqrt{(-\alpha\lambda+\beta+1)^{2}-4\beta}\right)\\[0.6em] \sigma_{2} & =\frac{1}{2}\left(1-\alpha\lambda+\beta-\sqrt{(-\alpha\lambda+\beta+1)^{2}-4\beta}\right) \end{aligned} When the (αλ+β+1)24β<0(-\alpha\lambda+\beta+1)^{2}-4\beta<0 is less than zero, then the roots are complex and the convergence rate is σ1=σ2=(1αλ+β)2+(αλ+β+1)24β=2β \begin{aligned} |\sigma_{1}|=|\sigma_{2}| & =\sqrt{(1-\alpha\lambda+\beta)^{2}+|(-\alpha\lambda+\beta+1)^{2}-4\beta|}=2\sqrt{\beta} \end{aligned} Which is, surprisingly, independent of the stepsize or the eigenvalue αλ\alpha\lambda. When the roots are real, the convergence rate is max{σ1,σ2}=12max{1αλi+β±(1αλi+β)24β} \max\{|\sigma_{1}|,|\sigma_{2}|\}=\tfrac{1}{2}\max\left\{ |1-\alpha\lambda_{i}+\beta\pm\sqrt{(1-\alpha\lambda_{i}+\beta)^{2}-4\beta}|\right\}
  5. This can be derived by reducing the inequalities for all 4 + 1 cases in the explicit form of the convergence rate above.
  6. We must optimize over minα,βmax{(βλiαβ1αλi),,(βλnαβ1αλn)}. \min_{\alpha,\beta}\max\left\{ \bigg\| \! \left(\begin{array}{cc} \beta & \lambda_{i}\\ -\alpha\beta & 1-\alpha\lambda_{i} \end{array}\right) \! \bigg\|,\ldots,\bigg\| \! \left(\begin{array}{cc} \beta & \lambda_{n}\\ -\alpha\beta & 1-\alpha\lambda_{n} \end{array}\right)\! \bigg\|\right\}. (\|\cdot \| here denotes the magnitude of the maximum eigenvalue), and occurs when the roots of the characteristic polynomial are repeated for the matrices corresponding to the extremal eigenvalues.
  7. The above optimization problem is bounded from below by 00, and vector of all 11's achieve this.
  8. This can be written explicitly as [LG]ij={degree of vertex ii=j1ij,(i,j) or (j,i)E0otherwise [L_{G}]_{ij}=\begin{cases} \text{degree of vertex }i & i=j\\ -1 & i\neq j,(i,j)\text{ or }(j,i)\in E\\ 0 & \text{otherwise} \end{cases}
  9. We use the infinity norm to measure our error, similar results can be derived for the 1 and 2 norms.
  10. The momentum iterations are zk+1=βzk+Awk+error(wk)wk+1=wkαzk+1. \begin{aligned} z^{k+1}&=\beta z^{k}+ A w^{k} + \text{error}(w^k) \\[0.4em] w^{k+1}&=w^{k}-\alpha z^{k+1}. \end{aligned} which, after a change of variables, become (10α1)(yik+1xik+1)=(βλi01)(yikxik)+(ϵik0) \left(\!\!\begin{array}{cc} 1 & 0\\ \alpha & 1 \end{array}\!\!\right)\Bigg(\!\!\begin{array}{c} y_{i}^{k+1}\\ x_{i}^{k+1} \end{array}\!\!\Bigg)=\left(\!\!\begin{array}{cc} \beta & \lambda_{i}\\ 0 & 1 \end{array}\!\!\right)\left(\!\!\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\!\!\right)+\left(\!\!\begin{array}{c} \epsilon_{i}^{k}\\ 0 \end{array}\!\!\right) Inverting the 2×22 \times 2 matrix on the left, and applying the formula recursively yields the final solution.
  11. On the 1D function f(x)=λ2x2f(x)=\frac{\lambda}{2}x^{2}, the objective value is Ef(xk)=λ2E[(xk)2]=λ2E(e2TRk(y0x0)+ϵke2Ti=1kRki(1α))2=λ2e2TRk(y0x0)+λ2E(ϵke2Ti=1kRki(1α))2=λ2e2TRk(y0x0)+λ2E[ϵk]i=1k(e2TRki(1α))2=λ2e2TRk(y0x0)+λE[ϵk2i=1kγi2,γi=e2TRki(1α) \begin{aligned} \mathbf{E}f(x^{k})&=\frac{\lambda}{2}\mathbf{E}[(x^{k})^{2}]\\&=\frac{\lambda}{2}\mathbf{E}\left(e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\epsilon^{k}e_{2}^{T}\sum_{i=1}^{k}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right)\right)^{2}\\&=\frac{\lambda}{2}e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\frac{\lambda}{2}\mathbf{E}\left(\epsilon^{k}e_{2}^{T}\sum_{i=1}^{k}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right)\right)^{2}\\&=\frac{\lambda}{2}e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\frac{\lambda}{2}\mathbf{E}[\epsilon^{k}]\,\cdot\,\sum_{i=1}^{k}\left(e_{2}^{T}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right)\right)^{2}\\&=\frac{\lambda}{2}e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\frac{\lambda\mathbf{E}[\epsilon^{k}}{2}\cdot\sum_{i=1}^{k}\gamma_{i}^{2}, \qquad \gamma_i = e_{2}^{T}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right) \end{aligned} The third inequality uses the fact that Eϵk=0\mathbf{E} \epsilon^k = 0 and the fourth uses the fact they are uncorrelated.

References

  1. On the importance of initialization and momentum in deep learning.[PDF]
    Sutskever, I., Martens, J., Dahl, G.E. and Hinton, G.E., 2013. ICML (3), Vol 28, pp. 1139--1147.
  2. Some methods of speeding up the convergence of iteration methods[PDF]
    Polyak, B.T., 1964. USSR Computational Mathematics and Mathematical Physics, Vol 4(5), pp. 1--17. Elsevier. DOI: 10.1016/0041-5553(64)90137-5
  3. Theory of gradient methods
    Rutishauser, H., 1959. Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems, pp. 24--49. Springer. DOI: 10.1007/978-3-0348-7224-9_2
  4. Analysis and design of optimization algorithms via integral quadratic constraints[PDF]
    Lessard, L., Recht, B. and Packard, A., 2016. SIAM Journal on Optimization, Vol 26(1), pp. 57--95. SIAM.
  5. Introductory lectures on convex optimization: A basic course
    Nesterov, Y., 2013. , Vol 87. Springer Science \& Business Media. DOI: 10.1007/978-1-4419-8853-9
  6. Natural gradient works efficiently in learning[link]
    Amari, S., 1998. Neural computation, Vol 10(2), pp. 251--276. MIT Press. DOI: 10.1162/089976698300017746
  7. Deep Learning, NIPS'2015 Tutorial[PDF]
    Hinton, G., Bengio, Y. and LeCun, Y., 2015.
  8. Adaptive restart for accelerated gradient schemes[PDF]
    O’Donoghue, B. and Candes, E., 2015. Foundations of computational mathematics, Vol 15(3), pp. 715--732. Springer. DOI: 10.1007/s10208-013-9150-3
  9. The Nth Power of a 2x2 Matrix.[PDF]
    Williams, K., 1992. Mathematics Magazine, Vol 65(5), pp. 336. MAA. DOI: 10.2307/2691246
  10. From Averaging to Acceleration, There is Only a Step-size.[PDF]
    Flammarion, N. and Bach, F.R., 2015. COLT, pp. 658--695.
  11. On the momentum term in gradient descent learning algorithms[PDF]
    Qian, N., 1999. Neural networks, Vol 12(1), pp. 145--151. Elsevier. DOI: 10.1016/s0893-6080(98)00116-6
  12. Understanding deep learning requires rethinking generalization[PDF]
    Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. arXiv preprint arXiv:1611.03530.
  13. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights[PDF]
    Su, W., Boyd, S. and Candes, E., 2014. Advances in Neural Information Processing Systems, pp. 2510--2518.
  14. The Zen of Gradient Descent[HTML]
    Hardt, M., 2013.
  15. A geometric alternative to Nesterov's accelerated gradient descent[PDF]
    Bubeck, S., Lee, Y.T. and Singh, M., 2015. arXiv preprint arXiv:1506.08187.
  16. An optimal first order method based on optimal quadratic averaging[PDF]
    Drusvyatskiy, D., Fazel, M. and Roy, S., 2016. arXiv preprint arXiv:1604.06543.
  17. Linear coupling: An ultimate unification of gradient and mirror descent[PDF]
    Allen-Zhu, Z. and Orecchia, L., 2014. arXiv preprint arXiv:1407.1537.
  18. Accelerating the cubic regularization of Newton’s method on convex problems[PDF]
    Nesterov, Y., 2008. Mathematical Programming, Vol 112(1), pp. 159--181. Springer. DOI: 10.1007/s10107-006-0089-x

Updates and Corrections

View all changes to this article since it was first published. If you see a mistake or want to suggest a change, please create an issue on GitHub.

Citations and Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 2.0, unless noted otherwise, with the source available on GitHub. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: “Figure from …”.

For attribution in academic contexts, please cite this work as

Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006

BibTeX citation

@article{goh2017why,
  author = {Goh, Gabriel},
  title = {Why Momentum Really Works},
  journal = {Distill},
  year = {2017},
  url = {http://distill.pub/2017/momentum},
  doi = {10.23915/distill.00006}
}

 

Posted by uniqueone
,
Posted by uniqueone
,

Object Localization and Detection · Artificial Inteligence
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html
Posted by uniqueone
,

awesome-deep-learning/README.md at master · ChristosChristofidis/awesome-deep-learning · GitHub
https://github.com/ChristosChristofidis/awesome-deep-learning/blob/master/README.md
Posted by uniqueone
,
https://m.facebook.com/groups/255834461424286?view=permalink&id=448254728848924

이제 GAN만으로 한학기 수업을 할수 있을만큼 논문과 재미있는 이야기들이 많이 나오고 있습니다.  이럴때 한번 쫙 읽으시면 도움이 되는 JaeJun Yoo 님의 재미난글:

1. Generative Adversarial Nets (1), http://jaejunyoo.blogspot.com/2017/01/generative-adversarial-nets-1.html

2. Generative Adversarial Nets (2) http://jaejunyoo.blogspot.com/2017/01/generative-adversarial-nets-2.html

3. Domain-Adversarial Training of Neural Networks (DANN) (1) http://jaejunyoo.blogspot.com/2017/01/domain-adversarial-training-of-neural.html

4. Domain-Adversarial Training of Neural Networks (DANN) (2) http://jaejunyoo.blogspot.com/2017/01/domain-adversarial-training-of-neural-2.html

5. Domain-Adversarial Training of Neural Networks (DANN) (3) http://jaejunyoo.blogspot.com/2017/01/domain-adversarial-training-of-neural-3.html

6.  Deep Convolutional Generative Adversarial Network (DCGAN) (1) http://jaejunyoo.blogspot.com/2017/02/deep-convolutional-gan-dcgan-1.html

7.  Deep Convolutional Generative Adversarial Network (DCGAN) http://jaejunyoo.blogspot.com/2017/02/deep-convolutional-gan-dcgan-2.html

8. Unrolled Generative Adversarial Networks (1) http://jaejunyoo.blogspot.com/2017/02/unrolled-generative-adversarial-network-1.html

9. Unrolled Generative Adversarial Networks (2) http://jaejunyoo.blogspot.com/2017/02/unrolled-generative-adversarial-network-2.html

10. InfoGAN (1) http://jaejunyoo.blogspot.com/2017/03/infogan-1.html

11. InfoGAN (2)
 http://jaejunyoo.blogspot.com/2017/03/infogan-2.html

12. LSGAN (1) http://jaejunyoo.blogspot.com/2017/03/lsgan-1.html

"초짜 대학원생의 입장에서 이해하는" 시리즈인데, 제생각에는 "최고수 대학원생의 입장에서 이해하는" 으로 바꾸어야 할듯합니다.

"내 멋대로 정리 및 결론" 코너도 매우 재미있고 유쾌하게 논문을 정리해줍니다.
Posted by uniqueone
,
안녕하세요, 고려대학교 산업경영공학부 Data Science & Business Analytics 연구실입니다.

AI Korea에서 항상 좋은 정보들을 많이 접하고 있어서 먼저 감사 말씀 드립니다.

저희 연구실에서 지난 겨울부터 연구 세미나의 한 종류로 동영상이 공개된 AI  관련 강의들을 구성원들이 각자 미리 학습하고 한 명이 대표로 발표한 뒤 전체 구성원들이 함께 토론하는 자리를 갖고 있습니다.

첫 강의로 이미 많은 분들이 잘 알고 계시는 Stanford University의 cs231n (Convolutional Neural Network for Visual Recognition) 2016년도 강좌 세미나를 완료했고, 현재는 cs224d (Deep Learning for Natural Language Processing) 강좌를 진행하고 있습니다.

이미 여러 전문가 분들께서 해당 강의에 대한 좋은 슬라이드를 올려주셨는데, 저희도 거기에 더해서 각 강좌 모듈마다 저희 연구실 발표자가 만든 발표자료와 세미나 동영상(엄밀하게는 슬라이드 + 음성 녹음)을 github와 유튜브 채널을 통해 공개하고 있습니다.

AI 분야를 처음 접하시는 분들이나 영어에 대한 울렁증으로 해당 강의를 직접 듣기 부담스러운 분들께서는 예습 차원에서 가볍게 봐주시면 좋겠고, 이미 잘 아시는 전문가 분들께서는 혹시나 발표 내용에서 오류가 있거나 개선할 부분이 있다면 주저하지 말고 dsba.koreauniv@gmail.com으로 메일 주십시오.

슬라이드 및 보조 코드 Github
https://github.com/dsba-koreauniv

발표자료 동영상 유튜브 채널
https://www.youtube.com/channel/UCPq01cgCcEwhXl7BvcwIQyg

감사합니다.


Posted by uniqueone
,

10 minutes Practical TensorFlow Tutorial for quick learners | CV-Tricks.com
http://cv-tricks.com/artificial-intelligence/deep-learning/deep-learning-frameworks/tensorflow-tutorial/
Posted by uniqueone
,

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
https://junyanz.github.io/CycleGAN/
Posted by uniqueone
,

Classifying White Blood Cells With Deep Learning (Code and data included!)
https://blog.athelas.com/classifying-white-blood-cells-with-convolutional-neural-networks-2ca6da239331
Posted by uniqueone
,

How to Install TensorFlow | NVIDIA
http://www.nvidia.com/object/gpu-accelerated-applications-tensorflow-installation.html
Posted by uniqueone
,

deep-photo-styletransfer/README.md at master · luanfujun/deep-photo-styletransfer · GitHub
https://github.com/luanfujun/deep-photo-styletransfer/blob/master/README.md
Posted by uniqueone
,

TensorFlow RNN Tutorial - Silicon Valley Data Science
https://svds.com/tensorflow-rnn-tutorial/
Posted by uniqueone
,

GitHub - jaewoosong/torch-tutorial-korean: Torch7 연습 자료를 한국어로 번역했습니다.
https://github.com/jaewoosong/torch-tutorial-korean/
Posted by uniqueone
,

Five video classification methods implemented in Keras and TensorFlow
https://hackernoon.com/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5#.4r0ggl6t9
Posted by uniqueone
,
https://github.com/pkmital/tensorflow_tutorials

 

UPDATE (July 12, 2016)

New free MOOC course covering all of this material in much more depth, as well as much more including combined variational autoencoders + generative adversarial networks, visualizing gradients, deep dream, style net, and recurrent networks: https://www.kadenze.com/courses/creative-applications-of-deep-learning-with-tensorflow-i/info

TensorFlow Tutorials

You can find python source code under the python directory, and associated notebooks under notebooks.

Source code Description
1 basics.py Setup with tensorflow and graph computation.
2 linear_regression.py Performing regression with a single factor and bias.
3 polynomial_regression.py Performing regression using polynomial factors.
4 logistic_regression.py Performing logistic regression using a single layer neural network.
5 basic_convnet.py Building a deep convolutional neural network.
6 modern_convnet.py Building a deep convolutional neural network with batch normalization and leaky rectifiers.
7 autoencoder.py Building a deep autoencoder with tied weights.
8 denoising_autoencoder.py Building a deep denoising autoencoder which corrupts the input.
9 convolutional_autoencoder.py Building a deep convolutional autoencoder.
10 residual_network.py Building a deep residual network.
11 variational_autoencoder.py Building an autoencoder with a variational encoding.

Installation Guides

For Ubuntu users using python3.4+ w/ CUDA 7.5 and cuDNN 7.0, you can find compiled wheels under the wheels directory. Use pip3 install tensorflow-0.8.0rc0-py3-none-any.whl to install, e.g. and be sure to add: export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64" to your .bashrc. Note, this still requires you to install CUDA 7.5 and cuDNN 7.0 under /usr/local/cuda.

Resources

Author

Parag K. Mital, Jan. 2016.

http://pkmital.com

License

See LICENSE.md

 

Posted by uniqueone
,
Posted by uniqueone
,
https://github.com/aymericdamien/TensorFlow-Examples

 

TensorFlow Examples

TensorFlow Tutorial with popular machine learning algorithms implementation. This tutorial was designed for easily diving into TensorFlow, through examples.

It is suitable for beginners who want to find clear and concise examples about TensorFlow. For readability, the tutorial includes both notebook and code with explanations.

Note: If you are using older TensorFlow version (before 0.12), please have a look here

Tutorial index

0 - Prerequisite

  • Introduction to Machine Learning (notebook)
  • Introduction to MNIST Dataset (notebook)

1 - Introduction

2 - Basic Models

3 - Neural Networks

4 - Utilities

  • Save and Restore a model (notebook) (code)
  • Tensorboard - Graph and loss visualization (notebook) (code)
  • Tensorboard - Advanced visualization (code)

5 - Multi GPU

Dataset

Some examples require MNIST dataset for training and testing. Don't worry, this dataset will automatically be downloaded when running examples (with input_data.py). MNIST is a database of handwritten digits, for a quick description of that dataset, you can check this notebook.

Official Website: http://yann.lecun.com/exdb/mnist/

More Examples

The following examples are coming from TFLearn, a library that provides a simplified interface for TensorFlow. You can have a look, there are many examples and pre-built operations and layers.

Tutorials

  • TFLearn Quickstart. Learn the basics of TFLearn through a concrete machine learning task. Build and train a deep neural network classifier.

Basics

Computer Vision

Natural Language Processing

Reinforcement Learning

Others

Notebooks

Extending TensorFlow

  • Layers. Use TFLearn layers along with TensorFlow.
  • Trainer. Use TFLearn trainer class to train any TensorFlow graph.
  • Built-in Ops. Use TFLearn built-in operations along with TensorFlow.
  • Summaries. Use TFLearn summarizers along with TensorFlow.
  • Variables. Use TFLearn variables along with TensorFlow.

Dependencies

tensorflow 1.0alpha
numpy
matplotlib
cuda
tflearn (if using tflearn examples)

For more details about TensorFlow installation, you can check TensorFlow Installation Guide

 

 

Posted by uniqueone
,
https://blog.acolyer.org/


Posted by uniqueone
,

Faster R-CNN
https://curt-park.github.io/2017-03-17/faster-rcnn/
Posted by uniqueone
,

http://yeramee.tistory.com/1

Posted by uniqueone
,

Ubuntu 16.04LTS + TensorFlow + Cuda + Cudnn + Pycharm + Anaconda

http://yeramee.tistory.com/3


Posted by uniqueone
,

https://deeptensorflow.github.io/2017/02/21/how-to-build-tensorflow-with-pip/



직접 bazel로 빌드했을때의 장점?

tensorflow 커뮤니티에 올라온 정보들에 따르면 직접 빌드했을때 속도가 소폭 향상된다고 합니다.
추후 직접 실험을 해서 속도비교를 해서 올릴 예정입니다.
그리고 자신이 직접 내부의 소스코드를 수정해서 빌드하면 ‘나만의 텐서플로우’를 만들 수 있습니다.

빌드 전 개발환경 세팅

pip(파이썬 패키지 매니저)과 java-jdk가 설치되어 있어야 합니다.
설치 방법은 OS환경마다 다르지만 매우 간단합니다.
아마 대부분 설치가 이미 되어있을 것이라고 생각하기 때문에 그냥 넘어가겠습니다.

bazel 다운로드

https://bazel.build/ 에서 다운로드가 가능합니다.
curl로 다운 받는 방법은 아래와 같습니다.

1
2
3
4
5
6
7
8
9
10
export BAZELRC=/home/<yourid>/.bazelrc
export BAZEL_VERSION=0.4.2
mkdir /home/<yourid>/bazel
cd /home/<yourid>/bazel
curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
curl -fSsL -o /home/<yourid>/bazel/LICENSE.txt https://raw.githubusercontent.com/bazelbuild/bazel/master/LICENSE.txt
chmod +x bazel-*.sh
sudo ./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
cd /home/<yourid>/
rm -f /home/<yourid>/bazel/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh

tensorflow github에서 다운로드

소스가 있어야 빌드를 하겠죠?
그리고 wheel, six, numpy 패키지가 필요합니다.

1
2
3
4
5
6
git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout r1.0 #빌드를 원하는 버전을 입력하시면 되요.
sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel #ubuntu 쓰는 분들만
brew install python3 #맥 쓰는 분들만
sudo pip3 install six numpy wheel #맥쓰는 분들만

여기서 빌드를 하기 전에 소스코드를 고쳐서 텐서플로우에 내 소스를 추가할 수 있습니다.
공식 홈페이지에서 해당 내용도 있습니다.
https://www.tensorflow.org/extend/adding_an_op
다음 포스팅으로 자세히 해당 내용에 대해서도 다루어 보겠습니다.

tensorflow에서는 설정을 쉽게 하는 방법을 제공합니다.

tensorflow 폴더에서 ./configure 하시면 빌드 설정을 쉽게 하도록 도와줍니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ cd tensorflow # cd to the top-level directory created
$ ./configure
Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python2.7
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Do you wish to use jemalloc as the malloc implementation? [Y/n]
jemalloc enabled
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N]
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N]
No Hadoop File System support will be enabled for TensorFlow
Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N]
No XLA JIT support will be enabled for TensorFlow
Found possible Python library paths:
/usr/local/lib/python2.7/dist-packages
/usr/lib/python2.7/dist-packages
Please input the desired Python library path to use. Default is [/usr/local/lib/python2.7/dist-packages]
Using python library path: /usr/local/lib/python2.7/dist-packages
Do you wish to build TensorFlow with OpenCL support? [y/N] N
No OpenCL support will be enabled for TensorFlow
Do you wish to build TensorFlow with CUDA support? [y/N] Y
CUDA support will be enabled for TensorFlow
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0
Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to use system default]: 5
Please specify the location where cuDNN 5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 3.0
Setting up Cuda include
Setting up Cuda lib
Setting up Cuda bin
Setting up Cuda nvvm
Setting up CUPTI include
Setting up CUPTI lib64
Configuration finished

중간에 파이썬 버전 선택이나 CUDA 지원 여부 자신의 환경에 맞게 잘 세팅해주세요.
제안들이 친절하게 분류되어있어서 어려움은 없을 것 같습니다.

bazel로 빌드하기

설정이 끝났다면 빌드를 해야겠죠?

1
2
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package #cpu버전일경우
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package #gpu버전일 경우

pip 패키지로 관리하기 쉽게 만들어줍시다.

1
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

pip으로 다운받아줍시다.

1
2
sudo pip install /tmp/tensorflow_pkg/tensorflow-1.0.0-py2-none-any.whl #python2버전
sudo pip3 install /tmp/tensorflow_pkg/tensorflow-1.0.0-cp36-cp36m-macosx_10_12_x86_64.whl #python3버전

이건 설정에 따라서 다르닌깐요
sudo pip(파이썬3이면 pip3) install /tmp/tensorflow_pkg/ 한다음에 tab키 쳐주시면 뜨는데 그리고 엔터 눌러주세요.

Posted by uniqueone
,
Posted by uniqueone
,