'Deep Learning'에 해당되는 글 593건

  1. 2017.03.22 GAN 그리고 Unsupervised Learning
  2. 2017.03.21 [Caffe] windows 환경에서 caffe를 설치하자 (161102 기준)
  3. 2017.03.21 윈도우10(Windows 10) / Caffe / CUDA / cuDNN / Python 설치
  4. 2017.03.21 Installing Keras, Theano and Dependencies on Windows 10
  5. 2017.03.21 TensorFlow: training on my own image
  6. 2017.03.19 What is the best probabilistic graphical model toolkit for MATLAB?
  7. 2017.03.18 Windows에 TensorFlow 설치하기
  8. 2017.03.18 텐서플로우 설치하기
  9. 2017.03.18 윈도우 텐서플로우 설치 (Tensorflow installation in window)
  10. 2017.03.18 아나콘다를 이용한 윈도우에 텐서플로우 설치해보기
  11. 2017.03.18 Windows 10 64bit 에서 텐서플로우(Tensorflow) 1.0.0 설치하기
  12. 2017.03.18 스마트폰에서 딥러닝 실행방법-Squeezing Deep Learning into Mobile Phones - A Practitioner's guide
  13. 2017.03.18 All AI Resources at one place – AI medicines
  14. 2017.03.18 Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning with Tables and Neural Networks
  15. 2017.03.17 Convolutional Neural Networks (CNNs): An Illustrated Explanation
  16. 2017.03.16 linear algebra start video lectures
  17. 2017.03.13 Winning Tips on Machine Learning Competitions by Kazanova, Current Kaggle #3 | HackerEarth Blog
  18. 2017.03.11 Directory of tutorials and open-source code repositories for working with Keras, the Python deep learning library
  19. 2017.03.09 [NIPS 2016 tutorial - Summary] Nuts and bolts of building AI applications using Deep Learning 1
  20. 2017.03.09 Linear algebra cheat sheet for deep learning
  21. 2017.03.09 Android에서 TensorFlow 실행하기
  22. 2017.03.09 [머신러닝] 19. 머신러닝 학습 방법(part 14) - AutoEncoder(6) : 네이버 블로그
  23. 2017.03.09 Neural networks and deep learning
  24. 2017.03.08 5 algorithms to train a neural network
  25. 2017.03.01 ssl verification error ssl certificate_verify_failed
  26. 2017.03.01 Python Keras+Tensorflow on Windows7 64bit 설치하기
  27. 2017.03.01 TensorFlow-v1.0.0 + Keras 설치 (Windows/Linux/macOS)
  28. 2017.02.28 Keras Tutorial: The Ultimate Beginner's Guide to Deep Learning in Python
  29. 2017.02.28 keras로 공부하기 좋은 사이트 theano
  30. 2017.02.28 Keras 자료

T-Robotics: GAN 그리고 Unsupervised Learning
http://t-robotics.blogspot.kr/2017/03/gan-unsupervised-learning.html?m=1


 
T-Robotics
2017년 3월 8일 수요일
GAN 그리고 Unsupervised Learning
이 글은 테크M에 기고한 글을 백업한 글입니다.

머신러닝 분야 세계 최고 학회중 하나인 NIPS에서 중국 바이두의  인공지능연구소장, 앤드류 응은 최근 머신러닝 트렌드를 이렇게 진단했다.
“이미지 인식, 음성 인식 등 이제까지 구글 같은 거대기업에 돈을 벌어 준 딥러닝 기술은 컨볼루셔널 신경망(CNN)이나 재귀신경망(RNN) 같은 지도학습 기술이었습니다. 하지만 올해 나온 많은 논문들이 보여주는 것처럼, 미래의 딥러닝을 이끌 기술은 생성적 적대신경망(GAN)과 같은 비지도 학습이 그 주인공이 될 것입니다.”
이러한 견해를 보이는 것은 응 교수만이 아니다. 많은 딥러닝 연구자들이 꾸준히 제기하고 있는 이야기이기도 하다. 지도학습은 많은 데이터의 지원을 받는 환경에서는  좋은 성능을 내지만, 그것은 미래 인공지능이 다뤄야 할 ‘경험의 진정한 이해’와는 큰 차이가 있다는 것이다.
“내가 다시 창조할 수 없는 것은 내가 이해한 것이 아니다.”      (물리학자 리처드 파인만)
많은 인공지능 전문가들이 미래를 이끌 것으로 전망하는 비지도 학습은 생성 모델(generative model)과 긴밀히 연결돼 있다. 기존의 CNN과 RNN은 이미지를 구별하고 음성을 인식하지만 이미지나 음성을 만들지는 못했다.
하지만 미래의 생성 모델을 활용하면 직접 이미지와 음성을 만들어낼 수 있다. 단순히 알아보기만 하던 것에서 한 발 나아가 직접 그림을 그릴 수 있는 것으로의 진보, 그 중심에는 2014년 처음 발표된 GAN이 있다.
GAN의 개발자이자 오픈AI의 수석연구원인 이안 굿 펠로우. 그는 캐나다 몬트리올대학 벤지오 교수의 애제자이기도 하다.
GAN의 개발자이자 오픈AI의 수석연구원인 이안 굿 펠로우. 그는 캐나다 몬트리올대학 벤지오 교수의 애제자이기도 하다.

지도 학습과 생성 모델

그럼 우선 지도학습(supervised learning)과 비지도 학습(unsupervised learning)의 차이를 알아보자.
지금까지 머신러닝의 주류를 이뤘던  지도학습은 데이터와 라벨(이름)의 짝을 집중 학습하는 방법이었다. 컴퓨터를 아이라고 가정하면, 여러 개의 사물을 보여주면서  ‘이것은 개’, ‘이것은 고양이’, ‘이것은 식탁’ 하고 사물 X와 이름 Y의 쌍을 지속적으로 학습시키는 것이다.
이렇게 수많은 (X, Y)의 함수를 알게 되면 전혀 알지 못하는 미지의 사물 X에 대해서도 적절한 예측 Y를 내놓을 수 있게 된다. 구글과 페이스북 등에서 사용하는 얼굴인식이나 사물인식, 음성인식 모두 이러한 지도학습 방법을 이용한 기술들이었다.

하지만 이 방법에는 한계가 있다. 누군가 일일이 각 사물의 ‘정답(label)’을 알려 줘야 만 학습이 가능하기 때문이다. 페이스북에 인물사진을 올리면 사진에 있는 얼굴이 누구인지 이용자들이 직접 써 넣도록 유도하는 데, 이러한 라벨링은 결국 특정 인물을 인식하게 하는 큰 원동력이 된다. 반대로 얘기하자면, 이러한 라벨링이 없다면 지식을 얻을 수 없고 학습도 할 수 없다는 게 현재 지도학습의 한계다.

하지만 아이가 사물들을 알아가는 방식을 생각해보자.  아이들은 꼭 누가 일일이 그 사물이 무엇인지 알려줘야만 사물에 대해 배우는 것은 아니다. 다양한 모양의 장난감 블록을 주고 가지고 놀게 하면, 아이는 스스로 네모와 세모, 동그라미의 차이를 터득하게 된다.
궁극적인 인공지능을 구현하려면 이렇게 누군가 정답을 가르쳐주지 않더라도 인공지능 스스로 사물의 특성을 파악할 수 있는 능력이 있어야 한다. 이렇게 라벨 없이 데이터 그 자체에서 지식을 얻는 방법을 비지도 학습이라고 한다.

비지도 학습이 미래 기술로서 주목 받는 이유는, 세상에는 (라벨링이 된 데이터의 양에 비해) 엄청나게 많은 라벨 없는 데이터들이 존재하기 때문이다. 사진만 해도 그렇다. 우리는 수많은 사진을 찍지만 사진 속 얼굴이 누구이고 이 사물이 무엇인지 라벨링을 해주는 경우는 극소수에 불과하다.
지금까지 지도학습 기반의 딥러닝 기술이 이렇게 발전할 수 있었던 것도 이미지넷이나 CIFAR 같은 라벨링이 잘 돼 있는 공공데이터의 존재 덕분이었다. 그런데 이 딥러닝의 혜택을 실제 세계에 옮겨오려면 (세상에 존재하는 수많은 라벨 없는 데이터들을 이용하려면) 비지도 학습의 발전이 없으면 불가능하다.

그 미래 기술의 물꼬를 튼 기술이 바로 GAN(Generative Adversarial Network) 이었다. 이전에도 제한 볼츠만 머신(RBM)이나 자동 엔코더 같은 비지도 학습 방법이 있기는 했다. 하지만 이 방법은 지도학습을 위한 사전 처리의 용도 외에는 활용하기가 쉽지 않았다.
하지만  ‘지도학습 같은 비지도 학습’인  GAN은 놀라운 성능과 함께 지도학습 못지 않은 활용도를 보여 머신러닝 연구의  중심에 섰다. 페이스북의 얀 르쿤이 “지난 10년간 있었던 머신러닝 연구 중 가장 재미있는 아이디어”라고 평했던 GAN은 대체 어떤 기술일까?

모조품과의 경쟁을 통해 배운다

GAN은 그동안 좋은 성과를 냈던 지도학습의 훈련방법을 비지도 학습에도 적용했다. ‘정답’에 대한 정보가 없는 상황에서 어떻게 지도학습의 방법을 차용했을까?

먼저 최대한 진짜 같은 샘플을 만드는 것을 목표로 한다. (앞의 ‘이해한다는 것은 새로 창조할 수 있다는 말이다’란 명언을 기억하자.) 비지도 학습법은 실제 데이터로 대강의 함수를 표현하고 이를 다시 적당히 일반화 해 새 데이터를 만들 수는 있었다.
예를 들어 10살과 20살의 얼굴 모습이 있다면, 이 변화과정을 적절히 모델링 해 15살의 얼굴을 만드는 것이다. 하지만 이 방법은 확률모델에 많은 가정을 해야 하고 다양한 데이터들이 평균점을 향해 섞여 결과적으로는 ‘뭉개진(blurry)’ 이미지를 만들어낸다는 단점을 가지고 있었다.

하지만 GAN은 ‘만드는 사람(generator)’과 ‘구별하는 사람(discriminator)’이 서로 경쟁하는 게임으로 바꿔 모델 학습을 시도했다. 비유를 하자면 화폐위조범(=generator)은 최대한 진짜 같은 화폐를 만들기 위해 노력하고, 감별사(=discriminator)는 진짜와 가짜 화폐를 구분하기 위해 노력함으로써 서로의 발전을 꾀하는 것이다(라이벌은 서로를 발전시킨다!).

GAN은 생성자(모조품 제작자)와 구별자(진품 감별사) 사이의 경쟁적 발전을 통해 좀 더 진짜에 가까운 샘플을 만든다.
GAN은 생성자(모조품 제작자)와 구별자(진품 감별사) 사이의 경쟁적 발전을 통해 좀 더 진짜에 가까운 샘플을 만든다.
여기서 구별하는 사람은 기존 지도학습 방법의 ‘인식 기술’과 같은 기술을 사용한다. 따라서 구별하는 사람을 속이려면 매우 좋은 품질의 위조품을 만들어야 하고, 결과적으로 스스로가 뛰어난 만드는 사람이 되는 학습을 하게 되는 것이다.
‘학습’을 ‘경쟁게임’으로 바꾼 GAN의 아이디어는 기존 비지도 학습의 난제를 피하면서도 지도학습의 기술들을 활용할 수 있는 매우 똑똑한 아이디어라 할 수 있다.

이 결과, GAN은 기존의 생성모델처럼 뭉개지지 않는, 더 생생한 이미지들을 만들어 낼 수 있었다. 뿐만 아니라, 사용자가 입력한 조건에 가장 가까운 샘플을 생성할 수도 있다. 사용자가 대충 스케치를 하면 진짜 같은 그림을 생성해주는 이미지 편집의 기능도 구현할 수 있게 해준 것이다. 또 흐려진 이미지에 대해 ‘진짜 같아지는 학습’을 통해 더욱 선명한 이미지로 복원하는 기능을 제공했고, 위성사진을 지도사진으로 변환하는 등 사진과 사진의 전환(image to image transfer)도 가능하게 해주었다.

GAN을 이용한 이미지 복원(위)과 GAN을 이용한 사진과 사진의 전환 (아래)
GAN을 이용한 이미지 복원(위)과 GAN을 이용한 사진과 사진의 전환 (아래)

DC-GAN을 이용해 생성된 침실 이미지들. 모두 실제 사진들 같아 보이지만 몇몇 사진들은 논리적으로 맞지 않는 침실모습을 갖고 있다.
DC-GAN을 이용해 생성된 침실 이미지들. 모두 실제 사진들 같아 보이지만 몇몇 사진들은 논리적으로 맞지 않는 침실모습을 갖고 있다.


GAN과 인공지능의 미래

GAN의 등장으로 인해 인공지능은 ‘수동적 인식’에서 벗어나 ‘능동적 행동’을 할 수 있는 인공지능으로 진일보하고 있다. 이안 굿 펠로우가 GAN을 발표한지 채 2년 밖에 되지 않았지만, GAN은 이미지 생성부터 편집, 변환, 복원 등 다양한 어플리케이션에서 그 효용을 보여주고 있다. 2017년에는 이미지 데이터를 넘어 음성이나 자연어 등의 데이터에도 GAN이 적용될 전망이다. 이를 이용하면 음성 생성과 편집, 음성 변환이나 복원 등도 가능하지 않을까 기대해본다.
이론적으로는 ‘학습’에서 ‘경쟁게임’으로 전환하여 푸는 방식이 매우 흥미로운 접근법이었는데, 최적화를 기반으로 한 학습에 치우쳐 있던 머신러닝의 기법들이 ‘경쟁게임’을 활용해 어떠한 방향으로 발전해 나갈지 기대가 크다.

GAN의 가장 큰 약점은 만드는 쪽과 구별하는 쪽을 균형 있게 훈련시키기가 기존의 최적화에 비해 매우 어렵다는 것이다. 게임을 하는 두 사람 간의 실력 차가 크다면 서로의 발전을 기대하기는 어렵다.  GAN의 학습도 이러한 ‘실력 차’에 의한 불균형이 종종 발생해 훈련을 어렵게 하곤 한다.
하지만 현재 이를 피하기 위한 여러 가지 트릭이 제시되고 있으며, 이와 함께 적대적 게임관계의 효율적 수렴을 위한 이론적 연구가 앞으로 많이 진행될 전망이다. 또 아직까지 정적인 데이터인 이미지에만 주로 사용되는 GAN을 음성이나 자연어 등에 확장시키려면 순서(sequence) 데이터를 고려한 알고리즘의 변형이 필요할 것으로 보인다.

인공지능은 GAN을 통해 수동적 존재에서 능동적 존재로의 첫 발을 내디뎠다. 비록 아직은 데이터로 재미있는 변화들을 만들어내는 수준에 불과하지만, 이들을 통해 스스로 새로운 지식을 쌓고 학습하는데(self-learning)까지 얼마가 더 걸릴지는 모르는 일이다.
인간이 데이터를 제공하는 것에서 벗어나 스스로 환경과 상호작용하며 경험을 축적하는 능동적 인공지능을 개발하려면 지식을 탐닉하는 에이전트가 필요하다. 이를 위해선 현재 게임에 많이 적용되고 있는 강화학습(reinforcement learning)과의 결합이 필요해 보인다.

[facebook] http://facebook.com/trobotics
[rss] http://t-robotics.blogspot.kr/feeds/posts/default




Terry TaeWoong Um [Date] 3/08/2017
공유
 




웹 버전 보기

 
Powered by Blogger.
Posted by uniqueone
,

[Caffe] windows 환경에서 caffe를 설치하자 (161102 기준)
http://jangjy.tistory.com/m/249
Posted by uniqueone
,

윈도우10(Windows 10) / Caffe / CUDA / cuDNN / Python 설치
http://hanmaruj.tistory.com/m/15
Posted by uniqueone
,

Installing Keras, Theano and Dependencies on Windows 10 – Ankivil
http://ankivil.com/installing-keras-theano-and-dependencies-on-windows-10/
Posted by uniqueone
,

TensorFlow: training on my own image - Stack Overflow
http://stackoverflow.com/questions/37340129/tensorflow-training-on-my-own-image
Posted by uniqueone
,

What is the best probabilistic graphical model toolkit for MATLAB? - Cross Validated
http://stats.stackexchange.com/questions/50871/what-is-the-best-probabilistic-graphical-model-toolkit-for-matlab
Posted by uniqueone
,

http://blog.daum.net/buillee/1513

 

1. 시스템종류 : 64비트 운영체제(윈도우즈)


2. Python3.5을 제공하는 Aanaconda3 4.2.0 버전을 다운로드함.

(https://repo.continuum.io/archive/index.html 에서 Anaconda3-4.2.0-Windows-x86_64.exe 를 다운로드)


3. Anaconda3 4.2.0을 실행해서 설치함


4. 제어판 --> 시스템 --> 고급 --> 환경변수 --> 사용자 변수 --> Path에 Anaconda3이 설치된 경로를 지정

(저의 경우는 다음과 같음.

C:\Users\buil\Anaconda3;C:\Users\buil\Anaconda3\Scripts;C:\Users\buil\Anaconda3\Library\bin;)


5. 제어판 --> 시스템 --> 고급 --> 환경변수 --> 시스템 변수 --> Path에 Anaconda3이 다음과 같이 지정함

(C:\Users\buil\Anaconda3\Scripts;)


6. 컴퓨터 다시 시작


7. 실행 --> cmd


8. command 창에서 conda create -n 지정이름 python=3.5를 입력함

(저는 conda create -n tf3 python=3.5)

중간에 질문에 나오면 y를 선택함


9. command 창에서 activate tf 입력함


10. command 창에서 pip install tensorflow 입력함

에러가 나지 않으면 잘 설치됨


11. command 창에서 jupyter notebook을 입력함

그러면 브라우저가 실행되며, 그곳에서 명령어를 입력해서 TensorFlow를 연습하면 됨.

 

 

Posted by uniqueone
,

1. 텐서플로우 설치하기
http://aileen93.tistory.com/58
Posted by uniqueone
,

윈도우 텐서플로우 설치 (Tensorflow installation in window)
http://lifestudying.tistory.com/9
Posted by uniqueone
,

아나콘다를 이용한 윈도우에 텐서플로우 설치해보기 : 네이버 블로그
http://blog.naver.com/windpriest/220937771369
Posted by uniqueone
,

Windows 10 64bit 에서 텐서플로우(Tensorflow) 1.0.0 설치하기
http://blog.ggaman.com/1000
Posted by uniqueone
,


Search Data Science Central
SearchSign UpSign In
Data Science Central
ANALYTICS
BIG DATA
HADOOP
DATA PLUMBING
DATAVIZ
JOBS
WEBINARS
MEMBERSHIP
SEARCH
CLASSIFIEDS
CONTACT


Subscribe to DSC Newsletter
All Blog PostsMy BlogAdd

Squeezing Deep Learning into Mobile Phones - A Practitioner's guide
Posted by Vincent Granville on March 17, 2017 at 7:30amView Blog
This is a slideshare presentation by Anirudh Koul. Anirudh is deep learning data scientist at Microsoft AI & Research. He earned a master of computational data science at Carnegie Mellon University, and a graduate certificate in data mining from Stanford University. He currently lives in the Bay Area. Anirudh is leading projects like Seeing AI (for the blind community) and others.


Yesterday, I gave a talk at the Strata+Hadoop World Conference on “Squeezing Deep Learning into Mobile Phones - A Practitioner's guide”. Luckily, it seems to have organically gone viral on Twitter, with 3000 views in 12 hours. I thought your readers might find it interesting too, hence sharing it with you.
Tweet: https://twitter.com/petewarden/status/842169469401104384
Slideshare: https://www.slideshare.net/anirudhkoul/squeezing-deep-learning-into...
My twitter id : @anirudhkoul
Below is a transcript of the presentation:
1. Squeezing Deep Learning into mobile phones - A Practitioners guide Anirudh Koul
2. Anirudh Koul , @anirudhkoul , http://koul.ai Project Lead, Seeing AI Applied Researcher, Microsoft AI & Research Akoul at Microsoft dot com Currently working on applying artificial intelligence for productivity, augmented reality and accessibility Along with Eugene Seleznev, Saqib Shaikh, Meher Kasam
3. Why Deep Learning On Mobile? Latency Privacy
4. Mobile Deep Learning Recipe Mobile Inference Engine + Pretrained Model = DL App (Efficient) (Efficient)
5. Building a DL App in _ time
6. Building a DL App in 1 hour
7. Use Cloud APIs Microsoft Cognitive Services Clarifai Google Cloud Vision IBM Watson Services Amazon Rekognition
8. Microsoft Cognitive Services Models won the 2015 ImageNet Large Scale Visual Recognition Challenge Vision, Face, Emotion, Video and 21 other topics
9. Building a DL App in 1 day
10. http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning... Energy to train Convolutional Neural Network Energy to use Convolutional Neural Network
11. Base PreTrained Model ImageNet – 1000 Object Categorizer Inception Resnet
12. Running pre-trained models on mobile MXNet Tensorflow CNNDroid DeepLearningKit Caffe Torch
13. MXNET Amalgamation : Pack all the code in a single source file Pro: • Cross Platform (iOS, Android), Easy porting • Usable in any programming language Con: • CPU only, Slow https://github.com/Leliana/WhatsThis
14. Tensorflow Easy pipeline to bring Tensorflow models to mobile Great documentation Optimizations to bring model to mobile Upcoming : XLA (Accelerated Linear Algebra) compiler to optimize for hardware
15. CNNdroid GPU accelerated CNNs for Android Supports Caffe, Torch and Theano models ~30-40x Speedup using mobile GPU vs CPU (AlexNet) Internally, CNNdroid expresses data parallelism for different layers, instead of leaving to the GPU’s hardware scheduler
16. DeepLearningKit Platform : iOS, OS X and tvOS (Apple TV) DNN Type : CNNs models trained in Caffe Runs on mobile GPU, uses Metal Pro : Fast, directly ingests Caffe models Con : Unmaintained
17. Caffe Caffe for Android https://github.com/sh1r0/caffe-android-lib Sample app https://github.com/sh1r0/caffe-android-demo Caffe for iOS : https://github.com/aleph7/caffe Sample app https://github.com/noradaiko/caffe-ios-sample Pro : Usually couple of lines to port a pretrained model to mobile CPU Con : Unmaintained
18. Running pre-trained models on mobile Mobile Library Platform GPU DNN Architecture Supported Trained Models Supported Tensorflow iOS/Android Yes CNN,RNN,LSTM, etc Tensorflow CNNDroid Android Yes CNN Caffe, Torch, Theano DeepLearningKit iOS Yes CNN Caffe MXNet iOS/Android No CNN,RNN,LSTM, etc MXNet Caffe iOS/Android No CNN Caffe Torch iOS/Android No CNN,RNN,LSTM, etc Torch
19. Building a DL App in 1 week
20. Learn Playing an Accordion 3 months
21. Learn Playing an Accordion 3 months Knows Piano Fine Tune Skills 1 week
22. I got a dataset, Now What? Step 1 : Find a pre-trained model Step 2 : Fine tune a pre-trained model Step 3 : Run using existing frameworks “Don’t Be A Hero” - Andrej Karpathy
23. How to find pretrained models for my task? Search “Model Zoo” Microsoft Cognitive Toolkit (previously called CNTK) – 50 Models Caffe Model Zoo Keras Tensorflow MXNet
24. AlexNet, 2012 (simplified) [Krizhevsky, Sutskever,Hinton’12] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11 n-dimension Feature representation
25. Deciding how to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
26. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
27. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
28. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
29. Building a DL Website in 1 week
30. Less Data + Smaller Networks = Faster browser training
31. Several JavaScript Libraries Run large CNNs • Keras-JS • MXNetJS • CaffeJS Train and Run CNNs • ConvNetJS Train and Run LSTMs • Brain.js • Synaptic.js Train and Run NNs • Mind.js • DN2A
32. ConvNetJS Both Train and Test NNs in browser Train CNNs in browser
33. Keras.js Run Keras models in browser, with GPU support.
34. Brain.JS Train and run NNs in browser Supports Feedforward, RNN, LSTM, GRU No CNNs Demo : http://brainjs.com/ Trained NN to recognize color contrast
35. MXNetJS On Firefox and Microsoft Edge, performance is 8x faster than Chrome. Optimization difference because of ASM.js.
36. Building a DL App in 1 month (and get featured in Apple App store)
37. Response Time Limits – Powers of 10 0.1 second : Reacting instantly 1.0 seconds : User’s flow of thought 10 seconds : Keeping the user’s attention [Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
38. Apple frameworks for Deep Learning Inference BNNS – Basic Neural Network Subroutine MPS – Metal Performance Shaders
39. Metal Performance Shaders (MPS) Fast, Provides GPU acceleration for inference phase Faster app load times than Tensorflow (Jan 2017) About 1/3rd the run time memory of Tensorflow on Inception-V3 (Jan 2017) ~130 ms on iPhone 7S Plus to run Inception-V3 Cons: • Limited documentation. • No easy way to programmatically port models. • No batch normalization. Solution : Join Conv and BatchNorm weights
40. Putting out more frames than an art gallery
41. Basic Neural Network Subroutines (BNNS) Runs on CPU BNNS is faster for smaller networks than MPS but slower for bigger networks
42. BrainCore NN Framework for iOS Provides LSTMs functionality Fast, uses Metal, runs on iPhone GPU https://github.com/aleph7/braincore
43. Building a DL App in 6 months
44. What you want https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016 $2000$200,000 What you can afford
45. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
46. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max1 Soft maxAct ivat ion soft max2 GoogleNet, 22 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
47. AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2 VGG, 19 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Ultra deep
48. ResNet, 152 layers 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 7x7 conv, 64, /2, pool/2 Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
49. 3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error (%) shallow8 layers 19 layers22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 8 layers Revolution of Depth
50. Your Budget - Smartphone Floating Point Operations Per Second (2015) http://pages.experts-exchange.com/processing-power-compared/
51. Accuracy vs Operations Per Image Inference Size is proportional to num parameters Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016 552 MB 240 MB What we want
52. Accuracy Per Parameter Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
53. Pick your DNN Architecture for your mobile architecture Resnet Family Under 150 ms on iPhone 7 using Metal GPU Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015
54. Strategies to make DNNs even more efficient Shallow networks Compressing pre-trained networks Designing compact layers Quantizing parameters Network binarization
55. Pruning Aim : Remove all connections with absolute weights below a threshold Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
56. Observation : Most parameters in Fully Connected Layers AlexNet 240 MB VGG-16 552 MB 96% of all parameters 90% of all parameters
57. Pruning gets quickest model compression without accuracy loss AlexNet 240 MB VGG-16 552 MB First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy
58. Weight Sharing Idea : Cluster weights with similar values together, and store in a dictionary. Codebook Huffman coding HashedNets Simplest implementation: • Round all weights into 256 levels • Tensorflow export script reduces inception zip file from 87 MB to 26 MB with 1% drop in precision
59. Selective training to keep networks shallow Idea : Augment data limited to how your network will be used Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate. Followed by WordLens / Google Translate Example : Add blur if analyzing mobile phone frames
60. Design consideration for custom architectures – Small Filters Three layers of 3x3 convolutions >> One layer of 7x7 convolution Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions Replace NxN convolutions with stack of 1xN and Nx1 Fewer parameters  Less compute  More non-linearity  Better Faster Stronger Andrej Karpathy, CS-231n Notes, Lecture 11
61. SqueezeNet - AlexNet-level accuracy in 0.5 MB SqueezeNet base 4.8 MB SqueezeNet compressed 0.5 MB 80.3% top-5 Accuracy on ImageNet 0.72 GFLOPS/image Fire Block Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
62. Reduced precision Reduce precision from 32 bits to <=16 bits or lesser Use stochastic rounding for best results In Practice: • Ristretto + Caffe • Automatic Network quantization • Finds balance between compression rate and accuracy • Apple Metal Performance Shaders automatically quantize to 16 bits • Tensorflow has 8 bit quantization support • Gemmlowp – Low precision matrix multiplication library
63. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
64. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
65. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
66. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
67. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
68. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
69. XNOR-Net on Mobile
70. Building a DL App and get $10 million in funding (or a PhD)
71. Minerva
72. Minerva
73. DeepX Toolkit Nicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016
74. EIE : Efficient Inference Engine on Compressed DNNs Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016 189x faster on CPU 13x faster on GPU
75. One Last Question
76. How to access the slides in 1 second Link posted here -> @anirudhkoul
Top DSC Resources
Article: Difference between Machine Learning, Data Science, AI, Deep Learnin...
Article: What is Data Science? 24 Fundamental Articles Answering This Question
Article: Hitchhiker's Guide to Data Science, Machine Learning, R, Python
Tutorial: Data Science Cheat Sheet
Tutorial: How to Become a Data Scientist - On Your Own
Tutorial: Advanced Machine Learning with Basic Excel
Categories: Data Science - Machine Learning - AI - IoT - Deep Learning
Tools: Hadoop - DataViZ - Python - R - SQL - Excel
Techniques: Clustering - Regression - SVM - Neural Nets - Ensembles - Decision Trees
Links: Cheat Sheets - Books - Events - Webinars - Tutorials - Training - News - Jobs
Links: Announcements - Salary Surveys - Data Sets - Certification - RSS Feeds - About Us
Newsletter: Sign-up - Past Editions - Members-Only Section - Content Search - For Bloggers
DSC on: Ning - Twitter - LinkedIn - Facebook - GooglePlus
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Views: 265
Like Share   
< Previous Post
Next Post >
Comment
You need to be a member of Data Science Central to add comments!
Join Data Science Central
© 2017   Data Science Central   Powered byWebsite builder | Create website | Ning.com Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service
Posted by uniqueone
,

All AI Resources at one place – AI medicines
http://aimedicines.com/2017/03/17/all-ai-resources-at-one-place/
Posted by uniqueone
,

Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning with Tables and Neural Networks
https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.3i736uw1y
Posted by uniqueone
,

http://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

 

Convolutional Neural Networks (CNNs): An Illustrated Explanation

Artificial Neural Networks (ANNs) are used everyday for tackling a broad spectrum of prediction and classification problems, and for scaling up applications which would otherwise require intractable amounts of data. ML has been witnessing a “Neural Revolution”1 since the mid 2000s, as ANNs found application in tools and technologies such as search engines, automatic translation, or video classification. Though structurally diverse, Convolutional Neural Networks (CNNs) stand out for their ubiquity of use, expanding the ANN domain of applicability from feature vectors to variable-length inputs.


The aim of this article is to give a detailed description of the inner workings of CNNs, and an account of the their recent merits and trends.

Table of Contents:

  1. Background
  2. Motivation
  3. CNN Concepts
    • Input/Output Volumes
    • Features
    • Filters (Convolution Kernels)
      • Kernel Operations Detailed
    • Receptive Field
    • Zero-Padding
    • Hyperparameters
  4. The CNN Architecture
    • Convolutional Layer
    • The ReLu (Rectified Linear Unit) Layer
    • The Fully Connected Layer
  5. CNN Design Principles
  6. Conclusion
  7. References

 1The Neural Revolution is a reference to the period beginning 1982, when academic interest in the field of Neural Networks was invigorated by CalTech professor John J. Hopfield, who authored a research paper[1] that detailed the neural network architecture named after himself. The crucial breakthrough, however, occurred  in 1986, when the backpropagation algorithm was proposed as such by David Rumelhart, Geoffrey E. Hinton and R.J. Williams [2]. For a history of neural networks, please see Andrey Kurenkov’s blog [3].


Acknowledgement

I would like to thank Adrian Scoica and Pedro Lopez for their immense patience and help with writing this piece. The sincerity of efforts and guidance that they’ve provided is ineffable. I’m forever inspired.

Background

The modern Convolutional Neural Networks owe their inception to a well-known 1998 research paper[4] by Yann LeCun and Léon Bottou. In this highly instructional and detailed paper, the authors propose a neural architecture called LeNet 5 used for recognizing hand-written digits and words that established a new state of the art2 classification accuracy of 99.2% on the MNIST dataset[5].

According to the author’s accounts, CNNs are biologically-inspired models. The research investigations carried out by D. H. Hubel and T. N. Wiesel in their paper[6] proposed an explanation for the way in which mammals visually perceive the world around them using a layered architecture of neurons in the brain, and this in turn inspired engineers to attempt to develop similar pattern recognition mechanisms in computer vision.
The most popular application for CNNs in the recent times has been Image Analysis, but many researchers have also found other interesting and exciting ways to use them: from winning Go matches against human players([7], a related video [8]) to an innovative application in discovering new drugs by training over large quantities of molecular structure data of organic compounds[9].

Motivation

A first question to answer with CNNs is why are they called Convolutional in the first place.

Convolution is a mathematical concept used heavily in Digital Signal Processing when dealing with signals that take the form of a time series. In lay terms, convolution is a mechanism to combine or “blend”[10] two functions of time3 in a coherent manner. It can be mathematically described as follows:

For a discrete domain of one variable:

CodeCogsEqn (1)

For a discrete domain of two variables:

CodeCogsEqn (2)


2A point to note here is the improvement is, in fact, modest. Classification accuracies greater than or equal to 99% on MNIST have been achieved using non-neural methods as well, such as K-Nearest Neighbours (KNN) or Support Vector Machines (SVM). For a list of ML methods applied and the respective classification accuracies attained, please refer to this[11] table.

3Or, for that matter, of another parameter.


Eq. 2 is perhaps more descriptive of what convolution truly is: a summation of pointwise products of function values, subject to traversal.

Though conventionally called as such, the operation performed on image inputs with CNNs is not strictly convolution, but rather a slightly modified variant called cross-correlation[10], in which one of the inputs is time-reversed:

2016-06-29 22_48_55-CNN_Blogpost_ - Google Docs

CNN Concepts

CNNs have an associated terminology and a set of concepts that is unique to them, and that sets them apart from other types of neural network architectures. The main ones are explained as follows:

Input/Output Volumes

CNNs are usually applied to image data. Every image is a matrix of pixel values. The range of values that can be encoded in each pixel depends upon its bit size. Most commonly, we have 8 bit or 1 Byte-sized pixels. Thus the possible range of values a single pixel can represent is [0, 255]. However, with coloured images, particularly RGB (Red, Green, Blue)-based images, the presence of separate colour channels (3 in the case of RGB images) introduces an additional ‘depth’ field to the data, making the input 3-dimensional. Hence, for a given RGB image of size, say 255×255 (Width x Height) pixels, we’ll have 3 matrices associated with each image, one for each of the colour channels. Thus the image in it’s entirety, constitutes a 3-dimensional structure called the Input Volume (255x255x3).

Figure 1: The cross-section of an input volume of size: 4 x 4 x 3. It comprises of the 3 Colour channel matrices of the input image.

Features

Just as its literal meaning implies, a feature is a distinct and useful observation or pattern obtained from the input data that aids in performing the desired image analysis. The CNN learns the features from the input images. Typically, they emerge repeatedly from the data to gain prominence. As an example, when performing Face Detection, the fact that every human face has a pair of eyes will be treated as a feature by the system, that will be detected and learned by the distinct layers. In generic object classification, the edge contours of the objects serve as the features.

Filters (Convolution Kernels)

A filter (or kernel) is an integral component of the layered architecture.

Generally, it refers to an operator applied to the entirety of the image such that it transforms the information encoded in the pixels. In practice, however, a kernel is a smaller-sized matrix in comparison to the input dimensions of the image, that consists of real valued entries.

The kernels are then convolved with the input volume to obtain so-called ‘activation maps’. Activation maps indicate ‘activated’ regions, i.e. regions where features specific to the kernel have been detected in the input. The real values of the kernel matrix change with each learning iteration over the training set, indicating that the network is learning to identify which regions are of significance for extracting features from the data.

Kernel Operations Detailed

The exact procedure for convolving a Kernel (say, of size 16 x 16) with the input volume (a 256 x 256 x 3 sized RGB image in our case) involves taking patches from the input image of size equal to that of the kernel (16 x 16), and convolving (or calculating the dot product) between the values in the patch and those in the kernel matrix.

The convolved value obtained by summing the resultant terms from the dot product forms a single entry in the activation matrix. The patch selection is then slided (towards the right, or downwards when the boundary of the matrix is reached) by a certain amount called the ‘stride’ value, and the process is repeated till the entire input image has been processed. The process is carried out for all colour channels. For normalization purposes, we divide the calculated value of the activation matrix by the sum of values in the kernel matrix.

The process is demonstrated in Figure 2, using a toy example consisting of a 3-channel 4×4-pixels input image and a 3×3 kernel matrix.  Note that:

  • pixels are numbered from 1 in the example;
  • the values in the activation map are normalized to ensure the same intensity range between the input volume and the output volume. Hence, for normalization, we divide the calculated value for the ‘red’ channel by 2 (the sum of values in the kernel matrix);
  • we assume the same kernel matrix for all the three channels, but it is possible to have a separate kernel matrix for each colour channel;
  • for a more detailed and intuitive explanation of the convolution operation, you can refer to the excellent blog-posts by Chris Olah[12] and by Tim Dettmers[13].

Figure_2

Figure 2: The convolution value is calculated by taking the dot product of the corresponding values in the Kernel and the channel matrices. The current path is indicated by the red-coloured, bold outline in the Input Image volume. Here, the entry in the activation matrix is calculated as:

CodeCogsEqn (6)

Receptive Field

It is impractical to connect all neurons with all possible regions of the input volume. It would lead to too many weights to train, and produce too high a computational complexity. Thus, instead of connecting each neuron to all possible pixels, we specify a 2 dimensional region called the ‘receptive field[14]’ (say of size 5×5 units) extending to the entire depth of the input (5x5x3 for a 3 colour channel input), within which the encompassed pixels are fully connected to the neural network’s input layer. It’s over these small regions that the network layer cross-sections (each consisting of several neurons (called ‘depth columns’)) operate and produce the activation map.

Zero-Padding

Zero-padding refers to the process of symmetrically adding zeroes to the input matrix. It’s a commonly used modification that allows the size of the input to be adjusted to our requirement. It is mostly used in designing the CNN layers when the dimensions of the input volume need to be preserved in the output volume.

Figure_3

Figure 3: A zero-padded 4 x 4 matrix becomes a 6 x 6 matrix.

Hyperparameters

In CNNs, the properties pertaining to the structure of layers and neurons, such spatial arrangement and receptive field values, are called hyperparameters. Hyperparameters uniquely specify layers. The main CNN hyperparameters are receptive field (R), zero-padding (P), the input volume dimensions (Width x Height x Depth, or W x H x D ) and stride length (S).

The CNN Architecture

Now that we are familiar with the CNN terminology, let’s go on ahead and study the CNN architecture in detail.

The architecture of a typical CNN is composed of multiple layers where each layer performs a specific function of transforming its input into a useful representation. There are 3 major types of layers that are commonly observed in complex neural network architectures:

Convolutional Layer
Also referred to as Conv. layer, it forms the basis of the CNN and performs the core operations of training and consequently firing the neurons of the network. It performs the convolution operation over the input volume as specified in the previous section, and consists of a 3-dimensional arrangement of neurons (a stack of 2-dimensional layers of neurons, one for each channel depth).

Figure_4

Figure 4: A 3-D representation of the Convolutional layer with 3 x 3 x 4 = 36 neurons.

Each neuron is connected to a certain region of the input volume called the receptive field (explained in the previous section). For example, for an input image of dimensions 28x28x3, if the receptive field is 5 x 5, then each neuron in the Conv. layer is connected to a region of 5x5x3 (the region always comprises the entire depth of the input, i.e. all the channel matrices) in the input volume. Hence each neuron will have 75 weighted inputs. For a particular value of R (receptive field), we have a cross-section of neurons entirely dedicated to taking inputs from this region. Such a cross-section is called a ‘depth column’. It extends to the entire depth of the Conv. layer.

For optimized Conv. layer implementations, we may use a Shared Weights model that reduces the number of unique weights to train and consequently the matrix calculations to be performed per layer. In this model, each ‘depth slice’ or a single 2-dimensional layer of neurons in the Conv architecture all share the same weights. The caveat with parameter sharing is that it doesn’t work well with images that encompass a spatially centered structure (such as face images), and in applications where we want the distinct features of the image to be detected in spatially different locations of the layer.

Figure_5

Figure 5: Concept of Receptive Field.

We must keep in mind though that the network operates in the same way that a feed-forward network would: the weights in the Conv layers are trained and updated in each learning iteration using a Back-propagation algorithm extended to be applicable to 3-dimensional arrangements of neurons.

The ReLu (Rectified Linear Unit) Layer

ReLu refers to the Rectifier Unit, the most commonly deployed activation function for the outputs of the CNN neurons. Mathematically, it’s described as:

CodeCogsEqn (3)

Unfortunately, the ReLu function is not differentiable at the origin, which makes it hard to use with backpropagation training. Instead, a smoothed version called the Softplus function is used in practice:

CodeCogsEqn (4)

The derivative of the softplus function is the sigmoid function, as mentioned in a prior blog post.

CodeCogsEqn (5)

The Pooling Layer

The pooling layer is usually placed after the Convolutional layer. Its primary utility lies in reducing the spatial dimensions (Width x Height) of the Input Volume for the next Convolutional Layer. It does not affect the depth dimension of the Volume.  

The operation performed by this layer is also called ‘down-sampling’, as the reduction of size leads to loss of information as well. However, such a loss is beneficial for the network for two reasons:

  1. the decrease in size leads to less computational overhead for the upcoming layers of the network;
  2. it work against over-fitting.

Much like the convolution operation performed above, the pooling layer takes a sliding window or a certain region that is moved in stride across the input transforming the values into representative values. The transformation is either performed by taking the maximum value from the values observable in the window (called ‘max pooling’), or by taking the average of the values. Max pooling has been favoured over others due to its better performance characteristics.

The operation is performed for each depth slice. For example, if the input is a volume of size 4x4x3, and the sliding window is of size 2×2, then for each color channel, the values will be down-sampled to their representative maximum value if we perform the max pooling operation.

No new parameters are introduced in the matrix by this operation. The operation can be thought of as applying a function over input values, taking fixed sized portions at a time, with the size, modifiable as a parameter. Pooling is optional in CNNs, and many architectures do not perform pooling operations.

Figure_6

Figure 6: The Max-Pooling operation can be observed in sub-figures (i), (ii) and (iii) that max-pools the 3 colour channels for an example input volume for the pooling layer. The operation uses a stride value of [2, 2]. The dark and red boundary regions describe the window movement. Sub-figure (iv) shows the operation applied for a stride value of [1,1], resulting in a 3×3 matrix Here we observe overlap between regions.

The Fully Connected Layer

The Fully Connected layer is configured exactly the way its name implies: it is fully connected with the output of the previous layer. Fully-connected layers are typically used in the last stages of the CNN to connect to the output layer and construct the desired number of outputs.

CNN Design Principles

Given the aforementioned building blocks, the last detail before implementing a CNN is to specify its design end to end, and to decide on the layer dimensions of the Convolutional layers.

A quick and dirty empirical formula[15] for calculating the spatial dimensions of the Convolutional Layer as a function of the input volume size and the hyperparameters we discussed before can be written as follows:

For each (ith) dimension of the input volume, pick:

Eq_7

where is the (ith) input dimension, R is the receptive field value, P is the padding value, and S is the value of the stride. Note that the formula does not rely on the depth of the input.

To better understand better how it works, let’s consider the following example:

  1. Let the dimensions of the input volume be 288x288x3, the stride value be 2 (both along horizontal and vertical directions).
  2. Now, since WIn=288 and S = 2, (2.P – R) must be an even integer for the calculated value to be an integer. If we set the padding to 0 and R = 4, we get WOut=(288-4+2.0)/2+1 =284/2 + 1 = 143. As the spatial dimensions are symmetrical (same value for width and height), the output dimensions are going to be: 143 x 143 x K, where K is the depth of the layer. K can be set to any value, with increasing values for every Conv. layer added. For larger networks values of 512 are common.
  3. The output volume from a Conv. layer either has the same dimensions as that of the Conv. layer (143x143x2 for the example considered above), or the same as that of the input volume (288x288x3 for the example above).

The generic arrangement of layers can thus be summarized as follows[15]:

Eq_8

Where N usually takes values between 0 and 3, M >= 0 and K∈[0,3).

The expression indicates multiple layers, with or without per layer-Pooling. The final layer is the fully-connected output layer. See [8] for more case-studies of CNN architectures, as well as a detailed discussion of layers and hyper-parameters.  

Conclusion

CNNs showcase the awesome levels of control over performance that can be achieved by making effective use of theoretical and mathematical insights. Many real world problems are being efficiently tackled using CNNs, and MNIST represents a simple, “Hello World”-type use-case of this technique. More complex problems such as object and image recognition require the use of deep neural networks with millions of parameters to obtain state-of-the-art results. CIFAR-10 is a good problem to solve in this domain, and it was first solved by Alex Krizhevsky et al.[16] in 2009. You can read through the technical report and try and grasp the approach before making way to the TensorFlow tutorial that solves the same problem[17].

Furthermore, applications are not limited to computer vision. The most recent win of Google’s AlphaGo Project over Lee Sedol in the Go game series relied on a CNN at its core. The self-driving cars which, in the coming years, will arguably become a regular sight on our streets, rely on CNNs for steering[18]. Google even held an art-show[19] for imagery created by its DeepDream project that showcased beautiful works of art created by visualizing the transformations of the network!

Thus a Machine Learning researcher or engineer in today’s world can rejoice at the technological melange of techniques at her disposal, among which an in-depth understanding of CNNs is both indispensable and empowering.


 

References

[1] Hopfield, John J. “Neural networks and physical systems with emergent collective computational abilities.” Proceedings of the national academy of sciences 79.8 (1982): 2554-2558.[http://www.pnas.org/content/79/8/2554.abstract]

[2]  Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.[http://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf]

[3]   Andrey Kurenkov, “A brief History of Neural Nets and Deep Learning”.[http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/]

[4]  LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

[5] The MNIST database of handwritten digits

[6] Hubel, David H., and Torsten N. Wiesel. “Receptive fields and functional architecture of monkey striate cortex.” The Journal of physiology 195.1 (1968): 215-243.

[7] Alpha Go video by Nature.                                             [http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234]

[8] Clark, Christopher, and Amos Storkey. “Teaching deep convolutional neural networks to play go.” arXiv preprint arXiv:1412.3409 (2014).[http://arxiv.org/pdf/1412.3409.pdf]

[9] Wallach, Izhar, Michael Dzamba, and Abraham Heifets. “AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery.” arXiv preprint arXiv:1510.02855 (2015).

[http://arxiv.org/pdf/1510.02855.pdf]

[10] Weisstein, Eric W. “Convolution.” From MathWorld — A Wolfram Web Resource. [http://mathworld.wolfram.com/Convolution.html]

[11] Table of classification accuracies attained over MNIST. [https://en.wikipedia.org/wiki/MNIST_database#Performance]

[12] Chris Olah, “Understanding Convolutions”.                           [http://colah.github.io/posts/2014-07-Understanding-Convolutions/]

[13] Tim Dettmers, “Understanding Convolution In Deep Learning”.[http://timdettmers.com/2015/03/26/convolution-deep-learning/]

[14] TensorFlow Documentation: Convolution [https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#convolution]

[15] Andrej Karpathy, “CS231n: Convolutional Neural Networks for Visual Recognition” [http://cs231n.github.io/convolutional-networks/]

[16] Krizhevsky, Alex, and Geoffrey Hinton. “Learning multiple layers of features from tiny images.” (2009).[http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf]

[17] TensorFlow: Convolutional Networks.[https://www.tensorflow.org/versions/r0.7/tutorials/deep_cnn/index.html#cifar-10-model]

[18] Google Deepmind’s AlphaGo: How it works.       [https://www.tastehit.com/blog/google-deepmind-alphago-how-it-works/]

[19] An Empirical Evaluation of Deep Learning on Highway Driving.[http://arxiv.org/pdf/1504.01716.pdf]

[20] Inside Google’s First DeepDream Art Project. [http://www.fastcodesign.com/3057368/inside-googles-first-deepdream-art-show/11]

 

 

 

 

 

 

 

 

 

 

Posted by uniqueone
,
If you haven't taken linear algebra start here: https://www.youtube.com/watch?v=kjBOesZCoqc&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

Then go on to Udacity's free Tensorflow course. Keep in mind that neural nets are one of many ML techniques.
Posted by uniqueone
,

Winning Tips on Machine Learning Competitions by Kazanova, Current Kaggle #3 | HackerEarth Blog
http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
Posted by uniqueone
,

https://github.com/fchollet/keras-resources

 

Keras resources

This is a directory of tutorials and open-source code repositories for working with Keras, the Python deep learning library.

If you have a high-quality tutorial or project to add, please open a PR.

Official starter resources

Tutorials

Code examples

Working with text

Working with images

Creative visual applications

Reinforcement learning

  • DQN
  • FlappyBird DQN
  • async-RL: Tensorflow + Keras + OpenAI Gym implementation of 1-step Q Learning from "Asynchronous Methods for Deep Reinforcement Learning"
  • keras-rl: A library for state-of-the-art reinforcement learning. Integrates with OpenAI Gym and implements DQN, double DQN, Continuous DQN, and DDPG.

Miscallenous architecture blueprints

Third-party libraries

  • Elephas: Distributed Deep Learning with Keras & Spark
  • Hyperas: Hyperparameter optimization
  • Hera: in-browser metrics dashboard for Keras models
  • Kerlym: reinforcement learning with Keras and OpenAI Gym
  • Qlearning4K: reinforcement learning add-on for Keras
  • seq2seq: Sequence to Sequence Learning with Keras
  • Seya: Keras extras
  • Keras Language Modeling: Language modeling tools for Keras
  • Recurrent Shop: Framework for building complex recurrent neural networks with Keras
  • Keras.js: Run trained Keras models in the browser, with GPU support
  • keras-vis: Neural network visualization toolkit for keras.

Projects built with Keras

 

 

Posted by uniqueone
,

http://jaejunyoo.blogspot.com/2017/03/nips-2016-tutorial-summary-nuts-and-bolts-of-building-AI-AndrewNg.html

 

 

2017년 3월 9일 목요일

[NIPS 2016 tutorial - Summary] Nuts and bolts of building AI applications using Deep Learning

Today, I am going to review a very educational tutorial by prof. Andrew Ng which was delivered in NIPS 2016. You can download the material by googling though it seems like that there is no official video clip provided from the NIPS 2016 homepage.

Still, you can see the video with almost identical contents (even the title is exactly the same) in the following YouTube link:


* I really recommend you guys to listen to his full lecture. Sometimes, however, watching video takes too much time to get a gist from it. Here, I tried to summarize what he have tried to deliver in his talk. I hope this helps.
** Note that I skipped a few slides or mixed their order to make it easier for me to explain.

TLDR;

  • Workflow guidelines

  • Follow this link and try some questionnaires. 


Outline

  • Trends of Deep Learning (DL)
    • Scale is driving DL progress
    • Rise of end-to-end learning
    • When to and when not to use "End-to-End" learning
  • Machine Learning (ML) Strategy (very practical advice)
    • How to manage train/dev/test data set and bias/variance
    • Basic recipe for ML
    • Defining a human level performance of each application is very useful

Trend #1


Q) Why is Deep Learning working so well NOW

A) Scale drives DL progress



The red line which stands for the traditional learning algorithms such as SVM and logistic regression shows a performance plateau after a while with a large amount of data. They did not know what to do with all the data we collected.

For last ten years, due to the rise of internet, mobile and IOT (internet of things), we could march along the X-axis. Andrew Ng commented that this is the number one reason why the DL algorithm works so well.

So... the implication of this :
To hit the top margin, you need a huge amount of data and a large NN model. 

Trend #2


The second major trend which he is excited about is end-to-end learning. 



Until recently, a lot of machine learning used real or integer numbers as output. In contrast to those, end-to-end learning can output much more complex things than numbers, e.g. image captioning.

It is called "end-to-end" learning because the input and output of the system are directly linked by a neural network unlike traditional models which have several intermediate steps:


This works well in many cases that are not effective when using traditional models. For example, end-to-end learning shows a better performance in speech recognition tasks.

While presenting this slide, he introduced the following anecdote:

"This end-to-end story really upset many people. I used to get around and say that I believe "phonemes" are the fantasy of the linguists and machines can do well without them. One day at the meeting in Stanford a linguist yelled at me in public for saying that. Well...we turned out to be right."

This story seems to say that end-to-end learning is a magic key for any application but rather he warned the audiences that they should be careful when applying the model to their problems.

Despite all the excitements about end-to-end learning, he does not think that this end-to-end learning is the solution for every application.  It works well in "some" cases but it does not in many others as well. For example, given the safety-critical requirement of autonomous driving and thus the need for extremely high levels of accuracy, a pure end-to-end approach is still challenging to get to work for autonomous driving.

In addition to this, he also commented that even though DL can almost always train a mapping from X to Y with a reasonable amount of data and you may publish a paper about it, it does not mean that using DL is actually a good idea, e.g. medical diagnosis or imaging.
End-to-End works only when you have enough (x,y) data to learn function of needed level of complexity. 
I totally agree with his point that we should not naively rely on the learning capability of the neural network. We should exploit all the power and knowledge of hand-designed or carefully chosen features which we already have.

In the same context, however, I have a slightly different point of view in "phonemes". I think that this can and should be also used as an additional feature in parallel which can reduce the labor of the neural network.

Machine Learning Strategy


Now let's move on to the next phase of his lecture. Here, he tries to give some sorts of answer or guideline to the following issues:
  • Often you will have a lot of ideas for how to improve an AI system, what do you do?
  • Good strategy will help avoid months of wasted effort. Then, what is it?

I think this part is a gist of his lecture. I really really liked his practical tips all of which can be actually applied in my situations right away.

One of those tips he proposed is a kind of "standard workflow" which guides you while training the model:



When the training error is high, this means that the bias between the output of your model and the real data is too big. To mitigate this issue, you need to train longer, use bigger model or adopt a new one. Next, you should check whether your dev error is high or not. If it is, you need more data, try to use some regularization, or use a new model architecture.

* Yes I know, this seems too obvious. Still, I want to mention that everything seems simple once it is organized under an unified system. Constructing an implicit know-how to an explicit framework is not an easy thing.

Here, you should be careful with the implication of the keywords, bias and variance. In his talk, bias and variance have slightly different meanings than textbook definitions (so we do not try to trade off between both entities) although they share a similar concept.

In the era before DL, people used to trade off between the bias and variance by playing with regularization and this coupling was not able to overcome because they were tied too strongly. 

Nowadays, however, the coupling between these two seems like to become weaker than before because you can deal with both separately by using simple strategies, i.e. use bigger model (bias) and gather more data (variance).

This also implicitly shows the reason why DL seems to be more applicable to various problems than the traditional learning models. By using DL, there are always at least two ways to solve the problems we are stuck in real life situations as introduced above.

Human Level Error as a Reference


To say whether your error is high or low, you need a reference. Andrew Ng suggests to use a human level error as an optimal error, or Bayes error . He strongly recommended to find the number before going deep in research because this is the very critical component to guide your next step.

Let's say our goal is to build a human level speech system using DL.  What we usually do with our data set is to split them with three sets; train, dev(val) and test. Then, the gaps between these errors may occur as below:


You can see the gaps between the errors are named as bias and variance. If you take time and think a while, you will find that it is quite intuitive why he named the gap between human level error and training set error as bias and the other as variance

If you find that your model has a high bias and a low variance, try to find a new model architecture or simply increase the capacity of the model. On the other hand, if you have a low bias but a high variance, you'd be better to try gathering more data as an easy remedy. 

As you can see, because you now use a human level performance as a baseline, you can always have a guideline where to focus on among several options you may have. 



Note that he did not say that it is "easy" to train a big model or to gather a huge amount of data. What he tries to deliver here is that at least you have an "easy option to try" even though you are not an expert in the area. You know... building a new model architecture which actually works is not a trivial task even for the experts. 

Still, there remains some unavoidable issues you need to overcome. 


Data Crisis


To deal with a finite amount of data to efficiently train the model, you need to carefully manipulate the data set or find a way to get more data (data synthesis). Here, I will focus on the former which brings more intuitions for practitioners.

Say you want to build a speech recognition system for a new in-car rearview mirror product. You have 50,000 hours of general speech data and 10 hours of in-car data. How do you split your data?

This is a BAD way to do it:


Having a mismatched dev and test distributions is not a good idea. You may spend months optimizing for dev set performance only to find it does not work well on the test set. 

So the following is his suggestion to do better:




A few remaining remarks

While the performance is worse than humans, there are many good ways to progress;
  • error analysis
  • estimate bias/variance
  • etc.
After surpassing the human performance or at least near the point, however, what you usually observe is that the progress becomes slow and almost stuck. There can be several reasons such as:
  • Label is made by human (so the limit lies here)
  • Maybe because human level error is close to the optimal error (Bayes error)
What you can do here is to find a subset of data that still works worse than human and make the model do better. 


I hope you enjoyed my summary. Thank you for reading :)

Interesting references

  • Slides (link)
  • Github markdown which you can actually try the workflow by clicking the multiple choices with questions. (link)
Posted by uniqueone
,
https://medium.com/towards-data-science/linear-algebra-cheat-sheet-for-deep-learning-cd67aba4526c#.2dqnqasml

 

 

Linear algebra cheat sheet for deep learning

While participating in Jeremy Howard’s excellent deep learning course I realized I was a little rusty on the prerequisites and my fuzziness was impacting my ability to understand concepts like backpropagation. I decided to put together a few wiki pages on these topics to improve my understanding. Here is a prettier version of my linear algebra page.

What is linear algebra?

In the context of deep learning, linear algebra is a mathematical toolbox that offers helpful techniques for manipulating groups of numbers simultaneously. It provides structures like vectors and matrices (spreadsheets) to hold these numbers and new rules for how to add, subtract, multiply, or divide them.

Why is it useful?

It turns complicated problems into simple, intuitive, and efficiently calculated problems. Here is an example of how linear algebra can achieve greater simplicity in code.

# Multiply two arrays 
x = [1,2,3]
y = [2,3,4]
product = []
for i in range(len(x)):
product.append(x[i]*y[i])
# Linear algebra version
x = numpy.array([1,2,3])
y = numpy.array([2,3,4])
x * y

How is it used in deep learning?

Neural networks store weights in matrices. Linear algebra makes matrix operations fast and easy, especially when training on GPUs. In fact, GPUs were created with vector and matrix operations in mind. Similar to how images can be represented as arrays of pixels, video games generate compelling gaming experiences using enormous, constantly evolving matrices. Instead of processing pixels one-by-one, GPUs manipulate entire matrices of pixels in parallel.

Vectors

Vectors are 1-dimensional arrays of numbers or terms. In geometry, vectors store the magnitude and direction of a potential change to a point in space. The vector [3, -2] for instance says go right 3 and down 2. A vector with more than one dimension is called a matrix.

Vector notation

There are a variety of ways to represent vectors. Here are a few you might come across in your reading.

Vectors in geometry

Vectors typically represent movement from a point. They store both the magnitude and direction of potential changes to a point. The vector [-2,5] says move left 2 units and up 5 units. Source.

v = [-2, 5]

A vector can be applied to any point in space. The vector’s direction equals the slope of the hypotenuse created moving up 5 and left 2. Its magnitude equals the length of the hypotenuse.

Scalar operations

Scalar operations involve a vector and a number. You modify the vector in-place by adding, subtracting, or multiplying the number from all the values in the vector.

Scalar addition

Elementwise operations

In elementwise operations like addition, subtraction, and division, values that correspond positionally are combined to produce a new vector. The 1st value in vector A is paired with the 1st value in vector B. The 2nd value is paired with the 2nd, and so on. This means the vectors must have equal dimensions to complete the operation.*

Vector addition
y = np.array([1,2,3])
x = np.array([2,3,4])
y + x = [3, 5, 7]
y - x = [-1, -1, -1]
y / x = [.5, .67, .75]

*In numpy if one of the vectors is of size 1 it can be treated like a scalar and applied to the elements in the larger vector. See below for more details on broadcasting in numpy.

Vector multiplication

There are two types of vector multiplication: Dot product and Hadamard product.

Dot product

The dot product of two vectors is a scalar. Dot product of vectors and matrices (matrix multiplication) is one of the most important operations in deep learning.

y = np.array([1,2,3])
x = np.array([2,3,4])
np.dot(y,x) = 20

Hadamard product

Hadamard Product is elementwise multiplication and it outputs a vector.

y = np.array([1,2,3])
x = np.array([2,3,4])
y * x = [2, 6, 12]

Vector fields

A vector field shows how far the point (x,y) would hypothetically move if we applied a vector function to it like addition or multiplication. Given a point in space, a vector field shows the power and direction of our proposed change at a variety of points in a graph.

Source

This vector field is an interesting one since it moves in different directions depending the starting point. The reason is that the vector behind this field stores terms like 2x or instead of scalar values like -2 and 5. For each point on the graph, we plug the x-coordinate into 2x or and draw an arrow from the starting point to the new location. Vector fields are extremely useful for visualizing machine learning techniques like Gradient Descent.

Matrices

A matrix is a rectangular grid of numbers or terms (like an Excel spreadsheet) with special rules for addition, subtraction, and multiplication.

Matrix dimensions

We describe the dimensions of a matrix in terms of rows by columns.

a = np.array([
[1,2,3],
[4,5,6]
])
a.shape == (2,3)
b = np.array([
[1,2,3]
])
b.shape == (1,3)

Matrix scalar operations

Scalar operations with matrices work the same way as they do for vectors. Simply apply the scalar to every element in the matrix — add, subtract, divide, multiply, etc.

Matrix scalar addition
a = np.array(
[[1,2],
[3,4]])
a + 1
[[2,3],
[4,5]]

Matrix elementwise operations

In order to add, subtract, or divide two matrices they must have equal dimensions.* We combine corresponding values in an elementwise fashion to produce a new matrix.

a = np.array([
[1,2],
[3,4]
])
b = np.array([
[1,2],
[3,4]
])
a + b
[[2, 4],
[6, 8]]
a — b
[[0, 0],
[0, 0]]

Numpy broadcasting*

I can’t escape talking about this, since it’s incredibly relevant in practice. In numpy the dimension requirements for elementwise operations are relaxed via a mechanism called broadcasting. Two matrices are compatible if the corresponding dimensions in each matrix (rows vs rows, columns vs columns) meet the following requirements:

  1. The dimensions are equal, or
  2. One dimension is of size 1
a = np.array([
[1],
[2]
])
b = np.array([
[3,4],
[5,6]
])
c = np.array([
[1,2]
])
# Same no. of rows
# Different no. of columns
# but a has one column so this works
a * b
[[ 3, 4],
[10, 12]]
# Same no. of columns
# Different no. of rows
# but c has one row so this works
b * c
[[ 3, 8],
[5, 12]]
# Different no. of columns
# Different no. of rows
# but both a and c meet the
# size 1 requirement rule
a + c
[[2, 3],
[3, 4]]

Things get weirder in higher dimensions — 3D, 4D, but for now we won’t worry about that. It’s less common in deep learning.

Matrix Hadamard product

Hadamard product of matrices is an elementwise operation, just like vectors. Values that correspond positionally are multiplied to product a new matrix.

a = np.array(
[[2,3],
[2,3]])
b = np.array(
[[3,4],
[5,6]])
# Uses python's multiply operator
a * b
[[ 6, 12],
[10, 18]]

In numpy you can take the Hadamard product of a matrix and vector as long as their dimensions meet the requirements of broadcasting.

Matrix transpose

Neural networks frequently process weights and inputs of different sizes where the dimensions do not meet the requirements of matrix multiplication. Matrix transpose provides a way to “rotate” one of the matrices so that the operation complies with multiplication requirements and can continue. There are two steps to transpose a matrix:

  1. Rotate the matrix right 90°
  2. Reverse the order of elements in each row (e.g. [a b c] becomes [c b a])

As an example, transpose matrix M into T:

a = np.array([
[1, 2],
[3, 4]])
a.T
[[1, 3],
[2, 4]]

Matrix multiplication

Matrix multiplication specifies a set of rules for multiplying matrices together to produce a new matrix.

Rules

Not all matrices are eligible for multiplication. In addition, there is a requirement on the dimensions of the resulting matrix output. Source.

  1. The number of columns of the 1st matrix must equal the number of rows of the 2nd
  2. The product of an M x N matrix and an N x K matrix is an M x K matrix. The new matrix takes the rows of the 1st and columns of the 2nd

Steps

Matrix multiplication relies on dot product to multiply various combinations of rows and columns. In the image below, taken from Khan Academy’s excellent linear algebra course, each entry in Matrix C is the dot product of a row in matrix A and a column in matrix B.

Source

The operation a1 · b1 means we take the dot product of the 1st row in matrix A (1, 7) and the 1st column in matrix B (3, 5).

Here’s another way to look at it:

Why does matrix multiplication work this way?

It’s useful. There is no mathematical law behind why it works this way. Mathematicians developed it because it streamlines previously tedious calculations. It’s an arbitrary human construct, but one that’s extremely useful.

Test yourself with these examples

Matrix multiplication with numpy

Numpy uses the function np.dot(A,B) for both vector and matrix multiplication. It has some other interesting features and gotchas so I encourage you to read the documentation here before use.

a = np.array([
[1, 2]
])
a.shape == (1,2)
b = np.array([
[3, 4],
[5, 6]
])
b.shape == (2,2)
# Multiply
mm = np.dot(a,b)
mm == [13, 16]
mm.shape == (1,2)

Tutorials

Khan Academy Linear Algebra
Deep Learning Book Math Section
Andrew Ng’s Course Notes
Explanation of Linear Algebra
Explanation of Matrices
Intro To Linear Algebra


Please hit the ♥ button below if you enjoyed reading. Also feel free to suggest new linear algebra concepts I should add to this post!

Posted by uniqueone
,

Android에서 TensorFlow 실행하기 – 꿈꾸는 개발자의 로그
http://www.kmshack.kr/2017/03/android%ec%97%90%ec%84%9c-tensorflow-%ec%8b%a4%ed%96%89%ed%95%98%ea%b8%b0/
Posted by uniqueone
,

[머신러닝] 19. 머신러닝 학습 방법(part 14) - AutoEncoder(6) : 네이버 블로그
http://blog.naver.com/laonple/220949087243
Posted by uniqueone
,

Neural networks and deep learning
http://neuralnetworksanddeeplearning.com/chap1.html

Hello guys,

If any of you are getting started with Deep Learning, I highly recommend this book. It's well written, nicely paced, clear, and starts right from the basic. In brief, it's perfect for a beginner to Deep Learning.

Hope this helps some learners!

Good luck!
Posted by uniqueone
,

https://www.neuraldesigner.com/blog/5_algorithms_to_train_a_neural_network

 

5 algorithms to train a neural network

By Alberto Quesada, Artelnics.


 

algorithm picture

The procedure used to carry out the learning process in a neural network is called the training algorithm. There are many different training algorithms, whith different characteristics and performance.

Problem formulation

The learning problem in neural networks is formulated in terms of the minimization of a loss function, f. This function is in general, composed of an error and a regularization terms. The error term evaluates how a neural network fits the data set. On the other hand, the regularization term is used to prevent overfitting, by controlling the effective complexity of the neural network.

The loss function depends on the adaptative parameters (biases and synaptic weights) in the neural network. We can conveniently group them together into a single n-dimensional weight vector w. The picture below represents the loss function f(w).

Loss function picture

 

As we can see in the previous picture, the point w* is minima of the loss function. At any point A, we can calculate the first and second derivatives of the loss function. The first derivatives are gropued in the gradient vector, whose elements can be written

if(w) = df/dwi (i = 1,...,n)


Similarly, the second derivatives of the loss function can be grouped in the Hessian matrix,

Hi,jf(w) = d2f/dwi·dwj (i,j = 1,...,n)

 

The problem of minimizing continuous and differentiable functions of many variables has been widely studied. Many of the conventional approaches to this problem are directly applicable to that of training neural networks.

One-dimensional optimization

Although the loss function depends on many parameters, one-dimensional optimization methods are of great importance here. Indeed, they are are very often used in the training process of a neural network.

Indeed, many training algorithms first compute a training direction d and then a traning rate η that minimizes the loss in that direction, f(η). The next picture illustrates this one-dimensional function.

one-dimensional function picture

 

The points η1 and η2 define an interval that contains the minimum of f, η*.

In this regard, one-dimensional optimization methods search for the minimum of a given one-dimensional function. Some of the algorithms which are widely used are the golden section method and the Brent's method. Both reduce the bracket of a minumum until the distance between the two outer points in the bracket is less than a defined tolerance.

Multidimensional optimization

The learning problem for neural networks is formulated as searching of a parameter vector w* at which the loss function f takes a minimum value. The necessary condition states that if the neural network is at a minimum of the loss function, then the gradient is the zero vector.

The loss function is, in general, a non linear function of the parameters. As a consequence, it is not possible to find closed training algorithms for the minima. Instead, we consider a search through the parameter space consisting of a succession of steps. At each step, the loss will decrease by adjusting the neural network parameters.

In this way, to train a neural network we start with some parameter vector (often chosen at random). Then, we generate a sequence of parameters, so that the loss function is reduced at each iteration of the algorithm. The change of loss between two steps is called the loss decrement. The training algorithm stops when a specified condition, or stopping criterion, is satisfied.

Now, we are going to describe the most importat training algorithms for neural networks.

Algorithm types

 


 

1. Gradient descent

Gradient descent, also known as steepest descent, is the simplest training algorithm. It requires information from the gradient vector, and hence it is a first order method.

Let denote f(wi) = fi and ᐁf(wi) = gi. The method begins at a point w0 and, until a stopping criterion is satisfied, moves from wi to wi+1 in the training direction di = -gi. Therefore, the gradient descent method iterates in the following way:

wi+1 = wi - di·ηi,   i=0,1,...

 

The parameter η is the training rate. This value can either set to a fixed value or found by one-dimensional optimization along the training direction at each step. An optimal value for the training rate obtained by line minimization at each successive step is generally preferable. However, there are still many software tools that only use a fixed value for the training rate.

The next picture is an activity diagram of the training process with gradient descent. As we can see, the parameter vector is improved in two steps: First, the gradient descent training direction is computed. Second, a suitable training rate is found.

Gradient descent diagram

 

The gradient descent training algorithm has the severe drawback of requiring many iterations for functions which have long, narrow valley structures. Indeed, the downhill gradient is the direction in which the loss function decreases most rapidly, but this does not necessarily produce the fastest convergence. The following picture illustrates this issue.

Gradient descent picture

 

Gradient descent is the recommended algorithm when we have very big neural networks, with many thousand parameters. The reason is that this method only stores the gradient vector (size n), and it does not store the Hessian matrix (size n2).

2. Newton's method

The Newton's method is a second order algorithm because it makes use of the Hessian matrix. The objective of this method is to find better training directions by using the second derivatives of the loss function.

Let denote f(wi) = fi, ᐁf(wi) = gi and Hf(wi) = Hi. Consider the quadratic approximation of f at w0 using the Taylor's series expansion

f = f0 + g0 · (w - w0) + 0.5 · (w - w0)2 · H0

 

H0 is the Hessian matrix of f evaluated at the point w0. By setting g equal to 0 for the minimum of f(w), we obtain the next equation

g = g0 + H0 · (w - w0) = 0

 

Therefore, starting from a parameter vector w0, Newton's method iterates as follows

wi+1 = wi - Hi-1·gi,   i=0,1,...

 

The vector Hi-1·gi is known as the Newton's step. Note that this change for the parameters may move towards a maximum rather than a minimum. This occurs if the Hessian matrix is not positive definite. Thus, the function evaluation is not guaranteed to be reduced at each iteration. In order to prevent such troubles, the Newton's method equation usually modified as:

wi+1 = wi - (Hi-1·gi)·ηi,   i=0,1,...

 

The training rate, η, can either set to a fixed value or found by line minimization. The vector d = Hi-1·gi is now called the Newton's training direction.

The state diagram for the training process with the Newton's method is depicted in the next figure. Here improvement of the parameters is performed by obtaining first the Newton's training direction and then a suitable training rate.

Newton's method diagram

 

The picture below illustrates the performance of this method. As we can see, the Newton's method requires less steps than gradient descent to find the minimum value of the loss function.

Newton's method graph

 

However, the Newton's method has the difficulty that the exact evaluation of the Hessian and its inverse are quite expensive in computational terms.


 

3. Conjugate gradient

The conjugate gradient method can be regarded as something intermediate between gradient descent and Newton's method. It is motivated by the desire to accelerate the typically slow convergence associated with gradient descent. This method also avoids the information requirements associated with the evaluation, storage, and inversion of the Hessian matrix, as required by the Newton's method.

In the conjugate gradient training algorithm, the search is performed along conjugate directions which produces generally faster convergence than gradient descent directions. These training directions are conjugated with respect to the Hessian matrix.

Let denote d the training direction vector. Then, starting with an initial parameter vector w0 and an initial training direction vector d0 = -g0, the conjugate gradient method constructs a sequence of training directions as:

di+1 = gi+1 + di·γi,   i=0,1,...

 

Here γ is called the conjugate parameter, and there are different ways to calculate it. Two of the most used are due to Fletcher and Reeves and to Polak and Ribiere. For all conjugate gradient algorithms, the training direction is periodically reset to the negative of the gradient.

The parameters are then improved according to the next expression. The training rate, η, is usually found by line minimization.

wi+1 = wi + di·ηi,   i=0,1,...

 

The picture below depicts an activity diagram for the training process with the conjugate gradient. Here improvement of the parameters is done by first computing the conjugate gradient training direction and then suitable training rate in that direction.

Conjugate gradient diagram

 

This method has proved to be more effective than gradient descent in training neural networks. Since it does not require the Hessian matrix, conjugate gradient is also recommended when we have very big neural networks.


 

4. Quasi-Newton method

Application of the Newton's method is computationally expensive, since it requires many operations to evaluate the Hessian matrix and compute its inverse. Alternative approaches, known as quasi-Newton or variable metrix methods, are developed to solve that drawback. These methods, instead of calculating the Hessian directly and then evaluating its inverse, build up an approximation to the inverse Hessian at each iteration of the algorithm. This approximation is computed using only information on the first derivatives of the loss function.

The Hessian matrix is composed of the second partial derivatives of the loss function. The main idea behind the quasi-Newton method is to approximate the inverse Hessian by another matrix G, using only the first partial derivatives of the loss function. Then, the quasi-Newton formula can be expressed as:

wi+1 = wi - (Gi·gi)·ηi,   i=0,1,...

 

The training rate η can either set to a fixed value or found by line minimization. The inverse Hessian approximation G has different flavours. Two of the most used are the Davidon–Fletcher–Powell formula (DFP) and the Broyden–Fletcher–Goldfarb–Shanno formula (BFGS).

The activity diagram of the quasi-Newton training process is illustrated bellow. Improvement of the parameters is performed by first obtaining the quasi-Newton training direction and then finding a satisfactory training rate.

Quasi newton algorithm diagram

 

This is the default method to use in most cases: It is faster than gradient descent and conjugate gradient, and the exact Hessian does not need to be computed and inverted.


 

5. Levenberg-Marquardt algorithm

The Levenberg-Marquardt algorithm, also known as the damped least-squares method, has been designed to work specifically with loss functions which take the form of a sum of squared errors. It works without computing the exact Hessian matrix. Instead, it works with the gradient vector and the Jacobian matrix.

Consider a loss function which can be expressed as a sum of squared errors of the form

f = ∑ ei2,   i=0,...,m


Here m is the number of instances in the data set.

 

We can define the Jacobian matrix of the loss function as that containing the derivatives of the errors with respect to the parameters,

Ji,jf(w) = dei/dwj (i = 1,...,m & j = 1,...,n)


Where m is the number of instances in the data set and n is the number of parameters in the neural network. Note that the size of the Jacobian matrix is m·n.

 

The gradient vector of the loss function can be computed as:

ᐁf = 2 JT·e


Here e is the vector of all error terms.

 

Finally, we can approximate the Hessian matrix with the following expression.

Hf ≈ 2 JT·J + λI


Where λ is a damping factor that ensures the positiveness of the Hessian and I is the identity matrix.

 

The next expression defines the parameters improvement process with the Levenberg-Marquardt algorithm

wi+1 = wi - (JiT·JiiI)-1·(2 JiT·ei),   i=0,1,...

 

When the damping parameter λ is zero, this is just Newton's method, using the approximate Hessian matrix. On the other hand, when λ is large, this becomes gradient descent with a small training rate.

The parameter λ is initialized to be large so that first updates are small steps in the gradient descent direction. If any iteration happens to result in a failure, then λ is increased by some factor. Otherwise, as the loss decreases, λ is decreased, so that the Levenberg-Marquardt algorithm approaches the Newton method. This process typically accelerates the convergence to the minimum.

The picture below represents a state diagram for the training process of a neural network with the Levenberg-Marquardt algorithm. The first step is to calculate the loss, the gradient and the Hessian approximation. Then the damping parameter is adjusted so as to reduce the loss at each iteration.

Levenberg-Marquardt algorithm diagram

 

As we have seen the Levenberg-Marquardt algorithm is a method tailored for functions of the type sum-of-squared-error. That makes it to be very fast when training neural networks measured on that kind of errors. However, this algorithm has some drawbacks. The first one is that it cannnot be applied to functions such as the root mean squared error or the cross entropy error. Also, it is not compatible with regularization terms. Finally, for very big data sets and neural networks, the Jacobian matrix becomes huge, and therefore it requires a lot of memory. Therefore, the Levenberg-Marquardt algorithm is not recommended when we have big data sets and/or neural networks.


 

Memory and speed comparison

The next graph depicts the computational speed and the memory requirements of the training algorithms discussed in this post. As we can see, the slowest training algorithm is usually gradient descent, but it is the one requiring less memory. On the contrary, the fastest one might be the Levenberg-Marquardt algorithm, but usually requires a lot of memory. A good compromise might be the quasi-Newton method.

Performance comparison between algorithms

 

To conclude, if our neural networks has many thousands of parameters we can use gradient descent or conjugate gradient, in order to save memory. If we have many neural networks to train with just a few thousands of instances and a few hundreds of parameters, the best choice might be the Levenberg-Marquardt algorithm. In the rest of situations, the quasi-Newton method will work well.

 

 

 

 

 

 

 

 

Posted by uniqueone
,

https://github.com/conda/conda/issues/1979

 

 

The response to this issue is

conda config --set ssl_verify false
conda update requests
conda config --set ssl_verify true
Posted by uniqueone
,

http://skyer9.tistory.com/11

 

 

 

 

Python Keras+Tensorflow on Windows7 64bit 설치하기




1. python 3.5 64bit 설치하기


https://www.python.org/downloads/release/python-352/

--> Windows x86-64 executable installer를 다운받음


"Add Python 3.5.2 to PATH" 를 체크하고 "Install Now" 를 선택한다.


도스창을 열고 아래 명령을 입력해 정상적으로 설치되었는지 확인한다.


C:\Users\skyer9>python -V

Python 3.5.2




2. tensorflow 설치하기


도스창에서 아래 명령을 입력한다.


C:\> pip3 install --upgrade tensorflow-gpu


현재(2017-02-25) 배포버전에 버그가 있다.

tensorflow 테스트할 때 오류가 발생하면 아래 설치방법으로 설치하면 된다.


C:\> pip3 install --upgrade http://ci.tensorflow.org/view/Nightly/job/nightly-win/85/DEVICE=gpu,OS=windows/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.0.0rc2-cp35-cp35m-win_amd64.whl




3. CUDA 8.0 을 설치한다.


아래 사이트에서 다운받는다.


https://developer.nvidia.com/cuda-downloads


왜인지... 여러번 반복해서 설치/삭제 후 다시 설치를 해야 설치되는 듯.

필요하면 Visual Studio 2015 Community 버전도 설치해준다.



4. cuDNN 을 설치한다.


https://developer.nvidia.com/cudnn


위 사이트에서 아래 파일을 다운받는다.


cuDNN v5.1 Library for Windows 7


압축해제 후 C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0 에 붙여넣기 한다.




5. 테스트 프로그램 실행하기


아래 내용을 hello.py 라는 이름으로 생성한다.

(TF_CPP_MIN_LOG_LEVEL 은 그냥 환경변수에 추가하는 것이 편하다.)


# ------------------------------------------------------------------------------

from __future__ import print_function


#disable tensorflow logging

import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'


import tensorflow as tf


hello = tf.constant('Hello, TensorFlow!')


# Start tf session

sess = tf.Session()


print(str(sess.run(hello).strip(), 'utf-8'))

# ------------------------------------------------------------------------------


C:\> python hello.py




6. keras 설치하기


아래 사이트에서 numpy-1.12.0+mkl-cp35-cp35m-win_amd64.whl 를 다운받는다.


http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy


설치한다.

(설치도중 오류가 나더라도 리스트에 numpy (1.12.0+mkl) 이 표시되면 정상적으로 설치된 것이다.)


C:\> pip3 install --upgrade numpy-1.12.0+mkl-cp35-cp35m-win_amd64.whl

C:\> pip3 list


아래 사이트에서 scipy-0.19.0rc2-cp35-cp35m-win_amd64.whl 를 다운받는다.


http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy


scipy 를 설치한다.


C:\> pip3 install scipy-0.19.0rc2-cp35-cp35m-win_amd64.whl


keras 를 설치한다.


C:\> pip3 install --upgrade keras




7. 테스트 프로그램 실행하기


아래 내용을 hello2.py 라는 이름으로 생성한다.


(Microsoft Visual C++ 2015 Redistributable 또는 Visual Studio 2015 Community 둘 중 어느것도 설치되어 있지 않으면 에러가 발생한다.)


# ------------------------------------------------------------------------------

import os


#disable tensorflow logging

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

os.environ['KERAS_BACKEND'] = 'tensorflow'


import tensorflow as tf

sess = tf.Session()


from keras import backend as K

K.set_session(sess)

# ------------------------------------------------------------------------------


C:\> python hello2.py

 

 

 

 

 

 

 

 

Posted by uniqueone
,

http://tmmse.xyz/2017/03/01/tensorflow-keras-installation-windows-linux-macos/

 

 

 

참고 :https://groups.google.com/forum/#!topic/keras-users/_hXfBOjXow8

선요약:

# export PATH=~/anaconda/bin:$PATH # MAC
conda create -n tf python=3.5  # 3.5 버전만 TensorFlow/Keras가 지원  
activate tf # Windows  
# source activate tf  : Linux/macOS

# 여기서부터 (tf) 환경. 설치 순서 중요
pip install tensorflow   # pip install tensorflow-gpu : GPU 버전  
conda install pandas matplotlib scikit-learn  
pip install keras  
conda install jupyter notebook  

jupyter notebook # Test 해보기  

우선 python 3.5.X 버전을 사용해야 현재 TensorFlow-v1.0.0와 Keras를 지원합니다. 평소에 Source code로만 설치하는 것을 선호했지만 Windows에서 설치가 조금 난감하기 때문에 Anaconda를 통해 비교적 쉽게 설치를 할 수 있습니다. 그리고 Linux/macOS에서도 다 작동하는 것을 확인했습니다.

편의상 문어체를 사용합니다. Anaconda 다운로드를 통해서 Anaconda Python 3.X 버전을 자신의 플랫폼에 맞게 설치한다.

나는 Python 3.6 버전이다.

Windows 경우 설치중에 Anaconda를 PATH 경로에 포함하는 체크란이 있음으로 반드시 체크 됨을 확인하자.

설치를 완료한 후에 linux/macOS 라면 terminal 그리고 Windows 라면 CMD 창에 conda --v 명령어가 작동해야한다.

만약 linux/macOS에서 conda 명령어가 먹히지 않으면 export PATH=~/anaconda/bin:$PATH 로 anaconda를 경로에 추가한다. 각자의 anaconda의 경로가 다를 수 있으므로 anaconda 혹은 anaconda3로 추가해준다. 편의상 자신의 .bashrc 등에 넣어주자. echo 'export PATH=~/anaconda/bin:$PATH' >> ~/.bashrc

커맨드 작동을 확인한다.Anaconda의 현 버전은 상관없는 듯하다.

그리고 conda environment를 만든다. 환경 설정후 환경을 활성화 시킨다.

conda create -n tf python=3.5 # y 등으로 계속 진행  
activate tf # Windows  
# source activate tf   # Linux/macOS

이 과정은 python의 virtualenv와 비슷하다. 안전하게 시스템의 python library가 꼬이지 않게 격리해서 만드는 과정이다. 예를들어 TensorFlow의 버전을 다르게 쓰고싶을때.

여기서 중요한건 python=3.5를 해줘야한다. Anaconda가 python3.6 버전을 사용하면 TensorFlow와 Keras 설치가 불가능하다. 그리고 환경을 활성화시킬때 Windows와 Linux/macOS 명령어의 차이가 난다.

다음 처럼 (tf) 환경이라고 쉘/터미널의 모습이 바뀐다.

그리고 쭉쭉 이 순서대로 설치해준다.

pip install tensorflow  # pip install tensorflow-gpu  
conda install pandas matplotlib scikit-learn  
pip install keras  
conda install jupyter notebook  

TensorFlow를 GPU 버전으로 사용할 수 있다면pip install tensorflow-gpu로 설치한다.
Keras 설치 중에 Theano를 설치하는 듯 하지만 기본 백엔드는 TensorFlow로 작동한다.

참고로 Keras 설치 전에 jupyter notebook 설치시 keras module을 import 하지 못하는 오류가 있다. 그래서 keras 설치후 jupyter notebook을 설치한다.

간단하게 Keras import가 되는지 확인한다.

끝으로 ..

자세히 읽지는 않았지만 다음 블로그 글에서 Windows7 64bit에서 CUDA와 CuDNN을 포함한 Windows에서 Keras+TensorFlow를 설치하는 글도 참고하면 도움이 되겠습니다. Python Keras+Tensorflow on Windows7 64bit 설치하기

 

 

------------------------------------------------------------

Kyung Mo Kweon 추가하자면, 윈도우에서만 3.5고
나머지는 이미 3.6으로 빌드 되어 있어서 꼭 3.5로 할 필요는 없습니다.
https://pypi.python.org/pypi/tensorflow/1.0.0

 

Posted by uniqueone
,

Keras Tutorial: The Ultimate Beginner's Guide to Deep Learning in Python
https://elitedatascience.com/keras-tutorial-deep-learning-in-python
Posted by uniqueone
,

Practical Deep Learning For Coders—18 hours of lessons for free
http://course.fast.ai/
Posted by uniqueone
,

Keras 자료

Deep Learning/Keras 2017. 2. 28. 18:18
텐서플로우도 좋지만, 컴퓨터사이언스가 전공이 아니고
기존 알고리즘들의 응용을 목적으로 하는 (저와 같은) 사람들을 위해서는 Keras로도 충분하다고 생각합니다 (제 전공은 산업공학입니다..)

그래서 Keras를 공부중인데 자료가 많이 없더라구요.

구글링해서 찾은 Keras CNN 튜토리얼을
https://elitedatascience.com/keras-tutorial-deep-learning-in-python
제 환경에 맞게(최신 Keras의 변경점을 수정하여)

나름대로 정리해봤습니다
https://byeongkijeong.github.io/Keras-cnn-tutorial/

Keras의 Backend를 Tensorflow로 사용하니, 본 그룹에 위배되지 않을거라 생각합니다 ㅎㅎ

블로그도, Jekyll도 처음이고 Markdown도 처음이라
글이 지저분하네요.

조만간 수정해보겠습니다 ㅠㅠ
Posted by uniqueone
,