The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.
Deep generative models can be either generative or discriminative. Generative models directly model the joint distribution of inputs and outputs, while discriminative models directly model the conditional distribution of outputs given inputs. Common deep generative models include restricted Boltzmann machines, deep belief networks, variational autoencoders, generative adversarial networks, and deep convolutional generative adversarial networks. These models use different network architectures and training procedures to generate new examples that resemble samples from the training data distribution.
The document discusses the AdaBoost classifier algorithm. AdaBoost is an algorithm that combines multiple weak classifiers to produce a strong classifier. It works by training weak classifiers on weighted versions of the training data and combining them through a weighted majority vote. The weights are updated at each iteration to focus on misclassified examples. The final strong classifier is a linear combination of the weak classifiers.
This document provides an overview of deep learning and neural networks. It begins with definitions of machine learning, artificial intelligence, and the different types of machine learning problems. It then introduces deep learning, explaining that it uses neural networks with multiple layers to learn representations of data. The document discusses why deep learning works better than traditional machine learning for complex problems. It covers key concepts like activation functions, gradient descent, backpropagation, and overfitting. It also provides examples of applications of deep learning and popular deep learning frameworks like TensorFlow. Overall, the document gives a high-level introduction to deep learning concepts and techniques.
Mathematics Foundation Course for Machine Learning & AI By Eduonix Nick Trott
This document provides information about a course on the mathematical foundations of machine learning and AI. It discusses how mathematics is important for tasks like selecting algorithms, setting parameters, and identifying overfitting. The course will cover key topics from linear algebra, multivariate calculus, and probability theory that are necessary for understanding machine learning algorithms and building AI systems. It consists of 18 lectures totaling 4 hours and provides a certificate of completion upon finishing.
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
The document discusses hyperparameters and hyperparameter tuning in deep learning models. It defines hyperparameters as parameters that govern how the model parameters (weights and biases) are determined during training, in contrast to model parameters which are learned from the training data. Important hyperparameters include the learning rate, number of layers and units, and activation functions. The goal of training is for the model to perform optimally on unseen test data. Model selection, such as through cross-validation, is used to select the optimal hyperparameters. Training, validation, and test sets are also discussed, with the validation set used for model selection and the test set providing an unbiased evaluation of the fully trained model.
This document provides an overview of different techniques for hyperparameter tuning in machine learning models. It begins with introductions to grid search and random search, then discusses sequential model-based optimization techniques like Bayesian optimization and Tree-of-Parzen Estimators. Evolutionary algorithms like CMA-ES and particle-based methods like particle swarm optimization are also covered. Multi-fidelity methods like successive halving and Hyperband are described, along with recommendations on when to use different techniques. The document concludes by listing several popular libraries for hyperparameter tuning.
Genetic algorithms are optimization techniques inspired by Darwin's theory of evolution. They use operations like selection, crossover and mutation to evolve solutions to problems by iteratively trying random variations. The document outlines the history, concepts, process and applications of genetic algorithms, including using them to optimize engineering design, routing, computer games and more. It describes how genetic algorithms encode potential solutions and use fitness functions to guide the evolution toward better outcomes.
This is a presentation I gave as a short overview of LSTMs. The slides are accompanied by two examples which apply LSTMs to Time Series data. Examples were implemented using Keras. See links in slide pack.
Generative Adversarial Networks (GANs) are a class of machine learning frameworks where two neural networks contest with each other in a game. A generator network generates new data instances, while a discriminator network evaluates them for authenticity, classifying them as real or generated. This adversarial process allows the generator to improve over time and generate highly realistic samples that can pass for real data. The document provides an overview of GANs and their variants, including DCGAN, InfoGAN, EBGAN, and ACGAN models. It also discusses techniques for training more stable GANs and escaping issues like mode collapse.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
The document discusses transfer learning and building complex models using Keras and TensorFlow. It provides examples of using the functional API to build models with multiple inputs and outputs. It also discusses reusing pretrained layers from models like ResNet, Xception, and VGG to perform transfer learning for new tasks with limited labeled data. Freezing pretrained layers initially and then training the entire model is recommended for transfer learning.
MLPfit is a tool for designing and training multi-layer perceptrons (MLPs) for tasks like function approximation and classification. It implements stochastic minimization as well as more powerful methods like conjugate gradients and BFGS. MLPfit is designed to be simple, precise, fast and easy to use for both standalone and integrated applications. Documentation and source code are available online.
The document describes multilayer neural networks and their use for classification problems. It discusses how neural networks can handle continuous-valued inputs and outputs unlike decision trees. Neural networks are inherently parallel and can be sped up through parallelization techniques. The document then provides details on the basic components of neural networks, including neurons, weights, biases, and activation functions. It also describes common network architectures like feedforward networks and discusses backpropagation for training networks.
This presentation examines one of the most popular algorithmic problems, from the evolutionary computation perspective. Contains problem definition, comparison between genetic algorithms and dynamic programming, the software design stage and how fitness function works in GA.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
This document provides an overview of deep learning including definitions, prerequisites, and examples of techniques like linear regression, multi-layer perceptrons, backpropagation, convolutional neural networks, and frameworks like PyTorch. It defines deep learning as being driven by very deep neural networks, explains why large networks are necessary to handle non-well-defined and ambiguous problems, and discusses how frameworks make deep learning models easy to implement and generalize.
Deep Learning Module 2A Training MLP.pptxvipul6601
This document provides an overview of deep learning concepts including linear regression, neural networks, and training multilayer perceptrons. It discusses:
1) How linear regression can be used for prediction tasks by learning weights to relate features to targets.
2) How neural networks extend this by using multiple layers of neurons and nonlinear activation functions to learn complex patterns in data.
3) The process of training neural networks, including forward propagation to make predictions, backpropagation to calculate gradients, and updating weights to reduce loss.
4) Key aspects of multilayer perceptrons like their architecture with multiple fully-connected layers, use of activation functions, and training algorithm involving forward/backward passes and parameter updates.
Interpreting Deep Neural Networks Based on Decision TreesTsukasaUeno1
The document discusses interpreting deep neural networks using decision trees. It describes experiments comparing the performance of neural networks with 1-15 hidden layers to decision trees built from the outputs of hidden layers on several datasets. The results show decision trees can approximate neural networks with less than 1% difference in accuracy. Tree size generally decreases as more hidden layers are added, though may increase or stop changing for some problems. The work seeks to determine the optimal number of hidden layers needed for problems by examining tree size. Future work is proposed to further analyze the effects of training parameters and test on larger datasets.
The document discusses various activation functions used in neural networks including Tanh, ReLU, Leaky ReLU, Sigmoid, and Softmax. It explains that activation functions introduce non-linearity and allow neural networks to learn complex patterns. Tanh squashes outputs between -1 and 1 while ReLU sets negative values to zero, addressing the "dying ReLU" problem. Leaky ReLU allows a small negative slope. Sigmoid and Softmax transform outputs between 0-1 for classification problems. Activation functions determine if a neuron's output is important for prediction.
This document provides an overview of deep learning including:
- Deep learning uses neural networks with multiple hidden layers to learn complex patterns in data.
- It can learn powerful feature representations from raw data in an unsupervised manner, unlike traditional ML which requires handcrafted features.
- The basics of neural networks including perceptrons, forward/backward propagation, and activation functions are explained.
- Training a neural network involves calculating loss, taking gradients to minimize loss through methods like stochastic gradient descent and adapting the learning rate.
- Regularization techniques help prevent overfitting, and H2O is introduced as a tool for scalable deep learning on large datasets.
This document provides an introduction and overview of deep learning. It begins with defining neural networks and how they are inspired by biological neurons. It then discusses different types of neural networks like single perceptrons, multi-layer perceptrons, convolutional neural networks, recurrent neural networks, and autoencoders. The document explains key concepts in deep learning like weights, biases, activation functions, loss functions, and training neural networks using gradient descent. It also clarifies terms like epochs, batches, and iterations in the training process. Finally, it touches on important hyperparameters like learning rate that control the training of neural networks.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
This document provides legal notices and disclaimers for an informational presentation by Intel. It states that the presentation is for informational purposes only and that Intel makes no warranties. It also notes that Intel technologies' features and benefits depend on system configuration. Finally, it specifies that the sample source code in the presentation is released under the Intel Sample Source Code License Agreement and that Intel and its logo are trademarks.
Introduction to artificial neural network and deep learningPramod Ramachandra
Neural networks are modeled after the human brain and consist of interconnected neurons that process information. A neural network has an input layer, hidden layers, and an output layer. Within each hidden layer are neurons that perform computations using weights, biases, and an activation function. The network is trained using gradient descent and backpropagation to minimize a loss function by adjusting the weights until the desired output is achieved. Popular applications of neural networks include computer vision, natural language processing, and autonomous vehicles. Deep learning uses neural networks with many hidden layers to learn complex patterns from large amounts of data.
Deep learning uses neural networks with multiple layers to simulate the human brain. It consists of an input layer, hidden layers, and an output layer with data passed between each layer. The example task uses a neural network for multi-class classification of handwritten digits, with an output softmax activation and minimization of error through gradient descent backpropagation. After experimentation, the model achieved 98% validation accuracy. Deep learning is widely used for computer vision, natural language processing, and reinforcement learning problems.
introduction to deep Learning with full detailsonykhan3
1. Deep learning involves using neural networks with multiple hidden layers to learn representations of data with multiple levels of abstraction.
2. These neural networks are able to learn increasingly complex features from the input data as the number of layers increases. The layers closer to the input learn simpler features while layers further from the input learn complex patterns in the data.
3. A breakthrough in deep learning was developing algorithms that can successfully train deep neural networks by unsupervised learning on each layer before using the learned features for supervised learning on the final layer. This pretraining helps the network learn useful internal representations.
introduction to DL network deep learning.pptQuangMinhHuynh
1. Deep learning involves using neural networks with multiple hidden layers to learn representations of data with multiple levels of abstraction.
2. These neural networks are able to learn increasingly complex features from the input data as the number of layers increases. The layers closer to the input learn simple features that the later layers combine to learn more complex patterns.
3. A breakthrough in deep learning was developing training methods that can effectively train deep neural networks by unsupervised learning on each layer before combining them into a full network for supervised learning tasks like classification. This pretraining helps the network learn useful internal representations.
Understanding computer vision with Deep LearningCloudxLab
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.
Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
The document provides information about key-value RDD transformations and actions in Spark. It defines transformations like keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), and cogroup(). It also defines actions like countByKey() and lookup() that can be performed on pair RDDs. Examples are given showing how to use these transformations and actions to manipulate key-value RDDs.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
1) NoSQL databases are non-relational and schema-free, providing alternatives to SQL databases for big data and high availability applications.
2) Common NoSQL database models include key-value stores, column-oriented databases, document databases, and graph databases.
3) The CAP theorem states that a distributed data store can only provide two out of three guarantees around consistency, availability, and partition tolerance.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
This document provides instructions for getting started with TensorFlow using a free CloudxLab. It outlines the following steps:
1. Open CloudxLab and enroll if not already enrolled. Otherwise go to "My Lab".
2. In "My Lab", open Jupyter and run commands to clone an ML repository containing TensorFlow examples.
3. Go to the deep learning folder in Jupyter and open the TensorFlow notebook to get started with examples.
Introduction to Deep Learning | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/goQxnL )
This CloudxLab Deep Learning tutorial helps you to understand Deep Learning in detail. Below are the topics covered in this tutorial:
1) What is Deep Learning
2) Deep Learning Applications
3) Artificial Neural Network
4) Deep Learning Neural Networks
5) Deep Learning Frameworks
6) AI vs Machine Learning
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Bert Blevins
Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
Best Practices for Effectively Running dbt in Airflow.pdfTatiana Al-Chueyr
As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Standard ways of running dbt (and when to utilize other methods)
- How Cosmos can be used to run and visualize your dbt projects in Airflow
- Common challenges and how to address them, including performance, dependency conflicts, and more
- How running dbt projects in Airflow helps with cost optimization
Webinar given on 9 July 2024
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
Comparison Table of DiskWarrior Alternatives.pdfAndrey Yasko
To help you choose the best DiskWarrior alternative, we've compiled a comparison table summarizing the features, pros, cons, and pricing of six alternatives.
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
2. Training Deep Neural Nets
Training Deep Neural Nets
● In previous chapter
○ We introduced artificial neural networks and
○ Trained our first deep neural network
○ It was a shallow NN
■ With only two hidden layers
○ This shallow neural network will not work if
■ We have to deal with complex problems such as
■ Detecting hundreds of objects in high-resolution images
3. Training Deep Neural Nets
Training Deep Neural Nets
● In that case, we may need to train a deeper neural network containing
○ Many layers
○ Each layer containing hundred of neurons
○ Connected by hundreds of thousands of connections
4. Training Deep Neural Nets
Training Deep Neural Nets
Question
What will be the challenges in training such a
deep neural network?
5. Training Deep Neural Nets
Training Deep Neural Nets
● We may face problem of vanishing gradients (which we will cover
shortly)
● Training such a large network will take a lot of time
● Such model with millions of parameters may be prone to overfitting
6. Training Deep Neural Nets
Training Deep Neural Nets
● In this chapter we will
○ Go through the vanishing gradients problem
■ And explore solutions to it
○ Look at various optimizers that can speed up training large models
● We will also look at
○ Popular regularization techniques for large neural networks
8. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● As discussed earlier
○ Backpropagation algorithm works by going from the
○ Output layer to the input layer
○ Propagating the error gradient on the way
9. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Once the algorithm computes the gradient of the cost function
○ With regards to each parameter in the network
○ Then it uses these gradients to update each parameter
10. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Here the problem is that
○ Gradients often get smaller and smaller
○ As the algorithm progresses down to the early layers
11. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Because of this,
○ The lower layer connection weights virtually remains unchanged
○ And training never converges to a good solution
○ This is called the vanishing gradients problem
12. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s understand Vanishing Gradient Problem with an example
13. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s recall sigmoid function
○ Popular activation function for ANN in classification context
○ Its output is in range of 0 to 1
Check the code to plot sigmoid function in the notebook
15. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s look at the derivative of sigmoid function
Sigmoid Function
Derivative of Sigmoid
S (1 - S)
16. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s plot the derivative of sigmoid function
Derivative of Sigmoid function
17. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s plot the derivative of sigmoid function
● As we can see
○ The output of the derivative of the Sigmoid function is
○ Always between 0 and ¼ (0.25)
Derivative of Sigmoid function
18. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now let’s look at the below univariate neural network
○ It has 2 hidden layers
○ act() is a sigmoid activation function
○ J returns the aggregate error of the model
Univariate 2-layer Neural Network
19. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now as per the chain rule in backpropagation
○ Rate of change in error because of weight w1 is
Univariate 2-layer Neural Network
20. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s focus on individual derivative for now
Univariate 2-layer Neural Network
21. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● A typical approach of weight initialization in a neural network is to
○ Choose weights using a normal distribution with
■ Mean of 0 and
■ Standard deviation of 1
○ Hence, the weights in the neural network are usually
■ Between -1 and 1
22. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now let’s come back to our individual derivative
● As we have seen in the past that
○ Output of derivative of sigmoid function lies between 0 and ¼
● And we have just discussed that
○ Weights in the neural network are usually between -1 and 1
< ¼ < 1
23. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Important - If we multiply two numbers between 0 and 1
○ Then the result will always be smaller
○ For example
○ ⅓ * ¼ = 1/12 (which is less than ⅓ and ¼)
● Here we are multiplying 2 values which are between 0 and 1
○ And the resulting gradient will be smaller
< ¼ < 1
24. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now let’s take another individual derivative
Univariate 2-layer Neural Network
25. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● This derivative has
○ Two sigmoid activation function
○ And here we multiply 4 values between 0 and 1
○ So this gradient will be really smaller than
○ The earlier derivative (∂output / ∂hidden2)
< ¼ < ¼< 1 < 1
26. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● So we can see that in the backpropagation as we move backward
○ Gradient just becomes smaller and smaller in every layer
○ And it becomes tiny in the early layers (input layers or the first layers)
○ This is called as Vanishing Gradient Problem
< ¼ < ¼< 1 < 1
27. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s understand it once again
● Below is 2-layer neural network
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Backpropagation
28. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Gradients will be largest in the output layer
○ Hence output layer is easiest to train
Largest gradients in
output layer
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Backpropagation
29. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Hidden layer 2 have
○ Smaller gradients than output layer
Smaller gradients in
hidden layer 2 than
output layer
Backpropagation
Input Layer Output LayerHidden Layer 1
Hidden Layer
2
30. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Hidden layer 1 have
○ Smaller gradients than hidden layer 2
Smaller gradients in hidden layer
1 than hidden layer 2
Input Layer Output Layer
Hidden Layer
1
Hidden Layer 2
Backpropagation
31. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● As input layer is farthest from the output layer
○ Its derivative will be the longer expression (using chain rule)
○ Hence it will contain more sigmoid derivatives
○ And it will have smallest derivative
○ This makes lower layers slowest to train
Smallest derivative in input layer
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Backpropagation
33. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
First problem
● Since gradient becomes really small in early layers (input layers)
○ It becomes really slow to train the early layers
Flat surface - small gradients.
Gradient Descent converges
slowly Larger gradients. Gradient
Descent converges fast
34. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
First problem
● Also because of small steps
○ May converge at a local minimum instead of global minimum
Flat surface - small gradients.
Gradient Descent converges
slowly Larger gradients. Gradient
Descent converges fast
35. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
Second problem
● Since the latter layers are dependent on the early layers
○ If early layers are not accurate
○ Then the latter or lower layers just build on this inaccuracy
○ And the entire neural net gets corrupted
● Early layers are responsible for
○ Detecting simple patterns and are
○ Building blocks of the neural network
○ Hence it becomes important that early layers are accurate
36. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
Second problem
● For example, in face recognition
○ Early layers detects the edges
○ Which gets combined to form facial features later in the network
● And if early layers get it wrong
○ The result built up by the neural network will be wrong
Original Image
Image seen by
neural
network
37. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
Exploding Gradients Problem
● Like vanishing gradients problem
○ We can also have exploding gradients problem
○ If the gradients were bigger than 1 (multiplying numbers greater than 1
always gives huge result)
○ Because of this, some layers may get insanely large weights and
○ The algorithm diverges instead of converging
○ This is called Exploding Gradients Problem
38. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● As we have seen deep neural networks suffer from unstable gradients
○ Different layers may learn at widely different speeds
● Because of vanishing gradients problem
○ Deep Neural Network were abandoned for a long time
○ Training the early layer correctly was the basis of network
○ But it proved too difficult that time because of
○ Available activation functions and hardware
39. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● In 2010, Xavier Glorot and Yoshua Bengio published a paper titled
○ “Understanding the Difficulty of Training Deep Feedforward
Neural Networks”
● Authors of this paper suggested that root cause of vanishing gradient
problem is
○ Nature of the sigmoid activation function derivative
40. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● If input is large,
○ Sigmoid function saturates at 0 or 1
○ And its derivative becomes extremely close to 0
41. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Thus when backpropagation kicks in
○ There is no gradient to propagate back through the network
○ And the little gradient that exists gets diluted as
○ Backpropagation reaches the early layers
○ So there is nothing left for early layers
42. Training Deep Neural Nets
Question
So what is the solution of vanishing gradients
problem?
43. Training Deep Neural Nets
Answer:
Good strategy for initializing weights
&
Use better activation functions
44. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Kaiming He suggested strategy for initializing the weights
○ To avoid vanishing gradients problem
○ It’s called He initialization
○ with below parameters for various activation functions
45. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
HE Initialization
import tensorflow as tf
reset_graph()
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
he_init =
tf.contrib.layers.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
kernel_initializer=he_init, name="hidden1")
46. Training Deep Neural Nets
ReLU Activation Function
● It turns out that ReLU activation function works better for Deep Neural
Networks
○ Because it does not saturate for positive values
○ And it is quite fast to compute
ReLU (z) = max (0, z)
47. Training Deep Neural Nets
ReLU Activation Function
Derivative of ReLU activation function
● It is not differentiable at x = 0
Derivative of ReLU activation function
For positive inputs , the
derivative is always 1
48. Training Deep Neural Nets
ReLU Activation Function
Derivative of ReLU activation function
● So with ReLU our gradients will never vanish
● As long as inputs are positive
Derivative of ReLU activation function
For positive inputs , the
derivative is always 1
49. Training Deep Neural Nets
Question
Do you see any problem with the derivative of
ReLU activation function?
50. Training Deep Neural Nets
ReLU Activation Function
● ReLU suffers from problem known as the dying ReLUs
● For negative inputs derivative is zero
Derivative of ReLU activation function
For negative inputs , the
derivative is always 0
51. Training Deep Neural Nets
ReLU Activation Function
Dying ReLUs
● Because of dying ReLUs, during training
○ Some neurons effectively die and
○ They stop outputting anything other than 0
○ It completely blocks the backpropagation
53. Training Deep Neural Nets
Leaky ReLU
● To solve dying ReLUs problem we use
○ Variant of ReLUs known as leaky ReLU
● Leaky ReLU output a very small gradient when the input is negative
Leaky ReLU
is the hyperparameter
which defines how much the
function “leaks” and is
typically set to “0.01”
= 0.01
RELU(x) = max( x, x)
54. Training Deep Neural Nets
Leaky ReLU
● This small gradient ensures that the
○ Leaky ReLUs never die
● In the recent researches it has been shown that
○ Setting = 0.2 (huge leak) results in better performance
56. Training Deep Neural Nets
Leaky ReLU
# Implementing Leaky ReLU in TensorFlow
reset_graph()
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
def leaky_relu(z, name=None):
return tf.maximum(0.01 * z, z, name=name)
hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu,
name="hidden1")
57. Training Deep Neural Nets
Leaky ReLU
Follow the code in the notebook to train a
neural network on MNIST using the Leaky
ReLU
58. Training Deep Neural Nets
ELU Activation Function
● In 2015, Djork-Arné Clevert et al proposed a new activation function
○ ELU - Exponential Linear Unit
● It outperformed all the ReLU variants in their experiments
○ Training time was reduced and
○ Neural network performed better on the test set
60. Training Deep Neural Nets
ELU Activation Function
● In ELU equation, the hyperparameter defines the value
○ That ELU function approaches to when z is a large negative number
○ is usually set to 1
○ But we can tweak it like any other hyperparameter
61. Training Deep Neural Nets
ELU Activation Function
Advantage over ReLU
● It has a nonzero gradient for z < 0
○ Which avoids the dying units issue
ELU ReLU
62. Training Deep Neural Nets
ELU Activation Function
Advantage over ReLU
● It is smooth everywhere including around z = 0
○ It helps speedup Gradient Descent
ELU ReLU
63. Training Deep Neural Nets
ELU Activation Function
Drawbacks over ReLU
● Because of the use of exponential function
○ It is slower to compute than the ReLU
● But during training this slowness gets compensated by
○ The faster convergence rate
● However during testing
○ ELU networks are slower than the ReLU networks
64. Training Deep Neural Nets
ELU Activation Function
# ELU plot
def elu(z, alpha=1):
return np.where(z < 0, alpha * (np.exp(z) - 1), z)
plt.plot(z, elu(z), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1, -1], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title(r"ELU activation function ($alpha=1$)", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])
plt.show()
65. Training Deep Neural Nets
ELU Activation Function
# Implementing ELU in TensorFlow
reset_graph()
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu,
name="hidden1")
66. Training Deep Neural Nets
SELU Activation Function
● In June 2017, Günter Klambauer, Thomas Unterthiner and Andreas Mayr
○ Proposed SELU activation function
○ It outperforms the other activation functions
○ Very significantly for deep neural networks
○ Even for 100 layer deep neural network
67. Training Deep Neural Nets
SELU Activation Function
SELU Function in Python
def selu(z,
scale=1.0507009873554804934193349852946,
alpha=1.6732632423543772848170429916717):
return scale * elu(z, alpha)
68. Training Deep Neural Nets
SELU Activation Function
Plot SELU Function
plt.plot(z, selu(z), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1.758, -1.758], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title(r"SELU activation function", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])
plt.show()
69. Training Deep Neural Nets
SELU Activation Function
● With this activation function
○ Even a 100 layer deep neural network
○ Preserves roughly mean 0 and standard deviation 1 across all layers
○ Avoiding the exploding/vanishing gradients problem
70. Training Deep Neural Nets
SELU Activation Function
Check the mean and standard deviation in the deep layers
np.random.seed(42)
Z = np.random.normal(size=(500, 100))
for layer in range(100):
W = np.random.normal(size=(100, 100), scale=np.sqrt(1/100))
Z = selu(np.dot(Z, W))
means = np.mean(Z, axis=1)
stds = np.std(Z, axis=1)
if layer % 10 == 0:
print("Layer {}: {:.2f} < mean < {:.2f}, {:.2f} < std
deviation < {:.2f}".format(
layer, means.min(), means.max(), stds.min(), stds.max()))
71. Training Deep Neural Nets
SELU Activation Function
Follow the code in the notebook to create a
neural net for MNIST using the SELU activation
function
73. Training Deep Neural Nets
Which Activation Function to Use?
Answer
In general,
SELU > ELU > Leaky ReLU > ReLU > tanh > logistic
Vanishing gradient
74. Training Deep Neural Nets
Which Activation Function to Use?
● If runtime performance is important then
○ Prefer Leaky ReLUs over ELUs
● Also instead of tweaking hyperparameter
○ We may use default suggested values
■ 0.2 for the leaky ReLUs and
■ 1 for ELU
● If we have spare time and computing power
○ Use cross-validation to evaluate the other activation functions
76. Training Deep Neural Nets
Batch Normalization
● Using He initialization and proper activation functions
○ Like ELU or any variant of ReLU
○ Vanishing / exploding gradient problem significantly reduces
○ But there is no guarantee that
○ This problem will not come back during training
● In 2015, Sergey Ioffe and Christian Szegedy
○ Proposed a technique called Batch Normalization (BN)
○ To address the vanishing/exploding gradients problems
77. Training Deep Neural Nets
Batch Normalization
● Batch Normalization helps in
○ Vanishing gradient problem and
○ It also helps the neural network to learn faster
● Let’s understand Batch Normalization
78. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● As discussed earlier in machine learning projects
○ Gradient Descent does not work well
○ If the input features are on different scales
○ Like say if we have number of miles individual has driven in last 5 years
■ This data can have a large varying scale
■ As someone might have driven 100, 000 miles
■ While other person might have driven 100 miles
■ So here the range is 100 - 100, 000
79. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● One of the techniques of feature scaling is
○ Standardization
● In Standardization, features are rescaled
○ So that output will have the properties of
○ Standard normal distribution with
■ Zero mean and
■ Unit variance
Mean
Standard
Deviation
80. Training Deep Neural Nets
Batch Normalization - Feature Scaling
Standardization
● The general method of calculation
○ Calculate distribution mean and standard deviation for each feature
○ Subtract the mean from each feature
○ Divide the result from previous step of each feature by its standard
deviation
Standardized Value
81. Training Deep Neural Nets
Batch Normalization - Feature Scaling
Standardization
● As a preprocessing step
○ We apply standardization to the input dataset
○ So that all the features will have same scale
■ With 0 mean
■ And unit standard deviation
○ And Gradient Descent converges faster
82. Training Deep Neural Nets
Batch Normalization - Feature Scaling
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Two Layer - Neural Network
Normalized Input Features
83. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● As we have discussed, if we normalize input features
○ It helps in converging faster
● If we normalize hidden layers also in deep neural network
○ Then it will speed up the learning
○ This is what we do in Batch normalization
■ We normalize hidden layers
○ Now let’s understand how do we do batch normalization in deep
neural networks
86. Training Deep Neural Nets
Batch Normalization - Algorithm
Algorithm
for T in 1 ……. number of mini batches:
Compute forward propagation for mini-batch X(T)
In each hidden layer normalize inputs
Use back propagation and update parameters
87. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Let’s say we have a simple network and
● Here normalizing input features helps in Calculate W and b more
efficiently
Step 1 - Calculate
mean
W, b
88. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Normalize the input features
Step 1 - Calculate
mean
W, b
89. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Normalize the input features
Step 1 - Calculate
mean
Step 2 - Calculate SD
W, b
90. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Normalize the input features
Step 1 - Calculate
mean
Step 2 - Calculate SD
Step 3 - Normalize
W, b
91. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
μB
is the mean,
evaluated over the
whole mini-batch B
92. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
σB
is the standard
deviation, evaluated over
the whole mini-batch B
93. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
mB
is number of
instances in the
mini-batch B
94. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
X(i)
is the normalized
output
95. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
ε is a tiny small number
to avoid division by zero
96. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
γ and β are parameters
which are learnt during
training
97. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
Z(i)
is the output of the
BN operations.
98. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
● In general four
parameters are trained
for each
batch-normalized layer
○ μ (mean)
○ σ (SD)
○ γ and
○ β
99. Training Deep Neural Nets
Question
At the test time how do we test the deep neural network
trained with batch normalization as there will not be any mini
batch to compute the mean and standard deviation?
100. Training Deep Neural Nets
Answer
By computing the moving average of whole training set’s mean
and standard deviation during training
101. Training Deep Neural Nets
Follow code in the notebook to implement
Batch Normalization with TensorFlow
102. Training Deep Neural Nets
Batch Normalization
Drawbacks
● In batch normalization,
○ The neural network makes slower predictions
○ Due to the extra computations required at each layer
● If we need fast predictions
○ We should first check
■ How Plain ELU + He initialization performs
■ Before playing with batch normalization
104. Training Deep Neural Nets
Gradient Clipping
● We can reduce the exploding gradients problem
○ By clipping the gradients during backpropagation
○ So that they never exceed some threshold
○ This is called Gradient Clipping
105. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 1
● Specify threshold and optimizer
106. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 2
● Call the optimizer’s compute_gradients() method
107. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 3
● Create an operation to clip the gradients using
● clip_by_value() function
108. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 4
● Create an operation to apply the
○ Clipped gradients using the optimizer’s
○ apply_gradients() method
109. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 5
● Run this training_op at every training step
○ It will compute gradients
○ Clip them between –1.0 and 1.0, and apply them
○ Note that threshold is a hyperparameter and can be tuned
110. Training Deep Neural Nets
Gradient Clipping
Follow code in the notebook to create a simple
neural net for MNIST and add gradient clipping
112. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning
● It is not a good idea to train a very large DNN from scratch
● We should find an existing neural network if possible
○ Which accomplishes a similar task we are trying to tackle
● If we can find such network
○ Then just reuse the lower layers (early layers) of this network
○ This is called Transfer Learning
113. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning
● There are two major advantages of Transfer Learning
○ It speeds up training considerably
○ It requires much less training data
114. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning - Examples
● Let’s say we have found an existing DNN
○ That was trained to classify pictures
○ Into 100 different categories like
■ Animals,
■ Plants,
■ Vehicles and
■ Everyday objects
115. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning - Examples
● Now we want to train a DNN to classify specific types of vehicles
● These tasks are similar to existing DNN and
● We should try to reuse the pretrained layers of the existing network
Reusing pretrained
layers
116. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning
● If input pictures in our task do not have the same size as the one in the
existing network
● Then we have to add a preprocessing step to resize them to the size
○ As expected by the existing model
● Also transfer learning works only when inputs in our task
○ Have similar low-level features as in the existing model
117. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model
● If the original model was trained using TensorFlow
○ We can simply restore it and train it on the new task
118. Training Deep Neural Nets
Reusing Pretrained Layers
Let’s see example of how to reuse a
TensorFlow model
119. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 1
● To reuse the model
○ First we need to load graph structure
○ Using import_meta_graph()
>>> reset_graph()
>>> saver =
tf.train.import_meta_graph("model_ckps/my_model_final.ckpt.meta")
120. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 2
● Next, get handle on all operations we will need for training
● If we do not know graph structure, then
○ List all the operations using below code
>>> for op in tf.get_default_graph().get_operations():
print(op.name)
121. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 3
● Once we know which operations do we need then
○ We can get a handle on them using the graph’s
■ get_operation_by_name() or
■ get_tensor_by_name() methods
>>> X = tf.get_default_graph().get_tensor_by_name("X:0")
>>> y = tf.get_default_graph().get_tensor_by_name("y:0")
>>> accuracy =
tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")
>>> training_op =
tf.get_default_graph().get_operation_by_name("GradientDescent")
122. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 4
● Now we can start session, restore the model's state and continue
training on data
with tf.Session() as sess:
saver.restore(sess, "model_ckps/my_model_final.ckpt")
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples //
batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
print(epoch, "Test accuracy:", accuracy_val)
save_path = saver.save(sess, "model_ckps/my_new_model_final.ckpt")
123. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
124. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● In general, we restore only part of the original model
○ Especifically early layers
○ Let’s restore only hidden layers 1, 2 and 3
125. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Get all trainable variables in hidden layers 1 to 3
126. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Create a dictionary mapping the name of each variable in the original
model to its name in the new model
127. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Create a Saver that will restore only original model
128. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Create another Saver to save the entire new model, not just layers 1 to 3
129. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Start the session
130. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Restore the variables from the original model’s layers 1 to 3
131. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Train the new model
132. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Save the whole model
133. Training Deep Neural Nets
Reusing Pretrained Layers
Follow the complete code to restore only
hidden layers 1, 2 and 3 in the notebook
134. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing Models from Other Frameworks
135. Training Deep Neural Nets
Reusing Models from Other Frameworks
● If the model was trained using another framework
○ Such as Theano
○ Then we need to load the weights manually
● Let’s see the example of
○ How we would copy the weight and biases from the first hidden layer
of a model trained using another framework
136. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 1
Load the weights from the other framework manually
137. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 2
● Find the initializer’s assignment operation for every variable
○ That we want to reuse
138. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 2
● The weights variable created by the tf.layers.dense() function is called
"kernel"
139. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 2
● Get the initialization value of every variable that we want to reuse
140. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 3
● When we run the initializer, we replace the initialization values with the
ones we want, using a feed_dict
141. Training Deep Neural Nets
Reusing Models from Other Frameworks
Check the complete code of “reusing models
from other frameworks” in the notebook
143. Training Deep Neural Nets
Freezing the Lower Layers
● As discussed earlier, lower layers detects the low level details
○ So we can reuse these lower layers as they are
○ This is also called freezing lower layers
● While training a new DNN
○ We generally freeze lower-layer weights
○ So that higher-layer weights will be easier to train
○ Because they won’t have to learn a moving target
144. Training Deep Neural Nets
Freezing the Lower Layers
● To freeze the lower layers during training
○ We give the list of variables to optimizer after excluding the variables
from lower layers)
>>> train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
>>> training_op = optimizer.minimize(loss, var_list=train_vars)
145. Training Deep Neural Nets
Freezing the Lower Layers
● Freeze the lower layers - Step 1
>>> train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
>>> training_op = optimizer.minimize(loss, var_list=train_vars)
● Gets list of all the trainable variables
○ In the hidden layers 3 and 4 and
○ In the output layer
● This leaves out the variables
○ In the hidden layers 1 and 2
146. Training Deep Neural Nets
Freezing the Lower Layers
● Freeze the lower layers - Step 2
>>> train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
>>> training_op = optimizer.minimize(loss, var_list=train_vars)
● Next we provide this restricted list of trainable variables
○ To the optimizer’s minimize() function
● That’s it
○ Now hidden layer 1 and 2 are frozen
147. Training Deep Neural Nets
Reusing Pretrained Layers
Tweaking, Dropping, or Replacing the Upper
Layers
148. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
● While training a new DNN using existing DNN
○ The output layer of the original model is usually replaced
○ As it is most likely not useful at all for the new task
○ Also it may not even have the right number of
○ Outputs/classes for the new task
● Also the upper hidden layers of the original model
○ Are less likely to be useful
○ As compared to early layers
149. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
Question
How do we find out right number of layers to
reuse?
150. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
● Try freezing all the copied layers first
○ Then train the model and see how it performs
● Then try unfreezing one or two top hidden layers
○ So that backpropagation can tweak them
○ And see if performance improves
● The more training data we have, the more layers we can unfreeze
151. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
● If we still can not get good performance and the training data is less
○ Then try dropping the top hidden layers
○ And freeze all remaining hidden layers again
● We can iterate until we find the right number of layer to reuse
● If we have plenty of training data then
○ Try replacing the top hidden layers
○ Instead of dropping them
○ Also add more hidden layers to get good performance
153. Training Deep Neural Nets
Model Zoos
● As we discussed we can reuse the existing pretrained neural network for
our new tasks
● But where can we find a trained neural network for the task similar to
ours?
154. Training Deep Neural Nets
Model Zoos
● The first place is to look our own catalog of models
○ This is why we should save all our models and
○ Organize them properly so that
○ We can retrieve them later
● Another option is to search in a model zoo
○ Many people after training their models
○ Release the trained models to the public
155. Training Deep Neural Nets
Model Zoos
● TensorFlow has its own model zoo available at
○ https://github.com/tensorflow/models
● It contains most of the image classification nets such as
○ VCG, Inception and ResNet
■ Including the code
■ The pretrained models and
■ Tools to download popular image datasets
156. Training Deep Neural Nets
Model Zoos
● Another popular model zoo is Caffe’s Model Zoo
○ https://github.com/BVLC/caffe/wiki/Model-Zoo
● It contains many computer vision models trained on various datasets
● We can also use below converter
○ To convert Caffe models to TensorFlow models
○ https://github.com/ethereon/caffe-tensorflow
158. Training Deep Neural Nets
Unsupervised Pretraining
● If we want to train a model for complex task
○ And we do not have much labeled training data
○ Also we could not find a pretrained model on similar task
● Then in this case how should we tackle the task?
159. Training Deep Neural Nets
Unsupervised Pretraining
● Try to gather more labeled training data
○ But if it is too hard or too expensive to get the training data
○ Then try to perform unsupervised pretraining
160. Training Deep Neural Nets
Unsupervised Pretraining
● If we have plenty of unlabelled training data then
○ Try to train the layers one by one
○ Starting with the lowest layer and then going up
○ Using an unsupervised feature detector algorithm such as
■ Restricted Boltzmann Machines (RBMs) or autoencoders
161. Training Deep Neural Nets
Unsupervised Pretraining
● Each layer is trained on the output of the
○ Previously trained layers
○ All layers except the one being trained are frozen
162. Training Deep Neural Nets
Unsupervised Pretraining
● Once all layers have been trained
○ We can fine-tune the network
○ Using supervised learning (with backpropagation)
● This is the long and tedious process
○ But often works well
163. Training Deep Neural Nets
Unsupervised Pretraining
● This technique was used by Geoffrey Hinton and his team in 2006
● It led to the revival of neural networks and the success of Deep Learning
● Until 2010, unsupervised pretraining (typically using RBMs)
○ Was the norm for deep nets
● Only after the vanishing gradients problem was alleviated
○ It became much more common to train
○ DNNs purely using backpropagation
164. Training Deep Neural Nets
Unsupervised Pretraining
● Unsupervised pretraining
○ Using autoencoders than RBM (Restricted Boltzmann Machines) is still
a good option when we have complex task to solve
■ And no similar pretrained model is available
■ And there is a little labeled training data but lot of unlabeled
training data is available
165. Training Deep Neural Nets
Reusing Pretrained Layers
Pretraining on an Auxiliary Task
166. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● Let’s say we want to build a system to recognize faces
● And as a training set
○ We may only have few pictures of each individual
○ Clearly not enough training set to train a good classifier
○ And gathering hundred of pictures of each person will not be practical
Solution??
167. Training Deep Neural Nets
Pretraining on an Auxiliary Task
Solution -
● We can download a lot of pictures of random people from internet
● And train a first neural network to detect
○ If two different pictures are of the same person
● Such a network would learn good feature detectors for faces
● So reusing its lower layers would allow us to train
○ A good face classifier
○ Using little training data which we had
168. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● It is cheap to gather unlabeled training data
○ Like in previous example
○ We could download images from internet for almost free
○ But it is quite expensive to label them
● A common technique is to
○ Label all the training examples as “good”
○ And then generate many new labeled training instances
○ By corrupting the good ones and
○ Label these corrupted instances as bad
169. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● And then we can train neural network
○ To classify these instances good or bad
● For example
○ Download millions of sentences
○ Then label all of them as “good”
○ Then randomly change a world in each sentence
○ And label the resulting sentence as “bad”
170. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● Now if neural network can tell that
○ “The dog sleeps” is a good sentence and
○ “The dog they” is a bad sentence
○ Then it probably knows a lot about language
● Reusing its lower layers will help in many language processing tasks
172. Training Deep Neural Nets
Faster Optimizers
1. Training a deep neural network can be painfully slow
2. So far we have seen four ways to speedup training
2.1. Applying a good initialization strategy for the connection weights
2.2. Using a good activation function
2.3. Using Batch Normalization
2.4. Reusing parts of a pretrained network
173. Training Deep Neural Nets
Faster Optimizers
● Speed boost also comes from using a faster optimizer
○ Than the Gradient Descent optimizer
● Popular optimizers are
○ Momentum optimization
○ Nesterov Accelerated Gradient
○ AdaGrad
○ RMSProp and
○ Adam optimization
Increasing order of performance
174. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Analogy
● Imagine a bowling ball rolling down a gentle slope on a smooth surface
● It will start out slowly, but it will quickly pick up momentum until it
eventually reaches terminal velocity.
● This is the very simple idea behind Momentum optimization, proposed by
Boris Polyak in 1964
175. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How is Momentum optimization different from Gradient Descent
● Regular Gradient Descent will simply take small regular steps down
the slope, so it will take much more time to reach the bottom.
● Gradient Descent simply updates the weights θ by directly subtracting
the gradient of the cost function J(θ) with regards to the weights (∇θ
J(θ))
multiplied by the learning rate η.
176. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How is Momentum optimization different from Gradient Descent
● The equation of Gradient descent is: θ ← θ – η∇θJ(θ).
● It does not care about what the earlier gradients were. If the local
gradient is tiny, it goes very slowly
● Momentum optimization cares a great deal about what previous
gradients were
177. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How does Momentum optimization work ?
● At each iteration, it adds the local gradient to the momentum vector
m, multiplied by the learning rate η,
● And it updates the weights by simply subtracting this momentum vector.
178. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How does Momentum optimization work ?
● In other words, the gradient is used as an acceleration, not as a speed.
● To simulate some sort of friction mechanism and prevent the momentum
from growing too large, the algorithm introduces a new
hyperparameter β, simply called the momentum, which must be set
between 0 (high friction) and 1 (no friction).
● A typical momentum value is 0.9.
179. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Advantages of Momentum optimization
● Gradient Descent goes down the steep slope quite fast, but then it takes
a very long time to go down the valley.
● Whereas Momentum optimization will roll down the bottom of the valley
faster and faster until it reaches the bottom (the optimum)
● In deep neural networks that don’t use Batch Normalization, the upper
layers will often end up having inputs with very different scales, so using
Momentum optimization helps a lot.
● It can also help roll past local optima.
180. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Disadvantage of Momentum optimization
● The one drawback of Momentum optimization is that it adds yet another
hyperparameter to tune.
● However, the momentum value of 0.9 usually works well in practice and
almost always goes faster than Gradient Descent.
181. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Implementing Momentum optimization
Implementing Momentum optimization in TensorFlow is easy : just replace
the GradientDescentOptimizer with the MomentumOptimizer
>>> optimizer =
tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9)
182. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● It is a small variant to Momentum optimization, proposed by Yurii
Nesterov in 1983, is almost always faster than vanilla Momentum
optimization.
183. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● The idea of Nesterov Momentum optimization, or Nesterov Accelerated
Gradient (NAG), is to
○ Measure the gradient of the cost function not at the local position but
slightly ahead in the direction of the momentum.
○ The only difference from vanilla Momentum optimization is that the
gradient is measured at θ + βm rather than at θ
184. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● This small tweak works because in general the momentum vector will be
pointing in the right direction (i.e., toward the optimum),
● So it will be slightly more accurate to use the gradient measured a bit
farther in that direction rather than using the gradient at the original
position
185. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● ∇1 represents the
gradient of the cost
function measured at the
starting point θ
● ∇2 represents the
gradient at the point
located at θ + βm
186. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● The Nesterov update
ends up slightly closer to
the optimum.
● After a while, these small
improvements add up
and NAG ends up being
significantly faster than
regular Momentum
optimization
187. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● Note that when the
momentum pushes the
weights across a valley,
∇1 continues to push
further across the valley,
while ∇2 pushes back
toward the bottom of
the Valley.
● This helps reduce
oscillations and thus
converges faster.
188. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
Implementing Nesterov Accelerated Gradient
NAG will almost always speed up training compared to regular Momentum
optimization. To use it, simply set use_nesterov=True when creating the
MomentumOptimizer:
>>> optimizer =
tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9, use_nesterov=True)
189. Training Deep Neural Nets
Faster Optimizers - AdaGrad
● Gradient Descent starts by quickly going down the steepest slope, then
slowly goes down the bottom of the valley
● It would be nice if the algorithm could detect this early on and correct its
direction to point a bit more toward the global optimum
● The AdaGrad algorithm achieves this by scaling down the gradient vector
along the steepest dimensions
190. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● The first step accumulates the square of the gradients into the vector s
● The ⊗ symbol represents the element-wise multiplication
● This vectorized form is equivalent to computing si
← si
+ (∂ / ∂ θi
J(θ))2
for each element si
of the vector s
191. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● In other words, each si
accumulates the squares of the partial derivative
of the cost function with regards to parameter θi
● If the cost function is steep along the ith dimension, then si
will get larger
and larger at each iteration.
192. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● The second step is almost identical to Gradient Descent, but with one
big difference:
○ The gradient vector is scaled down by a factor of
○ The ⊘ symbol represents the element-wise division, and ϵ is a
smoothing term to avoid division by zero, typically set to 10–10
193. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● This vectorized form is equivalent to computing for all
parameters θi
● This algorithm decays the learning rate, but it does so faster for steep
dimensions than for dimensions with gentler slopes.
● This is called an adaptive learning rate.
194. Training Deep Neural Nets
Faster Optimizers - AdaGrad
Advantages of AdaGrad
● It helps point the resulting updates more directly toward the global
optimum. One additional benefit is that it requires much less tuning of
the learning rate hyperparameter η
195. Training Deep Neural Nets
Faster Optimizers - AdaGrad
Disadvantages of AdaGrad
● AdaGrad often performs well for simple quadratic problems, but
unfortunately it often stops too early when training neural networks
● The learning rate gets scaled down so much that the algorithm ends
up stopping entirely before reaching the global optimum.
● So even though TensorFlow has an AdagradOptimizer, you should
not use it to train deep neural networks
● It may be efficient for simpler tasks such as Linear Regression
196. Training Deep Neural Nets
Faster Optimizers - RMSProp
● AdaGrad slows down a bit too fast and ends up never converging to the
global optimum
● The RMSProp algorithm fixes this by accumulating only the gradients
from the most recent iterations, as opposed to all the gradients since the
beginning of training
● It does so by using exponential decay in the first step
197. Training Deep Neural Nets
Faster Optimizers - RMSProp
● The decay rate β is typically set to 0.9
● It is once again a new hyperparameter, but this default value often works
well, so you may not need to tune it at all
198. Training Deep Neural Nets
Faster Optimizers - RMSProp
Implementing RMSProp
>>> optimizer =
tf.train.RMSPropOptimizer(learning_rate=learning_rate,
momentum=0.9, decay=0.9, epsilon=1e-10)
● Except on very simple problems, this optimizer almost always performs
much better than AdaGrad
● It also generally performs better than Momentum optimization and
Nesterov Accelerated Gradients
● In fact, it was the preferred optimization algorithm of many researchers
until Adam optimization came around
199. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Adam which stands for adaptive moment estimation, combines the ideas
of
○ Momentum optimization
○ And RMSProp
● Just like Momentum optimization it keeps track of an exponentially
decaying average of past gradients
● And just like RMSProp it keeps track of an exponentially decaying average
of past squared gradients
200. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Adam which stands for adaptive moment estimation, combines the ideas
of
○ Momentum optimization
○ And RMSProp
● Just like Momentum optimization it keeps track of an exponentially
decaying average of past gradients
● And just like RMSProp it keeps track of an exponentially decaying average
of past squared gradients
201. Training Deep Neural Nets
● If you just look at steps 1, 2, and 5, you will notice Adam’s close similarity
to both Momentum optimization and RMSProp.
Faster Optimizers - Adam Optimization
202. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● The only difference is that step 1 computes an exponentially decaying
average rather than an exponentially decaying sum
● But these are actually equivalent except for a constant factor, the
decaying average is just 1 – β1 times the decaying sum
203. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Steps 3 and 4 are somewhat of a technical detail
○ Since m and s are initialized at 0, they will be biased toward 0 at the
beginning of training
● So these two steps will help boost m and s at the beginning of training.
204. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● The momentum decay hyperparameter β1 is typically initialized to 0.9,
while the scaling decay hyperparameter β2 is often initialized to 0.999.
● As earlier, the smoothing term ϵ is usually initialized to a tiny number
such as 10–8
205. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Since Adam is an adaptive learning rate algorithm, like AdaGrad and
RMSProp, it requires less tuning of the learning rate hyperparameter η
● We can often use the default value η = 0.001, making Adam even easier
to use than Gradient Descent
206. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
Implementing Adam Optimization in TensforFlow
>>> optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
207. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
How do we find a good learning rate ??
208. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● Finding a good learning rate can be tricky.
● If we set it way too high,
○ Training may actually diverge
● If you set it too low,
○ Training will eventually converge to the optimum, but it will take a
very long time.
209. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● If you set it slightly too high,
○ It will make progress very quickly at first,
○ But it will end up dancing around the optimum, never settling down
● We have to use an adaptive learning rate optimization algorithm such as
AdaGrad, RMSProp, or Adam,
○ But even then it may take time to settle
● If you have a limited computing budget, you may have to interrupt
training before it has converged properly, yielding a suboptimal solution
210. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● We may be able to find a fairly good learning rate by training your
network several times during just a few epochs using various learning
rates and comparing the learning curves
211. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
The ideal learning rate will learn quickly and converge to good solution
212. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● We can do better than a constant learning rate:
● If we start with a high learning rate and then reduce it once it stops
making fast progress
● We can reach a good solution faster than with the optimal constant
learning rate.
213. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● There are many different strategies to reduce the learning rate during
training.
● These strategies are called learning schedules, the most common ones
are now discussed
214. Training Deep Neural Nets
Predetermined piecewise constant learning rate
● For example, set the learning rate to η0
= 0.1 at first, then to η1
= 0.001
after 50 epochs.
● Although this solution can work very well, it often requires fiddling
around to figure out the right learning rates and when to use them.
Faster Optimizers - Learning Rate Scheduling
215. Training Deep Neural Nets
Performance scheduling
● Measure the validation error every N steps, just like for early stopping
and reduce the learning rate by a factor of λ when the error stops
dropping.
Exponential scheduling
● Set the learning rate to a function of the iteration number t: η(t) = η0
10–t/r
. This works great, but it requires tuning η0
and r. The learning rate
will drop by a factor of 10 every r steps.
Faster Optimizers - Learning Rate Scheduling
216. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
Power scheduling
● Set the learning rate to η(t) = η0
(1 + t/r)–c
.
● The hyperparameter c is typically set to 1.
● This is similar to exponential scheduling, but the learning rate drops much
more slowly.
217. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
Implementing a learning schedule with TensorFlow
>>> initial_learning_rate = 0.1
>>> decay_steps = 10000
>>> decay_rate = 1/10
>>> global_step = tf.Variable(0, trainable=False)
>>> learning_rate = tf.train.exponential_decay(initial_learning_rate,
global_step, decay_steps, decay_rate)
>>> optimizer = tf.train.MomentumOptimizer(learning_rate,
momentum=0.9)
>>> training_op = optimizer.minimize(loss, global_step=global_step)
Run it on Notebook
218. Training Deep Neural Nets
Implementing a learning schedule with TensorFlow
Understanding previous code
● After setting the hyperparameter values, we create a nontrainable
variable global_step (initialized to 0) to keep track of the current training
iteration number.
● Then we define an exponentially decaying learning rate, with η0
= 0.1 and
r = 10,000 using TensorFlow’s exponential_decay() function.
Faster Optimizers - Learning Rate Scheduling
219. Training Deep Neural Nets
Implementing a learning schedule with TensorFlow
Understanding previous code
● Next, we create an optimizer, in this example, a MomentumOptimizer
using this decaying learning rate.
● Finally, we create the training operation by calling the optimizer’s
minimize() method; since we pass it the global_step variable, it will
kindly take care of incrementing it.
Faster Optimizers - Learning Rate Scheduling
220. Training Deep Neural Nets
Since AdaGrad, RMSProp, and Adam optimization automatically reduce the learning rate during
training, it is not necessary to add an extra learning schedule.
For other optimization algorithms, using exponential decay or performance scheduling can
considerably speed up convergence.
Faster Optimizers - Learning Rate Scheduling
221. Training Deep Neural Nets
Faster Optimizers
● The conclusion is that we should always use Adam optimization
○ We really do not have to know about internals
○ Simply replace GradientDescentOptimizer with AdamOptimizer
○ With this small change training will be several times faster
223. Training Deep Neural Nets
"With four parameters I can fit an elephant and with five I can make him wiggle his trunk. "
-- John von Neumann, cited by Enrico Fermi in Nature 427
Overfitting
224. Training Deep Neural Nets
Avoid Overfitting Through Regularization
● Deep neural networks may have millions of parameters
● With so many parameters network
○ has a huge amount of freedom
○ And it can fit variety of complex datasets
○ Also it becomes prone to overfitting
225. Training Deep Neural Nets
Avoid Overfitting Through Regularization
● In this section, we will go through
○ Some of the most popular regularization techniques
○ For neural network and how to implement them with TensorFlow
■ Early stopping
■ ℓ1 and ℓ2 regularization
■ Dropout
■ Max-Norm Regularization and
■ Data augmentation
227. Training Deep Neural Nets
Avoid Overfitting Through Regularization
Early Stopping
228. Training Deep Neural Nets
Early Stopping
● As discussed in Machine Learning course
○ To avoid overfitting the training set
○ A great solution is early stopping
229. Training Deep Neural Nets
Early Stopping
● Stop training as soon as the validation error reaches a minimum
● This is called early stopping
230. Training Deep Neural Nets
Avoid Overfitting Through Regularization
ℓ1 and ℓ2 Regularization
231. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● Just like we apply ℓ1 and ℓ2 regularization for simple linear models
○ We can apply the same regularization to constrain
○ Neural network’s connection weights (not biases)
● To do so in TensorFlow
○ Simply add the appropriate regularization terms to cost function
232. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● For example, suppose
○ We have just one hidden layer with weights weights1 and
○ One output layer with weights weights2
○ Then we can apply ℓ1 regularization like this
233. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
Follow the code in the notebook to implement
ℓ1 regularization manually assuming we have
only one hidden layer
234. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● Manually applying ℓ1 regularization will not be convenient
○ If we have many layers
● In TensorFlow,
○ We can pass a regularization function to the tf.layers.dense()
function
○ Which computes regularization loss
235. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● This code creates a neural network
○ With two hidden layers and one output layer
○ It also creates nodes in the graph to compute
■ The ℓ1 regularization loss corresponding to each layer’s weights
○ TensorFlow automatically adds these nodes to a
■ Special collection containing all the regularization losses
236. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● We just need to add
○ These regularization losses to overall loss, like below code
● Important
○ Don’t forget to add the regularization losses to overall loss
○ Else they will simply be ignored
>>> reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
>>> loss = tf.add_n([base_loss] + reg_losses, name="loss")
237. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
Follow the code in the notebook to implement
ℓ1 regularization in neural network with two
hidden layers
239. Training Deep Neural Nets
Dropout
● Dropout is the most popular
○ Regularization technique for deep neural networks
● It was proposed by G. E. Hinton in 2012
● Even the state-of-the-art neural networks
○ Got a 1–2% accuracy boost
○ Simply by adding dropout
● 1-2% accuracy boost may not sound like a lot
○ But when a model has 95% accuracy
○ Then 2% accuracy boost means dropping the error rate by 40%
○ (Going from 5% error to roughly 3%)
240. Training Deep Neural Nets
Dropout
● It is a fairly simple algorithm
● At every training step, every neuron
○ Including the input neurons but excluding the output neurons
○ Has a probability p of being temporarily “dropped out”
○ Meaning it will be entirely ignored during this training step
○ But it may be active during the next step
241. Training Deep Neural Nets
Dropout
● The hyperparameter p is called the dropout rate
○ And it is typically set to 50%
● After training, neurons don’t get dropped anymore
● Let’s understand this technique with an example
242. Training Deep Neural Nets
Dropout
Question
Would a company perform better if its
employees were told to toss a coin every
morning to decide whether or not to go to
work?
244. Training Deep Neural Nets
Dropout
● In that case company would be forced to adapt its organization
○ No single person will be responsible for filling the coffee machine
○ Or cleaning the office
○ Or performing any other critical tasks
● So these expertise would have to be spread across many people
● Employees would have to learn to
○ Cooperate with many of their coworkers
245. Training Deep Neural Nets
Dropout
Question
What will be the advantages of such a system?
246. Training Deep Neural Nets
Dropout
● The company would become much more resilient
● If one person quits, it would not make much difference
● Not sure if this idea will work for companies
○ But it definitely works for neural networks
247. Training Deep Neural Nets
Dropout
● Neurons trained with dropout
○ Can not co-adapt with their neighbouring neurons
○ They have to be as useful as possible on their own
○ They also can not rely excessively on just a few input neurons
○ They also must pay attention to each of their input neurons
○ As a result of this
■ They end up being less sensitive to slight changes in the inputs
● In the end we get a more robust network that generalizes better
248. Training Deep Neural Nets
Dropout
● To implement dropout using TensorFlow
○ Just apply dropout() function to the
○ Input layer and the output of every hidden layer
● During training dropout function() randomly drops some items
● After training, this function does nothing at all
>>> hidden1_drop = tf.layers.dropout(hidden1, dropout_rate,
training=training)
Just like batch normalization set training to
True during training and to False when testing
249. Training Deep Neural Nets
Dropout
Follow the code in the notebook to apply
dropout regularization to three-layer neural
network
250. Training Deep Neural Nets
Dropout
● If you observe model is overfitting
○ Then increase the dropout rate
● Else if model is underfitting
○ Then decrease the dropout rate
● It can also help to
○ Increase the dropout rate for large layers, and
○ Reduce it for small ones
251. Training Deep Neural Nets
Dropout
● Please note that dropout does
○ Tend to slow down convergence
○ But it results in a much better model when tuned properly
○ It is worth the extra time
252. Training Deep Neural Nets
Avoid Overfitting Through Regularization
Data Augmentation
253. Training Deep Neural Nets
Data Augmentation
● Data augmentation consists of
○ Generating new training instances from existing ones
○ Thereby increasing the size of the training set
● Let’s understand this with an example
● Let’s say we have to train a model to classify pictures of mushrooms
● Then we can slightly shift, rotate and resize
○ Every picture in the training set and
○ Add the resulting pictures to the training set
○ Thereby increasing the size of the training set
254. Training Deep Neural Nets
Data Augmentation
Generating new training instances of mushrooms from existing ones
255. Training Deep Neural Nets
Data Augmentation
● The trick is to generate realistic training instances
● A human should not be able to tell
○ Which instances were generated and which ones were not
● Moreover the modifications we apply should be learnable
256. Training Deep Neural Nets
Data Augmentation
● These newly added pictures
○ Forces the model to be more tolerant to the
■ Position,
■ Orientation, and
■ Size of the mushrooms in the picture
257. Training Deep Neural Nets
Data Augmentation
● If we want model to be more tolerant to the lightning conditions
○ We can also generate images with various contrasts and
○ Add them to the training set
258. Training Deep Neural Nets
Data Augmentation
● It is preferable to generate new images on the fly during training
○ Rather than wasting
■ Storage space and
■ Network bandwidth
259. Training Deep Neural Nets
Data Augmentation
● TensorFlow offers several image manipulation operations such as
○ Transposing(shifting)
○ Rotating
○ Resizing
○ Flipping
○ Cropping
○ Adjusting the brightness
○ Contrast
○ Saturation and
○ Hue
● These operations makes it easy to implement data augmentation for
image datasets
261. Training Deep Neural Nets
Practical Guidelines
● In this topic we have covered wide range of techniques
● And common question comes on which one to use
● The configuration shown below works fine in most of the cases
Default DNN Configuration
262. Training Deep Neural Nets
Practical Guidelines
● Also we should always look for the pretrained neural network solving the
similar problem
● The default configuration which we have shown in the last slide may be
tweaked as per the problem statement
○ If training set is too small then implement data augmentation
○ If we can’t find a good learning rate then trying adding
■ Learning schedule such as exponential decay
○ If we need a lightning fast model at run time
■ Then drop batch normalization and
■ Replace ELU with leaky ReLU
263. Training Deep Neural Nets
Practical Guidelines
● If we need a sparse model
○ Add some ℓ1 regularization
● With these guidelines
○ We can train deep neural networks
○ But if we use a single machine then
○ It make take days or months for training to complete
○ So be patient :)
○ Else train the model across many servers and GPUs