Revisiting important concepts before training deep neural networks.

Suraj
7 min readJun 28, 2020

Let’s dive into this guide which aims to provide an intuitive summary of concepts in Deep learning mandatory for anyone willing to train a deep neural network. Many beginners in the field of deep learning always tend to use open-sourced GitHub repositories by eminent research groups/researchers for getting started with the implementation, however this habit if prolonged can rust his/her coding and innovative skills.

The main reason why the opensource community has been built is to ensure that future generations don’t end up wasting time to reinvent the wheel but use it to convert it in a fully functional car. However, like everything that has pros comes with its cons., this approach often referred to as Reverse-Engineering has its own flaws too.

The pros being that one can save time to build something on top of it (One such illustration can be sought as wrappers functions in any programming/scripting language) but the cons are noticeably more evident and impactful as someone who doesn’t know about the wheels and its part’s functionality could end up swapping it with the steering wheel.

The same analogy can be derived in the domain of deep learning too, If we don’t know which hyper-parameters affect the training and optimization of a deep learning model, we could end up cursing the architectures open-sourced for the community members. I am assuming that the audience who is currently reading the blog is aware of the data being analogous to oil, as without proper oil neither a car would travel to the desired place nor the deep learning models could lead us to desired results.

If you are among those who are curious about what to do in scenarios about lack of data, I’d try to cover this in a separate blog.

I intentionally wrote the above paragraph(s) just to ensure that you’re patient enough for learning good things and thus refraining to adhere to the habit of switching between opensource GitHub repositories if they easily don’t work for your problem statement. My suggestion is always to stick to one approach as per literature and modify it to work for the problem statement. If you can solve the problem, do contribute to the open-sourced GitHub Repo so that someone else can resume from that checkpoint and contribute towards a new functionality in the repository.

Let’s now get started with the crux motto of the blog. I’ll try to break this entire blog into Q&As, which hopefully will assist the audience in their pursuit of implementing efficient deep learning models. Assuming that the audience knows the fundamentals and mathematics of each of the following concepts.

Q.1: I have studied that the initial layers of a Convolutional Neural Network learn simple features like edges, textures and patterns however as we progress towards the output layers, it becomes more representational as object parts and entire objects, Can we visualize it?

Ans: Yes, why not !. You can view the activation of each layer starting from Input to output using open source implementations like Keract which convert activation as NumPy array which can be easily plotted using matplotlib or their inbuilt function display_activations

Q.2: I have studied that using non-linear activations while training deep neural networks are preferred, why?

Ans: If you don’t use non-linear activation, the hidden layers will squash into one single layer and thus can be represented as a single linear function Y=Wx where Y is the predicted output, x being the input and W being the linear combination of all intermediate weights w0,…..,wn. Such a linear mapping between input and output can’t be used to learn the mapping for complex real-world data.

Q.3: There are multiple kinds of non-linear activation functions like sigmoid functions, hyperbolic tangent function, softmax function, rectified linear unit, leaky rectified linear unit, parametric rectified linear unit, exponential linear unit etc. Which one to use and when in a deep neural network?

Ans: Sigmoid function: They are rarely used in deep learning algorithms due to vanishing gradients problem during backpropagation, slow convergence and unsymmetrical distribution of output that often causes gradients to update in various directions, If used they are mostly placed in the output layer of binary classification models for predicting class probability between 0 & 1.

Hyperbolic tangent function(Tanh): They are often used as substitutes to sigmoid function as they overcome almost each of the aforementioned drawbacks of sigmoid function by predicting value between -1 and 1. Since their output distribution is symmetrical, It often facilitates backpropagation. They are mostly used in natural language processing tasks at the output layer. Their limitation is that they also suffer from vanishing gradients.

Softmax function: They are often used in multi-variate classification models in deep learning at the output layer, serving as probabilities of each class in the range of 0 and 1. They work such that the actual target class has the highest probability and can be converted into a one-hot encoded vector using argmax function.

Rectified Linear unit(ReLU) function: They are the most common activation function used in the intermediate layers of the deep learning models to learn the complex mapping between Input and output layers. The function passes each input as a linear mapped output if it is greater than zero else set them to zero which ensures not only easier to calculate gradients but also ensures faster inference time of models during deployment for production. Their limitation is that they easily die if the network learns large negative input bias, eventually causing the death of further neurons and often have a mean of activation greater than zero.

Leaky-Rectified Linear unit(Leaky-ReLU) function: They are active substitutes for ReLU in scenarios where the model has learnt large negative input bias as they ensure to have a small negative slope for input less than zero, Except this, they are very similar to ReLU and enable faster convergence of model during backpropagation.

Parametric-Rectified Linear unit(Parametric-ReLU) function: They are quite similar to Leaky-ReLU function except for the fact that the small negative slope for input less than zero is learnt during training.

Exponential Linear unit(ELU) function: They are similar to ReLU but very exponentially for input less than zero. Since ELU tend to push the mean of the activation towards zero, they often facilitate faster learning and convergence of deep learning models.

You can use Tensorflow playground to explore more about this.

Q.4: I have heard about multiple types of loss functions in deep neural networks, when and why to use them?

Ans: Assuming that the audience is aware of the following concepts in deep learning and why/when are they used while training deep neural networks- Loss functions, Classification and Regression. I’d primarily focus here on describing loss function for Classification(Binary and Multi-class) and Regression.

Loss functions for Binary Classification primarily include Binary cross-entropy, Hinge loss and Squared hinge loss. Binary cross entropy is the default loss function for training deep neural networks for the task of binary classification as it adheres best to Maximum Likelihood Estimation policy. However, In all such cases where cross-entropy fails, Hinge loss function acts as an active substitute. The output ground truth values must be in the range of {-1,1} to be able to use this. Hinge loss stresses more on the sign of the predicted value of the learning network in comparison to the ground truth. If there is a sign difference{-ve and +ve or vice versa} between the predicted and ground truth values, then hinge loss penalizes such with a large error during training. The output node of the network is often paired with Tanh to give values in the range of {-1,1}. Squared hinge loss square the hinge loss to decrease computational and are mostly used in the smoothing of the loss function of hinge loss.

Loss functions for Multi-class classification primarily include Multi-class cross-entropy(Categorical cross-entropy) and Kullback Leibler Divergence. Similar to binary classification, Multi-class cross-entropy is the default loss function for training deep neural networks for the task of multi-class classification as it adheres best to the Maximum Likelihood Estimation policy. It calculates the score that the mean average difference between the probability distribution function of predicted and ground-truth values across all classes. An ideal training scenario converges the cross-entropy value to zero. It is often referred to as categorical cross-entropy in multiple deep learning frameworks. A variation of Multi-class cross-entropy is sparse multi-class cross-entropy loss which avoids conversion of labels to one-hot encoded before feeding them into the network for training a deep neural network. However, In cases of Auto-Encoder applications where the distribution of predicted sample should match with ground truth sample and are much complex than multi-class classification, we often use Kullback Leibler Divergence. An ideal training session would lead to KL divergence loss converging to zero.

Loss functions for Regression primarily include Mean Squared Error, Mean Squared Logarithmic Error and Mean Absolute Error. Mean Squared error is the default loss function to use for training deep neural networks for the task of regression. If the output ground truth distribution is Gaussian then it is the go-to loss function under the Maximum Likelihood Estimation policy. However, It has the effect of incorporating the punishing effect of large differences in large predicted values. To avoid this Mean squared logarithmic error is often used. Often, the Gaussian distribution of the output ground truth variable suffers from outliers. In all such cases, Mean Absolute error is the preferred loss function.

With this set of questions answered here, I’d like to wrap up this blog aimed towards assisting the audience towards implementing deep learning models.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Suraj
Suraj

Written by Suraj

Seasoned machine learning engineer/data scientist versed with the entire life cycle of a data science project.

No responses yet

Write a response