Deep Learning Fundamentals

Deep Learning Concept Description Details Links / Analogies / Images
Gradient * The multi-dimensional equivalent of derivative is the gradient * Gradient is just the vector (or tensor) of partial derivatives of the function with respect to each of the variable (or parameters)
* Pytorch does it for us with automatic-differentiation
 
Gradient Descent Gradient descent is a first-order iterative optimization algorithm
for finding a local minimum of a differentiable function.
The idea is to take repeated steps in the opposite direction
of the gradient (or approximate gradient) of the function at the current point,
because this is the direction of steepest descent. Conversely, stepping in the
direction of the gradient will lead to a local maximum of that function;
the procedure is then known as gradient ascent.
Batch Gradient Descent
Batch Gradient Descent uses the whole batch of training data at every step.
It calculates the error for each record and takes an average to determine
the gradient. The advantage of Batch Gradient Descent is that the algorithm
is more computational efficient and it produces a stable learning path, so it
is easier to convergence. However, Batch Gradient Descent takes longer time
when the training set is large.

Stochastic Gradient Descent
In the other extreme case, Stochastic Gradient Descent just picks one instance
from training set at every step and update gradient only based on that single record.
The advantage of Stochastic Gradient Descent is that the algorithm is much faster
at every iteration, which remediate the limitation in Batch Gradient Descent.
However, the algorithm produces less regular and stable learning path compared
to Batch Gradient Descent. Instead of decreasing smoothly, the cost function will
bounce up and down. After rounds of iterations, the algorithm may find a good parameter,
but the final result is not necessary global optimal.

Mini-Batch Gradient Descent
Mini-Batch Gradient Descent combines the concept of Batch and Stochastic Gradient Descent.
At each step, the algorithm compute gradient based on a subset of training set instead of
full data set or only one record. The advantage of Mini-Batch Gradient Descent is that the
algorithm can take advantage of matrix operation during calculation and cost function can
decrease more smoothly and stable than Stochastic Gradient Descent.
 
Activation Function An activation function takes in weighted data (matrix multiplication between input
data and weights) and outputs a non-linear transformation of the data.
Sigmoid
* Pros: Maps to (0,1) and is non-linear, smooth, differentiable
* Cons: Is not centerd around 0 i.e. not standardized
* Cons: Vanishing gradient (because of derivates of composite functions with chain rule - max value of sigmoid derivative is 0.25 & chain multiplication of many derivatives results in very small number)

Tanh (Hyperbolic tangent)
* Pros: Maps to (-1,1) and is non-linear, smooth, differentiable
* Pros: Is centerd around 0 i.e. standardized
* Cons: Vanishing gradient

ReLU (Rectified Linear Unit - looks like hockey stick)
* Pros: Doesn’t have vanishing gradient, Non-zero gradient for all values > 0
* Pros: Reasonable default choice
* Cons: Not differentiable at 0, Dead neuron (Zero gradient for all values < 0)
* Cons: Not standardized

Leaky ReLU
* Pros: Has non-zero gradient for all values (has slope < 1 for values < 0 and slope = 1 for values > 0)
* Cons:

ELU (exponential linear unit)
* Pros: Fatser learning
* Pros: Allows negative outputs i.e. standardized outputs

Softplus
* Pros: Differentiable
* Cons: Vanishing gradient on left
* Cons: Not standardized

BRU (Binodal Root Unit)
* Biological activation function
 
Output Activation Function * The final layer is always one of these activation functions :

* Dense (Regression),
* Sigmoid (Binary Classification),
* Softmax (Multiclass Classification)
Binary Classification :
* Sigmoid

Multiclass Classification :
* Softmax (only used for output layer)
* Example -
* Character / handwriting recognition,
* Speech recognition (each word is a category),
* Image classification (ImageNet dataset)

Linear Regression :
* Dense
 
Loss Function

&

Cost Function
Loss Function
* Measures the error between predicted and actual values in a machine learning model.



Cost Function
* Quantifies the overall cost or error of the model on the entire training set.
* Aggregates the loss values over the entire training set.
* Often the average or sum of individual loss values in the training set.
* Used to determine the direction and magnitude of parameter updates during optimization.
* Typically derived from the loss function, but can include additional regularization terms or other considerations.


PyTorch Loss-Input Confusion (Cheatsheet)
* torch.nn.functional.binary_cross_entropy takes logistic sigmoid values as inputs
* torch.nn.functional.binary_cross_entropy_with_logits takes logits as inputs
* torch.nn.functional.cross_entropy takes logits as inputs (performs log_softmax internally)
* torch.nn.functional.nll_loss is like cross_entropy but takes log-probabilities (log-softmax) values as inputs
Mean Square Error (MSE)
* Task : Regression
* Model : Ordinary Least Square (OLS) Regression estimates the conditional mean of the response variable as a linear function of the explanatory variables

Negative Log Likelihood
* Task : Regression
* Model : Maximum Likelihood Estimation (MLE) Regression (a different parameter estimation method). * In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.
* The ordinary least squares estimator for a linear regression model maximizes the likelihood when all observed outcomes are assumed to have normal distributions with the same variance.

Quantile Loss
* Task : Regression
* Model : Quantile regression estimates the conditional median (or other quantiles) of the response variable as a linear function of the explanatory variables

Negative Log Likelihood or Binary Cross Entropy Loss
* Task : Binary Classification
* Model : Logistic Regression

Hinge Loss
* Task : Binary Classification
* Model : Support Vector Machines (SVM)

Cross Entropy Loss
* Task : Multi-Class Classification
* Model : Softmax Regression
 
Unit * A unit often refers to the activation function in a layer by which the inputs are
transformed via a nonlinear activation function (for example by the logistic sigmoid function).

* Usually, a unit has several incoming connections and several outgoing connections.

* However, units can also be more complex, like long short-term memory (LSTM) units,
which have multiple activation functions with a distinct layout of connections to the
nonlinear activation functions, or maxout units, which compute the final output over
an array of nonlinearly transformed input values.

* Pooling, convolution, and other input transforming functions are usually not referred to as units.
Simple
* sigmoid
* tanh
* relu
* leaky relu
* elu
* softplus
* bru

Complex
* RNN - Simple RNN, GRU, LSTM
 
Layer * A layer is the highest-level building block in deep learning. A layer is a container that
usually receives weighted input, transforms it with a set of mostly non-linear functions and
then passes these values as output to the next layer.

* A layer is usually uniform, that is it only contains one type of activation function, pooling,
convolution
etc. so that it can be easily compared to other parts of the network.

* The first and last layers in a network are called input and output layers, respectively, and all
layers in between are called hidden layers.
* Linear
* Dense
* Batch Normalization
* Convolutional - Dimensionality, Patch size, Stride, Number of feature maps to generate, Padding strategy, Activate function
* Pooling
* Drop out
* RNN - Dimensionality, Type of recurrent neural network (LSTM, GRU, or standard RNN layer), Return sequence, Dropout
* GRU
* LSTM
* Attention
 
Forward Propogation Input Layer –> Hidden Layer –> Hidden Layer 2 –> Output Layer * Forward propagation is where input data is fed through a network, in a forward direction, to generate an output.
* The data is accepted by hidden layers and processed, as per the activation function, and moves to the successive layer.
 
Back Propogation * Minimize the Cost Function, example - Negative Log Likelihood
* (by tuning model wts - Gradient Descent)
* Backpropagation of errors, or often simply backpropagation, is a method for finding the gradient of the error with respect to weights over a neural network.
* The gradient signifies how the error of the network changes with changes to the network’s weights.
* The gradient is used to perform gradient descent and thus find a set of weights that minimize the error of the network.
 
ANN (Artificial Neural Network) Artificial Neural Networks
* Feed-forward Neural Networks
* Each neural network layer is a “feature tranformation”
* Dense Layer = Fully Connected (FC) = nn.Linear()
* Input is always on left & output is always on right
* No recurrent connections (unlike RNN)

* Activation functions
* Multiclass classification
* Image data (all data is same)
* Regression
Basically, an ANNconsists of the following components:
* An input layer
* A hidden layer w/Activation Function
* An output layer
* Weights between the layers
 
CNN (Convolutional Neural Net) Why CNN architecture?
* Modern CNNs originated from LeNet
* Series of Convolutional Layers & Pooling Layers –> Dense Layers (or Fully Connected Layers)
* Pooling (can be thought of as downsampling) - Max Pooling / Avg. Pooling
* Practical - if we shrink the image, we have less data to process
* Translational Invariance - I don’t care where in the image feature occured, I just care it did
* Hyperparamters - Pool Size = 2 x 2 and Stride = 2

Why have a pooling layer followed by a convolutional layer?
* After each pair of Conv/Pool, image shrinks
* As we go deeper in CNN, the 3 x 3 convolutional filter looks for larger & larger pattern - thus learning hierarchical patterns
* With CNNs, the conventions are pretty standard

What if we have different size images?
* Use - Global max pooling layer

Summary of CNN architecture
Step-1
* Conv > Pool > Conv > Pool
* Strided Conv

Step-2
* Flatten - x.view(-1, shape)
* Global max pool - torch.max(x, dims)

Step-3
* Dense > Dense
Convolution
* Input Image * Filter (another name, Kernel) = Output Image
* Example : Edge detection filter, Gaussian filter
* Input length = N x N
* Filter length = K x K (almost always square matrix by convention)
* Unique postions for filter = N - K + 1
* Output height = input height - kernel height + 1
* Output width = input width - kernel width + 1

Covolution Operator in Deep Learning
* What we are doing? - is actually cross-correlation
* CNN can be thought as Correlation neural network
* Convolution operator flip the input on both x & y axes
* Correlation operator doesnt do any flipping

Padding
* To have output size = input size

What is convolution?
* Why is dot product important?
* Can be thought of as a correlation measure

Why convolution?
* Convolution filter is pattern finder, shared parameter matrix multiplication, feature transformer
* Parameter sharing
* Less number of connections than fully connected layer
* Saves memory and time
 
RNN (Recurrent Neural Networks) Sequence Data
* Example: Text, Speech, Financial Data / Stock Returns
* Simplest kind of sequence: Time series signal
* Airline passengers forecast (used for planning for enough airport workers, modify prices, marketing expenses planning)
* Weather tracking (dynamical system - chaos theory, butterfly effect, numerical round off errors)
* Speech
* Text

Shape of time series sequence
* N = # samples (number of windows in time series)
* D = # featues
* T = # time steps in sequence
* Shape = N x T x D (convention)
* Example: model the path employees take to get to work
* N = number of employees x trips
* D = 2 (latitude & longitude)
* T = 30 mins (record lat-long every second i.e. 30 x 60 = 1800, assume equal length sequences)
* Example: S&P 500
* N = number of windows
* D = 500 (500 different stocks)
* T = 10 (window size)


Forecasting

Simplest Method, Use Auto-Regressive Linear Regression
* N x T x 1
* N x T
* N = L - T + 1


Recurrent Neural Networks (RNN)
(also called Elman unit)
* Why ANN don’t work for sequence data?
* Will take too much space - too many connections and weights
* Just like with CNNs, RNNs take advantage of the structure of data
* Make the hidden feature (“hidden state”) depend on previous hidden state
* Linear regression forecasting model: output is linear function of inputs
* Now - hidden state is a non-linear function of input and past hidden state
* h[t] = Sigma(W[xh].x[t] + W[hh].h[t-1] + b[h])
* y_hat[t] = Sigma(W[o].h[t] + b[o])

* x = input
* h = hidden
* o = output
* xh = input-to-hidden
* hh = hidden-to-hidden

* Recurrent loop implies a time delay of 1


How do we calculate output prediction?
* Given = x1, x2,…., xT
* Shape(xT) = D
* First step
* h1 = Sigma(W[xh].x[1] + W[hh].h[0] + b[h])
* y_hat[1] = Sigma(W[o].h[1] + b[o])
* Set h0 to an array of zeros (or you can also make it a learnable parameter)
* h2 = Sigma(W[xh].x[2] + W[hh].h[1] + b[h])
* y_hat[2] = Sigma(W[o].h[2] + b[o])
* What is the purpose of having y_hat[T] for every time step?
* Mostly only the last y_hat[T] is used in time series forecasting
* All y_hat[1], y_hat[2], y_hat[3]… y_hat[T] is used in neural machine translation
* Classification Probability
* For ANN & CNN - P(y = k | x)
* Machine Translation - P(y_t = k | ?)
* Unrolled RNN
* P(y1 = k | x1)
* P(y2 = k | x1, x2)
* P(y3 = k | x1, x2, x3)
* P(yT = k | x1, x2, x3… xT)

* Relationship to Markov model
* The Markov assumption is that the current value depends only on the immediately previous value

* Unlike Markov model
* An RNN will forecast the next word based on all the previous words - much more powerful!


Pseudocode
Given

Wxh - input to hidden weight
Whh - hidden to hidden weight
bh - hidden bias
Wo - hidden to output weight
bo - output bias

X - T x D input matrix

tanh hidden activation
softmax output activation


* Neurons that fire together wire together, and neurons that fire out of sync fail to link


Savings from RNN
ANN
* T = 100
* D = 10
* M = 15 (hidden vector size)
* Flattened input vector - T x D = 1000
* We will have hidden states - T x M = 1500
* Assume binary classification - K = 1
* Input to hidden weight - 1000 x 1500 = 1.5 million
* Hidden to output weight - 1500
* Total - 1.5 million

Simple RNN
* Wxh - D x M = 10 x 15 = 150
* Whh - M x M = 15 x 15 = 225
* Wo - M x K = 15 x 1 = 15
* Total - 150 + 225 + 15 = 390
* Savings - 1.5M / 390 = 3850

Modern RNN Units

* LSTM - Long Short-Term Memory
* GRU - Gated Recurrent Unit
* GRU is like a simplified version of the LSTM (less parameters and thus. more efficient)

Why do we need these at all?
* Vanishing gradient problem due to deeply nested derivates of recurrent functions
* RNNs have problem learning from long term dependencies
* Just using ReLUs doesn’t work
* Most effective way to deal with vanishing gradients is use entirely different unit - LSTM & GRU
* Equations are generally more useful that Diagrams for LSTM


1. Simple RNN
h(t) = tanh(W{xh}.X(t) + W{hh}.h(t-1) + b{h})

2. GRU
* What did we learn so far?
* SimpleRNN has problem learning long-term dependencies
* The hidden state becomes the weighted sum of the previous hidden state and new value (allowing you to remember the old state)
* These are controlled by “gates” which are like binary classifiers / logistic regression / neurons

* z(t), r(t), h(t) are all vectors of size M
* M is a hyperparameter (“number of hidden units / features”)
* This implies the shape of all the weights
* Any weight going from x(t) is D x M
* Any weight going from h(t) is M x M
* All bias terms are of size M

2.1 update gate vector
Z(t) = sigma(W{xz}.X(t) + W{hz}.h(t-1) + b{z})

Z(t) –> rough interpretation - tell us if we should take new value or use old value

2.2 reset gate vector
r(t) = sigma(W{xr}.X(t) + W{hr}.h(t-1) + b{r})

r(t) –> rough interpretation - tell us which parts of h(t-1) we want to remember and which parts we want to forget

2.3 hidden state vector
element wise multiplicaton =

h(t) = (1 - z(t)) * h(t-1) + z(t) * tanh(W{xh}.X(t) + W{hh}(r(t) * h(t-1)) + b{h})

2.4 rough interpretation
h(t) = p{keep h(t-1)} * h(t-1) + p{discard h(t-1)} * SimpleRNN(x(t), h(t-1))

2.5 summary
Same API as simple recurrent unit
* Output is h(t), depends on h(t-1) and x(t)
* Has gates to remember / forget each component of h(t-1)
* SimpleRNNs have no choice but to eventually forget, due to the vanishing gradient
* We use binary classifiers (logistic regression neurons) as our gates
* SUMMARY - GRU unit has functionality to selectively remember and forget previous hidden state which helps it to learn long term dependencies (using update gate and reset gate)


3. LSTM
* Not exactly the same API
* “like the GRU” but with more state vectors and gates
* LSTM returns 2 states:
* Hidden state h(t)
* Cell state c(t) [usually ignored]
* LSTM unit outputs h(t), but cal also optionally output c(T)
* Also means you need 2 initial states: c0 and h0
* SUMMARY - LSTM unit has functionality to selectively remember and forget previous hidden state which helps it to learn long term dependencies (using update gate, forget gate and cell state)


3.1 Forget gate vector
f(t) = sigma(W{xf}.X(t) + W{hf}.h(t-1) + b{f})
Neuron (binary classifier)

3.2 Update gate vector
i(t) = sigma(W{xi}.X(t) + W{hi}.h(t-1) + b{i})
Neuron (binary classifier)

3.3 Output gate vector
o(t) = sigma(W{xo}.X(t) + W{ho}.h(t-1) + b{o})
Neuron (binary classifier)

3.4 Cell state [f_c = tanh]
c(t) = f(t) * c(t-1) + i(t) * f_c (W{xc}.X(t) + W{hc}.h(t-1) + b(c))

c(t) = f(t) * c(t-1) + i(t) * SimpleRNN

3.5 Hidden state [f_h = tanh]
h(t) = o(t) * f_h (c(t)
 
Image Data Previously (before CNN)

* Image data was flattened (all data is the same)
* Flattened dimensions = N x D array
How to Represent Images

* stored as 3-d tensors
* value is k (will have 3 channels) - [r, g, b] (red, green, blue are primary colors)

For a 500 x 500 image = 500 x 500 x 3 x 8 = 6 million bits = 750,000 bytes = 732kb
* jpeg compression helps us store file in a compressed format

Grayscale images

* stored as 2-d arrays
* Value is 0 (black) to 255 (white)
 
Optimization Algorithm By repeatedly applying the gradient descent to the weight parameters, we will eventually arrive at the optimal weights that minimize the loss function. At the same time, these weight parameters allow the neural network to make better predictions



* Local optima and saddle points of the loss function pose problems where the simple gradient method reaches its limits when training neural networks
* With AdaGrad, RMSProp and ADAM there are technical possibilities to make the gradient descent more efficient when finding the optimal weights.
* ADAM is usually the better optimization algorithm for training the neural networks.
Stochastic Gradient Descent with Momentum
* In other words: We are moving more in the direction of velocity than in the direction of slope at a given point.

AdaGrad
* Using the AdaGrad procedure, we force the algorithm to make updates to the weights in each spatial dimension with the same proportions.

RMSProp
* There is a slight modification of AdaGrad called “RMSProp”. * For RMSProp, the sum of squared gradients is multiplied by a decay rate α and the current gradient – weighted by (1- α) – is added.
* The update step in the case of RMSProp looks the same as in AdaGrad.
* Here we divide the current gradient by the sum of the squared gradients to get the nice property of speeding up the updating of the weights along one dimension and slowing down the motion along the other.

Adam Optimizer
* A combination of RMSProp and SGD with momentum
 
Embeddings * Embeddings are dense vector representations of values or objects like text, images, and audio that are designed to be consumed by machine learning models and semantic search algorithms.
* Embeddings capture semantics
Word2Vec
Take a fake problem (CBOW or Skipgrams)
Solve it using neural network
You get word embeddings as a side effect
Two methods
1. CBOW : Continuous Bag of Words
- Given context words predict the target word
2. Skipgrams
- Given the target predict context words

GloVe : Global Vectors for Word Representation
Similar to Word2Vec

BERT : Bidirectional Encoder Representations from Transformers
Based on transformer architecture
 
Attention Mechanism (Transformers) Main Points

* The attention mechanism allows neural networks to learn very long-range dependencies in sequences
* Longer range than LSTM, a type of RNN
* Attention was created for RNNs, but transformers use attention only, while doing away with the recurrent part
* Transformers are big and slow
* But computations can be done in parallel (unlike RNNs!)
Transformers

* Embedding
Embeddings are low-dimensional dense semantics aware representation of objects

* Similarity between words
* dot product
* cosine similarity
* pearson correlatiom

* Context
* please buy and apple and orange
* move ‘apple’ towards the ‘orange’ in embedding vector space
* apple unveiled the new phone
* move ‘apple’ towards the ‘phone’ in embedding vector space

* Attention Mechanism
* Use the similarity matrix to move the words around in the embedding space


* Keys and Query Matrixes (as linear transformation) => gives us Left Embeddings
* Orange embedding vector * keys matrix
* queries matrox.T * Phone embedding vector.T

* Keys and Query matrix transform the embeddings into a space where it’s convenient to calculate the similarities

* The Left Embeddings know the features of the word

* Value Matrixes (as linear transformation) => gives us Right Embeddings
* Embeddings on the right is optimized to find the next word in the sentence
* Left Embeedings x Value Matrix => Right Embeddings

* The Right Embeddings know when two words could appear in the same context

Main Architectures
* Self-attention
* Multi-head attention (using Transformers)
 
Batch Normalization * Batch normalization (also known as batch norm) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers’ inputs by re-centering and re-scaling.

Wiki Reference
* Batch Norm is just another network layer that gets inserted between a hidden layer and the next hidden layer. Its job is to take the outputs from the first hidden layer and normalize them before passing them on as the input of the next hidden layer.

Medium Reference
 
Pooling Layer * Pooling layers like Convolutional layers are also used to reduce the size of the image representation * Example - Max Pooling or Avg. Pooling

* f = filter size

* s = stride
 

Resources