Optimization in Neural Networks and Newton's Method

Last Updated : 06 Nov, 2023

In machine learning, optimizers and loss functions are two components that help improve the performance of the model. A loss function measures the performance of a model by measuring the difference between the output expected from the model and the actual output obtained from the model. Mean square loss, hinge loss, and log loss are some examples of loss functions. The optimizer helps improve the model by adjusting its parameters to minimize the loss function value. SGD, ADAM, RMSProp, and Newton's method are some examples of optimizers. The role of the optimizer is to find the best set of parameters (weights and biases) of the neural network that allow it to make accurate predictions.

Optimization in Neural Networks

The computational method for iterative optimization technique can be broadly divided in three types

Zero Order or Direct Search
- These involve exploring a range of potential values (akin to grid search) for the variable x to find the minimum of the objective function. These techniques are generally computationally intensive compared to higher-order methods. Nevertheless, they are known for their reliability and straightforward implementation. While there are more advanced techniques available, that improve upon the grid search our primary emphasis will be on the higher-order iterative approaches.
First-order or Gradient Methods
- These techniques make use of the first-order partial derivatives.
- Example: gradient descent and its variants SGD, ADAM, RMSPROP, etc.
Second Order Methods:
- These techniques make use of the second-order partial derivatives (hessian).
- Example: Newton Method, Quassi-Newton method.

In this article we will focus on the Newton method for optimization and how it can be used for training neural networks. Let us first compare it with gradient descent.

Gradient Descent Vs Newton Method

Gradient Descent and the Newton Method are two well-known optimization techniques for training neural networks each method has its advantages and disadvantages. The choice between them depends on the problem at hand, the complexity of the neural network, and available computation resources.

CRITERIA	GRADIENT DESCENT	NEWTON METHOD
WORKING	It relies only on first order derivative of the loss function to update model parameters	It relies on both first-order (gradient) and second-order derivatives (hessian ) to update model parameters
APPLICABILITY	It is relatively easy to implement and is widely used in practice due to its simplicity.	The Newton method may not be practical or suitable for very large neural networks or when the Hessian matrix is computationally expensive to compute or invert
Convergence	Gradient Descent typically converges to a local minimum but can be sensitive to the choice of learning rate, may get stuck in saddle points, and may have slower convergence in some cases.	The Newton method often converges much faster than Gradient Descent.
Learning Rate	Highly Sensitive	Less sensitive

Basic Concept of Newton Method

The genesis of the Newton's method lies in calculus for finding the roots of a differentiable function F, which are solutions to the equation F (x) = 0.

Let us understand Newton method for root finding , then we will see how it can be used for optimization

Problem : We have a function f(x). The goal is to find root of f(x) i.e., f(x) = 0.
Initial Guess : We make a initial guess X_0 for the root.
Update: We update the current guessX_t to get a new estimate X_{t+1} using the formula

X_{t+1} = X_t - f(X_t)/f'(X_t)

Derivation of above formula : From the below graph one can see that the slop of orange line is f'(x). We know that slope of line is:

X_{t+1} = X_t - f(X_t)/f'(X_t) .

So using this we get,

f'(x) = (f(x_t)-0 )/ (X_t-X_{t+1} ) . Rearranging the above equation we get

X_{t+1} = X_t - f(X_t)/f'(X_t)

Repeat: The method proceeds iteratively. We repeat step 3 till we reach a predefined tolerance or a certain number of iterations

Newton Method for Optimization

The above method we discussed is used to find the roots of a function i.e. f(x) = 0. Now we know that derivative of the function is zero at its critical point(minimum/maximum/saddle). So instead of finding roots of f(x) we can find the roots of f'(x). By doing so we find the critical point of f(x).

Let us first get an intuition behind Newton method using a single variable case . Then we will extend it to multivariate case.
Consider the above graph in which a single step of Newton method has been shown.

Here, we want to find the minimum of the function f(X):

Here we start with a initial guess at X_0
We approximate the function at X_0 using Taylor series.
- f(x) = \begin{array}{l}f(X)+\frac{f^{\prime}(X)}{1 !}(X-X_0)+\frac{f^{\prime \prime}(X)}{2 !}(X-X_0)^{2}+\frac{f^{\prime \prime \prime}(X_0)}{3 !}(X-X_0)^{3}+\cdots\end{array}
Since the derivative of a function is zero at the point of minima/maxima , we find the minimum of this approximation by differentiating and equating to zero. Note this is the minimum of Taylor series which is approximate at X_0
- f'(x) = 0 + f'(X_0) + f''(X_0)(X - X_0)/1! + .... = 0
- We find point X₁ from above equation by rearranging above equation:
  - X_1 = X_0 - [f''(X_0)]⁻¹ f'(x_0)
Repeat steps 3 and 4 until convergence is reached (e.g., when the change in x is sufficiently small)

We can extend the above method to multivariate case.

The multivariable quadratic approximation to f(x1,…,xn) can be written.

f(x_0) + (x-x_0)^{T}\nabla f(x_0) + \frac{1}{2}(x -x_0)^{T}H(x-x_0)

Here,

x is a vector of length n representing the variables x_1 ... x_n
x₀ is a vector of length n representing the variable value at the approximation point
Δf(x₀) is the Multivariable equivalent of the first derivative - gradient, and it computes the first derivatives of f(x) with respect to every parameter in one vector of partial derivatives.

\overrightarrow{\nabla}f(\overrightarrow{x}) =\begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots \\ \frac{\partial f}{\partial x_n}\end{bmatrix}

Multivariable equivalent of the second derivative : Instead of a vector, we now need a matrix. This matrix Hf(x ) is called the Hessian, and is a matrix of the second derivatives of f(x) with respect to each pair of parameters.

To calculate hessian we take the gradient and different each of its component with each of the parameter.
For example, to calculate the first row of Hessian we take the partial derivative of gradient \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \frac{\partial f}{\partial x_3} & ... & \frac{\partial f}{\partial x_n} \end{bmatrix} by {\partial x_1} which give us\begin{bmatrix} \frac{\partial f }{\partial^2 {x_{1}}} & \frac{\partial f }{\partial {x_{1}} \partial {x_{2}}} & \frac{\partial f }{\partial {x_{1}} \partial {x_{3}}} &... & \frac{\partial f }{\partial {x_{1}} \partial {x_{n}}} \end{bmatrix} . This gives us the first row of hessian matrix.
To get second row we take the partial derivative of gradient\begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \frac{\partial f}{\partial x_3} & ... & \frac{\partial f}{\partial x_n} \end{bmatrix} by {\partial x_2} which give us \begin{bmatrix} \frac{\partial f }{\partial {x_{2}} \partial {x_{1}}} & \frac{\partial f }{\partial^2 {x_{2}}} & \frac{\partial f }{\partial {x_{2}} \partial {x_{3}}} &... & \frac{\partial f }{\partial {x_{2}} \partial {x_{n}}} \end{bmatrix}
This is done till {\partial x_n} to get below hessian matrix:

Hf(\overrightarrow{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial^{2} x_{1}}& \frac{\partial^2 f}{\partial x_{1} \partial x_{2}} & ... & \frac{\partial^2 }{\partial x_1 \partial x_n}\\ \frac{\partial^2 }{\partial x_2 \partial x_1}& & & \frac{\partial^2 }{\partial x_2 \partial x_n}\\ \vdots & & & \vdots \\ \frac{\partial^2 }{\partial x_n \partial x_1} & ... & ... & \frac{\partial^2 f}{\partial^{2} x_{n}} \end{bmatrix}

Hessian Matrix
The update rule in Newton's method uses the Hessian matrix and the gradient to calculate the step size, which is essentially solving a linear system of equations involving the Hessian and the gradient.

Significance of Hessian Matrix
Curvature Information: The Hessian matrix provides information about the curvature of the loss landscape. Specifically, it tells you how the loss function's gradient (first derivative) changes with respect to changes in each parameter and how the parameters interact with each other in the optimization space
Global Minima, Maxima and Saddle Points: The Hessian matrix can help identify whether a critical point (a point where the gradient is zero) is a local minimum, a local maximum, or a saddle point. It can distinguish between these different types of critical points based on the eigenvalues of the Hessian matrix. Positive eigenvalues indicate a local minimum, negative eigenvalues indicate a local maximum, and both positive and negative eigenvalues indicate a saddle point.

Utilization of Second Derivative

The geometric explanation of Newton's method involves an iterative process where, at each step, it approximates the graph of the function f(x) near the trial value by fitting a parabola. Newton's method makes use of the first derivation (gradient) and second derivative (the Hessian matrix) of the objective function . The parabola is designed to have the same slope and curvature as the function's graph at that particular point. Utilizing second derivative provides a more accurate and informative characterization of the local curvature of the function than the first derivative which only provides the slope information. Once the parabola is constructed, the method proceeds to determine the maximum or minimum of this parabolic approximation.

Gradient descent, on the other hand, relies solely on the first derivative (the gradient). Incorporating second derivative information allows Newton's method to take more precise and efficient steps towards the minimum. Newton methods can converge faster than first-order methods like gradient descent because they consider both the direction and curvature of the loss landscape.

How to use Newton's Method for Optimization?

Basic outline of implementing newtons method for neural network is given below:

Define the Model : Determine the neural network architecture, including the number of layers, activation functions, and the loss function . Identify the model parameters, which are the weights and biases of the neural network. Determine the hyperparameters like learning rate.
Initialize Parameters : Choose an initial guess for the model parameters
Loss function: Define loss function the measures the error between models output and actual output.
Optimizer : Define the optimizer that will update the model parameters . In our case it will be the Newton optimizer
Regularization: Incorporate regularization techniques if needed, such as L2 regularization (weight decay) or dropout, to prevent overfitting.
Iterative Update :
- Pass the input to the model and compute loss
- Compute the Gradient and Hessian Matrix of model parameters with respect to loss
- Update the parameter estimate by backprogogation
- Repeat above steps till convergence or maximum iteration is reached
Validation and Testing: After training, evaluate on a validation dataset to assess the model performance.
Fine-Tuning: Experiment with different hyperparameters, learning rates, and regularization strengths to achieve better results.
Convergence Check
- Check for convergence by evaluating a stopping criterion. Common criteria include:
- The norm of the gradient becoming very small.
- The change in the parameter estimate becoming very small.
- Reaching a specified number of iterations.
Termination and Output: If the convergence criterion is met, terminate the optimization. Output the final parameter estimate which represents the optimal neural network weights and biases.

Step-by-Step Guide of Newton Method in Python

We will train a custom neural network for image classification using pytorch and LBFGS solver as optimizer. We will use the famous CIFAR10 dataset.

Import Necessary Libraries

Torch is the main library for implementing pytorch based neural network

Python3

import torch from torch import nn,optim import torch.nn.functional as F import pandas as pd import numpy as np

Initialize Device

We will initialize our device variable to cuda if GPU is available otherwise to CPU.

Python3

device = "cuda" if torch.cuda.is_available() else "cpu"

Create Model

Here we have designed a simple CNN network for classification. It consists of 2 layer of convolution followed by 2 linear layers. We are using ReLU activation and maxpool after convolution layer. We use dropout layer so that all weights are trained. We use Xavier initialization for initializing the weights of the two convolution layer of the model.

Python3

# Creating our own ConvNet class ConvNet(nn.Module):     def __init__(self):         super(ConvNet, self).__init__()         self.conv1 = nn.Conv2d(3, 6, 5)         self.pool = nn.MaxPool2d(2, 2)          self.conv2 = nn.Conv2d(6, 16, 5)         self.fc1 = nn.Linear(16 * 5 * 5, 1000)         self.fc2 = nn.Linear(1000, 10)         self.dropout = nn.Dropout(0.3)      def forward(self, x):         x = self.pool(F.relu(self.conv1(x)))         x = self.dropout(x)          x = self.pool(F.relu(self.conv2(x)))         x = self.dropout(x)          x = x.view(-1, 16 * 5 * 5)         x = F.relu(self.fc1(x))         x = self.fc2(x)         return x  model = ConvNet().to(device) torch.nn.init.xavier_uniform_(model.conv1.weight) torch.nn.init.xavier_uniform_(model.conv2.weight)

Load the dataset

Here we load the dataset using the built in class CIFAR10. We use transform class to make sure that the data is in tensor format . We split our data in train and test set. We create a dataloader object to create a iterator that can be used by our model.

Python3

from torchvision import transforms from torchvision.datasets import CIFAR10 train_transforms = transforms.Compose([ transforms.ToTensor()]) dataset = CIFAR10(root=&quot;./test/&quot;, train=False, download=True, transform=train_transforms)   train_set, test_set = torch.utils.data.random_split(dataset, [8000, 2000])  trainloader = torch.utils.data.DataLoader( train_set, batch_size=1024,shuffle=True) testloader = torch.utils.data.DataLoader( test_set,batch_size=1024, shuffle=True)

Loss function and optimizer

We define cross entropy loss as it is a classification problem. We use the LBFGS optimizer. It is based on Newton method that optimizes memory and uses an estimate of the inverse Hessian matrix to save computation resources.

Python3

criterion = nn.CrossEntropyLoss() optimizer = optim.LBFGS(model.parameters(),lr=0.39,max_iter=2,history_size=10,line_search_fn='strong_wolfe')

Model Training

We train the model . Below is a standard pytorch training code used for neural network. The only difference is use of closure function. This is as per requirement of pytorch implementation of LBFGS.

Python3

N_EPOCHS = 50 for epoch in range(N_EPOCHS):   epoch_loss = 0.0      model.train()   for inputs, labels in trainloader:     inputs = inputs.to(device)     labels = labels.to(device)       def closure():       optimizer.zero_grad()       outputs = model(inputs)       loss = criterion(outputs, labels)       loss.backward(retain_graph=True)        return loss          optimizer.step(closure)      outputs = model(inputs)     loss = criterion(outputs, labels)     epoch_loss += loss.item()    val_loss = 0.0   model.eval()   for inputs, labels in testloader:     inputs = inputs.to(device)      labels = labels.to(device)        outputs = model(inputs)     loss = criterion(outputs, labels)     val_loss += loss.item()   print(&quot;Epoch: {} Train Loss: {} Val Loss: {} &quot;.format(epoch, epoch_loss/len(trainloader), val_loss/len(testloader)))

Output:

Epoch: 0 Train Loss: 2.302582234144211 Val Loss: 2.302242159843445 
Epoch: 1 Train Loss: 2.302254170179367 Val Loss: 2.302234172821045 
Epoch: 2 Train Loss: 2.301358014345169 Val Loss: 2.3020946979522705 
Epoch: 3 Train Loss: 2.302258551120758 Val Loss: 2.3020176887512207 
Epoch: 4 Train Loss: 2.300609827041626 Val Loss: 2.298862934112549 
Epoch: 5 Train Loss: 2.199757754802704 Val Loss: 2.0873111486434937

After training for 50 epochs we get loss of 1.6 for both train and validation set.

Lets randomly check the output of our model

Python3

y_pred = model(test_set[901][0].unsqueeze(0).to(device)) print(f&quot; Model predicted label : {np.argmax(y_pred.detach().cpu().numpy())}&quot;) print(f&quot; Actual label : {test_set[901][1]}&quot;)

Output:

Model predicted label : 1
 Actual label : 1

A single neuron neural network in Python

rahulsm27

Improve

Article Tags :

Practice Tags :

Machine Learning

Optimization in Neural Networks and Newton's Method