Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
A single neuron neural network in Python
Next article icon

Optimization in Neural Networks and Newton's Method

Last Updated : 06 Nov, 2023
Comments
Improve
Suggest changes
Like Article
Like
Report

In machine learning, optimizers and loss functions are two components that help improve the performance of the model. A loss function measures the performance of a model by measuring the difference between the output expected from the model and the actual output obtained from the model. Mean square loss, hinge loss, and log loss are some examples of loss functions. The optimizer helps improve the model by adjusting its parameters to minimize the loss function value. SGD, ADAM, RMSProp, and Newton's method are some examples of optimizers. The role of the optimizer is to find the best set of parameters (weights and biases) of the neural network that allow it to make accurate predictions.

Optimization in Neural Networks

The computational method for iterative optimization technique can be broadly divided in three types

  • Zero Order or Direct Search
    • These involve exploring a range of potential values (akin to grid search) for the variable x to find the minimum of the objective function. These techniques are generally computationally intensive compared to higher-order methods. Nevertheless, they are known for their reliability and straightforward implementation. While there are more advanced techniques available, that improve upon the grid search our primary emphasis will be on the higher-order iterative approaches.
  • First-order or Gradient Methods
    • These techniques make use of the first-order partial derivatives.
    • Example: gradient descent and its variants SGD, ADAM, RMSPROP, etc.
  • Second Order Methods:
    • These techniques make use of the second-order partial derivatives (hessian).
    • Example: Newton Method, Quassi-Newton method.

In this article we will focus on the Newton method for optimization and how it can be used for training neural networks. Let us first compare it with gradient descent.

Gradient Descent Vs Newton Method

Gradient Descent and the Newton Method are two well-known optimization techniques for training neural networks each method has its advantages and disadvantages. The choice between them depends on the problem at hand, the complexity of the neural network, and available computation resources.

CRITERIA

GRADIENT DESCENT

NEWTON METHOD

WORKING

It relies only on first order derivative of the loss function to update model parameters

It relies on both first-order (gradient) and second-order derivatives (hessian ) to update model parameters

APPLICABILITY

It is relatively easy to implement and is widely used in practice due to its simplicity.

The Newton method may not be practical or suitable for very large neural networks or when the Hessian matrix is computationally expensive to compute or invert

Convergence

Gradient Descent typically converges to a local minimum but can be sensitive to the choice of learning rate, may get stuck in saddle points, and may have slower convergence in some cases.

The Newton method often converges much faster than Gradient Descent.

Learning Rate

Highly Sensitive

Less sensitive

Basic Concept of Newton Method

The genesis of the Newton's method lies in calculus for finding the roots of a differentiable function F, which are solutions to the equation F (x) = 0.

Let us understand Newton method for root finding , then we will see how it can be used for optimization

  • Problem : We have a function f(x). The goal is to find root of f(x) i.e., f(x) = 0.
  • Initial Guess : We make a initial guess X_0 for the root.
  • Update: We update the current guessX_t to get a new estimate X_{t+1} using the formula

X_{t+1} = X_t - f(X_t)/f'(X_t)

  • Derivation of above formula : From the below graph one can see that the slop of orange line is f'(x). We know that slope of line is:

X_{t+1} = X_t - f(X_t)/f'(X_t) .

So using this we get,

f'(x) = (f(x_t)-0 )/ (X_t-X_{t+1} ) . Rearranging the above equation we get

X_{t+1} = X_t - f(X_t)/f'(X_t)

  • ​Repeat: The method proceeds iteratively. We repeat step 3 till we reach a predefined tolerance or a certain number of iterations

Newton Method for Optimization

The above method we discussed is used to find the roots of a function i.e. f(x) = 0. Now we know that derivative of the function is zero at its critical point(minimum/maximum/saddle). So instead of finding roots of f(x) we can find the roots of f'(x). By doing so we find the critical point of f(x).

Newton-Method-for-optimization
Newton Method For Optimization


Let us first get an intuition behind Newton method using a single variable case . Then we will extend it to multivariate case.
Consider the above graph in which a single step of Newton method has been shown.

Here, we want to find the minimum of the function f(X):

  1. Here we start with a initial guess at X_0
  2. We approximate the function at X_0 using Taylor series.
    • f(x) = \begin{array}{l}f(X)+\frac{f^{\prime}(X)}{1 !}(X-X_0)+\frac{f^{\prime \prime}(X)}{2 !}(X-X_0)^{2}+\frac{f^{\prime \prime \prime}(X_0)}{3 !}(X-X_0)^{3}+\cdots\end{array}
  3. Since the derivative of a function is zero at the point of minima/maxima , we find the minimum of this approximation by differentiating and equating to zero. Note this is the minimum of Taylor series which is approximate at X_0
    • f'(x) = 0 + f'(X_0) + f''(X_0)(X - X_0)/1! + .... = 0
    • We find point X1 from above equation by rearranging above equation:
      • X_1 = X_0 - [f''(X_0)]⁻¹ f'(x_0)
  4. Repeat steps 3 and 4 until convergence is reached (e.g., when the change in x is sufficiently small)

We can extend the above method to multivariate case.

The multivariable quadratic approximation to f(x1,…,xn) can be written.

f(x_0) + (x-x_0)^{T}\nabla f(x_0) + \frac{1}{2}(x -x_0)^{T}H(x-x_0)

Here,

  • x is a vector of length n representing the variables x_1 ... x_n
  • x0 is a vector of length n representing the variable value at the approximation point
  • Δf(x0) is the Multivariable equivalent of the first derivative - gradient, and it computes the first derivatives of f(x) with respect to every parameter in one vector of partial derivatives.

\overrightarrow{\nabla}f(\overrightarrow{x}) =\begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots \\ \frac{\partial f}{\partial x_n}\end{bmatrix}

Multivariable equivalent of the second derivative : Instead of a vector, we now need a matrix. This matrix Hf(x ) is called the Hessian, and is a matrix of the second derivatives of f(x) with respect to each pair of parameters.

  • To calculate hessian we take the gradient and different each of its component with each of the parameter.
  • For example, to calculate the first row of Hessian we take the partial derivative of gradient \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \frac{\partial f}{\partial x_3} & ... & \frac{\partial f}{\partial x_n} \end{bmatrix} by {\partial x_1} which give us\begin{bmatrix} \frac{\partial f }{\partial^2 {x_{1}}} & \frac{\partial f }{\partial {x_{1}} \partial {x_{2}}} & \frac{\partial f }{\partial {x_{1}} \partial {x_{3}}} &... & \frac{\partial f }{\partial {x_{1}} \partial {x_{n}}} \end{bmatrix} . This gives us the first row of hessian matrix.
  • To get second row we take the partial derivative of gradient\begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \frac{\partial f}{\partial x_3} & ... & \frac{\partial f}{\partial x_n} \end{bmatrix} by {\partial x_2} which give us \begin{bmatrix} \frac{\partial f }{\partial {x_{2}} \partial {x_{1}}} & \frac{\partial f }{\partial^2 {x_{2}}} & \frac{\partial f }{\partial {x_{2}} \partial {x_{3}}} &... & \frac{\partial f }{\partial {x_{2}} \partial {x_{n}}} \end{bmatrix}
  • This is done till {\partial x_n} to get below hessian matrix:

Hf(\overrightarrow{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial^{2} x_{1}}& \frac{\partial^2 f}{\partial x_{1} \partial x_{2}} & ... & \frac{\partial^2 }{\partial x_1 \partial x_n}\\ \frac{\partial^2 }{\partial x_2 \partial x_1}& & & \frac{\partial^2 }{\partial x_2 \partial x_n}\\ \vdots & & & \vdots \\ \frac{\partial^2 }{\partial x_n \partial x_1} & ... & ... & \frac{\partial^2 f}{\partial^{2} x_{n}} \end{bmatrix}

Hessian Matrix

The update rule in Newton's method uses the Hessian matrix and the gradient to calculate the step size, which is essentially solving a linear system of equations involving the Hessian and the gradient.

Significance of Hessian Matrix

  • Curvature Information: The Hessian matrix provides information about the curvature of the loss landscape. Specifically, it tells you how the loss function's gradient (first derivative) changes with respect to changes in each parameter and how the parameters interact with each other in the optimization space
  • Global Minima, Maxima and Saddle Points: The Hessian matrix can help identify whether a critical point (a point where the gradient is zero) is a local minimum, a local maximum, or a saddle point. It can distinguish between these different types of critical points based on the eigenvalues of the Hessian matrix. Positive eigenvalues indicate a local minimum, negative eigenvalues indicate a local maximum, and both positive and negative eigenvalues indicate a saddle point.

Utilization of Second Derivative

The geometric explanation of Newton's method involves an iterative process where, at each step, it approximates the graph of the function f(x) near the trial value by fitting a parabola. Newton's method makes use of the first derivation (gradient) and second derivative (the Hessian matrix) of the objective function . The parabola is designed to have the same slope and curvature as the function's graph at that particular point. Utilizing second derivative provides a more accurate and informative characterization of the local curvature of the function than the first derivative which only provides the slope information. Once the parabola is constructed, the method proceeds to determine the maximum or minimum of this parabolic approximation.

Gradient descent, on the other hand, relies solely on the first derivative (the gradient). Incorporating second derivative information allows Newton's method to take more precise and efficient steps towards the minimum. Newton methods can converge faster than first-order methods like gradient descent because they consider both the direction and curvature of the loss landscape.

How to use Newton's Method for Optimization?

Basic outline of implementing newtons method for neural network is given below:

  • Define the Model : Determine the neural network architecture, including the number of layers, activation functions, and the loss function . Identify the model parameters, which are the weights and biases of the neural network. Determine the hyperparameters like learning rate.
  • Initialize Parameters : Choose an initial guess for the model parameters
  • Loss function: Define loss function the measures the error between models output and actual output.
  • Optimizer : Define the optimizer that will update the model parameters . In our case it will be the Newton optimizer
  • Regularization: Incorporate regularization techniques if needed, such as L2 regularization (weight decay) or dropout, to prevent overfitting.
  • Iterative Update :
    • Pass the input to the model and compute loss
    • Compute the Gradient and Hessian Matrix of model parameters with respect to loss
    • Update the parameter estimate by backprogogation
    • Repeat above steps till convergence or maximum iteration is reached
  • Validation and Testing: After training, evaluate on a validation dataset to assess the model performance.
  • Fine-Tuning: Experiment with different hyperparameters, learning rates, and regularization strengths to achieve better results.
  • Convergence Check
    • Check for convergence by evaluating a stopping criterion. Common criteria include:
    • The norm of the gradient becoming very small.
    • The change in the parameter estimate becoming very small.
    • Reaching a specified number of iterations.
  • Termination and Output: If the convergence criterion is met, terminate the optimization. Output the final parameter estimate which represents the optimal neural network weights and biases.

Step-by-Step Guide of Newton Method in Python

We will train a custom neural network for image classification using pytorch and LBFGS solver as optimizer. We will use the famous CIFAR10 dataset.

Import Necessary Libraries

Torch is the main library for implementing pytorch based neural network

Python3
import torch from torch import nn,optim import torch.nn.functional as F import pandas as pd import numpy as np 


Initialize Device

We will initialize our device variable to cuda if GPU is available otherwise to CPU.

Python3
device = "cuda" if torch.cuda.is_available() else "cpu" 


Create Model

Here we have designed a simple CNN network for classification. It consists of 2 layer of convolution followed by 2 linear layers. We are using ReLU activation and maxpool after convolution layer. We use dropout layer so that all weights are trained. We use Xavier initialization for initializing the weights of the two convolution layer of the model.

Python3
# Creating our own ConvNet class ConvNet(nn.Module):     def __init__(self):         super(ConvNet, self).__init__()         self.conv1 = nn.Conv2d(3, 6, 5)         self.pool = nn.MaxPool2d(2, 2)          self.conv2 = nn.Conv2d(6, 16, 5)         self.fc1 = nn.Linear(16 * 5 * 5, 1000)         self.fc2 = nn.Linear(1000, 10)         self.dropout = nn.Dropout(0.3)      def forward(self, x):         x = self.pool(F.relu(self.conv1(x)))         x = self.dropout(x)          x = self.pool(F.relu(self.conv2(x)))         x = self.dropout(x)          x = x.view(-1, 16 * 5 * 5)         x = F.relu(self.fc1(x))         x = self.fc2(x)         return x  model = ConvNet().to(device) torch.nn.init.xavier_uniform_(model.conv1.weight) torch.nn.init.xavier_uniform_(model.conv2.weight) 


Load the dataset

Here we load the dataset using the built in class CIFAR10. We use transform class to make sure that the data is in tensor format . We split our data in train and test set. We create a dataloader object to create a iterator that can be used by our model.

Python3
from torchvision import transforms from torchvision.datasets import CIFAR10 train_transforms = transforms.Compose([ transforms.ToTensor()]) dataset = CIFAR10(root="./test/", train=False, download=True, transform=train_transforms)   train_set, test_set = torch.utils.data.random_split(dataset, [8000, 2000])  trainloader = torch.utils.data.DataLoader( train_set, batch_size=1024,shuffle=True) testloader = torch.utils.data.DataLoader( test_set,batch_size=1024, shuffle=True) 


Loss function and optimizer

We define cross entropy loss as it is a classification problem. We use the LBFGS optimizer. It is based on Newton method that optimizes memory and uses an estimate of the inverse Hessian matrix to save computation resources.

Python3
criterion = nn.CrossEntropyLoss() optimizer = optim.LBFGS(model.parameters(),lr=0.39,max_iter=2,history_size=10,line_search_fn='strong_wolfe') 


Model Training

We train the model . Below is a standard pytorch training code used for neural network. The only difference is use of closure function. This is as per requirement of pytorch implementation of LBFGS.

Python3
N_EPOCHS = 50 for epoch in range(N_EPOCHS):   epoch_loss = 0.0      model.train()   for inputs, labels in trainloader:     inputs = inputs.to(device)     labels = labels.to(device)       def closure():       optimizer.zero_grad()       outputs = model(inputs)       loss = criterion(outputs, labels)       loss.backward(retain_graph=True)        return loss          optimizer.step(closure)      outputs = model(inputs)     loss = criterion(outputs, labels)     epoch_loss += loss.item()    val_loss = 0.0   model.eval()   for inputs, labels in testloader:     inputs = inputs.to(device)      labels = labels.to(device)        outputs = model(inputs)     loss = criterion(outputs, labels)     val_loss += loss.item()   print("Epoch: {} Train Loss: {} Val Loss: {} ".format(epoch, epoch_loss/len(trainloader), val_loss/len(testloader))) 

Output:

Epoch: 0 Train Loss: 2.302582234144211 Val Loss: 2.302242159843445 
Epoch: 1 Train Loss: 2.302254170179367 Val Loss: 2.302234172821045
Epoch: 2 Train Loss: 2.301358014345169 Val Loss: 2.3020946979522705
Epoch: 3 Train Loss: 2.302258551120758 Val Loss: 2.3020176887512207
Epoch: 4 Train Loss: 2.300609827041626 Val Loss: 2.298862934112549
Epoch: 5 Train Loss: 2.199757754802704 Val Loss: 2.0873111486434937

After training for 50 epochs we get loss of 1.6 for both train and validation set.

Lets randomly check the output of our model

Python3
y_pred = model(test_set[901][0].unsqueeze(0).to(device)) print(f" Model predicted label : {np.argmax(y_pred.detach().cpu().numpy())}") print(f" Actual label : {test_set[901][1]}") 

Output:

Model predicted label : 1
Actual label : 1

Next Article
A single neuron neural network in Python

R

rahulsm27
Improve
Article Tags :
  • Machine Learning
  • Geeks Premier League
  • AI-ML-DS
  • Neural Network
  • Geeks Premier League 2023
Practice Tags :
  • Machine Learning

Similar Reads

  • Optimization Rule in Deep Neural Networks
    In machine learning, optimizers and loss functions are two fundamental components that help improve a model’s performance. A loss function evaluates a model's effectiveness by computing the difference between expected and actual outputs. Common loss functions include log loss, hinge loss, and mean s
    5 min read
  • A single neuron neural network in Python
    Neural networks are the core of deep learning, a field that has practical applications in many different areas. Today neural networks are used for image classification, speech recognition, object detection, etc. Now, Let's try to understand the basic unit behind all these states of art techniques.A
    3 min read
  • Train and Test Neural Networks Using R
    Training and testing neural networks using R is a fundamental aspect of machine learning and deep learning. In this comprehensive guide, we will explore the theory and practical steps involved in building, training, and evaluating neural networks in R Programming Language. Neural networks are a clas
    10 min read
  • What is Forward Propagation in Neural Networks?
    Forward propagation is the fundamental process in a neural network where input data passes through multiple layers to generate an output. It is the process by which input data passes through each layer of neural network to generate output. In this article, we’ll more about forward propagation and se
    4 min read
  • Numpy Gradient - Descent Optimizer of Neural Networks
    NumPy Gradient Descent Optimizer is a commonly used optimization algorithm in neural network training that is based on the gradient descent algorithm. It is used to minimize the cost function of a neural network model, by adjusting the model's weights and biases through a series of iterations. The b
    6 min read
  • Weights and Bias in Neural Networks
    Machine learning, with its ever-expanding applications in various domains, has revolutionized the way we approach complex problems and make data-driven decisions. At the heart of this transformative technology lies neural networks, computational models inspired by the human brain's architecture. Neu
    13 min read
  • Unconstrained Optimization Techniques in Neural Networks
    Unconstrained optimization plays a crucial role in the training of neural networks. Unlike constrained optimization, where the solution must satisfy certain constraints, unconstrained optimization seeks to minimize (or maximize) an objective function without any restrictions on the variable values.
    4 min read
  • ANN - Self Organizing Neural Network (SONN)
    Self Organizing Neural Network (SONN) is an unsupervised learning model in Artificial Neural Network termed as Self-Organizing Feature Maps or Kohonen Maps. These feature maps are the generated two-dimensional discretized form of an input space during the model training (based on competitive learnin
    2 min read
  • What is a Neural Network?
    Neural networks are machine learning models that mimic the complex functions of the human brain. These models consist of interconnected nodes or neurons that process data, learn patterns, and enable tasks such as pattern recognition and decision-making. In this article, we will explore the fundament
    14 min read
  • Feedback Neural Networks: Structure, Training, and Applications
    Neural networks, a cornerstone of deep learning, are designed to simulate the human brain's behavior in processing data and making decisions. Among the various types of neural networks, feedback neural networks (also known as recurrent neural networks or RNNs) play a crucial role in handling sequent
    7 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences