Gradient Descent Algorithm in R

Last Updated : 09 Sep, 2024

Gradient Descent is a fundamental optimization algorithm used in machine learning and statistics. It is designed to minimize a function by iteratively moving toward the direction of the steepest descent, as defined by the negative of the gradient. The goal is to find the set of parameters that result in the lowest possible error for a given model.

How Gradient Descent Works

The goal of Gradient Descent is to optimize a function by adjusting its parameters in such a way that the error between the model's predictions and the actual values is minimized.
The algorithm calculates the gradient (slope) of the cost function to each parameter and updates the parameters in the direction opposite to the gradient.
It ensures that with each iteration, the parameters move closer to the point where the error is minimized (global minimum).

Learning Rate and Its Effect

The learning rate (α) determines how large a step is taken during each update. It plays a critical role in the convergence of the algorithm:

A high learning rate may cause the algorithm to overshoot the minimum, resulting in divergence.
A low learning rate can lead to slow convergence, making the algorithm inefficient. An optimal learning rate strikes a balance between these two extremes, allowing for faster and more stable convergence.

Types of Gradient Descent

There are three main types of Gradient Descent:

Batch Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-batch Gradient Descent

1.Batch Gradient Descent

In Batch Gradient Descent, the gradient is calculated using the entire dataset. This means that every iteration takes into account all the data points when updating the model's parameters. While this approach is accurate, it can be slow and computationally expensive, especially with large datasets.

For a linear regression model, the loss function (Mean Squared Error) is:

[ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2 ]

where,

( m ) is the number of data points.
( h_{\theta}(x^{(i)}) ) is the predicted value for the ( i )-th data point.
( y^{(i)} ) is the actual value.

The gradient for each parameter ( \theta_j ) is:

[ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} ]

The parameters are updated as follows:

[ \theta_j = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} ]

2. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) updates the model's parameters for each individual data point, rather than using the entire dataset at once. This makes the algorithm faster and more efficient, especially for large datasets. However, because it uses only one data point at a time, the updates can be noisy, causing the loss function to fluctuate.

The gradient for each parameter ( \theta_j ) is calculated for each data point:

[ \frac{\partial J(\theta)}{\partial \theta_j} = \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} ]

The parameters are updated as follows:

[ \theta_j = \theta_j - \alpha \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} ]

3. Mini-batch Gradient Descent

Mini-batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic Gradient Descent. It splits the dataset into small batches and updates the model's parameters after processing each batch. This approach balances the efficiency of SGD and the accuracy of Batch Gradient Descent, reducing the noise while still being computationally efficient.

The gradient for each parameter ( \theta_j ) is calculated for a mini-batch of data points:

[ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{b} \sum_{i=1}^{b} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} ]

where,

( b ) is the number of data points in the mini-batch.

The parameters are updated as follows:

[ \theta_j = \theta_j - \alpha \frac{1}{b} \sum_{i=1}^{b} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} ]

Now we implement step by step Batch Gradient Descent for a linear regression problem in R Programming Language.

Step 1: Data Preparation

First create a synthetic dataset for this example.

set.seed(42) n <- 100 x <- runif(n, min = 0, max = 100) y <- 50 * x + 100 + rnorm(n, mean = 0, sd = 10)

Step 2:Initialize Parameters

Initialize the slope (m), intercept (b), learning rate (alpha), and the number of iterations.

m <- 0  # Initial slope b <- 0  # Initial intercept alpha <- 0.00001  # Learning rate iterations <- 1000  # Number of iterations

Step 3: Manual Gradient Descent Implementation

We will implement the gradient descent algorithm using loops.

gradient_descent <- function(x, y, m, b, alpha, iterations) {   n <- length(y)  # Number of data points   cost_history <- numeric(iterations)  # To store the cost at each iteration      for (i in 1:iterations) {     # Predicted values     y_pred <- m * x + b          # Calculate gradients     gradient_m <- -(2/n) * sum(x * (y - y_pred))  # Gradient for slope (m)     gradient_b <- -(2/n) * sum(y - y_pred)  # Gradient for intercept (b)          # Update parameters     m <- m - alpha * gradient_m     b <- b - alpha * gradient_b          # Calculate and store the cost (Mean Squared Error)     cost <- sum((y - y_pred)^2) / n     cost_history[i] <- cost          # Print the cost every 100 iterations     if (i %% 100 == 0) {       cat("Iteration:", i, " Cost:", cost, "\n")     }   }      return(list(m = m, b = b, cost_history = cost_history)) }  # Run the gradient descent algorithm result <- gradient_descent(x, y, m, b, alpha, iterations)  # Extract final slope, intercept, and cost history final_m <- result$m final_b <- result$b cost_history <- result$cost_history

Output:

Iteration: 100  Cost: 2395.926 
Iteration: 200  Cost: 2390.765 
Iteration: 300  Cost: 2388.487 
Iteration: 400  Cost: 2386.211 
Iteration: 500  Cost: 2383.938 
Iteration: 600  Cost: 2381.667 
Iteration: 700  Cost: 2379.397 
Iteration: 800  Cost: 2377.131 
Iteration: 900  Cost: 2374.866 
Iteration: 1000  Cost: 2372.604

Step 4: Plot the Fitted Line

Visualize the data points and the best-fit line obtained from gradient descent.

plot(x, y, main = "Gradient Descent: Fitted Line", xlab = "x", ylab = "y") abline(a = final_b, b = final_m, col = "red", lwd = 2)

Output:

Screenshot-2024-09-08-123821 — Plot the Fitted Line

Step 5: Visualization of the Cost Function Over Iterations

Plot the cost function over the iterations to visualize how the algorithm converges toward the minimum.

plot(1:iterations, cost_history, type = "l", col = "blue", lwd = 2,      main = "Cost Function over Iterations", xlab = "Iterations", ylab = "Cost")

Output:

Screenshot-2024-09-08-124000 — Plot the Cost Function

Step 6: Summary of Results

Check the final result.

cat("Final Slope (m):", final_m, "\nFinal Intercept (b):", final_b, "\n")

Output:

Final Slope (m): 51.42538 
Final Intercept (b): 1.215063

Applications and Considerations

Applications:

Linear and Logistic Regression: Gradient Descent is used to optimize the parameters for these models.
Neural Networks: It is crucial for training neural networks, where it helps adjust weights and biases to minimize error.
Support Vector Machines (SVMs): Gradient Descent can be used to optimize the margin in SVMs.

Considerations:

The learning rate must be chosen carefully. Too high a rate can cause the algorithm to overshoot the minimum, while too low a rate can make convergence very slow.
Gradient Descent might not always reach the global minimum, especially if the loss function has multiple minima (local minima).
The choice between Batch, Stochastic, and Mini-batch Gradient Descent depends on the dataset size and available computational resources.

Conclusion

Gradient Descent is a versatile and essential optimization algorithm used across various machine learning models. By understanding its different types and how to implement it in R, one can effectively optimize models for better performance. Choosing the right type of Gradient Descent and properly tuning the learning rate are critical for achieving the best results.

Stochastic Gradient Descent In R

surajpatlcyj

Improve

Article Tags :

Gradient Descent Algorithm in R

How Gradient Descent Works

Learning Rate and Its Effect

Types of Gradient Descent

1.Batch Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini-batch Gradient Descent

Step 1: Data Preparation

Step 2:Initialize Parameters

Step 3: Manual Gradient Descent Implementation

Step 4: Plot the Fitted Line

Step 5: Visualization of the Cost Function Over Iterations

Step 6: Summary of Results

Applications and Considerations

Conclusion

Similar Reads