Stochastic Gradient Descent (SGD): An iterative method for optimizing an objective function with suitable smoothness properties

Stochastic Gradient Descent (SGD) is one of the most widely used optimisation methods in machine learning. It powers the training of models ranging from linear regression to deep neural networks because it scales well to large datasets and can make progress even when computing an exact gradient is expensive. If you are exploring optimisation as part of a data science course in Kolkata, SGD is a core concept because it connects mathematical ideas (smoothness, gradients, convergence) with hands-on model training.

At a high level, SGD minimises an objective function—often called a loss function—by repeatedly taking steps in the direction that reduces the loss. The “stochastic” part means that instead of using the full dataset to compute the gradient every time, SGD uses a randomly selected example (or a small batch). This makes each update cheaper and often faster in practice.

The optimisation goal and the “smoothness” idea

In supervised learning, we usually minimise an average loss over data points:

F(θ)=1n∑i=1nℓi(θ)F(\theta) = \frac{1}{n}\sum_{i=1}^{n} \ell_i(\theta)F(θ)=n1​i=1∑n​ℓi​(θ)

Here, θ\thetaθ represents model parameters, and ℓi(θ)\ell_i(\theta)ℓi​(θ) is the loss on the iii-th example. Many commonly used losses (like mean squared error and logistic loss) are smooth or “smooth enough” in typical operating regions.

When we say the objective has “suitable smoothness properties,” we typically mean the gradient does not change too abruptly. A common mathematical way to express this is: the function has an L-Lipschitz continuous gradient, which informally means small changes in parameters produce bounded changes in the gradient. This matters because gradient-based updates assume the gradient direction is a reliable local guide to decreasing the objective.

The SGD update rule (and why it works)

Standard gradient descent uses the full gradient:

θt+1=θt−η∇F(θt)\theta_{t+1} = \theta_t – \eta \nabla F(\theta_t)θt+1​=θt​−η∇F(θt​)

SGD replaces the full gradient with an estimate computed from a randomly chosen data point (or mini-batch):

θt+1=θt−η∇ℓit(θt)\theta_{t+1} = \theta_t – \eta \nabla \ell_{i_t}(\theta_t)θt+1​=θt​−η∇ℓit​​(θt​)

where iti_tit​ is a random index and η\etaη is the learning rate.

Why does this make sense? Because ∇ℓit(θt)\nabla \ell_{i_t}(\theta_t)∇ℓit​​(θt​) is an unbiased (or approximately unbiased) estimate of the true gradient under typical sampling assumptions. Over many iterations, the noise averages out, and the parameters move toward regions of lower loss. In fact, the noise can sometimes help optimisation escape shallow local minima or flat regions, which is one reason SGD remains popular for deep learning.

If you are implementing training loops while taking a data science course in Kolkata, you will notice that SGD often improves the loss quickly at first, even though it may “jitter” near the optimum because each step is based on a noisy gradient estimate.

Mini-batch SGD, stability, and convergence intuition

In practice, we often use mini-batches: instead of one example, we compute the gradient over a small set (like 32, 64, or 128 samples). Mini-batches reduce gradient noise and better utilise modern hardware.

The key trade-off is:

  • Smaller batches: faster updates, more noise, sometimes better generalisation.
  • Larger batches: smoother updates, potentially faster per-epoch convergence, but can require careful learning-rate tuning.

Convergence behaviour depends on the type of objective:

  • For convex and smooth objectives, SGD can be shown to converge under appropriate learning-rate schedules (often decreasing over time).
  • For non-convex objectives (common in deep networks), SGD typically converges to a point where gradients are small, but not necessarily the global optimum.

The learning rate is crucial. If it is too large, updates overshoot and training becomes unstable. If it is too small, progress slows dramatically.

Practical tuning: learning rate schedules, momentum, and common pitfalls

SGD is simple, but good performance depends on tuning a few key pieces:

Learning rate schedules

A fixed learning rate can work, but schedules often work better:

  • Step decay: reduce η\etaη after certain epochs.
  • Exponential decay: steadily shrinking η\etaη.
  • Cosine annealing: gradually reduces η\etaη in a smooth curve.
  • Warm-up: start with a smaller η\etaη and increase briefly to stabilize early training.

Momentum

Momentum helps smooth noisy updates by accumulating a velocity term. It tends to speed up progress along consistent directions and reduce oscillations in steep, narrow valleys.

Common pitfalls

  • Not normalising features: for many models, poorly scaled features make gradients ill-conditioned and SGD struggles.
  • Ignoring batch size effects: changing batch size usually requires changing the learning rate.
  • Overfitting due to long training: monitor validation performance and use early stopping when needed.

These are exactly the “engineering meets theory” lessons that appear repeatedly in a data science course in Kolkata, because real model training is rarely “set it and forget it.”

Conclusion

Stochastic Gradient Descent remains a foundational optimization method because it is computationally efficient, conceptually simple, and effective for large-scale learning. Its success comes from using cheap, noisy gradient estimates to steadily reduce a smooth objective function, provided the learning rate and training setup are chosen carefully. Once you understand SGD—its update rule, its relationship to smoothness, and its practical tuning—you gain a strong base for understanding more advanced optimisers as well. For learners building solid ML fundamentals in a data science course in Kolkata, mastering SGD is one of the most valuable steps toward training models reliably and efficiently.

Related Articles

Latest Articles