Skip to main content

Command Palette

Search for a command to run...

How i Understood Logistic Regression and implemented from scratch

Published
11 min read
How i Understood Logistic Regression and implemented from scratch

Why logistic regression confused me at first?

Before I started learning Logistic regression in detail, I knew that it is used for classification problems, but the name says “regression”. That already felt inconsistent. Most explanations I found jumped straight into the sigmoid curve and said “This converts values into probabilities”. I memorized the formula, but didn’t understand why it existed in the first place.

Another thing that bothered me was: in linear regression, the equation directly gives the prediction. In logistic regression, the same-looking equation suddenly, but also not clearly explained as something else.

I also thought logistic regression was non-linear because of the sigmoid curve. Visually, everything looked curved, so i assumed the model itself must be non-linear. Overall, my confusion wasn’t about formulas or math difficulty. It was about what exactly the model was trying to learn and what the output of the linear equation actually represented. Until that was clear, everything else— loss functions, gradient, decision boundaries— felt disconnected and hard to reason about.

From Probabilities to Log-Odds: What Logistic Regression Actually Models

My first assumption was simple: logistic regression takes input features, applies a linear equation, and directly predicts probability. That felt intuitive, especially because the final output is a number between 0 and 1.

BUT

Probability is bounded between 0 and 1. A linear function, on the other hand, has no such bounds. If we try to model probability directly using a linear equation, we immediately run into a mismatch between what the math can produce and what probability allows.

At this point, I thought the sigmoid was introduced just to “squash” the line into a curve. Visually, it looked like we had a straight line that somehow became an S-shaped curve, and I assumed that was the whole reason sigmoid existed.

That explanation turned out to be incomplete.

The real shift in understanding came when I realized that logistic regression does not try to model probability directly. Instead, it models odds, and more specifically, the log of odds.

$$\text{odds} = \frac{p}{1 - p}$$

Here p is the probability. Odds range from 0 to infinity. Taking the logarithm of odds maps this to a range from negative infinity to positive infinity.

$$\log\left(\frac{p}{1 - p}\right) \in (-\infty, +\infty)$$

This is exactly the range a linear model naturally operates in. This leads to the key assumption of logistic regression:

$$w^T x + b = \log\left(\frac{p}{1 - p}\right)$$

At first, this felt arbitrary. Why assume the output of a linear equation is the log-odds? The answer is:

  • Log-odds are unbounded

  • Log-odds make probability symmetric around 0

  • A linear model fits naturally in this space

Only after modeling log-odds do we convert the result back into probability. That conversion is done using the sigmoid function, which is simply the inverse of the log-odds transformation:

$$p = \frac{1}{1 + e^{-(w^T x + b)}}$$

This helped me in reframing my entire understanding. The sigmoid function is not the model. It is just the final step that translates log-odds into probability. So when logistic regression produces a value using the familiar form wx + b ,that value is not the predicted probability. It is the predicted log-odds.

Once this clicked, the rest of logistic regression stopped feeling like some magic and started looking like a consistent probabilistic model.

Why Odds? And why take the Log?

A question might arise: Why did we even introduce odds and log-odds?

The problem is that probability is limited. It can only take values between 0 and 1. But the formula we use in model, wx + b, can produce any number— very large, very small, positive, or negative. So it cannot safely represent probability by itself. To fix this, we change how we represent probability. First, we convert probability into odds:

$$\text{odds} = \frac{p}{1 - p}$$

Odds answer a simple question: “How much more likely is class 1 compared to class 0?”

Still, odds have a problem. They can go from 0 to infinity, but they cannot be negative. A linear model needs freedom on both sides. So we take the log of the odds:

$$\log\left(\frac{p}{1 - p}\right)$$

Now the values can range from negative infinity to positive infinity. This perfectly matches what a linear equation can produce.

So linear equation does not predict probability. It predict log-odds, because log-odds fit naturally with linear equations.

Once we have log-odds, converting then back to probability is easy. That conversion is done using the sigmoid function. This is why sigmoid is used— not to make a curve look nice, but because it is the inverse of log-odds.

Why cross-Entropy is Used? Why not MSE?

After understanding how logistic regression predict probability, the next confusion I had was about the loss function. As linear ML models works on minimizing the loss function and getting the best weights and bias.

In linear regression, we used Mean Squared Error (MSE). So a natural question is : Why not use the same loss here?

At first, I assumed cross-entropy was just another optimization trick. It felt like something added because “everyone uses it,” not because it was necessary. The assumption was wrong. The key difference lies in the type of data we are modeling. The key difference lies in the type of data we are modeling.

In linear regression, we assume the target values are continuous and affected by Gaussian noise. That assumption directly leads to Mean Squared Error as the correct loss function.

In logistic regression, the target variable is binary: 0 or 1. This kind of data follows a Bernoulli distribution, not a Gaussian one. So the loss function must come from Bernoulli probability, not from squared error.

$$y \sim \text{Bernoulli}(p)$$

For a single data point, the probability of observing the true label is:

$$P(y \mid p) = p^{y}(1 - p)^{1 - y}$$

To train the model, we maximize this probability for all data points. Instead of maximizing probability directly, we minimized the negative log-likelihood.

$$L = -\left[y \log(p) + (1 - y)\log(1 - p)\right]$$

Average cross-entropy loos over dataset is:

$$L = -\frac{1}{n}\sum_{i=1}^{n} \left[y_i \log(p_i) + (1 - y_i)\log(1 - p_i)\right]$$

This expression is exactly what we call binary cross-entropy loss.

At this point, things became clearer. Cross-entropy is not an arbitrary choice. It is simply the loss that naturally comes from modeling binary data with probabilities. Another important reason MSE fails here is behavior during training. With MSE, confident wrong prediction are not punished strongly enough, which leads to slow or unstable learning. Cross-entropy, on the other hand, heavily penalizes confident mistakes, which pushed the model to correct itself faster.

How logistic Regression Is Trained: Gradient and Optimization

Once the model and loss function are defined, the next question is how the model actually learns. Unlike linear regression, logistic regression does not have a closed-form solution. There is no single formula that directly gives the best weights. So we rely on optimization.

The goal of training is simple: adjust the weights so that the cross-entropy loss becomes as small as possible.

To do this, we need to know how the loss changes when the weights change. This is where gradient come in.

A gradient tells us the direction in which the loss increases the most. If we move in the opposite direction, the loss decreases. This idea is called gradient descent.

For logistic regression, something important happens when we compute the gradient of the loss. The next part is bit math heavy, if it becomes hard to understand you can just focus on the final output.

$$z = w^{T}x + b $$

sigma value:$$p = \sigma(z) = \frac{1}{1 + e^{-z}} $$Binary cross-entropy loss:$$L = -\left[ y \log(p) + (1 - y)\log(1 - p) \right] $$Gradient of loos w.r.t probability$$\frac{\partial L}{\partial p} = -\left( \frac{y}{p} - \frac{1 - y}{1 - p} \right) $$Gradient of probability w.r.t linear output$$ \frac{\partial p}{\partial z} = p(1 - p) $$Gradient of linear output w.r.t weights$$\frac{\partial z}{\partial w} = x $$Applying chain rule:$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial p} \cdot \frac{\partial p}{\partial z} \cdot \frac{\partial z}{\partial w} $$Final result:$$\frac{\partial L}{\partial w} = (p - y)x$$

This expression explains a lot:

  • If the prediction p is close to the true label y, the update is small.

  • If the model is confident and wrong, the update is large.

  • The input feature x scales how much each weight is updated.

During training, the model repeatedly:

  1. Computes predicted probabilities

  2. Measure loss using cross-entropy

  3. Computes gradients

  4. Updates weights slightly in the direction that reduces loss

The process continues until the model converges. At first, I thought gradients were just a technical detail. But this equation made it clear that training is not magic. The model simply adjusts itself based on how wrong it is and how confident it was.

Is Logistic Regression Linear or Non-Linear?

Before moving to implementation from scratch, i would like to address few confusion I had and also their solution.

When i first saw the sigmoid curve, I assumed logistic regression must be a non-linear model. Visually, everything looked curved, so I equated that curve with model complexity. That assumption turned out to be wrong.

Logistic regression is a Linear Model.

The confusion comes from mixing up two different things:

  • the model

  • the output transformation

The core of logistic regression is still a linear equation. This linear equation defines the decision boundary. The model predicts class 1 when this value is greater than zero and class 0 otherwise. In two dimensions, this boundary is a straight line. In higher dimension, it is a hyperplane.

We know now that sigmoid function does not make the model non-liner but actually converts the linear output into probability. So even though probabilities change smoothly in an S-shaped manner, the point where the model switches its prediction remains linear.

once I realized that sigmoid only affects the scale of the output and not the structure of the boundary, it became clear why logistic regression struggles with problems that require non-linear separation. With this let’s talk about few conditions in which logistic regression struggles.

Where Logistic Regression Breaks (and when to use it Carefully)

Understanding logistic regression also means knowing where it fails. These failures are not mistakes in the algorithm; they come directly from its assumptions

Feature Interactions: What the model cannot learn on tis own

Logistic regression combines features in a purely additive way:

This means each feature is treated independently. If feature A alone is weak and feature B alone is weak, but together they form a strong signal, logistic regression will miss it.

To capture such relationships, we must manually add in interaction term:

Only after adding this does the model become capable of learning the combined effect, This highlights an important limitation: logistic regression does not discover interactions automatically. It depends heavily on feature engineering.

Perfect Separation and the Need for Regularization

Another practical issue appears when data is perfectly separable. In this case, logistic regression keeps increasing weights to make predictions more confident. The loss keeps decreasing, and the optimal weights move toward infinity.

As a result, the model fails to converge.

Regularization fixes this by penalizing large weights and forcing the model to settle on a stable solution. Without regularization logistic regression is fragile in such scenarios.

$$$$L2 regularization, optional but good$$

L = -\left[ y \log(p) + (1 - y)\log(1 - p) \right] + \lambda |w|_2^2 $$L1 regularization, optional$$ L = -\left[ y \log(p) + (1 - y)\log(1 - p) \right] + \lambda |w|_1$$

What this Means in Practice

Logistic regression works best when:

  • The decision boundary is roughly linear.

  • Features are well-designed.

  • regularization is used.

It struggles when:

  • Relationships are highly non-linear.

  • Interaction dominate the signal.

  • Features are raw and unstructured.

Final Section: Logistic Regression From Scratch (Putting everything together)

Minimal From-Scratch Implementation

import numpy as np

class LogisticRegressionScratch:
    def __init__(self, lr=0.1, epochs=1000, reg=0.0):
        self.lr = lr
        self.epochs = epochs
        self.reg = reg
        self.w = None
        self.b = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0

        for _ in range(self.epochs):
            z = X @ self.w + self.b          # log-odds
            p = self.sigmoid(z)              # probability

            dw = (1 / n_samples) * (X.T @ (p - y)) + self.reg * self.w
            db = np.mean(p - y)

            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict_proba(self, X):
        z = X @ self.w + self.b
        return self.sigmoid(z)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

We gave the following parameters:

  • lr : learning rate for gradient descent

  • epochs : number of optimization steps

  • reg : regularization strength

This sets the training hyperparameters.

# This computes the log-odds
# This is the core linear assumption of logistic regression
z = X @ self.w + self.b

# Converts log-odds into probability
# Output lies strictly between 0 and 1
p = self.sigmoid(z)

Gradient Computation:

  • (p-y) is the error signal

  • If the model is confident and wrong, this term is large

  • X.T @ (p - y) distributes error across features

  • self.reg * self.w prevents weights from exploding in case of perfectly separable data.

dw = (1 / n_samples) * (X.T @ (p - y)) + self.reg * self.w
db = np.mean(p - y)

Gradient Descent update

# Moving weights in the direction that reduces cross-entropy
# Repeat this for many epochs until convergence
self.w -= self.lr * dw
self.b -= self.lr * db

predict_proba(): probability output

# Returns predicted probabilities
# Useful for ROC, AUC, threshold testing
def predict_proba(self, X):
    z = X @ self.w + self.b
    return self.sigmoid(z)

Conclusion

Logistic regression became clear once I understood what it actually models. It does not predict probabilities directly. It predicts log-odds, and probability is obtained only at the final step using the sigmoid function.

Cross-entropy also stopped feeling arbitrary when I saw that it come directly from modeling binary data with a Bernoulli distribution. The loss, gradients, and optimization are all consequences of this assumption, not independent choices.

Implementing logistic regression from scratch tied everything together. The same linear equation, sigmoid and gradient updates appeared again in code, making the model feel consistent rather than confusing

Logistic regression is simple, but only when its assumptions are respected. Knowing both its strength and limits is what makes it useful in practice.