March 26, 202612 min read

Deep Learning and Neural Networks: What's Actually Happening Under the Hood

Understand how neural networks work from the ground up. Covers neurons, layers, backpropagation, and building a digit classifier with PyTorch.

deep-learning neural-networks pytorch machine-learning ai
Ad 336x280

Neural networks power everything from ChatGPT to self-driving cars to the recommendations on your phone. But most explanations either drown you in math or wave their hands and say "it learns patterns." Neither is helpful.

This guide walks through what's actually happening inside a neural network. We'll build up from a single neuron to a working digit classifier, and you'll understand every layer of the system. The math exists, but we'll focus on intuition first and equations second.

What Is a Neural Network?

A neural network is a function. It takes some input (an image, a sentence, a row of numbers) and produces some output (a classification, a prediction, a probability). The "neural" part comes from a loose biological analogy -- the network is made of simple units called neurons connected in layers.

Here's the key idea: the network has millions of adjustable parameters (called weights), and during training, those weights are tuned so the function maps inputs to correct outputs. The "learning" is just finding the right weight values.

Input → [Layer 1] → [Layer 2] → [Layer 3] → Output
         weights     weights     weights

That's it at the highest level. Now let's zoom in.

The Single Neuron

A neuron does three things:

  1. Takes inputs and multiplies each by a weight
  2. Adds them all up (plus a bias term)
  3. Passes the result through an activation function
# A single neuron (conceptual)
def neuron(inputs, weights, bias):
    # Step 1 & 2: weighted sum
    z = sum(x * w for x, w in zip(inputs, weights)) + bias

# Step 3: activation function (ReLU in this case)
    output = max(0, z)
    return output

The weights control how much each input matters. The bias shifts the output. The activation function introduces non-linearity -- without it, stacking layers would be pointless because multiple linear transformations collapse into a single linear transformation.

Activation Functions

The most common activation functions:

  • ReLU (Rectified Linear Unit): max(0, x) -- simple, fast, and used in most hidden layers. If the input is negative, output is 0. If positive, pass it through unchanged.
  • Sigmoid: squashes output to range (0, 1). Used in binary classification output layers.
  • Softmax: converts a vector of numbers into probabilities that sum to 1. Used in multi-class classification output layers.
  • Tanh: squashes output to range (-1, 1). Sometimes used in RNNs.
import numpy as np

def relu(x):
return np.maximum(0, x)

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def softmax(x):
exp_x = np.exp(x - np.max(x)) # Subtract max for numerical stability
return exp_x / exp_x.sum()

Layers: Stacking Neurons

A layer is a group of neurons that all receive the same inputs and produce a set of outputs. Layers are stacked sequentially:

  • Input layer: your raw data (pixel values, features, etc.)
  • Hidden layers: intermediate transformations where the network builds up abstract representations
  • Output layer: the final answer (a class label, a probability, a number)
Input (784 pixels)
    ↓
Hidden Layer 1 (128 neurons) → ReLU
    ↓
Hidden Layer 2 (64 neurons) → ReLU
    ↓
Output Layer (10 neurons) → Softmax
    ↓
Probabilities for digits 0-9

Each connection between neurons has a weight. A network with 784 inputs, a 128-neuron hidden layer, and a 10-neuron output layer has: (784 128) + 128 + (128 10) + 10 = 101,770 parameters. And that's a tiny network.

The Forward Pass

The forward pass is simply running input data through the network layer by layer to get an output. Nothing is being learned yet -- you're just computing a prediction.

import numpy as np

# Simple 2-layer network (no framework)
def forward(X, W1, b1, W2, b2):
    # Hidden layer
    z1 = X @ W1 + b1       # Matrix multiplication + bias
    a1 = np.maximum(0, z1)  # ReLU activation

# Output layer
    z2 = a1 @ W2 + b2      # Matrix multiplication + bias
    # Softmax
    exp_z2 = np.exp(z2 - np.max(z2, axis=1, keepdims=True))
    a2 = exp_z2 / exp_z2.sum(axis=1, keepdims=True)

return a2

With random weights, the output is garbage. The magic happens in training.

Loss Functions: Measuring How Wrong You Are

A loss function (or cost function) quantifies how far the network's predictions are from the correct answers. The goal of training is to minimize this number.

For classification, the standard loss is cross-entropy loss:

def cross_entropy_loss(predictions, targets):
    # predictions: softmax output (probabilities)
    # targets: one-hot encoded true labels
    n = predictions.shape[0]
    log_probs = -np.log(predictions[range(n), targets])
    return log_probs.mean()

Intuitively: if the correct answer is class 3 and your model says "90% chance it's class 3," the loss is small. If it says "10% chance it's class 3," the loss is large. Cross-entropy penalizes confident wrong answers more heavily than uncertain ones.

For regression, the standard loss is Mean Squared Error (MSE):

def mse_loss(predictions, targets):
    return ((predictions - targets) ** 2).mean()

Backpropagation: How the Network Learns

Backpropagation is the algorithm that adjusts weights to reduce the loss. Here's the intuition without getting lost in calculus:

  1. Forward pass: run an input through the network, get a prediction
  2. Calculate loss: compare prediction to the correct answer
  3. Backward pass: for each weight, figure out "if I nudge this weight a tiny bit, how much does the loss change?" This is the gradient -- the direction and magnitude of change that would reduce the loss
  4. Update weights: adjust each weight in the direction that reduces the loss
Forward:  Input → Prediction → Loss
Backward: Loss → Gradients → Updated Weights

The "back" in backpropagation refers to the chain rule from calculus -- you compute gradients starting from the output layer and propagate them backward through each layer. The good news: frameworks like PyTorch compute all of this automatically.

Gradient Descent

Gradient descent is the optimization algorithm that uses the gradients to update weights:

# Simplified gradient descent
for each training step:
    gradients = compute_gradients(loss, weights)
    weights = weights - learning_rate * gradients

The learning rate controls how big each step is. Too large and you overshoot the optimal weights. Too small and training takes forever (or gets stuck). Typical starting values: 0.001 to 0.01.

Stochastic Gradient Descent (SGD) and Variants

In practice, you don't compute gradients on the entire dataset at once (too slow). Instead, you process small batches (32, 64, 128 examples) and update weights after each batch. This is Stochastic Gradient Descent.

Modern optimizers like Adam adapt the learning rate for each parameter individually, making training more stable:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Adam is the default choice for most projects. Start with it unless you have a reason not to.

Building a Digit Classifier with PyTorch

Time to build something real. We'll classify handwritten digits (0-9) from the MNIST dataset -- the "hello world" of deep learning.

Setup

pip install torch torchvision

Loading the Data

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Transform images to tensors and normalize
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Download MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True,
                                transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)

# Create data loaders (batches of 64)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

MNIST contains 60,000 training images and 10,000 test images. Each image is 28x28 pixels in grayscale.

Defining the Model

class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()            # 28x28 → 784
        self.fc1 = nn.Linear(784, 128)         # Hidden layer 1
        self.fc2 = nn.Linear(128, 64)          # Hidden layer 2
        self.fc3 = nn.Linear(64, 10)           # Output layer (10 digits)
        self.relu = nn.ReLU()

def forward(self, x):
x = self.flatten(x)
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.fc3(x) # No activation -- CrossEntropyLoss handles softmax
return x

model = DigitClassifier()
print(model)

Training Loop

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train(model, train_loader, criterion, optimizer, epochs=5):
model.train()
for epoch in range(epochs):
running_loss = 0.0
correct = 0
total = 0

for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad() # Reset gradients
output = model(data) # Forward pass
loss = criterion(output, target) # Calculate loss
loss.backward() # Backward pass (compute gradients)
optimizer.step() # Update weights

running_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
correct += predicted.eq(target).sum().item()

accuracy = 100.0 * correct / total
avg_loss = running_loss / len(train_loader)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Accuracy={accuracy:.2f}%")

train(model, train_loader, criterion, optimizer)

After 5 epochs, you should see accuracy above 97%.

Evaluation

def evaluate(model, test_loader):
    model.eval()
    correct = 0
    total = 0

with torch.no_grad(): # No need to track gradients during evaluation
for data, target in test_loader:
output = model(data)
_, predicted = output.max(1)
total += target.size(0)
correct += predicted.eq(target).sum().item()

accuracy = 100.0 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

evaluate(model, test_loader)

Expected test accuracy: ~97-98%. Not bad for a simple network.

The Training Loop Explained

Every training loop follows this pattern:

for each epoch:
    for each batch:
        1. optimizer.zero_grad()   # Clear old gradients
        2. output = model(data)    # Forward pass
        3. loss = criterion(output, target)  # Compute loss
        4. loss.backward()         # Compute gradients (backprop)
        5. optimizer.step()        # Update weights

This five-line loop is the heart of all deep learning training in PyTorch. Every project -- from MNIST to GPT -- follows this structure.

CNNs: Convolutional Neural Networks

Our digit classifier used fully-connected layers, which treat the image as a flat list of 784 numbers. This throws away spatial information -- the network doesn't know that pixel (0,0) is next to pixel (0,1).

Convolutional Neural Networks (CNNs) preserve spatial structure by using filters (small grids, typically 3x3 or 5x5) that slide across the image and detect features like edges, corners, and textures.
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64  7  7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

def forward(self, x):
x = self.pool(self.relu(self.conv1(x))) # 28x28 → 14x14
x = self.pool(self.relu(self.conv2(x))) # 14x14 → 7x7
x = x.view(-1, 64 7 7) # Flatten
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x

This CNN would push MNIST accuracy above 99%. CNNs are the backbone of all modern image recognition.

Key CNN concepts:
  • Convolution: a filter slides across the image, computing dot products to detect features
  • Pooling: reduces spatial dimensions (MaxPool keeps the strongest signal in each region)
  • Feature hierarchy: early layers detect edges, middle layers detect shapes, later layers detect objects

RNNs: Recurrent Neural Networks

RNNs are designed for sequential data -- text, time series, audio. The key idea: they have a hidden state that acts as memory, carrying information from previous time steps.

Input: "The cat sat on the ___"

Step 1: Process "The" → update hidden state
Step 2: Process "cat" → update hidden state
Step 3: Process "sat" → update hidden state
Step 4: Process "on" → update hidden state
Step 5: Process "the" → update hidden state
Output: predict "mat"

In practice, vanilla RNNs struggle with long sequences (the hidden state "forgets" early information). LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) solve this with gating mechanisms that control what information to keep and what to discard.

class TextRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

def forward(self, x):
embedded = self.embedding(x)
_, (hidden, _) = self.lstm(embedded)
output = self.fc(hidden.squeeze(0))
return output

Note: for most modern NLP tasks, Transformers (the architecture behind GPT, BERT, Claude) have largely replaced RNNs. But understanding RNNs gives you the foundation for understanding why Transformers were invented.

Common Deep Learning Pitfalls

Overfitting

The network memorizes training data instead of learning general patterns. Solutions:


  • Dropout: randomly disable neurons during training (nn.Dropout(0.5))

  • Data augmentation: artificially increase training data with transformations

  • Early stopping: stop training when validation loss starts increasing

  • Regularization: add a penalty for large weights


Vanishing/Exploding Gradients

In very deep networks, gradients can shrink to near-zero (vanishing) or grow to infinity (exploding) as they propagate backward. Solutions:


  • Use ReLU activation (avoids vanishing gradient in most cases)

  • Batch normalization (nn.BatchNorm2d)

  • Residual connections (skip connections, as in ResNet)

  • Gradient clipping for exploding gradients


Wrong Learning Rate

Too high: loss oscillates or diverges. Too low: training is extremely slow. Use a learning rate scheduler:

scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3)
# Automatically reduces LR when validation loss plateaus

GPU Acceleration

Deep learning on CPU is painfully slow. GPUs parallelize the matrix operations that neural networks rely on:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Move data to GPU during training
for data, target in train_loader:
    data, target = data.to(device), target.to(device)
    output = model(data)
    # ... rest of training loop

An NVIDIA GPU with CUDA support can speed up training by 10-50x depending on the model.

Where to Go from Here

  1. Solidify the basics: modify the MNIST classifier (add layers, change hyperparameters, try CNNs) and observe the effects
  2. Image classification: try CIFAR-10 (color images, 10 classes) -- it's harder than MNIST and forces you to use CNNs
  3. Transfer learning: use pre-trained models (ResNet, VGG) and fine-tune them for your own task -- this is how most real-world image classification works
  4. NLP: learn about Transformers and attention mechanisms -- they're the foundation of modern language models
  5. Frameworks: get comfortable with PyTorch (or TensorFlow/JAX) -- the ability to quickly prototype and iterate is essential
Deep learning is a field where building things teaches you more than reading papers. Start with the MNIST example in this guide, break it, fix it, extend it, and move on to harder problems.

For more deep learning tutorials, AI guides, and programming content, check out CodeUp.

Ad 728x90