Autodiff from Scratch
Last updated: 28/06/2025
I implemented an autodiff engine from scratch to understand PyTorch's autograd at a deeper level. The math is simpler than expected — it's just the chain rule applied systematically.
Computational Graphs
Every operation builds a DAG. Nodes are tensors, edges are operations. The graph records how each value was computed.
FORWARD PASS: Building the Graph
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
x w b
│ │ │
▼ ▼ │
┏━━━━━━━━━━━━━━━━┓ │
┃ multiply ┃ │
┃ (x * w) ┃ │
┗━━━━━━━┳━━━━━━━━┛ │
│ │
▼ ▼
┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ add ┃
┃ (x * w) + b ┃
┗━━━━━━━━━━━━┳━━━━━━━━━━━━┛
│
▼
output
(y = wx + b)
The Chain Rule
For a composite function :
Each operation knows its local gradient. We multiply by the incoming gradient and propagate backwards.
BACKWARD PASS: Flowing Gradients
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
∂L/∂output = 1
│
▼
┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ add ┃
┃ local: ∂/∂(xw) = 1 ┃ ← gradient passes through
┃ local: ∂/∂b = 1 ┃
┗━━━━━━━━━━━━┳━━━━━━━━━━━━┛
│ │
▼ ▼
┏━━━━━━━━━━━━━━━━┓ ∂L/∂b = 1
┃ multiply ┃
┃ local: ∂/∂x=w ┃ ← chain rule: multiply by incoming
┃ local: ∂/∂w=x ┃
┗━━━━━━━┳━━━━━━━━┛
│ │
▼ ▼
∂L/∂x = w ∂L/∂w = x
The Tensor Class
A Tensor wraps a numpy array and tracks its gradient and backprop function.
1class Tensor:2 def __init__(self, data: np.ndarray, requires_grad=False, dtype=DType.LONG):3 self.data = data.astype(np_type) # the actual numpy array4 self.requires_grad = requires_grad5 self.grad: Tensor | None = None # ∂L/∂(this tensor)6 self.grad_fn: Callable | None = None # how to backprop
c = a + b creates a new Tensor with a grad_fn closure that propagates gradients back to a and b.
┏━━━┓ ┏━━━┓
┃ a ┃ ┃ b ┃
┗━┳━┛ ┗━┳━┛
┃ + ┃
┗━━━━┳━━━━┛
┃
▼
┏━━━┓
┃ c ┃ ← c.grad_fn knows about a and b
┗━━━┛ via closure
Local Gradients
Each operation defines a local gradient:
| Operation | Output | Local Gradients |
|---|---|---|
c = a + b | ||
c = a * b | ||
c = a ** n | ||
c = sigmoid(a) | ||
c = relu(a) | ||
C = A @ B |
Recursive Backprop
backwards() initializes the gradient to ones, then executes grad_fn which recurses on parent tensors.
1def backwards(self):2 # Initialize gradient for starting tensor3 if self.grad is None:4 self.grad = Tensor(np.ones_like(self.data), requires_grad=False)56 if self.grad_fn is not None:7 self.grad_fn() # computes parent grads, then recurses
Each grad_fn computes local gradients, accumulates into parent tensors, then recurses:
1def multiply_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:2 result = Tensor(tensor1.data * tensor2.data, requires_grad=True)34 def _backward():5 # Local gradients for multiplication: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a6 grad1 = tensor2.data * result.grad.data7 grad2 = tensor1.data * result.grad.data89 # Accumulate gradients10 if tensor1.grad is None:11 tensor1.grad = Tensor(grad1, requires_grad=False)12 else:13 tensor1.grad += Tensor(grad1, requires_grad=False)14 if tensor2.grad is None:15 tensor2.grad = Tensor(grad2, requires_grad=False)16 else:17 tensor2.grad += Tensor(grad2, requires_grad=False)1819 # Recurse to parents20 tensor1.backwards()21 tensor2.backwards()2223 result.grad_fn = _backward24 return result
add_tensor is the same pattern with local gradients of 1:
1def add_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:2 result = Tensor(tensor1.data + tensor2.data, requires_grad=True)34 def _backward():5 # Local gradient for addition is 16 grad1 = np.ones_like(tensor1.data) * result.grad.data7 grad2 = np.ones_like(tensor2.data) * result.grad.data89 # Accumulate and recurse...10 if tensor1.grad is None:11 tensor1.grad = Tensor(grad1, requires_grad=False)12 else:13 tensor1.grad += Tensor(grad1, requires_grad=False)14 # ...1516 result.grad_fn = _backward17 return result
Activation Functions
Same pattern, different local gradients. ReLU:
1def relu_tensor(tensor: Tensor):2 result = Tensor(np.maximum(tensor.data, 0), requires_grad=True)34 def _backward():5 # ReLU gradient: 1 where input > 0, 0 elsewhere6 relu_grad = result.grad.data * (tensor.data > 0).astype(tensor.data.dtype)78 if tensor.grad is None:9 tensor.grad = Tensor(relu_grad, requires_grad=False)10 else:11 tensor.grad += Tensor(relu_grad, requires_grad=False)12 tensor.backwards()1314 result.grad_fn = _backward15 return result
Sigmoid:
1def sigmoid_tensor(tensor: Tensor) -> Tensor:2 sigmoid_output = 1 / (1 + np.exp(-tensor.data))3 result = Tensor(sigmoid_output, requires_grad=True)45 def _backward():6 # Sigmoid gradient: σ(x) * (1 - σ(x))7 sigmoid_grad = result.grad.data * result.data * (1 - result.data)89 if tensor.grad is None:10 tensor.grad = Tensor(sigmoid_grad, requires_grad=False)11 else:12 tensor.grad += Tensor(sigmoid_grad, requires_grad=False)13 tensor.backwards()1415 result.grad_fn = _backward16 return result
Matrix Multiplication
For , the gradients require transposing:
1def matmul_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:2 result = Tensor(tensor1.data @ tensor2.data, requires_grad=True)34 def _backward():5 # For C = A @ B:6 # ∂L/∂A = ∂L/∂C @ B.T7 # ∂L/∂B = A.T @ ∂L/∂C8 grad1 = np.matmul(result.grad.data, tensor2.data.T)9 grad2 = np.matmul(tensor1.data.T, result.grad.data)1011 if tensor1.grad is None:12 tensor1.grad = Tensor(grad1, requires_grad=False)13 else:14 tensor1.grad += Tensor(grad1, requires_grad=False)15 if tensor2.grad is None:16 tensor2.grad = Tensor(grad2, requires_grad=False)17 else:18 tensor2.grad += Tensor(grad2, requires_grad=False)1920 tensor1.backwards()21 tensor2.backwards()2223 result.grad_fn = _backward24 return result
A Complete Example
Let's trace through with :
FORWARD: BACKWARD:
════════ ═════════
a=2 b=3 c=1 ∂L/∂L = 1
│ │ │
└──┬──┘ │ ∂L/∂e = 2e = 14
▼ │
[*] │ d = a×b = 6 ∂L/∂d = 14 × 1 = 14
│ d=6 │ ∂L/∂c = 14 × 1 = 14
└───┬────┘
▼ ∂L/∂a = 14 × b = 42
[+] e = d+c = 7 ∂L/∂b = 14 × a = 28
│ e=7
▼ ┏━━━━━━━━━━━━━━━━━━━━┓
[**2] L = e² = 49 ┃ FINAL GRADIENTS: ┃
│ ┃ a.grad = 42 ┃
▼ ┃ b.grad = 28 ┃
L=49 ┃ c.grad = 14 ┃
┗━━━━━━━━━━━━━━━━━━━━┛
The math checks out:
Operator Overloading
Overriding __add__, __mul__, __matmul__, etc. lets arithmetic build the graph automatically.
1class Tensor:2 def __add__(self, other):3 return add_tensor(self, other)45 def __mul__(self, other):6 if isinstance(other, Tensor):7 return multiply_tensor(self, other)8 else:9 return scalar_multiply_tensor(self, other)1011 def __matmul__(self, other):12 return matmul_tensor(self, other)1314 def __pow__(self, power):15 return tensor_pow(self, power)1617 def __neg__(self):18 return scalar_multiply_tensor(self, -1)1920 def __sub__(self, other):21 return add_tensor(self, -1 * other)
Reflections
Two ideas make this work: track operations during the forward pass, walk backwards applying the chain rule. Each node stores a closure over its parents. backwards() recurses through the graph.
The implementation also handles broadcasting, dtype resolution, and gradient accumulation for shared tensors.
Resources
GitHub Repository — Full implementation with examples
Technical Report (PDF) — Detailed mathematical writeup