Autodiff from Scratch

Last updated: 28/06/2025

I implemented an autodiff engine from scratch to understand PyTorch's autograd at a deeper level. The math is simpler than expected — it's just the chain rule applied systematically.

Computational Graphs

Every operation builds a DAG. Nodes are tensors, edges are operations. The graph records how each value was computed.


FORWARD PASS: Building the Graph
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

         x            w            b
         │            │            │
         ▼            ▼            │
       ┏━━━━━━━━━━━━━━━━┓          │
       ┃    multiply    ┃          │
       ┃     (x * w)    ┃          │
       ┗━━━━━━━┳━━━━━━━━┛          │
               │                   │
               ▼                   ▼
           ┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
           ┃          add            ┃
           ┃      (x * w) + b        ┃
           ┗━━━━━━━━━━━━┳━━━━━━━━━━━━┛
                        │
                        ▼
                     output
                    (y = wx + b)

The Chain Rule

For a composite function L=f(g(x))L = f(g(x)):

Lx=Lggx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial g} \cdot \frac{\partial g}{\partial x}

Each operation knows its local gradient. We multiply by the incoming gradient and propagate backwards.


BACKWARD PASS: Flowing Gradients
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                        ∂L/∂output = 1
                              │
                              ▼
           ┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
           ┃          add            ┃
           ┃  local: ∂/∂(xw) = 1     ┃  ← gradient passes through
           ┃  local: ∂/∂b = 1        ┃
           ┗━━━━━━━━━━━━┳━━━━━━━━━━━━┛
               │                 │
               ▼                 ▼
       ┏━━━━━━━━━━━━━━━━┓    ∂L/∂b = 1
       ┃    multiply    ┃
       ┃  local: ∂/∂x=w ┃  ← chain rule: multiply by incoming
       ┃  local: ∂/∂w=x ┃
       ┗━━━━━━━┳━━━━━━━━┛
           │        │
           ▼        ▼
     ∂L/∂x = w   ∂L/∂w = x

The Tensor Class

A Tensor wraps a numpy array and tracks its gradient and backprop function.

1class Tensor:
2 def __init__(self, data: np.ndarray, requires_grad=False, dtype=DType.LONG):
3 self.data = data.astype(np_type) # the actual numpy array
4 self.requires_grad = requires_grad
5 self.grad: Tensor | None = None # ∂L/∂(this tensor)
6 self.grad_fn: Callable | None = None # how to backprop

c = a + b creates a new Tensor with a grad_fn closure that propagates gradients back to a and b.


            ┏━━━┓     ┏━━━┓
            ┃ a ┃     ┃ b ┃
            ┗━┳━┛     ┗━┳━┛
              ┃    +    ┃
              ┗━━━━┳━━━━┛
                   ┃
                   ▼
                 ┏━━━┓
                 ┃ c ┃  ← c.grad_fn knows about a and b
                 ┗━━━┛    via closure

Local Gradients

Each operation defines a local gradient:

OperationOutputLocal Gradients
c = a + ba+ba + bca=1,cb=1\frac{\partial c}{\partial a} = 1, \quad \frac{\partial c}{\partial b} = 1
c = a * ba×ba \times bca=b,cb=a\frac{\partial c}{\partial a} = b, \quad \frac{\partial c}{\partial b} = a
c = a ** nana^nca=nan1\frac{\partial c}{\partial a} = n \cdot a^{n-1}
c = sigmoid(a)σ(a)\sigma(a)ca=σ(a)(1σ(a))\frac{\partial c}{\partial a} = \sigma(a)(1 - \sigma(a))
c = relu(a)max(0,a)\max(0, a)ca={1a>00a0\frac{\partial c}{\partial a} = \begin{cases} 1 & a > 0 \\ 0 & a \leq 0 \end{cases}
C = A @ BABABLA=LCBT,LB=ATLC\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} B^T, \quad \frac{\partial L}{\partial B} = A^T \frac{\partial L}{\partial C}

Recursive Backprop

backwards() initializes the gradient to ones, then executes grad_fn which recurses on parent tensors.

1def backwards(self):
2 # Initialize gradient for starting tensor
3 if self.grad is None:
4 self.grad = Tensor(np.ones_like(self.data), requires_grad=False)
5
6 if self.grad_fn is not None:
7 self.grad_fn() # computes parent grads, then recurses

Each grad_fn computes local gradients, accumulates into parent tensors, then recurses:

1def multiply_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:
2 result = Tensor(tensor1.data * tensor2.data, requires_grad=True)
3
4 def _backward():
5 # Local gradients for multiplication: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
6 grad1 = tensor2.data * result.grad.data
7 grad2 = tensor1.data * result.grad.data
8
9 # Accumulate gradients
10 if tensor1.grad is None:
11 tensor1.grad = Tensor(grad1, requires_grad=False)
12 else:
13 tensor1.grad += Tensor(grad1, requires_grad=False)
14 if tensor2.grad is None:
15 tensor2.grad = Tensor(grad2, requires_grad=False)
16 else:
17 tensor2.grad += Tensor(grad2, requires_grad=False)
18
19 # Recurse to parents
20 tensor1.backwards()
21 tensor2.backwards()
22
23 result.grad_fn = _backward
24 return result

add_tensor is the same pattern with local gradients of 1:

1def add_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:
2 result = Tensor(tensor1.data + tensor2.data, requires_grad=True)
3
4 def _backward():
5 # Local gradient for addition is 1
6 grad1 = np.ones_like(tensor1.data) * result.grad.data
7 grad2 = np.ones_like(tensor2.data) * result.grad.data
8
9 # Accumulate and recurse...
10 if tensor1.grad is None:
11 tensor1.grad = Tensor(grad1, requires_grad=False)
12 else:
13 tensor1.grad += Tensor(grad1, requires_grad=False)
14 # ...
15
16 result.grad_fn = _backward
17 return result

Activation Functions

Same pattern, different local gradients. ReLU:

xReLU(x)={1x>00x0\frac{\partial}{\partial x} \text{ReLU}(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}
1def relu_tensor(tensor: Tensor):
2 result = Tensor(np.maximum(tensor.data, 0), requires_grad=True)
3
4 def _backward():
5 # ReLU gradient: 1 where input > 0, 0 elsewhere
6 relu_grad = result.grad.data * (tensor.data > 0).astype(tensor.data.dtype)
7
8 if tensor.grad is None:
9 tensor.grad = Tensor(relu_grad, requires_grad=False)
10 else:
11 tensor.grad += Tensor(relu_grad, requires_grad=False)
12 tensor.backwards()
13
14 result.grad_fn = _backward
15 return result

Sigmoid:

xσ(x)=σ(x)(1σ(x))\frac{\partial}{\partial x} \sigma(x) = \sigma(x)(1 - \sigma(x))
1def sigmoid_tensor(tensor: Tensor) -> Tensor:
2 sigmoid_output = 1 / (1 + np.exp(-tensor.data))
3 result = Tensor(sigmoid_output, requires_grad=True)
4
5 def _backward():
6 # Sigmoid gradient: σ(x) * (1 - σ(x))
7 sigmoid_grad = result.grad.data * result.data * (1 - result.data)
8
9 if tensor.grad is None:
10 tensor.grad = Tensor(sigmoid_grad, requires_grad=False)
11 else:
12 tensor.grad += Tensor(sigmoid_grad, requires_grad=False)
13 tensor.backwards()
14
15 result.grad_fn = _backward
16 return result

Matrix Multiplication

For C=ABC = AB, the gradients require transposing:

LA=LCBT,LB=ATLC\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} B^T, \quad \frac{\partial L}{\partial B} = A^T \frac{\partial L}{\partial C}
1def matmul_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:
2 result = Tensor(tensor1.data @ tensor2.data, requires_grad=True)
3
4 def _backward():
5 # For C = A @ B:
6 # ∂L/∂A = ∂L/∂C @ B.T
7 # ∂L/∂B = A.T @ ∂L/∂C
8 grad1 = np.matmul(result.grad.data, tensor2.data.T)
9 grad2 = np.matmul(tensor1.data.T, result.grad.data)
10
11 if tensor1.grad is None:
12 tensor1.grad = Tensor(grad1, requires_grad=False)
13 else:
14 tensor1.grad += Tensor(grad1, requires_grad=False)
15 if tensor2.grad is None:
16 tensor2.grad = Tensor(grad2, requires_grad=False)
17 else:
18 tensor2.grad += Tensor(grad2, requires_grad=False)
19
20 tensor1.backwards()
21 tensor2.backwards()
22
23 result.grad_fn = _backward
24 return result

A Complete Example

Let's trace through L=(a×b+c)2L = (a \times b + c)^2 with a=2,b=3,c=1a=2, b=3, c=1:


FORWARD:                          BACKWARD:
════════                          ═════════

a=2   b=3   c=1                   ∂L/∂L = 1
 │     │     │
 └──┬──┘     │                    ∂L/∂e = 2e = 14
    ▼        │
   [*]       │    d = a×b = 6     ∂L/∂d = 14 × 1 = 14
    │ d=6    │                    ∂L/∂c = 14 × 1 = 14
    └───┬────┘
        ▼                         ∂L/∂a = 14 × b = 42
       [+]        e = d+c = 7     ∂L/∂b = 14 × a = 28
        │ e=7
        ▼                         ┏━━━━━━━━━━━━━━━━━━━━┓
      [**2]       L = e² = 49     ┃  FINAL GRADIENTS:  ┃
        │                         ┃  a.grad = 42       ┃
        ▼                         ┃  b.grad = 28       ┃
       L=49                       ┃  c.grad = 14       ┃
                                  ┗━━━━━━━━━━━━━━━━━━━━┛

The math checks out:

La=Leedda=2e1b=2(7)(3)=42\frac{\partial L}{\partial a} = \frac{\partial L}{\partial e} \cdot \frac{\partial e}{\partial d} \cdot \frac{\partial d}{\partial a} = 2e \cdot 1 \cdot b = 2(7)(3) = 42

Operator Overloading

Overriding __add__, __mul__, __matmul__, etc. lets arithmetic build the graph automatically.

1class Tensor:
2 def __add__(self, other):
3 return add_tensor(self, other)
4
5 def __mul__(self, other):
6 if isinstance(other, Tensor):
7 return multiply_tensor(self, other)
8 else:
9 return scalar_multiply_tensor(self, other)
10
11 def __matmul__(self, other):
12 return matmul_tensor(self, other)
13
14 def __pow__(self, power):
15 return tensor_pow(self, power)
16
17 def __neg__(self):
18 return scalar_multiply_tensor(self, -1)
19
20 def __sub__(self, other):
21 return add_tensor(self, -1 * other)

Reflections

Two ideas make this work: track operations during the forward pass, walk backwards applying the chain rule. Each node stores a closure over its parents. backwards() recurses through the graph.

The implementation also handles broadcasting, dtype resolution, and gradient accumulation for shared tensors.

Resources

GitHub Repository — Full implementation with examples

Technical Report (PDF) — Detailed mathematical writeup