Autodiff from Scratch

Last updated: 28/06/2025

I implemented an autodiff engine from scratch to understand PyTorch's autograd at a deeper level. The math is simpler than expected — it's just the chain rule applied systematically.

Computational Graphs

Every operation builds a DAG. Nodes are tensors, edges are operations. The graph records how each value was computed.


FORWARD PASS: Building the Graph
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

         x            w            b
         │            │            │
         ▼            ▼            │
       ┏━━━━━━━━━━━━━━━━┓          │
       ┃    multiply    ┃          │
       ┃     (x * w)    ┃          │
       ┗━━━━━━━┳━━━━━━━━┛          │
               │                   │
               ▼                   ▼
           ┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
           ┃          add            ┃
           ┃      (x * w) + b        ┃
           ┗━━━━━━━━━━━━┳━━━━━━━━━━━━┛
                        │
                        ▼
                     output
                    (y = wx + b)

The Chain Rule

For a composite function $L = f(g(x))$ :

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial g} \cdot \frac{\partial g}{\partial x}

Each operation knows its local gradient. We multiply by the incoming gradient and propagate backwards.


BACKWARD PASS: Flowing Gradients
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                        ∂L/∂output = 1
                              │
                              ▼
           ┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
           ┃          add            ┃
           ┃  local: ∂/∂(xw) = 1     ┃  ← gradient passes through
           ┃  local: ∂/∂b = 1        ┃
           ┗━━━━━━━━━━━━┳━━━━━━━━━━━━┛
               │                 │
               ▼                 ▼
       ┏━━━━━━━━━━━━━━━━┓    ∂L/∂b = 1
       ┃    multiply    ┃
       ┃  local: ∂/∂x=w ┃  ← chain rule: multiply by incoming
       ┃  local: ∂/∂w=x ┃
       ┗━━━━━━━┳━━━━━━━━┛
           │        │
           ▼        ▼
     ∂L/∂x = w   ∂L/∂w = x

The Tensor Class

A Tensor wraps a numpy array and tracks its gradient and backprop function.

1class Tensor:
2    def __init__(self, data: np.ndarray, requires_grad=False, dtype=DType.LONG):
3        self.data = data.astype(np_type)       # the actual numpy array
4        self.requires_grad = requires_grad
5        self.grad: Tensor | None = None        # ∂L/∂(this tensor)
6        self.grad_fn: Callable | None = None   # how to backprop

c = a + b creates a new Tensor with a grad_fn closure that propagates gradients back to a and b.


            ┏━━━┓     ┏━━━┓
            ┃ a ┃     ┃ b ┃
            ┗━┳━┛     ┗━┳━┛
              ┃    +    ┃
              ┗━━━━┳━━━━┛
                   ┃
                   ▼
                 ┏━━━┓
                 ┃ c ┃  ← c.grad_fn knows about a and b
                 ┗━━━┛    via closure

Local Gradients

Each operation defines a local gradient:

Operation	Output	Local Gradients
`c = a + b`	$a + b$	$\frac{\partial c}{\partial a} = 1, \quad \frac{\partial c}{\partial b} = 1$
`c = a * b`	$a \times b$	$\frac{\partial c}{\partial a} = b, \quad \frac{\partial c}{\partial b} = a$
`c = a ** n`	$a^n$	$\frac{\partial c}{\partial a} = n \cdot a^{n-1}$
`c = sigmoid(a)`	$\sigma(a)$	$\frac{\partial c}{\partial a} = \sigma(a)(1 - \sigma(a))$
`c = relu(a)`	$\max(0, a)$	$\frac{\partial c}{\partial a} = \begin{cases} 1 & a > 0 \\ 0 & a \leq 0 \end{cases}$
`C = A @ B`	$AB$	$\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} B^T, \quad \frac{\partial L}{\partial B} = A^T \frac{\partial L}{\partial C}$

Recursive Backprop

backwards() initializes the gradient to ones, then executes grad_fn which recurses on parent tensors.

1def backwards(self):
2    # Initialize gradient for starting tensor
3    if self.grad is None:
4        self.grad = Tensor(np.ones_like(self.data), requires_grad=False)
5
6    if self.grad_fn is not None:
7        self.grad_fn()  # computes parent grads, then recurses

Each grad_fn computes local gradients, accumulates into parent tensors, then recurses:

1def multiply_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:
2    result = Tensor(tensor1.data * tensor2.data, requires_grad=True)
3
4    def _backward():
5        # Local gradients for multiplication: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
6        grad1 = tensor2.data * result.grad.data
7        grad2 = tensor1.data * result.grad.data
8
9        # Accumulate gradients
10        if tensor1.grad is None:
11            tensor1.grad = Tensor(grad1, requires_grad=False)
12        else:
13            tensor1.grad += Tensor(grad1, requires_grad=False)
14        if tensor2.grad is None:
15            tensor2.grad = Tensor(grad2, requires_grad=False)
16        else:
17            tensor2.grad += Tensor(grad2, requires_grad=False)
18
19        # Recurse to parents
20        tensor1.backwards()
21        tensor2.backwards()
22
23    result.grad_fn = _backward
24    return result

add_tensor is the same pattern with local gradients of 1:

1def add_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:
2    result = Tensor(tensor1.data + tensor2.data, requires_grad=True)
3
4    def _backward():
5        # Local gradient for addition is 1
6        grad1 = np.ones_like(tensor1.data) * result.grad.data
7        grad2 = np.ones_like(tensor2.data) * result.grad.data
8
9        # Accumulate and recurse...
10        if tensor1.grad is None:
11            tensor1.grad = Tensor(grad1, requires_grad=False)
12        else:
13            tensor1.grad += Tensor(grad1, requires_grad=False)
14        # ...
15
16    result.grad_fn = _backward
17    return result

Activation Functions

Same pattern, different local gradients. ReLU:

\frac{\partial}{\partial x} \text{ReLU}(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}

1def relu_tensor(tensor: Tensor):
2    result = Tensor(np.maximum(tensor.data, 0), requires_grad=True)
3
4    def _backward():
5        # ReLU gradient: 1 where input > 0, 0 elsewhere
6        relu_grad = result.grad.data * (tensor.data > 0).astype(tensor.data.dtype)
7
8        if tensor.grad is None:
9            tensor.grad = Tensor(relu_grad, requires_grad=False)
10        else:
11            tensor.grad += Tensor(relu_grad, requires_grad=False)
12        tensor.backwards()
13
14    result.grad_fn = _backward
15    return result

Sigmoid:

\frac{\partial}{\partial x} \sigma(x) = \sigma(x)(1 - \sigma(x))

1def sigmoid_tensor(tensor: Tensor) -> Tensor:
2    sigmoid_output = 1 / (1 + np.exp(-tensor.data))
3    result = Tensor(sigmoid_output, requires_grad=True)
4
5    def _backward():
6        # Sigmoid gradient: σ(x) * (1 - σ(x))
7        sigmoid_grad = result.grad.data * result.data * (1 - result.data)
8
9        if tensor.grad is None:
10            tensor.grad = Tensor(sigmoid_grad, requires_grad=False)
11        else:
12            tensor.grad += Tensor(sigmoid_grad, requires_grad=False)
13        tensor.backwards()
14
15    result.grad_fn = _backward
16    return result

Matrix Multiplication

For $C = AB$ , the gradients require transposing:

\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} B^T, \quad \frac{\partial L}{\partial B} = A^T \frac{\partial L}{\partial C}

1def matmul_tensor(tensor1: Tensor, tensor2: Tensor) -> Tensor:
2    result = Tensor(tensor1.data @ tensor2.data, requires_grad=True)
3
4    def _backward():
5        # For C = A @ B:
6        # ∂L/∂A = ∂L/∂C @ B.T
7        # ∂L/∂B = A.T @ ∂L/∂C
8        grad1 = np.matmul(result.grad.data, tensor2.data.T)
9        grad2 = np.matmul(tensor1.data.T, result.grad.data)
10
11        if tensor1.grad is None:
12            tensor1.grad = Tensor(grad1, requires_grad=False)
13        else:
14            tensor1.grad += Tensor(grad1, requires_grad=False)
15        if tensor2.grad is None:
16            tensor2.grad = Tensor(grad2, requires_grad=False)
17        else:
18            tensor2.grad += Tensor(grad2, requires_grad=False)
19
20        tensor1.backwards()
21        tensor2.backwards()
22
23    result.grad_fn = _backward
24    return result

A Complete Example

Let's trace through $L = (a \times b + c)^2$ with $a=2, b=3, c=1$ :


FORWARD:                          BACKWARD:
════════                          ═════════

a=2   b=3   c=1                   ∂L/∂L = 1
 │     │     │
 └──┬──┘     │                    ∂L/∂e = 2e = 14
    ▼        │
   [*]       │    d = a×b = 6     ∂L/∂d = 14 × 1 = 14
    │ d=6    │                    ∂L/∂c = 14 × 1 = 14
    └───┬────┘
        ▼                         ∂L/∂a = 14 × b = 42
       [+]        e = d+c = 7     ∂L/∂b = 14 × a = 28
        │ e=7
        ▼                         ┏━━━━━━━━━━━━━━━━━━━━┓
      [**2]       L = e² = 49     ┃  FINAL GRADIENTS:  ┃
        │                         ┃  a.grad = 42       ┃
        ▼                         ┃  b.grad = 28       ┃
       L=49                       ┃  c.grad = 14       ┃
                                  ┗━━━━━━━━━━━━━━━━━━━━┛

The math checks out:

\frac{\partial L}{\partial a} = \frac{\partial L}{\partial e} \cdot \frac{\partial e}{\partial d} \cdot \frac{\partial d}{\partial a} = 2e \cdot 1 \cdot b = 2(7)(3) = 42

Operator Overloading

Overriding __add__, __mul__, __matmul__, etc. lets arithmetic build the graph automatically.

1class Tensor:
2    def __add__(self, other):
3        return add_tensor(self, other)
4
5    def __mul__(self, other):
6        if isinstance(other, Tensor):
7            return multiply_tensor(self, other)
8        else:
9            return scalar_multiply_tensor(self, other)
10
11    def __matmul__(self, other):
12        return matmul_tensor(self, other)
13
14    def __pow__(self, power):
15        return tensor_pow(self, power)
16
17    def __neg__(self):
18        return scalar_multiply_tensor(self, -1)
19
20    def __sub__(self, other):
21        return add_tensor(self, -1 * other)

Reflections

Two ideas make this work: track operations during the forward pass, walk backwards applying the chain rule. Each node stores a closure over its parents. backwards() recurses through the graph.

The implementation also handles broadcasting, dtype resolution, and gradient accumulation for shared tensors.

Resources

GitHub Repository — Full implementation with examples

Technical Report (PDF) — Detailed mathematical writeup