In the video The spelled-out intro to language modeling: building makemore, Andrej Karpathy briefly mentioned a bug that can occur when broadcasting tensors during mathematical operations.

In the lines below, N is a 27x27 tensor. The code is trying to normalize the rows of the tensor by dividing each row by the sum of the row.

Here is the incorrect calculation:

P = N.float()
P /= P.sum(1)

The correct calculation should be:

P = N.float()
P /= P.sum(1,keepdim=True)

Andrej briefly explained why this happens, but his explanation didn't land for me. I walked through a few smaller examples to understand the mechanics of the tensor operations.

`P / P.sum(1, keepdim=True)`

P = [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]

P = torch.tensor(P)
P_sum = torch.tensor(P).sum(1, keepdim=True)

# P_sum: this is what a column vector looks like in PyTorch
# [[6],
#  [15],
#  [24]]

Here, P.sum(1, keepdim=True) computes the sum of each row in P and returns its results as a 3 element column vector. The 1 argument is specifying that we want sums computed by the first axis, which are the rows of the vector. keepdim=True is saying that the output should have the same dimensions as the input. In other words, each row should remain a vector, rather than being flattened during the summation operation.

Now when you divide P by P_sum, each element in a row of P is divided by the sum of that row. This normalizes each row so that the sum of the elements in each row is 1.

P / P.sum(1, keepdim=True)  

# [[1/6, 2/6, 3/6],
# [4/15, 5/15, 6/15],
# [7/24, 8/24, 9/24]]

`P / P.sum(1)`

P = [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]
     
P = torch.tensor(P)
P_sum = P.sum(1) 

# [6, 15, 24]

P.sum(1) computes the sum of each row in P but it flattens the results into a single, flat row vector. The inner dimensions are discarded.

When you divide P by P.sum(1), the broadcasting rules in PyTorch will try to align the shapes. The broadcasting will align the 3 elements of P.sum(1) with the columns of P, which is not the intended behavior. This will result in incorrect normalization. Notice how each summed row is dividing a row, rather than a column. It's operating on the wrong dimension.

P / P.sum(1)  

# [[1/6, 2/15, 3/24],
#  [4/6, 5/15, 6/24],
#  [7/6, 8/15, 9/24]]

PyTorch mechanics can seem a little unintuitive and mind-bendy when you're learning them, so it can be helpful to pick them apart like we did here to understand them.

Pytorch broadcasting mechanics

P / P.sum(1, keepdim=True)

P / P.sum(1)

`P / P.sum(1, keepdim=True)`

`P / P.sum(1)`