Raw notes to tackle ✨Deep Learning ✨‘s second assignment on CNNs from scratch. This was helpful for a few people in the recent cohort so why not share it.
Motivations
-
Earlier, we did:
- fully connected layer - every neuron to every other neuron - dense connections
- in tensorflow, its called dense layer
- in pytorch, its called linear layer
- but this leads to a lot of parameters
- input is 28x28, 784 x10 params
- for eg in hidden: 784 x 256 and 256 x 10
-
Motivations for cnn:
- can we organize our weights in smaller set of weights and convolve of cross correlate it? reduces the number of params a lot!
- and that motivates learning about CNNs.
TLDR: use numpy and do forward and backward pass
Part 2 | Coding
uses torch’s autograd and leverage the library to compute the gradients
- we compose the forward function
- then when backward method is called, the gradient is computed
- gradient can be used in optimization process
- do some experiments, play around with architecture
- tldr: experiments with CNNs in pytorch and do some experiments
Part 3 and 4 | Imbalance
we use an imbalanced dataset and discuss some ways to deal with class imbalance
- read papers
- synthesize the implementation from those papers
- and run the experiements
Part 1 in NumPy
implement modules / layers:
- the forward and backward
- convolution
- max pooling
- relu (familiar from a1)
- linear (familiar from a2)
In a1, the architecture was hard coded but in a2 we’re going to make generalized modules and chain them together
implement optimizer:
- sgd with momentum (slight extension to a1)
Convolution | Forward Pass

First one (A) as an example:
- Multiply and sum, for eg: the first one: 1*1 + 1*0 + 1*1 (first row) + 0*0 + 1*1 + 1*0 (second row) + 0*1 + 0*0 + 1*1 (third row) = 4
- 9 multiplications and summations:
- 9 different linear transforms - 1 feature map has 9 values
- We can use different filters/kernels to extract certain type of features
- vertical or horizontal edges
- dots
- The weights are repeatedly shared across the image
- and important to note we can have multiple kernels/filters
- The images in the dataset are 3 channel: 32 x 32 x 3
- a linear layer (fully connected) on this will be huge
- The kernel is the depth of the input
- if input is 3 channel, our kernel 3x3x3 ⇒ the depth of kernel == depth of input
- Motivated by parameter sharing and notions from signal processing, we create this convolution operation that is applied to images. This is the FORWARD PASS.
Vectorizing
- It is okay to use loops.
- Vectorizing is not necessary but its important in practice
- We know apriori from kernel size, image size and stride:
- how many operations we will do
- and how these feature views look like
- using this apriori we can use fancy indexing to take each of these views and create them all at once
- we can take copies of the kernels and broadcast it over each of the views
Vectorization | Stride tricks (1 of 2)

- Goal: create views of image the same size as filters we’re applying:
- (c, h, w) → (h_new, w_new, c, k, k) → height_new * width_new * channel dims * 3 x 3
- Image A as array:
- A.shape = (c,h,w) = (3, 45, 40) = channel * height * width
A.strides
= (s_c, s_h, s_w), eg: (14400, 320, 8)- strides represent how many bytes of offset to get to next value in the dimension
- (14400, 320, 8)
- 14400 - MOVE ACROSS CHANNELS - 45 x 40 x 8 bytes per stride
- number of steps between CHANNELS
- between R G B channel
- 320 - MOVE ACROSS ROWS in SAME CHANNEL - 40 x 8 bytes per stride
- if we are at the first row and want to jump to next row, we need to jump 320 bytes to go to next element
- 8 - MOVE across COLUMN in SAME ROW in SAME CHANNEL - 8 bytes per stride
- to move from one column to next in same row
- 14400 - MOVE ACROSS CHANNELS - 45 x 40 x 8 bytes per stride
Vectorization | Stride tricks (2 of 2)

The input stride from #1 tricks goes to the above function.
- A is the input image tensor
- shape = the desired shape
- strides: how we want to index it
- s_h*2: strides=2 in h, similar for s_w
- s_c: we’re not striding over channel dimension
- writeable=False ⇒ see docs below.
- If set to False, the returned array will always be readonly. Otherwise it will be writable if the original array was. It is advisable to set this to False if possible.
Vectorization | Multiplying tensors

Both tensordot and einsum is fine. Tensordot is a tiny bit faster.
- We have a tensor A and kernel B.
- With einsum, you can do many things.
- We put in the string and that tells einsum what operation we intend to do.
- Dot products
- Transposes
- Diagonal etc
- strings:
- (‘i,i’, a,b): dot product
- (‘ij,jk’, a,b): matmul
- (‘nchw, chwk→nk’): do a reduction across chw and we’re left with a n by k matrix.
- We put in the string and that tells einsum what operation we intend to do.
Vectorization | Broadcasting

The broadcasting is creating B of size (a, b, c) so we can perform the addition. Can use .reshape as well
Vectorization: Unravel Index | Max Pooling
Useful for max pooling

It returns 2 because that’s the index in the flattened array.
Unravel index tells us “okay, here’s the index in the flattened array, where would it occur in the original matrix which is 1,0”
Part 2 - Using PyTorch
We only do the forward and PyTorch’s autograd does the backward which is then used to perform the training.
There’s a difference between training and validation.
Training loop of PyTorch
- Implement training and validation step
- in validation, we don’t need gradients
PyTorch Cross Entropy
In PyTorch, cross-entropy loss does softmax (like in A1 we combined the functions as gradient calc becomes simplified). So, softmax and ce-loss occur in the same function.
There exists an argument where you can separate it out but the default output just throws out the loss.
PyTorch CNN
In PyTorch, convolution is called conv2d which means we’re doing convolution in 2d spatial dimensions. Similarly, conv3d is in 3 dimensions.
Implementing our own model
implementing training loop and forward

We are allowed to take inspiration from the lectures which talk about different CNN architectures, i.e. we can use the lectures as apriori.
There may be certain architectures motivated by certain concepts. Feel free to leverage that intuition in your own experimentation.
Limits to own model:
- Size limitations for full submission < 100 mb
- you can get 2mb model can pass (which is >.8)
- How long model can take which is < 10 mins, 10 mins for all test.
- so if you write a lot for part 1, you’re limiting your part 2 submission
Once the code is written and ran, the code spits out the model and the result. In the instructions, we can package the entire submission. Gradescope will take our trained model and test it, Gradescope is not training. The .sh script packages the checkpoints.
A checkpoint is a checkpoint of the model training. The weights of the model. The checkpoint may not be just the weights, it could be:
- weights
- learning rate if you use a learning rate schedule
- many other things
ENSURE the model has been trained and the code will creates and saves the checkpoints. Gradescope just takes these checkpoints and runs it against the validation set.
Tips:
- If model is too deep (too many params), you will get a time out. Especially for fully connected network because they have a lot params.
- You will need “some” level of fully connected network to project the data back to 10 classes.
Inspiration
A good video to take inspiration from:
Imbalanced dataset part
We’re given an imbalanced dataset and use the resnet implementation provided to us.
Implement/Experiment with the methods provided in the papers that deal with class imbalance.
What’s the difference between multiclass and multilabel classification?
- Multiclass output vector has a 1 in the correct class and 0 everywhere else.
- In multilabels, we can have multiple 1s in our output vector.
- We can’t use softmax so we use multi binary classifiers - need to look into this.
- The paper talks about both these approaches.
- a car is a vehicle + an object
- a particular breed of a dog
- We can’t use softmax so we use multi binary classifiers - need to look into this.
There does not seem to be a huge difference between the softmax and multi binary classifiers.
Infrastructure
- GPU could be useful for Part 2 and 3.
- It mostly amounts to installing PyTorch with the right GPU lib. Eg. Apple has mps/Metal.
- CPU is fine for most parts.