# Lab 1
Today's lab consists of practice questions to review the topics presented thus far in class. We will be focusing on:
1. Neural network terminology and architecture
2. Python
3. Forward and backward propagation

### Question 1
Let's review the terminology introduced by thinking about how to design a model for each the following scenarios. It's important to remember that while there is more than one correct answer in these cases, we want to develop an intuition to help save time in parameter tuning, training, computational resources, etc. We'll also briefy touch on some advanced topics to provide a foundation for later in the course, and remember you do not need to use a deep neural network in every case.

*Case 1:* The input is the MNIST handwritten digits dataset (features are 28x28 pixel intensities and labels are digits 0-9). We want to predict which digit the image represents and there are only 10 images per category ($N=100$).

    - Random forest, k-nearest neighbors because of the small sample size and relatively easy prediction task.

*Case 2:* The identical setup but this time there are thousands of images per category.

    - Either of the above methods are fine, or using a simple neural network. Activation function needs to be softmax and loss function needs to be categorical cross-entropy.

*Case 3:* The identical setup as case 2 but this time images may contain multiple digits or no digits at all.

    - Last layer activation should be sigmoid and BCE (binary cross_entropy).

*Case 4:*  The covariates are BMI measurements and reported smoking status, the labels are binary denoting cardiovascular disease. Our sample consists of 70 individuals and we want to predict an individuals' health status based on their BMI and smoking status. We are interested in the effect of BMI on cardiovascular disease.

    - Logistic regression.

*Case 5:* The input consists of thousands of images of different animals and we want to classify which animal the image contains.

    - CNN with softmax and CCE (categorical cross-entropy) or sigmoid and BCE (binary cross_entropy).

*Case 6:* The input consists of thousands of English sentences and we want to predict the next word in the sentences.

    - RNN

*Case 7:* The input consists of biomarker status for thousands of loci across thousands of individuals (i.e. Ancestry.com). There are no associated labels and we wish to learn about population substructure.

    - PCA, VAE (variational autoencoders), etc.

### Question 2

Draw the architecture of a neural network satisying the following conditions:

1. X is a univariate covariate. We will consider the case when X=5.
2. There are two hidden layers. The first consists of two nodes, each with a bias term taking values (-1 and 1, respectively). The weight going to the first node takes value 0.5 and the weight going to the second node takes value -0.5.
3. The nodes in hidden layer 1 each use a linear activation function.
4. Hidden layer 2 consists of a single node with no bias term and the ReLU activation function. The weight from node 1 in hidden layer 1 is 0.3 and the weight from node 2 in hidden layer 1 is 0.7.
5. Hidden layer 2 outputs to a single output node. The bias term for the output node is 0.5 and the weight from hidden layer 2 is 2.
6. The loss function to be optimized is squared loss.

![network](https://drive.google.com/uc?id=12tva0pDMq4gCM_GABnbsmyh-TOL9lnU5)





### Question 3
Implement a single forward pass of the network described in Question 3. You do not need to implement the network in keras and should instead use numpy operations (either scalar or matrix). Start by defining the weights and input matrices.

In [None]:
# Import necessary packages
import numpy as np

In [None]:
,,x = np.array([1, 5])                        # add bias/intercept as first entry (1x2 matrix)
w_hidden1 = np.matrix([[-1, 1], [.5, -.5]]) # 2x2 matrix of first-layer biases and weights
w_hidden2 = np.matrix([[.3], [.7]])         # 2x1 matrix of second-layer weights
w_out = 2                                   # 1x1 scalar of third-layer weights
b_out = 0.5                                 # 1x1 scalar of third-layer bias

In [None]:
print(w_hidden2)

[[0.3]
 [0.7]]


Now perform the forward pass.

In [None]:
hidden1 = np.matmul(x,w_hidden1)         # perform matrix multiplication to get hidden layer 1
hidden2 = np.matmul(hidden1,w_hidden2)   # perform matrix multiplication to get hidden layer 2
hidden2_clamped = np.maximum(hidden2, 0) # relu
y_hat = hidden2_clamped*w_out + b_out    # perform third multiplication to get output layer

In [None]:
print(y_hat)

[[0.5]]


And let's print the values.

In [None]:
print('The values for the hidden layer 1 are:', hidden1)
print('The values for the hidden layer 2 are:', hidden2)
print('The post-relu values for the hidden layer 2 are:', hidden2_clamped)
print('The value for the output layer is:', y_hat)

The values for the hidden layer 1 are: [[ 1.5 -1.5]]
The values for the hidden layer 2 are: [[-0.6]]
The post-relu values for the hidden layer 2 are: [[0.]]
The value for the output layer is: [[0.5]]


Calculate the loss for the training example given a label of Y = 0.25.

In [None]:
y_i = 0.25 # positive outcome as defined in the problem
loss_i = (y_i-y_hat)**2
print('The loss is:',loss_i)

The loss is: [[0.0625]]


Implement a single backward pass of the network. Again use numpy. Start by defining the individual gradient terms.

In [None]:
# gradient for loss
dl_dyhat = -2*(y_i-y_hat) # gradient of loss wrt predicted probability (1x1)

# gradients for output layer
dyhat_dhidden2_clamped = w_out # gradient of y_hat wrt hidden output (1x1)
dyhat_dw_out = hidden2_clamped # gradient of  y_hat wrt output layer weight (1x1)
dyhat_db_out = 1 # gradient of  y_hat wrt output layer bias (1x1)

# gradient (gate) for relu
dhidden2_clamped_dhidden2 = (hidden2>0)*1 # gradient for relu

# gradients for second hidden layer
dhidden2_dw_2 = hidden1 # gradient of second hidden layer wrt second hidden layer weights
dhidden2_dhidden1 = w_hidden2 # gradient of second hidden layer wrt first hidden layer output

# gradients for first hidden layer
dhidden1_dw_1 = x # gradient of first hidden layer wrt first hidden layer weights

In [None]:
print(dl_dyhat)
print(dyhat_dhidden2_clamped)
print(dyhat_dw_out)
print(dyhat_db_out)
print(dhidden2_clamped_dhidden2)
print(dhidden2_dw_2)
print(dhidden2_dhidden1)

[[-0.5]]
2
[[0.]]
1
[[0]]
[[ 1.5 -1.5]]
[[0.3]
 [0.7]]


In [None]:
dl_dw_out = dl_dyhat*dyhat_dw_out # gradient of loss wrt output weights
dl_db_out = dl_dyhat*1         # gradient of loss wrt output bias
dl_dw_2 = dl_dyhat*dyhat_dhidden2_clamped*dhidden2_clamped_dhidden2*dhidden2_dw_2                       # gradient of loss wrt second hidden layer weights
dl_dw_11 = dl_dyhat*dyhat_dhidden2_clamped*dhidden2_clamped_dhidden2*dhidden2_dhidden1[0]*dhidden1_dw_1 # gradient of loss wrt first hidden layer weights (node 1)
dl_dw_12 = dl_dyhat*dyhat_dhidden2_clamped*dhidden2_clamped_dhidden2*dhidden2_dhidden1[1]*dhidden1_dw_1 # gradient of loss wrt first hidden layer weights (node 2)

In [None]:
print(dl_dw_out)
print(dl_db_out)
print(dl_dw_2)
print(dl_dw_11)
print(dl_dw_12)

[[-0.]]
[[-0.5]]
[[0. 0.]]
[[0. 0.]]
[[0. 0.]]



What is the purpose of the partial derivative w.r.t. the weight/parameter
    
    - The partial derivative w.r.t. the parameter/weights is used to update the parameter values in the training loop via gradient descent.