第四周编程作业(一)-Building your Deep Neural Network: Step by Step(三)

简介: 第四周编程作业(一)-Building your Deep Neural Network: Step by Step(三)

6 - Backward propagation module


Just like with forward propagation, you will implement helper functions for backpropagation. Remember that back propagation is used to calculate the gradient of the loss function with respect to the parameters.


Reminder:


<caption><center> Figure 3 : Forward and Backward propagation for LINEAR->RELU->LINEAR->SIGMOID

The purple blocks represent the forward propagation, and the red blocks represent the backward propagation.  </center></caption>

Now, similar to forward propagation, you are going to build the backward propagation in three steps:

  • LINEAR backward
  • LINEAR -> ACTIVATION backward where ACTIVATION computes the derivative of either the ReLU or sigmoid activation
  • [LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID backward (whole model)


6.1 - Linear backward


For layer $l$, the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).

Suppose you have already calculated the derivative $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$. You want to get $(dW^{[l]}, db^{[l]} dA^{[l-1]})$.


<caption><center> Figure 4 </center></caption>

The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l]})$ are computed using the input $dZ^{[l]}$.Here are the formulas you need:

$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$$

$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{l}\tag{9}$$

$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$


Exercise: Use the 3 formulas above to implement linear_backward().

# GRADED FUNCTION: linear_backward
def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)
    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]
    ### START CODE HERE ### (≈ 3 lines of code)
    dW = (1/m)*np.dot(dZ,A_prev.T)
    db = (1/m)*np.sum(dZ,axis=1,keepdims=True)
    dA_prev = np.dot(W.T,dZ)
    ### END CODE HERE ###
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    return dA_prev, dW, db

# Set up some test inputs
dZ, linear_cache = linear_backward_test_case()
dA_prev, dW, db = linear_backward(dZ, linear_cache)
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

dA_prev = [[ 0.51822968 -0.19517421]
 [-0.40506361  0.15255393]
 [ 2.37496825 -0.89445391]]
dW = [[-0.10076895  1.40685096  1.64992505]]
db = [[ 0.50629448]]


Expected Output:

<table style="width:90%">

<tr>

<td> dA_prev </td>

<td > [[ 0.51822968 -0.19517421]

[-0.40506361  0.15255393]

[ 2.37496825 -0.89445391]] </td>

</tr>

<tr>
    <td> **dW** </td>
    <td > [[-0.10076895  1.40685096  1.64992505]] </td> 
</tr> 
<tr>
    <td> **db** </td>
    <td> [[ 0.50629448]] </td> 
</tr>


</table>


6.2 - Linear-Activation backward


Next, you will create a function that merges the two helper functions: linear_backward and the backward step for the activation linear_activation_backward.

To help you implement linear_activation_backward, we provided two backward functions:

  • sigmoid_backward: Implements the backward propagation for SIGMOID unit. You can call it as follows:

dZ = sigmoid_backward(dA, activation_cache)


  • relu_backward: Implements the backward propagation for RELU unit. You can call it as follows:

dZ = relu_backward(dA, activation_cache)


If $g(.)$ is the activation function,

sigmoid_backward and relu_backward compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) \tag{11}$$.


Exercise: Implement the backpropagation for the LINEAR->ACTIVATION layer.

# GRADED FUNCTION: linear_activation_backward
def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache
    if activation == "relu":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        ### END CODE HERE ###
    elif activation == "sigmoid":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        ### END CODE HERE ###
    return dA_prev, dW, db

AL, linear_activation_cache = linear_activation_backward_test_case()
dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation = "sigmoid")
print ("sigmoid:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db) + "\n")
dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation = "relu")
print ("relu:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

sigmoid:
dA_prev = [[ 0.11017994  0.01105339]
 [ 0.09466817  0.00949723]
 [-0.05743092 -0.00576154]]
dW = [[ 0.10266786  0.09778551 -0.01968084]]
db = [[-0.05729622]]
relu:
dA_prev = [[ 0.44090989  0.        ]
 [ 0.37883606  0.        ]
 [-0.2298228   0.        ]]
dW = [[ 0.44513824  0.37371418 -0.10478989]]
db = [[-0.20837892]]


Expected output with sigmoid:

<table style="width:100%">

<tr>

<td > dA_prev </td>

<td >[[ 0.11017994  0.01105339]

[ 0.09466817  0.00949723]

[-0.05743092 -0.00576154]] </td>

</tr>

<tr>
<td > dW </td> 
       <td > [[ 0.10266786  0.09778551 -0.01968084]] </td>


</tr>

<tr>
<td > db </td> 
       <td > [[-0.05729622]] </td>


</tr>

</table>


Expected output with relu

<table style="width:100%">

<tr>

<td > dA_prev </td>

<td > [[ 0.44090989  0.        ]

[ 0.37883606  0.        ]

[-0.2298228   0.        ]] </td>

</tr>

<tr>
<td > dW </td> 
       <td > [[ 0.44513824  0.37371418 -0.10478989]] </td>


</tr>

<tr>
<td > db </td> 
       <td > [[-0.20837892]] </td>


</tr>

</table>


6.3 - L-Model Backward


Now you will implement the backward function for the whole network. Recall that when you implemented the L_model_forward function, at each iteration, you stored a cache which contains (X,W,b, and z). In the back propagation module, you will use those variables to compute the gradients. Therefore, in the L_model_backward function, you will iterate through all the hidden layers backward, starting from layer $L$. On each step, you will use the cached values for layer $l$ to backpropagate through layer $l$. Figure 5 below shows the backward pass.


<caption><center>  Figure 5 : Backward pass  </center></caption>

** Initializing backpropagation**:

To backpropagate through this network, we know that the output is,

$A^{[L]} = \sigma(Z^{[L]})$. Your code thus needs to compute dAL $= \frac{\partial \mathcal{L}}{\partial A^{[L]}}$.

To do so, use this formula (derived using calculus which you don't need in-depth knowledge of):

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL


You can then use this post-activation gradient dAL to keep going backward. As seen in Figure 5, you can now feed in dAL into the LINEAR->SIGMOID backward function you implemented (which will use the cached values stored by the L_model_forward function). After that, you will have to use a for loop to iterate through all the other layers using the LINEAR->RELU backward function. You should store each dA, dW, and db in the grads dictionary. To do so, use this formula :

$$grads["dW" + str(l)] = dW^{[l]}\tag{15} $$

For example, for $l=3$ this would store $dW^{[l]}$ in grads["dW3"].


Exercise: Implement backpropagation for the [LINEAR->RELU] $\times$ (L-1) -> LINEAR -> SIGMOID model.

# GRADED FUNCTION: L_model_backward
def L_model_backward(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    # Initializing the backpropagation
    ### START CODE HERE ### (1 line of code)
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    ### END CODE HERE ###
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
    ### START CODE HERE ### (approx. 2 lines)
    current_cache = caches[L-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL,current_cache,activation="sigmoid")
    ### END CODE HERE ###
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
        ### START CODE HERE ### (approx. 5 lines)
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+2)],current_cache,"relu")
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
        ### END CODE HERE ###
    return grads

AL, Y_assess, caches = L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print ("dW1 = "+ str(grads["dW1"]))
print ("db1 = "+ str(grads["db1"]))
print ("dA1 = "+ str(grads["dA1"]))

dW1 = [[ 0.41010002  0.07807203  0.13798444  0.10502167]
 [ 0.          0.          0.          0.        ]
 [ 0.05283652  0.01005865  0.01777766  0.0135308 ]]
db1 = [[-0.22007063]
 [ 0.        ]
 [-0.02835349]]
dA1 = [[ 0.          0.52257901]
 [ 0.         -0.3269206 ]
 [ 0.         -0.32070404]
 [ 0.         -0.74079187]]


Expected Output

<table style="width:60%">

<tr>

<td > dW1 </td>

<td > [[ 0.41010002  0.07807203  0.13798444  0.10502167]

[ 0.          0.          0.          0.        ]

[ 0.05283652  0.01005865  0.01777766  0.0135308 ]] </td>

</tr>

<tr>
<td > db1 </td> 
       <td > [[-0.22007063]


[ 0.        ]

[-0.02835349]] </td>

</tr>

<tr>

<td > dA1 </td>

<td > [[ 0.          0.52257901]

[ 0.         -0.3269206 ]

[ 0.         -0.32070404]

[ 0.         -0.74079187]] </td>

</tr>

</table>


6.4 - Update Parameters


In this section you will update the parameters of the model, using gradient descent:

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{16}$$

$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{17}$$

where $\alpha$ is the learning rate. After computing the updated parameters, store them in the parameters dictionary.


Exercise: Implement update_parameters() to update your parameters using gradient descent.


Instructions:

Update parameters using gradient descent on every $W^{[l]}$ and $b^{[l]}$ for $l = 1, 2, ..., L$.

# GRADED FUNCTION: update_parameters
def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """
    L = len(parameters) // 2 # number of layers in the neural network
    # Update rule for each parameter. Use a for loop.
    ### START CODE HERE ### (≈ 3 lines of code)
    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-grads["dW"+str(l+1)]*learning_rate
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)]-grads["db"+str(l+1)]*learning_rate
    ### END CODE HERE ###
    return parameters

parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)
print ("W1 = "+ str(parameters["W1"]))
print ("b1 = "+ str(parameters["b1"]))
print ("W2 = "+ str(parameters["W2"]))
print ("b2 = "+ str(parameters["b2"]))

W1 = [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]]
b1 = [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 = [[-0.55569196  0.0354055   1.32964895]]
b2 = [[-0.84610769]]


Expected Output:

<table style="width:100%">

<tr>

<td > W1 </td>

<td > [[-0.59562069 -0.09991781 -2.14584584  1.82662008]

[-1.76569676 -0.80627147  0.51115557 -1.18258802]

[-1.0535704  -0.86128581  0.68284052  2.20374577]] </td>

</tr>

<tr>
<td > b1 </td> 
       <td > [[-0.04659241]


[-1.28888275]

[ 0.53405496]] </td>

</tr>

<tr>

<td > W2 </td>

<td > [[-0.55569196  0.0354055   1.32964895]]</td>

</tr>

<tr>
<td > b2 </td> 
       <td > [[-0.84610769]] </td>


</tr>

</table>


7 - Conclusion


Congrats on implementing all the functions required for building a deep neural network!

We know it was a long assignment but going forward it will only get better. The next part of the assignment is easier.

In the next assignment you will put all these together to build two models:

  • A two-layer neural network
  • An L-layer neural network

You will in fact use these models to classify cat vs non-cat images!

相关文章
|
6月前
|
机器学习/深度学习 自然语言处理 算法
【论文精读】ACL 2022:Graph Pre-training for AMR Parsing and Generation
【论文精读】ACL 2022:Graph Pre-training for AMR Parsing and Generation
|
3月前
|
机器学习/深度学习 人工智能 算法
【博士每天一篇论文-算法】Collective Behavior of a Small-World Recurrent Neural System With Scale-Free Distrib
本文介绍了一种新型的尺度无标度高聚类回声状态网络(SHESN)模型,该模型通过模拟生物神经系统的特性,如小世界现象和无标度分布,显著提高了逼近复杂非线性动力学系统的能力,并在Mackey-Glass动态系统和激光时间序列预测等问题上展示了其优越的性能。
30 1
【博士每天一篇论文-算法】Collective Behavior of a Small-World Recurrent Neural System With Scale-Free Distrib
|
3月前
|
机器学习/深度学习 算法 物联网
【博士每天一篇论文-算法】Overview of Echo State Networks using Different Reservoirs and Activation Functions
本文研究了在物联网网络中应用回声状态网络(ESN)进行交通预测的不同拓扑结构,通过与SARIMA、CNN和LSTM等传统算法的比较,发现特定配置的ESN在数据速率和数据包速率预测方面表现更佳,证明了ESN在网络流量预测中的有效性。
32 4
|
3月前
|
机器学习/深度学习 算法
【文献学习】Meta-Learning to Communicate: Fast End-to-End Training for Fading Channels
把学习如何在衰落的噪声信道上进行通信的过程公式化为对自动编码器的无监督训练。该自动编码器由编码器,信道和解码器的级联组成。
41 2
|
机器学习/深度学习 编解码 固态存储
Single Shot MultiBox Detector论文翻译【修改】
Single Shot MultiBox Detector论文翻译【修改】
104 0
Single Shot MultiBox Detector论文翻译【修改】
|
机器学习/深度学习 PyTorch 算法框架/工具
Batch Normlization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》论文详细解读
Batch Normlization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》论文详细解读
127 0
Batch Normlization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》论文详细解读
|
机器学习/深度学习 算法 数据挖掘
【论文泛读】 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
【论文泛读】 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
【论文泛读】 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
|
算法 数据挖掘 知识图谱
Self-supervised graph convolutional network for multi-view clustering(论文阅读记录)
IEEE TRANSACTIONS ON MULTIMEDIA 2区top 2022 影响因子:6.005
436 0
|
机器学习/深度学习 计算机视觉
DL之FAN:FAN人脸对齐网络(Face Alignment depth Network)的论文简介、案例应用之详细攻略
DL之FAN:FAN人脸对齐网络(Face Alignment depth Network)的论文简介、案例应用之详细攻略
DL之FAN:FAN人脸对齐网络(Face Alignment depth Network)的论文简介、案例应用之详细攻略
|
机器学习/深度学习 监控 算法
Paper:Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读
Paper:Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读
Paper:Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读