d2l-ai
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎chapter_attention-mechanisms/bahdanau-attention.md‎
Lines changed: 1 addition & 0 deletions b/‎chapter_attention-mechanisms/bahdanau-attention.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎chapter_computer-vision/transposed-conv.md‎
Lines changed: 1 addition & 1 deletion b/‎chapter_computer-vision/transposed-conv.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎chapter_convolutional-neural-networks/channels.md‎
Lines changed: 1 addition & 0 deletions b/‎chapter_convolutional-neural-networks/channels.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎chapter_deep-learning-computation/model-construction.md‎
Lines changed: 17 additions & 1 deletion b/‎chapter_deep-learning-computation/model-construction.md‎
Lines changed: 17 additions & 1 deletion
diff --git a/‎chapter_natural-language-processing-applications/index.md‎
Lines changed: 16 additions & 12 deletions b/‎chapter_natural-language-processing-applications/index.md‎
Lines changed: 16 additions & 12 deletions
diff --git a/‎chapter_natural-language-processing-applications/natural-language-inference-and-dataset.md‎
Lines changed: 6 additions & 6 deletions b/‎chapter_natural-language-processing-applications/natural-language-inference-and-dataset.md‎
Lines changed: 6 additions & 6 deletions
@@ -34,6 +34,8 @@ Our goal is to offer a resource that could
 
 ## Cool Papers Using D2L
 
+1. [**Descending through a Crowded Valley--Benchmarking Deep Learning Optimizers**](https://arxiv.org/pdf/2007.01547.pdf). R. Schmidt, F. Schneider, P. Hennig. *International Conference on Machine Learning, 2021*
+
 1. [**Universal Average-Case Optimality of Polyak Momentum**](https://arxiv.org/pdf/2002.04664.pdf). D. Scieur, F. Pedregosan. *International Conference on Machine Learning, 2020*
 
 1. [**2D Digital Image Correlation and Region-Based Convolutional Neural Network in Monitoring and Evaluation of Surface Cracks in Concrete Structural Elements**](https://www.mdpi.com/1996-1944/13/16/3527/pdf). M. Słoński, M. Tekieli. *Materials, 2020*
@@ -42,11 +44,9 @@ Our goal is to offer a resource that could
 
 1. [**Detecting Human Driver Inattentive and Aggressive Driving Behavior Using Deep Learning: Recent Advances, Requirements and Open Challenges**](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9107077). M. Alkinani, W. Khan, Q. Arshad. *IEEE Access, 2020*
 
-1. [**Diagnosing Parkinson by Using Deep Autoencoder Neural Network**](https://link.springer.com/chapter/10.1007/978-981-15-6325-6_5). U. Kose, O. Deperlioglu, J. Alzubi, B. Patrut. *Deep Learning for Medical Decision Support Systems, 2020*
-
 <details><summary>more</summary>
 
-1. [**Descending through a Crowded Valley--Benchmarking Deep Learning Optimizers**](https://arxiv.org/pdf/2007.01547.pdf). R. Schmidt, F. Schneider, P. Hennig.
+1. [**Diagnosing Parkinson by Using Deep Autoencoder Neural Network**](https://link.springer.com/chapter/10.1007/978-981-15-6325-6_5). U. Kose, O. Deperlioglu, J. Alzubi, B. Patrut. *Deep Learning for Medical Decision Support Systems, 2020*
 
 1. [**Deep Learning Architectures for Medical Diagnosis**](https://link.springer.com/chapter/10.1007/978-981-15-6325-6_2). U. Kose, O. Deperlioglu, J. Alzubi, B. Patrut. *Deep Learning for Medical Decision Support Systems, 2020*
 
 
@@ -431,6 +431,7 @@ d2l.show_heatmaps(attention_weights[:, :, :, :len(engs[-1].split()) + 1],
 * When predicting a token, if not all the input tokens are relevant, the RNN encoder-decoder with Bahdanau attention selectively aggregates different parts of the input sequence. This is achieved by treating the context variable as an output of additive attention pooling.
 * In the RNN encoder-decoder, Bahdanau attention treats the decoder hidden state at the previous time step as the query, and the encoder hidden states at all the time steps as both the keys and values.
 
+
 ## Exercises
 
 1. Replace GRU with LSTM in the experiment.
 
@@ -308,7 +308,7 @@ Therefore,
 the transposed convolutional layer
 can just exchange the forward propagation function
 and the backpropagation function of the convolutional layer:
-its forward propagation  
+its forward propagation 
 and backpropagation functions
 multiply their input vector with 
 $\mathbf{W}^\top$ and $\mathbf{W}$, respectively.
 
@@ -102,6 +102,7 @@ corr2d_multi_in(X, K)
 ```
 
 ## Multiple Output Channels
+:label:`subsec_multi-output-channels`
 
 Regardless of the number of input channels,
 so far we always ended up with one output channel.
 
@@ -206,13 +206,29 @@ Before we implement our own custom block,
 we briefly summarize the basic functionality
 that each block must provide:
 
+:begin_tab:`mxnet, tensorflow`
+
+1. Ingest input data as arguments to its forward propagation function.
+1. Generate an output by having the forward propagation function return a value. Note that the output may have a different shape from the input. For example, the first fully-connected layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
+1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation function. Typically this happens automatically.
+1. Store and provide access to those parameters necessary
+   to execute the forward propagation computation.
+1. Initialize model parameters as needed.
+
+:end_tab:
+
+:begin_tab:`pytorch`
+
 1. Ingest input data as arguments to its forward propagation function.
-1. Generate an output by having the forward propagation function return a value. Note that the output may have a different shape from the input. For example, the first fully-connected layer in our model above ingests an      input of arbitrary dimension but returns an output of dimension 256.
+1. Generate an output by having the forward propagation function return a value. Note that the output may have a different shape from the input. For example, the first fully-connected layer in our model above ingests an input of dimension 20 but returns an output of dimension 256.
 1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation function. Typically this happens automatically.
 1. Store and provide access to those parameters necessary
    to execute the forward propagation computation.
 1. Initialize model parameters as needed.
 
+:end_tab:
+
+
 In the following snippet,
 we code up a block from scratch
 corresponding to an MLP
 
@@ -1,30 +1,34 @@
 # Natural Language Processing: Applications
 :label:`chap_nlp_app`
 
-We have seen how to represent text tokens and train their representations in :numref:`chap_nlp_pretrain`.
+We have seen how to represent tokens in text sequences and train their representations in :numref:`chap_nlp_pretrain`.
 Such pretrained text representations can be fed to various models for different downstream natural language processing tasks.
 
-This book does not intend to cover natural language processing applications in a comprehensive manner.
-Our focus is on *how to apply (deep) representation learning of languages to addressing natural language processing problems*.
-Nonetheless, we have already discussed several natural language processing applications without pretraining in earlier chapters,
+In fact,
+earlier chapters have already discussed some natural language processing applications
+*without pretraining*,
 just for explaining deep learning architectures.
 For instance, in :numref:`chap_rnn`,
 we have relied on RNNs to design language models to generate novella-like text.
 In :numref:`chap_modern_rnn` and :numref:`chap_attention`,
-we have also designed models based on RNNs and attention mechanisms
-for machine translation.
+we have also designed models based on RNNs and attention mechanisms for machine translation.
+
+However, this book does not intend to cover all such applications in a comprehensive manner.
+Instead,
+our focus is on *how to apply (deep) representation learning of languages to addressing natural language processing problems*.
 Given pretrained text representations,
-in this chapter, we will consider two more downstream natural language processing tasks:
-sentiment analysis and natural language inference.
-These are popular and representative natural language processing applications:
-the former analyzes single text and the latter analyzes relationships of text pairs.
+this chapter will explore two 
+popular and representative
+downstream natural language processing tasks:
+sentiment analysis and natural language inference,
+which analyze single text and relationships of text pairs, respectively.
 
 ![Pretrained text representations can be fed to various deep learning architectures for different downstream natural language processing applications. This chapter focuses on how to design models for different downstream natural language processing applications.](../img/nlp-map-app.svg)
 :label:`fig_nlp-map-app`
 
 As depicted in :numref:`fig_nlp-map-app`,
 this chapter focuses on describing the basic ideas of designing natural language processing models using different types of deep learning architectures, such as MLPs, CNNs, RNNs, and attention.
-Though it is possible to combine any pretrained text representations with any architecture for either downstream natural language processing task in :numref:`fig_nlp-map-app`,
+Though it is possible to combine any pretrained text representations with any architecture for either application in :numref:`fig_nlp-map-app`,
 we select a few representative combinations.
 Specifically, we will explore popular architectures based on RNNs and CNNs for sentiment analysis.
 For natural language inference, we choose attention and MLPs to demonstrate how to analyze text pairs.
@@ -33,7 +37,7 @@ for a wide range of natural language processing applications,
 such as on a sequence level (single text classification and text pair classification)
 and a token level (text tagging and question answering).
 As a concrete empirical case,
-we will fine-tune BERT for natural language processing.
+we will fine-tune BERT for natural language inference.
 
 As we have introduced in :numref:`sec_bert`,
 BERT requires minimal architecture changes
 
@@ -48,7 +48,7 @@ To study this problem, we will begin by investigating a popular natural language
 
 ## The Stanford Natural Language Inference (SNLI) Dataset
 
-Stanford Natural Language Inference (SNLI) Corpus is a collection of over $500,000$ labeled English sentence pairs :cite:`Bowman.Angeli.Potts.ea.2015`.
+Stanford Natural Language Inference (SNLI) Corpus is a collection of over 500000 labeled English sentence pairs :cite:`Bowman.Angeli.Potts.ea.2015`.
 We download and store the extracted SNLI dataset in the path `../data/snli_1.0`.
 
 ```{.python .input}
@@ -110,7 +110,7 @@ def read_snli(data_dir, is_train):
     return premises, hypotheses, labels
 ```
 
-Now let us print the first $3$ pairs of premise and hypothesis, as well as their labels ("0", "1", and "2" correspond to "entailment", "contradiction", and "neutral", respectively ).
+Now let us print the first 3 pairs of premise and hypothesis, as well as their labels ("0", "1", and "2" correspond to "entailment", "contradiction", and "neutral", respectively ).
 
 ```{.python .input}
 #@tab all
@@ -121,8 +121,8 @@ for x0, x1, y in zip(train_data[0][:3], train_data[1][:3], train_data[2][:3]):
     print('label:', y)
 ```
 
-The training set has about $550,000$ pairs,
-and the testing set has about $10,000$ pairs.
+The training set has about 550000 pairs,
+and the testing set has about 10000 pairs.
 The following shows that 
 the three labels "entailment", "contradiction", and "neutral" are balanced in 
 both the training set and the testing set.
@@ -246,7 +246,7 @@ def load_data_snli(batch_size, num_steps=50):
     return train_iter, test_iter, train_set.vocab
 ```
 
-Here we set the batch size to $128$ and sequence length to $50$,
+Here we set the batch size to 128 and sequence length to 50,
 and invoke the `load_data_snli` function to get the data iterators and vocabulary.
 Then we print the vocabulary size.
 
@@ -258,7 +258,7 @@ len(vocab)
 
 Now we print the shape of the first minibatch.
 Contrary to sentiment analysis,
-we have $2$ inputs `X[0]` and `X[1]` representing pairs of premises and hypotheses.
+we have two inputs `X[0]` and `X[1]` representing pairs of premises and hypotheses.
 
 ```{.python .input}
 #@tab all