Sowmik Sarker

Literature Review and Backgroud Study for Machine Translation and Sequential Data Classification

3412

Attention Based Neural Networks for Natural Language Translation and Sequential Data Classification

S.S.K.Sarker, M.Banik

Department of Computer Science and Engineering, University of Dhaka, Dhaka, Bangladesh

PDF version will be found here.

Abstract: Sequential data is defined as a structured pattern of continuous data which follows a particular order and so it poses an inherent challenge in both translation and classification task. In Machine Translation(MT) task Neural Machine Translation (NMT) is a recent and effective technique which led to remarkable improvements among conventional machine translation techniques. This report provides an overview of MT techniques and looks in details of attention mechanism based Neural Networks (NN).

Index Terms: Neural Networks, Machine Translation, Attention Model, Sequential data classification.

1. Introduction

Communication and information exchange among people is necessary for sharing knowledge, thoughts and opinions. For communication English is used globally. The contents in the internet almost all are in English. But people of different regions use different regional language. So, the number of people who use Bengali language is not less. In order to bridge this language gap we need an effective and accurate computational approaches. This task can be done using machine translation. [1].

1.1. Motivation

The aim of machine translation is to generate translations which have the same meaning as source sentence and grammatically correct in the target language. There are many approaches (such as Rule-based translation, Knowledge-based translation, Corpus-based translation, Hybrid translation and Statistical machine translation) for machine translation to achieve more and more accurate translation. All these approaches has their own success and failure. But in recent years the usage of Neural Networks in machine translation become popular and novel approaches and is known as Neural Machine Translation (NMT).

1.2. Research Objectives

In our research work, we intend to design attention-model based neural network architectures that achieve close to human level performance for the translation of low-resource Indic language and to study the effectiveness of attention-based NMT aproaches for spoken as well as written language domain. Also to capture the temporal and spatial context in the extracted high and low level features from sequential time-series data through neural network architectures with attention models.

2. Literature Review and Background Study 2.1. Sequence-to-Sequence Model

Sequence-to-sequence deep learning architecture have gained a notable success in the task of machine translation, text summarization, image captioning etc. Sequence-to-sequence models have been introduced nicely in two pioneering papers authored by Sutskever et al. 2014 [2] and Cho et a. 2014[3].

2.1.1. LONG SHORT-TERM MEMORY [4]

2.1.1.a. Introduction:

LSTM is an efficient, gradient based faster method. It solves complex, artificial long time lag tasks. It is mainly designed to overcome backpropagated error.

2.1.1.b. Previous Works:

Previous all methods suffer from Back-Propagation Through Time (BPTT) and Real-Time Recur-rent Learning (RTRL) and long time lags problems. But LSTM doesn’t suffer from these problems.

2.1.1.c. The Model:

Long Short-Term Memory (LSTM) recurrent neural network processes a variable-length sequence x = (x1,x2,,xn ) by incrementally adding new content into a single memory slot, with gates controlling the extent to which new content should be memorized, old content should be erased, and current content should be exposed. At time step t, the memory ct and the hidden state ht are updated with the following equations:

  it  =  σ  W.\[ht−1,xt\]

 ct tanh

 ct = ft∨ct1 +it∨ct

 ht = ot∨tanh(ct)

2.1.1.d. Success:

Const error backpropagation within memory cells results in ability of bridge very long time lags.LSTM can handle noise, distributed representations and continuous values.It works on a broad range of parameters such as learning rate, input gate bias and output gate bias. LSTM is local in both space and time. It’s update complexity per weight and time step is O(1). A useful property of the LSTM is that it learns to map an input sentence of variable length into a fixed-dimensional vector representation.

2.1.1.e. Limitations:

Efficient truncated backprop version of the LSTM algorithm will not easily solve problems similar to strongly delayed XOR problems where the goal is to compute the XOR of two widely separated inputs. Each memory cell block needs two additional input and output gate but doesnot increase the number of weights by more than a factor of 9. Like all gradient based approaches, LSTM suffer from practical inability to precisely count discrete time steps.

2.1.2. Sequence to Sequence Learning with Neural Networks [2]

2.1.2.a. Introduction:

In this paper, presented a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Here used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep

LSTM to decode the target sequence from the vector. Main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a better BLEU score than another previous approaches. Additionally, the LSTM did not have difficulty on long sentences.

2.1.2.b. Prev. Works:

Among the previous approaches, a phrase-based Statistical Machine Translation(SMT) system achieves higher BLEU score on same (WMT-14) dataset. But in this paper LSTM is used to rerank the 1000 hypotheses produced by SMT system and it increased BLEU score. LSTM also learned sensible phrase and sentence representation. Also reversing order of words in all the source sentence improved the LSTM’s performance.

2.1.2.c. The Model:

In this paper, shows that a straightforward application of the Long Short-Term Memory (LSTM) architecture [4] can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed- dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (??)

Recurrent Neural Network (RNN) is a natural generalization of feedforward neural networks to sequences. Given a sequence of inputs x1,...,xT , a standard RNN computes a sequence of outputs y1,...,yT by iterating the following equation:

 ht = sigm(Whxxt +Whhht1)

 yt = Wyhht

RNN can easily map sequences to sequences whenever the alignment between the inputs the outputs is known ahead of time. RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relationships.

A simple strategy for general sequence learning is to map the input sequence to a fixed-sized vector using one RNN, and then to map the vector to the target sequence with another RNN [3]. While it could work in principle since the RNN is provided with all the relevant information, it would be difficult to train the RNNs due to the resulting long term dependencies. However, the Long Short-Term Memory (LSTM)[4] is known to learn problems with long range temporal dependencies, so an LSTM may succeed in this setting. The goal of the LSTM is to estimate the

conditional probability p( y1,...,yT′|x1,...,xT ) where (x1,...,xT ) is an input sequence and (y1,...,yT ) is its corresponding output sequence whose length T’ may differ from T.

 T′ p(y1,...,yT′\|x1,...,xT ) = p(yt\|v,y1,...,yt1)

 t=1

In this equation, each p(yt|v,y1,...,yt1) distribution is represented with a softmax over all the words in the vocabulary.

One LSTM is used for the input sequence and another for the output sequence, because it increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously. LSTM computes the representation of ”A”, ”B”, ”C”, ”<EOS\” and then uses this representation to compute the probability of ”W”, ”X”, ”Y”, ”Z”, ”<EOS\”. Deep LSTMs significantly outperformed shallow LSTMs, so here chosen an LSTM with four layers. Also, it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence a, b, c to the sentence , , the LSTM is asked to map c, b, a to , , where , , is the translation of a, b, c. This way, a is in close proximity to , b is fairly close to and so on.

High Level Design(HLD) of Real Time CMS

Fig. 1. This model reads an input sentence ”ABC” and produces ”WXYZ” as the output sentence. The model stops making predictions after outputting the end-of-sentence <EOS\token. Note that the LSTM reads the input sentence in reverse, because doing so introduces many short term dependencies in the data that make the optimization problem much easier.

2.1.2.d. Success:

A large deep LSTM with a limited vocabulary can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task and correctly translate very long sentences. This simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.

2.1.2.e. Limitations:

LSTM would fail on very long sentences due to its limited memory and other researchers reported poor performance on long sentences with a model similar to this model.

2.1.3. Long Short-Term Memory-Networks for Machine Reading [5]

2.1.3.a. Introduction:

In this paper, designed a machine reader that automatically learns to understand text and proposed a machine reading simulator to address the limitations of recurrent neural networks when process-ing inherently structured input. This model is based on a Long Short-Term Memory architecture embedded with a memory network, explicitly storing contextual representations of input tokens without recursively compressing them.

2.1.3.b. Prev. Works:

Recurrent neural network have been successfully applied to various sequence modeling and sequence-to-sequence transduction tasks. The latter have assumed several guises in the literature such as machine translation [3], sentence compression [6], and reading comprehension [7]. A key contributing factor to their success has been the ability to handle well-known problems with exploding or vanishing gradients, leading to models with gated activation functions and more advanced architectures that enhance the information flow within the network . Here in this paper, memory and attention are added within a sequence encoder allowing the network to uncover lexical relations between tokens.

2.1.3.c. The Model:

Main model is the modification of the standard LSTM structure by replacing the memory cell with a memory network. The resulting Long Short-Term Memory-Network (LSTMN) stores the contextual representation of each input token with a unique memory slot and the size of the memory grows with time until an upper bound of the memory span is reached.

A standard tool for modeling two sequences with recurrent networks is the encoder-decoder architecture where the second sequence (also known as the target) is being processed conditioned on the first one (also known as the source). In this section Below the figure illustrate how to combine the LSTMN which applies attention for intra-relation reasoning, with the encoder-decoder

High Level Design(HLD) of Real Time CMS

Fig. 2. Long Short-Term Memory-Network. Color indicates degree of memory activation

network whose attention module learns the inter-alignment between two sequences.

High Level Design(HLD) of Real Time CMS

Fig. 3. LSTMNs for sequence-to-sequence modeling. The encoder uses intra-attention, while the decoder incorporates both intra- and inter-attention. The two figures present two ways to combine the intra- and inter-attention in the decoder.

High Level Design(HLD) of Real Time CMS

Illustration of the model while reading any sentence.

Fig. 4. Illustration of our model while reading the sentence The FBI is chasing a criminal on the run. Color red represents the current word being fixated, blue represents memories. Shading indicates the degree of memory activation.

2.1.3.d. Success:

This simulator can successfully overcome the limitations of RNNs when processing inherently structured input. Can store contextual representation of input tokens without recursively com-pressing them. When direct supervision is provided, similar architectures can be adapted to tasks such as dependency parsing and relation extraction.

2.1.3.e. Limitations:

This is not more linguistically plausible neural architectures and not able to reason over nested structures. Its unable to learn to discover compositionality with weak or indirect supervision.

2.2. Attention Mechanism

2.2.1. Effective Approaches to Attention-based Neural Machine Translation [1]

2.2.1.a. Introduction:

In this paper, proposed two simple and effective attentional mechanisms for neural machine translation: the global approach and the local approach. The global approach always looks at all source positions and the local approach only attends to a subset of source positions at a time. And the effectiveness of these models is tested in the WMT translation tasks between English and German in both directions.

2.2.1.b. Prev. Works

As NMT requires minimal domain knowledge and is conceptually simple so it’s appealing and achieved state-of-the-art performances in large-scale translation tasks such as from English to French ([1] and English to German [1]. Recently, in parallel, the concept of attention-mechanism has gained popularity in training neural networks, allowing models to learn alignments between image objects and agent actions in the dynamic control problem , between speech frames and text in the speech recognition task or between visual features of a picture and its text description in the image caption generation task [8]. But for English to German translation, this model achieve new state-of-the-art (SOTA) results for both WMT’14 and WMT’15, outperforming previous SOTA systems, backed by NMT models and n-gram LM rerankers.

2.2.1.c. The Model:

1. Global Attention: Attention is placed on all source positions. 2. Local Attention: Attention is placed only on a few source positions. Both attention based models differ from the normal encoder-decoder architecture only in the decoding phase. These attention based methods differ in the way that they compute context vector (c(t)).

Global attention: Global attention takes into consideration all encoder hidden states to derive the context vector (c(t)). In order to calculate context vector c(t), need to compute alignment vector a(t) which is a variable length alignment vector. The alignment vector is derived by computing a similarity measure between source hidden state h(t) and target hidden state hbar(s) . Similar states in encoder and decoder are actually referring to the same meaning. The alignment vector (a(t,s))is defined as:

 at(s) = align(ht,h′ ) = exexp(score(ht,hs))

The task of score function to calculate the similarity between the hidden states of the target and the source.

Local attention: Global attention focus on all source side words for all target words, it is com-putationally very expensive and is impractical when translating for long sentences. To overcome this deficiency local attention chooses to focus only on a small subset of the hidden states of the encoder per target word. The model first generates an aligned position p(t) for each target word at time t. The context vector (c(t)) is derived as a weighted average over the set of source hidden states within the window [p(t) — D, p(t)+D]; D is empirically selected. As compared to the global alignment vector local alignment vector a(t) is now fixed dimensional. There are two types of alignment: Monotonic alignment(local-m) , Predictive alignment(local-p). Monotonic alignment vector is the same as the global alignment. Predictive alignment predicts an alignment position as follows:

 pt = S.sigmoid(vT tanh(Wp,ht))

Input feeding approach: In the proposed attention mechanisms the attention decisions are made independently (previously predicted alignments does not influence on next alignment) which is not optimal. In order to make sure that future alignment decisions take into consideration past alignment information hbar(t) is concatenated with inputs at the next time steps by making the model fully aware of the previous alignment choices and by creating a very deep network spanning both horizontally and vertically.

2.2.1.d. Success:

Local attention focuses on D hidden states on both sides of p(t) so it can easily translate long sentence.

2.2.1.e. Limitations:

Global attention is computationally more expensive and is useless for long sentences.

2.2.2. Accelerating Neural Transformer via an Average Attention Network [9]

2.2.2.a. Introduction:

This paper described the average attention network that considerably alleviates the decoding bottleneck of the neural Transformer. It employs a cumulative average operation to capture im-portant contextual clues from previous target words and a feed forward gating layer to enrich the expressiveness of learned hidden representations and is further enhanced with a masking trick and a dynamic programming method to accelerate the Transformer’s decoder. The proposed average attention network is able to speed up the Transformer’s decoder by over 4 times.

2.2.2.b. Prev. Works:

GRU or LSTM RNNs are widely used for neural machine translation to deal with long range dependencies as well as the gradient vanishing issue. Neural Transformer is very fast to train but due to the auto-regressive architecture and self-attention in the decoder, the decoding procedure becomes slow. To alleviate this issue, authors proposed an average attention network as an alternative to the self-attention network in the decoder of the neural Transformer.

2.2.2.c. The Model:

Here used an Average Attention Network that replaces the original dynamically computed attention weights by the self-attention network in the decoder of the neural Transformer with simple and fixed average weights. Benefits from the cumulative-average operation is that no matter how long the input is, the connection strength with each previous input embedding is invariant. AAN control how much past information can be preserved from previous context and how much new information can be captured from current input and helps this model to detect correlations inside input embeddings. AAN only can be performed sequentially. In decoding phase, decoding procedure is accelerated by using dynamic programming in AAN. And decomposes this steps by folling two:

 gj = gj−1 +yj gj = FFN(g′/j)

This model comput j’th input representation based on only the previous state gj−1.

2.2.2.d. Success:

The model is further imporved with a masking trick and a dynamic programming method to accelerate the Transformer’s decoder. Also WMT14 and six WMT17 language pairs demonstrate that the proposed average attention network is able to speed up the Transformer’s decoder by over 4 times.

2.2.2.e. Limitations:

In the future, authors plans to apply this model on other sequence to sequence learning tasks (not specifically mentioned). and also attempt to improve this model to enhance its modeling ability so as to consistently outperform the original neural Transformer.

2.3. Transformer Architecture

2.3.1. Attention Is All You Need [10]

2.3.1.a. Introduction:

This paper presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, this model achieve a new state of the art and outperforms even all previously reported ensembles.

2.3.1.b. Prev. Works:

Recurrent neural networks, long short-term memory [4] and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Attention mechanisms have become an integral part of compelling sequence mod-eling and transduction models in various tasks. But this model architecture eliminate recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation.

High Level Design(HLD) of Real Time CMS

2.3.1.c. The Model:

The Transformer proposed in this paper is a model architecture which relies entirely on an attention mechanism to draw global dependencies between input and output. Neural sequence transduction models generally have an encoder-decoder structure. The encoder maps an input sequence of symbol representations to a sequence of continuous representations. The decoder then generates an output sequence of symbols, one element at a time.

Fig. 5. Transformer model architecture.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.

Here, motivated to use self-attention because of three criteria. One is that the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.

The Transformer uses two different types of attention functions: Scalled-dot product attention: Computes the attention function on a set of queries simultaniously, packed together into a matrix. Multihead-attention: Allows the model to jointly attend to information from different representa-tion subspaces at different positions. A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.

2.3.1.d. Success:

It introduces Transformer, a novel sequence transduction model based entirely on attention mech-anism and replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers for translation tasks. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, this model achieves a new state of the art and outperforms all previously reported ensembles.

2.3.1.e. Limitations:

Transformer has only been applied to transduction model tasks as of yet. In the near future, the authors plan to use this model for other problems involving input and output modalities other than text. They plan to apply attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.

2.3.2. Improving Language Understanding by Generative Pre-Training [11]

2.3.2.a. Introduction:

This paper introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training on a diverse corpus with long stretches of contiguous text this model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity as-sessment, entailment determination, and text classification.

2.3.2.b. Prev. Works:

The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model. Over the last few years, researchers have demonstrated the benefits of using word embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. These approaches, however, mainly transfer word-level information, whereas aim to capture higher-level semantics.

2.3.2.c. The Model:

This model builds on ULMFiT [12] and the Transformer [10]. This paper uses a simple pre-processing approach that fits structured tasks to the architecture rather than the other way around. The approach roughly follows ULMFiT, though the training process is in only two phases and is somewhat simplified. It also swaps out ULMFiT’s LSTM with a Transformer, which is better at capturing long-term dependencies. This work also uses an auxiliary language model objective in the fine-tuning phase, which is a contributing factor to their ability to get away with two phases instead of ULMFiT’s three.

The first phase is unsupervised pre-training and is quite standard. The authors train a 12-layer Transformer decoder model with masked self-attention. The masking is necessary to avoid looking at the words to be predicted. The objective L1 maximizes log likelihood of the next word, summed over the corpus, given a context window of 512 contiguous tokens. They used BookCorpus, which contains long term dependencies.

The second phase is supervised fine-tuning, considers sets of inputs x1,...,xm and an output label y. The output is predicted by introducing a new weight matrix Wy, which is multiplied by the last Transformer activation and fed into a softmax to produce output probabilities P(y). Again the objective (L2) is to maximize the sum of the log probabilities, though in fact the objective from the first phase is kept and used as an auxiliary objective. So the final objective is L3 = L2 + L1 ( = 0.5). The only thing necessary to learn in this phase is Wy, so fine-tuning is relatively fast.

2.3.2.d. Success:

This paper solves predicting next word by applying ”traversal-style” pre-processing.

2.3.2.e. Limitations:

The pre-training phase took very long time (1 month) to train on 8 GPUs. Fortunately this pre-trained model is available.

2.4. Human Activity Recognition using wearables sensors

2.4.1. Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables [13]

2.4.1.a. Introduction:

In this paper authors explored the performance of state-of-the-art deep learning approaches for Human Activity Recognition using wearable sensors. Here described how to train recurrent approaches in this setting and introduced a novel regularisation approach. Authors found that bi-directional LSTMs outperformed the current state-of-the-art on Opportunity dataset, by a con-siderable margin.

2.4.1.b. Prev. Works:

Dominant technical approach in HAR includes sliding window segmentation of time-series data captured with body-worn sensors, manually designed feature extraction procedures, and a wide variety of (supervised) classification methods. Deep learning has the potential to have significant impact on HAR in ubicomp. It can substitute for manually designed feature extraction procedures which lack the robust physiological basis that benefits other fields such as speech recognition. Also, for the practitioner it is difficult to select the most suitable method for their application. So, authors provide the first unbiased and systematic exploration of the performance of state-of-the-art deep learning approaches on three different recognition problems typical for HAR in ubicomp and introduce a novel approach to regularisation for recurrent networks.

2.4.1.c. The Model:

The training process for deep, convolutional and recurrent models: Deep feed-forward networks (DNN): Here, deep feed-forward networks is implemented with a neural network with up to five hidden layers followed by a softmax-group. N hidden layer contains the same number of units and corresponds to a linear transformation and a recitified-linear (ReLU) activation function. Two types of regularization techniques are used. i) Dropout and ii) Max-in-norm. Supervised learning approach is used to limit the number of hyperparameters and not chosen any generative pre-training approach. DNN is trained on 64 frame based mini-batch approach.

Convolutional networks (CNN): Each CNN contains at least one temporal convolution layer, one pooling layer and at least one fully connected layer prior to a top-level softmax-group. The temporal convolution layer corresponds to a convolution of the input sequence with different kernels of width. Width of the max-pooling was fixed to 2 throughout all experiments. The output of each max-pooling layer is transformed using a ReLU activation function. The subsequent fully connected part effectively corresponds to a DNN and follows the same architecture. For regularisation dropout is applied after each maxpooling or fully-connected layer. Like DNN, CNN is trained using stratified mini-batches (64 frames) and stochastic gradient descent to minimise negative log likelihood.

Recurrent networks: To exploit the temporal dependencies within the movement data recurrent neural networks is implemented based on LSTM. Two flavours of LSTM recurrent networks are im-plemented. i) deep forward LSTMs, contains multiple layers of recurrent units and ii) bi-directional LSTMs which contain two parallel recurrent layers. In the first case the input data fed into the network at any given time t corresponds to the current frame of movement data, which stretches a certain length of time and whose dimensions have been concatenated and this model is denoted as LSTM-F. The second application case of forward LSTMs represents a real-time application, where each sample of movement data is presented to the network in the sequence they were recorded, denoted LSTM-S. The final scenario sees the application of bi-directional LSTMs to the same sample-by-sample prediction problem, denoted by b-LSTM-S.

2.4.1.d. Success:

Using bi-directional LSTMs (b-LSTM-S) the number of units in each layer has a suprisingly large effect on performance, which should motivate practitioners to first focus on tuning this(learning rate) parameter. For sample-based forward LSTMs (LSTM-S) mostly confirm earlier findings for

this type of model that found learning-rate to be the most crucial parameter.

2.4.1.e. Limitations:

These models differ in the spread of recognition performance for different parameter settings. Recurrent networks outperform convolutional networks significantly on activities that are short in duration but have a natural ordering. Authors recommend to start exploring learning-rates, before optimising the architecture of the network, as the learning-parameters had the largest effect on performance. Autors adviced to practitioners not discard the model even if a preliminary exploration leads to poor recognition performance. More sophisticated approaches like CNNs or RNNs show a much smaller spread of performance.

References

[1] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.

[2] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.

[3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[5] J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networks for machine reading,” arXiv preprint arXiv:1601.06733, 2016.

[6] A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” arXiv preprint arXiv:1509.00685, 2015.

[7] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” in Advances in neural information processing systems, 2015, pp. 1693–1701.

[8] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.

[9] B. Zhang, D. Xiong, and J. Su, “Accelerating neural transformer via an average attention network,” arXiv preprint arXiv:1805.00631, 2018.

[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.

[11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.

[12] J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” arXiv preprint arXiv:1801.06146, 2018.

[13] N. Y. Hammerla, S. Halloran, and T. Plotz, “Deep, convolutional, and recurrent models for human activity recognition using wearables,” arXiv preprint arXiv:1604.08880, 2016.



Share this post on...

0
Machine TranslationRNNGRUTransformer Model