Long Short Term Memory (LSTM)

7 min read

As we know sequence prediction problems have been around for a long time. These include a wide range of problems; from predicting sales to finding patterns in stock markets’ data, from understanding movie plots to recognizing your way of speech, from language translations to predicting your next word on your Phone’s keyboard. They are considered as one of the hardest problems to solve. with the recent breakthroughs that have been happening , it is found that for almost all of these sequence prediction problems, Long short Term Memory networks(LSTM) have been observed as the most effective solution. LSTMs have an edge over RNN in many ways. This is because of their property of selectively remembering patterns for long durations of time.

Aim of the article is to explain LSTM. So let’s have a look on topics that will cover in this article!

Before going to learn about LSTMs let’s have a look on limitation of RNN.

Limitations of RNNs

A typical RNN looks like:

RNNs turn out to be quite effective when we are dealing with short-term dependencies. just like below example:

The color of sky is __ 

RNN work just fine here because this problem has nothing to do with the context of the statement. The RNN need not remember what was said before this, or what was its meaning, all they need to know is that in most cases the sky is blue. Thus the prediction would be: 

The color of sky is blue

By above example we can see RNN works fine on short dependencies but RNNs fail to understand the context behind an input. Something that was said long before, cannot be recalled when making predictions in the present. Let’s understand this by below example:

I spent more than 20 years working for under privileged kids in Spain. then I moved to Africa.


I can speak fluent __

This is where a Recurrent Neural Network fails! Here, we can understand that since the author has worked in Spain for 20 years, it is very likely that he may possess a good command over Spanish. But, to make a proper prediction, the RNN needs to remember this context. The relevant information may be separated from the point where it is needed, by a huge load of irrelevant data.

The reason behind this is the problem of Vanishing Gradient. When dealing with activation functions like the sigmoid function, the small values of its derivatives (occurring in the error function) gets multiplied multiple times as we move towards the starting layers. As a result of this, the gradient almost vanishes as we move towards the starting layers, and it becomes difficult to train these layers. A similar case is observed in Recurrent Neural Networks. RNN remembers things for just small durations of time, i.e. if we need the information after a small time it may be reproducible, but once a lot of words are fed in, this information gets lost somewhere. This issue can be resolved by applying a slightly tweaked version of RNNs – the Long Short-Term Memory(LSTM) Networks.

Introduction to Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a special kind of RNN, which shows outstanding performance on a large variety of problems. LSTM networks are a variety of recurrent neural network capable of learning long-term dependencies, especially in sequence prediction problems. LSTM has feedback connections, i.e., it is capable of processing the entire sequence of data, apart from single data points such as images. 

The unit is called a long short-term memory block because the program is using a structure founded on short-term memory processes to create longer-term memory. These systems are often used, for example, in natural language processing. The recurrent neural network uses the long short-term memory blocks to take a particular word or phoneme, and evaluate it in the context of others in a string, where memory can be useful in sorting and categorizing these types of inputs. In general, LSTM is an accepted and common concept in pioneering recurrent neural networks.

Architecture of LSTMs

Now let’s get into the details of the architecture of LSTM network:

A typical LSTM network is comprised of different memory blocks called cell(rectangles shown in below image ). There are two states that are being transferred to the next cell; the cell state and the hidden state. The memory blocks are responsible for remembering things and manipulations to this memory is done through three major mechanisms, called gates. These gates optionally let the information flow in and out of the cell. It contains a pointwise multiplication operation and a sigmoid neural net layer that assist the mechanism. The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing should be let through’, and one means ‘everything should be let through’. Each Gates is being discussed below.

Forget Gate

A forget gate is responsible for removing information from the cell state. The information that is no longer required for the LSTM to understand things or the information that is of less importance is removed via multiplication of a filter. This is required for optimizing the performance of the LSTM network.

This gate takes in two inputs; h_t-1 and x_t. h_t-1 is the hidden state from the previous cell or the output of the previous cell and x_t is the input at that particular time step. The given inputs are multiplied by the weight matrices and a bias is added. Following this, the sigmoid function is applied to this value. The sigmoid function outputs a vector, with values ranging from 0 to 1, corresponding to each number in the cell state.

Basically, the sigmoid function is responsible for deciding which values to keep and which to discard. If a ‘0’ is output for a particular value in the cell state, it means that the forget gate wants the cell state to forget that piece of information completely. Similarly, a ‘1’ means that the forget gate wants to remember that entire piece of information. This vector output from the sigmoid function is multiplied to the cell state.

Input Gate

The input gate is responsible for the addition of information to the cell state. This addition of information is basically three-step process as seen in the diagram below.

  1. Regulating what values need to be added to the cell state by involving a sigmoid function. This is basically very similar to the forget gate and acts as a filter for all the information from h_t-1 and x_t.
  2. Creating a vector containing all possible values that can be added (as perceived from h_t-1 and x_t) to the cell state. This is done using the tanh function, which outputs values from -1 to +1.  
  3. Multiplying the value of the regulatory filter (the sigmoid gate) to the created vector (the tanh function) and then adding this useful information to the cell state via addition operation.

Once this three-step process is done with, we ensure that only that information is added to the cell state that is important and is not redundant.

Output Gate

This job of selecting useful information from the current cell state and showing it out as an output is done via the output gate. Here is its structure:

The functioning of an output gate can again be broken down to three steps:

  1. Creating a vector after applying tanh function to the cell state, thereby scaling the values to the range -1 to +1.
  2. Making a filter using the values of h_t-1 and x_t, such that it can regulate the values that need to be output from the vector created above. This filter again employs a sigmoid function.
  3. Multiplying the value of this regulatory filter to the vector created in step 1, and sending it out as a output and also to the hidden state of the next cell.

Filter needs to be built on the input and hidden state values and be applied on the cell state vector.

LSTM Applications

LSTM neural networks are capable of solving numerous tasks that are not solvable by previous learning algorithms like RNNs. Long-term temporal dependencies can be captured effectively by LSTM, without suffering much optimization hurdles. LSTM networks find useful applications in the following areas:

  • Language modeling
  • Machine translation
  • Handwriting recognition
  • Image captioning
  • Image generation using attention models
  • Question answering
  • Video-to-text conversion
  • Polymorphic music modeling
  • Speech synthesis
  • Protein secondary structure prediction

Hope this will helpful and informative.

Choose your Reaction!
Leave a Comment