As I have just completed first week of my internship at freshlybuilt, so I have a great experience with freshlybuilt.

On this platform, I am working on a project in which we are going to build a Python library. I am very delighted for part of this project. It is really interesting that we are going to create our own library, everyone will be able to use it.

At the beginning of week, brief introduction of the project took place and all the team members gave their ideas to implement it and we discussed a lot about the project.

So first we read about different text extraction tools like TESSERACT, GOOGLE VERSION API and KRAKEN for extraction of text present in document and clicked images and then I tested the accuracy of these tools over different text images. I got some good results and some bad results

After that I started working on the IMAGE DESCRIPTION GENERATION task.



In this, I had to generate human readable textual description from an image based on the objects and actions in the image. So basically, we were describing an image with text.

Here are some examples :


I started reading about it and then I followed an approach. in this,

The task involves two main modules: –

  1. Feature Extraction.
  2. Language Model.


Feature Extraction

Feature extraction model is an image based model which extracts the features and nuances out of our image. For this CNN ( Convolutional Neural Network model ) is used as feature extractor. CNN created feature vector that are called as embedding so CNN is referred to as encoder.


Language model

Language model is a language based model which translates the features and objects given by our image based model to a natural sentence. For this RNN(Recurrent Neural Network) such as a LSTM (Long Short-Term Memory network) is used as language model. Language model is referred to as decoder. Initial embedding of the image is given, the LSTM is trained to predict the most probable next value of the sequence.


The below image describing approach: ­-


