The Challenge of Machines Understanding Text Data
It’s quite easy for both you and me to understand any sentence that we see on the computer screen within a fraction of a second. But it is not the same case for machines. They fail to do so as they simply cannot understand and process text data (data text) present in its raw form.
In the real scenario, we need to break down the text into a numerical format that’s easily readable and executable by the machine and this is somewhat the idea behind what we study in NLP also.
The concepts of Bag-of-Words (BoW) and TF-IDF come into play in this text understanding by machines where both BoW and TF-IDF are techniques that help us convert text sentences into numeric vectors.
So, in this particular blog, we’ll be discussing both Bag-of-Words and TF-IDF where we’ll use an intuitive and general example to understand each concept in detail. Let’s start learning with some fun.
Introduction to the Challenge of Machines Understanding Text Data
We, mainly the new generation students, just love doing online shopping (to varying degrees) and tend to always look at the product reviews before finalizing it to buy. I know a lot of you do the same!
So, taking this scenario let us take for instance a sample of reviews from a thousand reviews about a particular novel we are planning to buy:
- Review 1: This novel is very humorous and long
- Review 2: This novel is not humorous but is interesting
- Review 3: This novel is factual based and interesting
Clearly, we can say that there are a lot of interesting insights that we can draw from these reviews and build upon our view to determine how well the novel is rated. But as already discussed above, we cannot simply give these sentences to a machine learning model and ask it to predict whether a review was positive or negative. So, we need to perform certain text preprocessing steps such as BOW and TF-IDF.
Creating Vectors from Text
Before I say, can you suggest some techniques that we could use to vectorize a sentence at the beginning? Yes, if the word “Word Embedding” is striking in your mind, then you are absolutely correct. So, what does it mean by the term Word Embedding?
In simple words, we can say Word Embedding is simply a technique where we can represent the text using vectors. The more popular forms of word embeddings include BoW, which stands for Bag of Words and TF-IDF, which stands for Term Frequency-Inverse Document Frequency. Now, let us see how we can represent the above novel reviews as embeddings and get them ready for a machine learning model.
Read Also: Top Technology Companies around the World and their Addresses
Bag of Words (BoW) Model
The Bag of Words (BoW) model is the simplest form of text representation using numbers. It is a common way of representing text data as input feature vectors to an ML model. Like the term itself, we can represent a sentence in the form of a word vector or more precisely a string of numbers.
The most advantageous point of implementing a bag of words in our model is that it is very flexible, simple to understand and implement.
Do you know the practical applications of Bag of Words? It is an approach which is highly used in Natural Language Processing, Document Classification and Information Retrieval from documents.
Try to recall the three types of novel reviews we saw earlier:
- Review 1: This novel is very humorous and long
- Review 2: This novel is not humorous but is interesting
- Review 3: This novel is factual based and interesting
Step 1: The above review list is our data collected for the implemented bag of words model. For instance, let us consider this small example and treat all the three lines as a separate “document”. Altogether they will be as our entire corpus of documents.
Step 2: The next step includes designing our vocabulary. We will first build a vocabulary making a list from all the unique words in the above three reviews. The vocabulary consists of these 12 words: ‘This’, ‘novel’, ‘is’, ‘very’, ‘humorous’, ‘and’, ‘long’, ‘not’, ‘but’, ‘interesting’, ‘factual’, ‘based’.
Step 3: When the vocabulary is built, now we need to create document vectors. Once a vocabulary has been chosen, the occurrence of words in example documents needs to be scored. We can now pick out each of these words and note down their occurrence for the three novel reviews above representing 1s and 0s. This will give us three vectors for three reviews:
this | novel | is | very | humorous | and | long | not | but | interesting | factual | based | Length of review | |
R1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 7 |
R2 | 1 | 1 | 2 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 8 |
R3 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 7 |
Step 4: Vector of Review 1: [111111100000]; Vector of Review 2: [112010011100]; Vector of Review 3: [111001000111].
The evaluation of novel review text is actually referred to as a classification problem often called sentiment analysis. A popular technique for developing sentiment analysis models is to implement bag-of-words in our model building that transforms documents into vectors where each word in the document is assigned a score.
And that’s the brief idea behind a Bag of Words (BoW) model. Hope you guys have got a basic understanding of it.
Drawbacks of using a Bag-of-Words (BoW) Model
In the above example, we have seen vectors of length 12. However, we will start facing issues when we add new sentences in our dataset. Let’s see why:
- Firstly, if the new sentences come up with new words, then our vocabulary size would definitely increase and thereby, the length of the vectors would increase too.
- Additionally, if the vectors contain many 0s, then it results in a sparse matrix which is what needed to be avoided at any cost.
- And finally, we are not retaining any information either on the grammar of the sentences or on the ordering of the words in the text.
Are you a programming lover or want to learn how to implement a bag of words model? If yes, then you are on the right track. Read out the article Implementation of Bag of Words using Python.
Read Also: Meaning of Technology and It’s Different Uses
Term Frequency-Inverse Document Frequency
So, in order to overcome some of these drawbacks of bag of words, we have moved to TF-IDF vector of a given document. Considering a formal definition on TF-IDF, we can define it as a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. So, what is Term Frequency?
In simple words, it is a measure of how frequently a term, t, appears in a document,
d: tf t,d = n t,d / No. of terms in the document. Here, n which is represented in the numerator is the number of times the term t appears in the document d and thus, each document and term would have its own TF value.
So, what is IDF now? It is a measure of how important a term is in the document. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words. In mathematical representation, we can denote it as:
idft = log (number of documents / number of documents with term ‘t’).
In short, we can conclude that the Term Frequency is a scoring of the frequency of the word in the current document whereas Inverse Document Frequency: is a scoring of how rare the word is across documents.
We will again use the previous novel review vocabulary to show how to calculate the TF for Review2: This novel is not humorous but is interesting. Here,
Step 1: Vocabulary: ‘This’, ‘novel’, ‘is’, ‘not’, ‘humorous’, ‘but’, ‘is’, ‘interesting’. The number of words in Review 2 = 8.
Step 2: TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/(number of terms in review 2) = 1/8
Step 3: Similarly, we can calculate the term frequencies for all the terms and all the reviews in this manner:
Term | Review 1 | Review 2 | Review 3 | TF (Review 1) | TF (Review 2) | TF (Review 3) |
this | 1 | 1 | 1 | 1/7 | 1/8 | 1/7 |
novel | 1 | 1 | 1 | 1/7 | 1/8 | 1/7 |
is | 1 | 2 | 1 | 1/7 | 1/4 | 1/7 |
very | 1 | 0 | 0 | 1/7 | 0 | 0 |
humorous | 1 | 1 | 0 | 1/7 | 1/8 | 0 |
and | 1 | 0 | 1 | 1/7 | 0 | 1/7 |
long | 1 | 0 | 0 | 1/7 | 0 | 0 |
not | 0 | 1 | 0 | 0 | 1/8 | 0 |
but | 0 | 1 | 0 | 0 | 1/8 | 0 |
interesting | 0 | 1 | 1 | 0 | 1/8 | 1/7 |
factual | 0 | 0 | 1 | 0 | 0 | 1/7 |
based | 0 | 0 | 1 | 0 | 0 | 1/7 |
The next question arises what are the advantages of TF-IDF due to which we have left behind a bag of words model. To be very true, the TF-IDF form of vectorization is very easy to compute and we have some basic metric to extract the most descriptive terms in a document.
Moreover, we can easily compute the similarity between 2 documents using this technique.
But this form of vectorization has some disadvantages too. Since, TF-IDF is based on the principles of bag-of-words (BoW) model; therefore it does not capture any position in text, semantics, or co-occurrences in different documents, etc.
And this becomes the only reason why TF-IDF is only useful as a lexical level feature. Moreover, it cannot also capture semantics as compared to topic models and word embeddings.
The ending words….
From the above sections, we can draw our conclusion that Bag of Words is an algorithm or a tally that counts how many times a word appears in a document. Whereas with TF-IDF, words are given weight. TF-IDF helps in measuring relevance, not frequency.
That is, word counts are replaced with TF-IDF scores across the whole dataset. No doubt, Bag of Words is very easy to interpret but TFIDF usually performs better in machine learning models. But still, there are a lot many demerits of these forms of word embedding techniques. This is where Word Embedding techniques such as Word2Vec, Glove, Continuous Bag of Words (CBOW), Fasttext, Skip Gram, etc. come into play.
Contributed by: Ram Tavva
This is exactly the information I’m looking for, I couldn’t have asked for a simpler read with great tips like this… Thanks!
Glad to hear this!
Nice Article
Thanks for sharing with us
Thank you so much and we are glad that you find our article very helpful
thanks for valuable information
Thank you so much and we are glad that you find our article very helpful
Nice information you provided. Thanks for that and we want more from you so that we will get more knowledge related to this topic.
Thank you so much and we are glad that you find our article very helpful
Nice Article!
Thanks for sharing with us