Generally in Machine Learning, any given raw dataset needs to be converted to a certain format which can finally be used as a dataset that could be used to train an ML model. And all these steps come under Data Preprocessing.
It Includes getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English. In this process, some very common words that appear to provide little or no value to the NLP objective are filtered and excluded from the text to be processed.
Stop words can be safely ignored by carrying out a lookup in a pre-defined list of keywords, freeing up database space and improving processing time.
We can import Stopwords using NLTK:
from nltk.corpus import stopwords
Stemming:
The process of slicing the end or the beginning of words with the intention of removing affixes (lexical additions to the root of the word) is called stemming. It can be used to correct spelling errors from the tokens To improve speed and performance in the NLP model, then stemming is certainly the way to go.
In order to have a look at its advantages and disadvantages, let us have a look at the given example:
Here, we can see after stemming the word 'Playing' it boils down to 'Play'. Whereas in the case of the word 'News', it boils down to 'New', which is an absolute blunder and it completely changes the meaning of the word.
So we can say Stemming does not always provide an accurate result.
That's the very reason, in this post we are going to discuss two of the Most important data pre-processing techniques: STOP WORDS REMOVAL & STEMMING
Stop Words Removal:
Stop Words are basically a group of commonly used words that barely add any meaning to the sentence. Presence/Absence of these words in the sentence does not many differences in the overall meaning of the sentence.
It Includes getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English. In this process, some very common words that appear to provide little or no value to the NLP objective are filtered and excluded from the text to be processed.
Stop words can be safely ignored by carrying out a lookup in a pre-defined list of keywords, freeing up database space and improving processing time.
We can import Stopwords using NLTK:
from nltk.corpus import stopwords
Stemming:
The process of slicing the end or the beginning of words with the intention of removing affixes (lexical additions to the root of the word) is called stemming. It can be used to correct spelling errors from the tokens To improve speed and performance in the NLP model, then stemming is certainly the way to go.
In order to have a look at its advantages and disadvantages, let us have a look at the given example:
Here, we can see after stemming the word 'Playing' it boils down to 'Play'. Whereas in the case of the word 'News', it boils down to 'New', which is an absolute blunder and it completely changes the meaning of the word.
So we can say Stemming does not always provide an accurate result.
Nice work��
ReplyDelete