Tuesday, April 28, 2020

Text summarizer and its Types !

Hello everyone!!

In this post I am going to explain how can we manage to summarize lengthy news , comprehensive reports and study notes using text summarizer. 

In in our busy schedule we don’t get time to read full newspaper. All we collect news by hearing summaries from other  people. But is not always possible to manually summarize the news or other text.There is an enormous amount of textual material, and it is only growing every single day.

      

Solution – Implementation of Text Summarizer using NLP.

This is where the awesome concept of Text Summarization using machine learning helped us out. It solves the one issue which kept bothering us before – now our model can understand the context of the entire text. It is dream come true for those who cant live without reading newspaper and for those students who study before exams.

Lets first understand what text summarization is before how it works. So here is the definition given .

"The ideal of automatic summarization work is to develop techniques by which a machine can generate summarize that successfully imitate summaries generated by human beings."
                          — Page 2, Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding, 2014.

Types of summarizer -
Text summarizer can be divided based on many factors.But here are two summarizers based on output of summarization.

Extraction based Summarization
The name gives away what this approach does. In this we identify what sentences are most important and show most of information.Those extracted sentences would be our summary. The below diagram illustrates extractive summarization:

Abstraction-based summarization
In this NLU as well as NLG both works. We try to generate new sentences from the original text bu understanding it. In this sentences are generated are  not actually present in the actual document. They are just summarized version of whole text. So it is more preferable than extraction based summarizer.


Thank you guys for reading !!
Stay tuned for my next article on how to implement this text summarizer.
If you have any doubts you can comment below.
                                                                                                         - By Ashwini Ghode

Saturday, April 25, 2020

Implementation of Text summarizer using NLP !

Hello everyone !!

In this post I will explain how I I implemented Text summarizer  using NLP techniques.
But before learning implementation process let's learn its core algorithm i.e Text Rank algorithm on which basis this technique works.

Pagerank algorithm
 Before getting started with the TextRank algorithm, there is another algorithm which we should know - The Pagerank algorithm.In fact this actually inspired Textrank. Here are the fundamentals of it.





Here user probability of visiting from one page to another is calculated and is stored in a matrix.This probability is called as a page rank score which is given by 1/no of web page link it contains.

From Pagerank to Textrank:

  • The first step would be to concatenate all the text contained in the articles
  • Then split the text into individual sentences
  • In the next step, we will find vector representation (word embedding) for each and every sentence
  • Similarities between sentence vectors are then calculated and stored in a matrix
  • The similarity matrix is then converted into a graph, with sentences as vertices and similarity scores as edges, for sentence rank calculation
  • Finally, a certain number of top-ranked sentences form the final summary.
Actual implementation of text summarizer:
1. Import required libraries.
2. Read the data.
3. Split Text into Sentences
 We can use the sent_tokenize( ) function of the NLTK library to do this.
4. Use Glove word embedding or Bag of words.
Glove word embedding are vector representation of words. These word embedding will be used to create vectors for our sentences. We can use the Bag-of-Words or TF-IDF approaches to create features for our sentences, but these methods ignore the order of the words (and the number of features is usually pretty large).
5. Text Preprocessing
 So,  some basic text cleaning is required. Get rid of the stop words (commonly used words of a language – is, am, the, of, in, etc.) present in the sentences
6.Vector Representation of Sentences
7. Similarity Matrix Preparation
The next step is to find similarities between the sentences, and we can use the cosine similarity approach for this challenge.
8. Applying Text Rank Algorithm.
9. Summary Extraction
Finally your summarized report is ready.

Conclusion:

As a result, we are able to summarize text automatically. No need of manual classifier. People who are can keep updated themselves in short time. We can manage to read important news articles in our busy schedule. Inshorts is an innovative news app that converts news articles into a 60-word summary. Most of the students have habit of study before exams. But of course a vast syllabus can not be covered in one night manually. Text summarizer is best option for those students. Students will be able to get meaningful study notes. Our research paper project report can be summarized. 

Text summarizer can be used in fields like Media monitoring ,news letters, social media marketing, question answering and boots, programming language.

Thanks for reading !!
If you have any doubt comment me below.
                                                                                                             By Ashwini Ghode

Friday, April 10, 2020

Open Problems in NLP

The study of Natural Language Processing(NLP)  started somewhere in the 1950s. Ever since there has been a lot of advancement in this field. It has been into a cycle of regular evolution which has caused it to be a very powerful weapon in the world of Artificial Intelligence.

Despite so much of achievement so far, there are still a lot of limitations/Problems NLP has. Natural Language Understanding is one of the areas in which NLP is lagging behind the most. Despite so much of advancement, there is no NLP model that can excel like human beings.  Today, In this post, we are going to discuss some of them.

1. Ambiguity : 

One of the biggest challenges is to understand the meaning of an Ambiguous statement(statement open to different interpretations). NLP may a times do not handle Ambiguous statement nicely.

For eg: "John went to the bank", here the word 'bank' an either refer to where the money is kept or it could be a river bank.

2. Synonymy: 

The same ideas can often be constructed into sentences in which one word could be replaced by its synonym without really changing the meaning of the text. But NLP some times fail to understand the fact that in some cases sysnyms can also change the meaning of the whole text.

For eg: 'large' and 'big.' are synoyms. "He is my big brother" cannot be replaced by "He is my Large brother".

3. Intention/Sarcasm :

I am honestly quite skeptical about NLP models understanding the level of sarcasm which is quite common to Humans. It is one of the limitations of NLP. People might sarcastically criticize a product and the Model might interpret in a different way.

4. Language-Resource unavailability:

Even though there are thousands of Languaguages spoken World-wide but there is hardly any data resources available in other languages other than English and Chinese.

So, because of this problem, it might be quite fair to say that NLP is powerful only for Languages like English and Chinese only.

Now, we have come to the end of the article. Feel free to ask any questions in the comment section below.

Thursday, April 9, 2020

Lexical Analysis & Syntactic Analysis(Parsing)

In this post, I'm going to discuss two of the most important steps which need to be implemented whenever we deal with NLP. They are Lexical Analysis and Syntactic Analysis.




Lexical Analysis:

It basically is a stage in Natural Language Processing(NLP) where the given raw text is divided/segmented into different chunks of words or other units like paragraphs or sentences.


It involves identifying and analyzing the structure of words. Lexicon of a language basically means the collection of words and phrases in a language. It is also known as Morphological Analysis.



For eg:
" Tom owns an iPhone" could be tokenized into :
Tom
owns
an 
iPhone.

Syntactic Analysis(Parsing):

In simple words, this step does nothing but acts as a grammar checker. It analyses the words in the given sentence for grammar and also checks if the arrangement of different words are in a certain order which satisfies the relationship among the word.

What it basically does is it checks the basic grammar of the sentence.

For eg:
"The College goes to boy"
This given sentence is Syntactically wrong.



It identifies and analyzes the structure of words.
Lexicon of a language means the collection of words and phrases in a language

Discourse integration and pragmatic analysis

In this blog,I am going to explain about discourse integration and pragmatic analysis.
Discourse integration
Discourse integration is considered as the larger context for any smaller part of NL structure.

NL is so complex and, most of the time, sequences of text are dependent on prior discourse.

This concept occurs often in pragmatic ambiguity. This analysis deals with how the immediately preceding sentence can affect the meaning and interpretation of the next sentence.

 Concept of discourse integration often used in NLG(Natural Language Generation) applications. Chatbots, which are developed to deliver generalized AI. In this kind of application, deep learning has been used.



Pragmatic Analysis
Pragmatics is the study of how words are used, or the study of signs and symbols.

A example of pragmatics is how the same word can have different meanings in different settings.

Pragmatic Analysis is part of the process of extracting information from text.

Specifically, it's the portion that focuses on taking structures set of text and figuring out what the actual meaning was.

Pragmatic analysis refers to a set of linguistic and logical tools with which analysts develop systematic accounts of discursive political interactions.




Sentiment Analysis

Sentiment Analysis

Sentiment analysis is that the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services in online conversations and feedback.
With the recent advances in deep learning, the pliability of algorithms to analyse text has improved considerably. Creative use of advanced computing techniques could also be an honest tool for doing in-depth research. We believe it is important to classify incoming customer conversation a pair of name supported following:
1. Key aspects of a brand’s product and repair that customers care about.
2. Users’ underlying intentions and reactions concerning those aspects.
Also sentiment analysis is most common text classification tool that analyses incoming messages, social posts, comments on forum, etc. which is believed as Intent Analysis Or Profanity Analysis

What does it do ?

Sentiment analysis model detects the polarity within a text (positive or negative), understanding people's emotions is incredibly important for any business, since users can express them themselves in reviews more freely than ever.
For example : A owner of a business used sentiment analysis on the reviews given by the purchasers and located that the bulk of the purchasers were happy by his product, as you will see within the image below.


Now look at the figure below, as you can see it can also tell the emotion attached to comments by analyzing the sentences


Types of Sentiment Analysis

If the polarity is very important to the owner, then most of the emotions should be like
1. Very Good
2. Good
3. Neutral
4. Bad
5. Very Bad.
Here, Very Good = 10 points, and Very bad = 1 point
This same kind is used in one of our college feedback forms, which is pretty good, as the it directly tells the guardians and faculty member weather we are happy or not...

How does this work?

so you would possibly be wondering by now that how did this work?
The process is pretty simple as followed :
1. Break each text document down into its component parts (sentences, phrases, tokens and parts of speech) (Bag of Words)
2. Identify each sentiment-bearing phrase and component (Dictionary meaning and in context meaning) (Lemmatisation)
3. Assign a sentiment score to every phrase and component (-1 to +1)(Can be of any range as an example above is 0 to 10)
4. Optional: Combine scores for multi-layered sentiment analysis. (For system which has multiple output like "Very Good", and "Good"

Based on these points train the model.  There will be 3 types on this :
1. Rule-based systems that perform sentiment analysis supported a group of manually crafted rules.
2. Automatic systems that depend upon machine learning techniques to be told from data.
3. Hybrid systems that combine both rule-based and automatic approaches.

Where do we use it?

1. We can use it to detect the customers reviews in any business 
2. As for now majority of the social site like facebook and twitter use this to check the person behavior on the site.


For now there are many database available online in which you can some of your own words and train it. There are ready made pre-trained open source models on GITHUB, which you can use for your project.


Thank you for reading this.
If you got any doubts? please feel free to ask in the comments below. we will get back to you asap
By Kapil Kadadas






Topic Modeliing

Hello Guys, here we are back again with a new topic known as Topic Modelling

Topic Modelling

Basically Topic modeling is an unsupervised machine learning technique that scans a collection of documents, detecting word and phrase patterns within them, and automatically classifying them into word groups and similar expressions that best characterize a collection of documents.
Topic models also are named as probabilistic topic models, which refers to statistical algorithms for locating the latent semantic structures of an in depth text body. within the age of data, the quantity of the writing we encounter day by day is just beyond our processing capacity. Topic models can help to arrange and offer insights for us to know large collections of unstructured text bodies.


How does it work ?

Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. Let’s say you’re a software company and you would like to grasp what customers are saying about particular features of your product. rather than spending hours longing plenty of feedback, in a trial to deduce which texts are talking about your topics of interest, you'll analyze them with a subject modeling algorithm.
By detecting pattern of such words, and finding a pattern in them like frequency of the words, etc the subject model cluster feedback which is analogous and words and phrases that appear the foremost. Also as this can be an unsupervised technique there's no training required.
For example :
"The nice thing about Eventbrite is that it’s absolve to use as long as you’re not charging for the event. there's a fee if you're charging for the event –  2.5% plus a $0.99 transaction fee"
by taking the aboce sentence, and identifing the words and phrases like absolve to use, chargings, fees, $0.99 and lots of such things, Topic Model can group such reviews with many other review which can or might not speak about the costs.
















Where is Topic Modelling used?

Mostly used in NLP.
1. To sort and filter the reviews about a product according to the user preferences
2. To highlight the importants points in a document.

Algorithms:

In practice researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A recent survey by Blei describes this suite of algorithms. Several groups of researchers starting with Papadimitriou et al. have attempted to design algorithms with probable guarantees.

Libraries which we use in our code :

There are may libraries and software whcih the user can use, here I will mention a few of them.
1. BigARTM. site : (https://github.com/bigartm/bigartm)
2. Mallet. site :  (http://mallet.cs.umass.edu/)
3. Stanford Topic Modelling Toolkit. site : (http://nlp.stanford.edu/software/tmt/tmt-0.4/)
etc

Thank you for reading this.
if you got any doubts? please feel free to ask in the comments below. we will get back to you asap
By Kapil Kadadas


Semantic Analysis

Hello everyone,In this post I am going to explain about the semantic Analysis one of the step in NLP.
Semantic analysis describes the process of understanding natural language–the way that humans communicate–based on meaning and context.
Semantic analysis is the task of ensuring that the declarations and statements of a program are semantically correct, i.e, that their meaning is clear and consistent with the way in which control structures and data types are supposed to be used.
 The semantic analysis of natural language content starts by reading all of the words in content to capture the real meaning of any text.
It analyzes context in the surrounding text and it analyzes the text structure to accurately disambiguate the proper meaning of words that have more than one definition.

Semantic technology processes the logical structure of sentences to identify the most relevant elements in text.
Semantic Analysis is the third phase of Compiler.

Semantic anlysis implementation steps

Functions of Semantic Analysis

Type Checking –
Ensures that data types are used in a way consistent with their definition.

Label Checking –
A program should contain labels references.

Flow Control Check –
Keeps a check that control structures are used in a proper manner.(example: no break statement outside a loop)

Application of Semantic analysis:
The application of semantic analysis methods generally streamlines organizational processes of any knowledge management system.
Academic libraries often use a domain-specific application to create a more efficient organizational system.

Wednesday, April 8, 2020

Stop Words Removal & Stemming

Generally in Machine Learning, any given raw dataset needs to be converted to a certain format which can finally be used as a dataset that could be used to train an ML model. And all these steps come under Data Preprocessing.

That's the very reason, in  this post we are going to discuss two of the Most important data pre-processing techniques: STOP WORDS REMOVAL & STEMMING

Stop Words Removal:

Stop Words are basically a group of commonly used words that barely add any meaning to the sentence. Presence/Absence of these words in the sentence does not many differences in the overall meaning of the sentence.

It Includes getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English. In this process, some very common words that appear to provide little or no value to the NLP objective are filtered and excluded from the text to be processed.

Stop words can be safely ignored by carrying out a lookup in a pre-defined list of keywords, freeing up database space and improving processing time.



We can import Stopwords using NLTK:

from nltk.corpus import stopwords


Stemming:


The process of slicing the end or the beginning of words with the intention of removing affixes (lexical additions to the root of the word) is called stemming. It can be used to correct spelling errors from the tokens To improve speed and performance in the NLP model, then stemming is certainly the way to go.



In order to have a look at its advantages and disadvantages, let us have a look at the given example:




Here, we can see after stemming the word 'Playing' it boils down to 'Play'. Whereas in the case of the word 'News', it boils down to 'New', which is an absolute blunder and it completely changes the meaning of the word.


So we can say Stemming does not always provide an accurate result.



Tuesday, April 7, 2020

Lemmatisation and What is it?

Hello Guys.
Today we are going to to see what is Lemmatisation.

Lemmatisation.

Lemmatisation (or lemmatization) in linguistics is that the process of grouping together the inflected sorts of a word in order that they may be analysed as one item, identified by the word's lemma, or dictionary form.In linguistics, lemmatisation is that the algorithmic process of determining the lemma of a word supported its intended meaning.
Lemmatisation is usually utilized in NLP, The machine goes the all the words, and reduces it to their base form which then will have same meaning if put together in an exceedingly sentence.

As to what is the dictionary meaning of this ?

For example, in English, the verb 'to care' may appear as 'care', 'caring', 'cared. The base form, 'care', that one might look up in a dictionary, is called the lemma for the word

Why Lemmatisation ?

The goal of both stemming(similar to lemmatization, please refer post by Arman for more details) and lemmatization is to scale back inflectional forms and sometimes derivationally related sorts of a word to a typical base form.
For Example :

As you can see here, all the words origin from the base word Trouble, so the lemmatized word from trouble, troubling, troubled and troubles is the same TROUBLE.

Similary here, the lemmatised word is Goose.






Algorithms

A trivial way to do lemmatization is by simple dictionary lookup, which be very length and complex but result accurate words which would have proper meaning in the context.

Thank you for reading this.
if you got any doubts? please feel free to ask in the comments below.
By Kapil Kadadas

Algorithms and techniques used in NLP

In this blog, I am going to explain about techniques used in natural language processing.
1. Bag of Words
2. Tokenization

1. Bag of Words
Bag of words is a method to extract features from text documents.These features can be used for training machine learning algorithms.It creates vocabulary of all unique words occurring in all the documents in the training. The approach is very simple and flexible and can be used in myriad of ways for extracting features from documents.



Bag of words is a representation of text that describes the occurrence of words within a document. It involves two things.

1. A Vocabulary of known  words.
2. A measure of the presence of known words.

2.Tokenization

Tokenization is the process of replacing sensitive data with unique identification symbol than retain all the essential information about data without compromising it's security.

The Tokenization system must be secured and validated using security best practices, applicable to sensitive data protection ,secure storage,audit, authentication and authorization.

Tokenization is often used in credit card processing.




Wednesday, April 1, 2020

WHAT IS NLP AND WHY IT MATTERS

Hello everyone!!

In this post, you will find out what is the meaning of natural language processing is and why it is so important.

NLP or Natural Language Processing  is defined as the automatic manipulation of natural language, like text and speech, by intellectual application software.

It is the field of artificial intelligence .NLP gives machine the ability to read,understand and derive meaning from human language.It is discipline  and technique that focuses on the interaction between machine learning and human language and is scaling to many industries. In other words, it is an approach to process, analyze and understand large amount of data that is being generated every day.



What Is a Virtual Assistant?
In addition to human language understanding, this field is also occupied with teaching computers to generate human language, so that they can make you think that you are talking with an real human being. I'm sure you have experienced virtual assistants.




Now the question arrives why NLP is important ??

The computer is incredibly fast, accurate, and stupid. Man is unbelievably slow, inaccurate, and brilliant. The marriage of the two is a force beyond calculation.” 
– Leo Cherne


Human knowledge is limited, while today, in the big data century , the computer has access to almost unlimited knowledge. So what if we taught computers to understand us human beings?
Of course, some of these applications already exist. They are using NLP to implement such applications.

1. Handling large volumes of text data




Let’s face it. There are billions of text data being generated every day.
Look around.

In-apps messages (Whats app, WeChat, Telegram etc.), social media (Facebook, Instagram, Twitter, YouTube etc.), forums (Quora, Reddit etc.), blogs, news publishing platforms, google searches and many other channels.

All these channels are constantly generating large amount of text data every minute daily.
And because of the large volumes of text data as well as the highly unstructured data source, we can no longer use the common approach to understand the text and this is where NLP comes in.

With the big data technology, NLP has entered the mainstream as this approach can now be applied to handle large volumes of text data via cloud/distributed computing at an unprecedented speed.


2.Structuring highly unstructured data source

Human language is astoundingly complex and diverse.  When we speak, most of the time we have regional accents, and  sometimes we mumble, stutter and borrow terms from other languages too. 

NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics. 

Thanks for reading.
Hope you find my article helpful.
If you have any doubts,ask your questions in the comments below and I will do my best to answer.
By Ashwini Ghode


Components of NLP

Hello everyone!! In this post I am  going to explain what is NLU and NLG and how three of them works together also techniques used in NLP....