Natural Language Processing Using Python & NLTK (2024)

What is consciousness? Could a counterfeit machine truly think? For many, these have been the vital contemplations for the eventual fate of Artificial Intelligence (AI). Yet, British computer scientist Alan Turing chose to dismiss every one of these inquiries for a lot less difficult one: Can a computer talk like a human? This question prompted the thought of estimating the intelligence of computer machines, famously known as The Turing Test.

In the 1950 paper, “Computing Machinery and Intelligence”, Turing proposed the following game. A human judge has a text conversation with unseen players and evaluates their responses. To pass the test, the machine should substantially substitute a player without changing the results. To put in other words, A machine should fairly play enough to fool the human judge as a human. The machines started finding smarter ways to fool judge rather than using its computational power.

The first program to succeed with some claim was called ELIZA, with fairly short and script, it managed to mislead people by mimicking a psychologist, encouraging them to talk more. And that’s how the light on Natural Language Processing was thrown.

In communication, what falls into place for humans without any issues, nonetheless, is hard for computer machines with the huge amount of unstructured data is lack of formal standards/ rules and absence of real-world context or intent. And as AI gets more sophisticated, so will Natural Language Processing (NLP). While the terms AI and NLP might conjure images of advanced robots, there are already basic examples of NLP at work in our daily lives.

Here are a few prominent examples.

Email Filters

Most basic and initial application is spam filtering (spam & non-spam mail), Gmail’s email classification (into any one three categories: Primary, Social, Promotions) based on their contents. This helps users to keep their inbox manageable with important and relevant emails.

Smart Assistants

Smart Assistants like Apple’s Siri, Amazon’s Alexa and Google’s Ok Google started to identify patterns in speech, recognize voices, infer meaning and provide a response. The interactions with assistants grow more personal as they get to know more about us. As a New York Times article “Why We May Soon Be Living in Alexa’s World,” explained: “Something bigger is afoot. Alexa has the best shot of becoming the third great consumer computing platform of this decade.”

Search Results

Search Engines use NLP to find significant results based on similar search patterns/ behaviours or user intent enabling any person to find what they need without being an expertise in it. For an instance, Google not just predicts the mainstream search based on users’ queries but it looks into the entire picture and perceives what are we trying to say rather than the exact search words.

Predictive Texts

Any smartphone user would have come across things like autocorrect, autocomplete and predictive texts enabling us to predict things to say based on what we type, completing the word or suggesting a relevant word to make the overall message sensible and meaningful one. Predictive texts will customize itself to users’ personal language, the more they use it.

There are many more to be added: Language Translation, Digital Phone Calls (Google Assistant making a hair appointment), Question Answering systems, NLP also gets its hands laid on better Data Analysis and Data Visualization, Text Analytics and so on.

Natural Language Processing Using Python & NLTK (1)

Symbolic NLP
Statistical NLP
Neural NLP

Symbolic NLP: In early days, many language systems were designed using symbolic NLP, i.e. , the hand-coded set of rules assisted with dictionary lookup: such as by writing grammars or devising heuristic rules for stemming. However systems that could automatically learn rules produce more accurate results. Even now it is used when insufficient training data exists, for preprocessing (Tokenization), for post-processing (knowledge extraction).

Statistical NLP: In 1980s and mid 90s, NLP research started to rely on machine learning paradigms that used statistical inferences to automatically learn rules through analysing large corpora of real-world applications. Many different classes of ML algorithms have been applied to NLP tasks. These algorithms take as input a large set of features that are generated from the input data. Increasingly, research has focused on statistical models, eg: Parts-Of-Speech tagging (POS tagging) used hidden markov models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.

Neural NLP: The major drawback in Statistical NLP is the requirement of elaborate feature engineering. This made researchers to shift to Deep Learning based Neural Network (DLNN) approaches to handle sequence-to-sequence transformations. Popular technique includes the use of word embeddings to capture the semantics of the words.

Natural Language Processing Using Python & NLTK (2)

Natural Language Processing is divided into two parts

Natural Language Understanding (NLU)
Natural Language Generation (NLG)

Natural Language Processing Using Python & NLTK (3)

Natural Language Understanding: The process of making the intelligent systems to understand a natural language input via text or speech.

Natural Language Processing Using Python & NLTK (4)

Lexical Analysis: With lexical analysis, we divide a whole chunk of text into paragraphs, sentences, and words. It involves identifying and analyzing words’ structure.

Syntactic Analysis: Syntactic analysis involves the analysis of words in a sentence for grammar and arranging words in a manner that shows the relationship among the words.

Semantic Analysis: Semantic analysis draws the exact meaning for the words, and it analyzes the text meaningfulness.

Discourse Analysis: Disclosure Analysis/ Discourse Integration takes into account the context of the text. It considers the meaning of the sentence before it ends.

Pragmatic Analysis: Pragmatic analysis deals with overall communication and interpretation of language. It deals with deriving meaningful use of language in various situations.

Natural Language Generation: The process of producing meaningful phrases and sentences in the form of natural language from some internal representation.

Natural Language Processing Using Python & NLTK (5)

Discourse Generation: The process where the input is the communication goal and the output can be the discourse often in the form of the content tree.

Sentence Planning: It involves surface realization or linearization according to grammar.

Lexical Choice: It involves choosing the content words (nouns, verbs, adjectives, and adverbs) in a generated text.

Sentence Structuring: The process of creating the sentence text, which should be correct according to the rules of syntax

Morphological Generation: The process of final structuring that may involve correction such as tense discrepancies or gender discrepancies of a language in context to the entity, situation etc.

There are many Open Source NLP Libraries like Apache’s OpenNLP, Stanford NLP, MALLET that provides the algorithmic building blocks of NLP in real-world applications. Few easy-to-use python frameworks for NLP are NLTK, spaCy, GenSim, TextBlob.

“an amazing library to play with natural language.”

Natural Language ToolKit (NLTK) is one Python Library that provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet along with text processing libraries for processing texts, classifying, tokenizing, stemming, tagging, parsing, sentiment analysis and more. It also supports packages in building chatbots and wrappers for industrial-strength NLP libraries and an active discussion forum. It is available for Windows, Mac OS X and Linux.

The below codes are executed in Google Colaboratory which explains the Preprocessing of texts in NLP.

For various data processing cases in NLP, we need to import some libraries. In this case, we are going to use NLTK for Natural Language Processing. We will use it to perform various operations on the text.

The NLTK data package includes punkt : pre-trained tokenizer for English, averaged_perceptron_tagger : pre-trained parts-of-speech tagger for English, stopwords : list of 179 english stop words like “I”, “a”, “the” which adds little meaning to the text when analyzed.

Tokenization: The process of separating a piece of text into smaller units called tokens. The tokens can either be sentences, words, sub-words or characters.Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary. Vocabulary refers to the set of unique tokens in the corpus.

Code Explanation: The sent_tokenize sub module is imported from NLTK library. A variable text is initialized with 3 sentences. Text variable is passed in sent_tokenize module and printed the result. This module breaks each sentence with punctuation which you can see in the output.

Code Explanation: The word_tokenize sub module is imported from NLTK library. A variable text is initialized with 3 sentences. Text variable is passed in word_tokenize module and printed the result. This module breaks each word including punctuations as tokens which you can see in the output.

Code Explanation: The TreebankWordTokenize sub module is imported from NLTK library. A variable text is initialized with 4 sentences. Text variable is passed in TreebankWordTokenize module and printed the result. This module assumes that the text is already been segmented into sentences and breaks each word as tokens. It considers most punctuations, split-off commas and single quotes, when followed by white space separate periods at the end of the line as tokens which you can see in the output.

The submodules lower() and upper() converts the given text to lower case and upper case respectively.

Stemming: The process of reducing the words to their word stem or root form. For example, “connection”, “connected”, “connecting” words reduce to a common word “connect”. The objective is to reduce the related words to the same stem even if the stem is not a dictionary word. It does not preserve the context of the word as it operates only on a single word. This happens because there are mainly two errors in stemming.

Over-stemming: When two words with different stems are stemmed to the same root. This is also known as a false positive. For example, “universe”, “universal”, “university” though different words are stemmed to “univers” which is not true.

Under-stemming: When two words that should be stemmed to the same root are not. This is also known as a false negative. For example, “alumnus”, “alumni”, “alumnae” though convey same meaning are stemmed to different words.

There are many approaches for stemming like Porter Stemmer, Snowball Stemmer, Lancaster Stemmer, the Porter Stemmer algorithm is considered empirically effective. Snowball Stemmer is an improvised version of Porter, also known as Porter2 stemmer.

Lemmatization: reduces words to their base word, reducing the inflected words properly and ensuring that the root word (lemma) belongs to the language. Lemma is the dictionary form of set of words. It identifies the Parts-Of-Speech (POS) and then normalizes. For example, the word ‘leaves’ without a POS tag would get lemmatized to the word ‘leaf’, but with a verb tag, its lemma would become ‘leave’.

Stopwords Removal: Stopwords are the most commonly used words in english that does not add much meaning to a sentence. They can be safely removed without actually affecting the meaning of the sentence. For example, “the”, “about”, “to”, “he”.

Import stopwords module from nltk corpus. Compare tokens with stopwords list. If match found, remove else print.

Parts-Of-Speech Tagging: It helps us to identify the tags of each word whether it is a noun, verb, adjective, etc.,. The below are the NLTK listed pos tags.

Natural Language Processing Using Python & NLTK (6)

Named Entity Recognition: It is used to capture all textual mentions of named entities. A named entity can be a person, place, organization, etc.,. For example, GPE — Geo-Political Entity, ORG — Organization and so on.

Sentiment Analysis: The Process of analyzing the sentiments in a text. the sentiments rely on Polarity ( neg: Negative, neu: Neutral, pos: Positive) and Magnitude ( weights assigned to each polarity). Download VADER lexicon from NLTK. VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity and magnitude of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data.

Hope this article gave you an insight of NLP and its practical exposure using NLTK. Try Executing all the codes by yourself. Do follow.

To view code samples, Click Colab Code Contents