Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (2024)

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (3)

It is quite possible that you might have heard of the term ‘Natural Language Processing’ or NLP(abbreviated). Natural Language processing is a very hot topic in the field of Artificial intelligence and particularly in machine learning. The reason being its enormous applications in day to day life.

These applications include Chatbots, Language translation, Text Classification, Paragraph summarization, Spam filtering and many more. There are a few open-source NLP libraries, that do the job of processing text, like NLTK, Stanford NLP suite, Apache Open NLP, etc. NLTK is the most popular as well as an easy to understand library.

OK. Enough chit - chat. Let us start with some of the basics scripts in order to get our hands adept with NLTK.

Tokenisation - Splitting bigger parts to small parts. We can tokenize paragraphs to sentences and sentences to words. The process of converting the normal text strings into a list of tokens (words that we actually want).

#TOKENISATIONfrom nltk.tokenize import sent_tokenize, word_tokenizeEXAMPLE_TEXT = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favorable any. Unknown chiefly showing to conduct no."

tokened_sent = sent_tokenize(EXAMPLE_TEXT)

tokened_word = word_tokenize(EXAMPLE_TEXT)print(tokened_sent)
print(tokened_word)

The output of the above program would be like this.

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (4)

Stemming - Removing affixes from words and returning the root word. (The stem of the word ‘working’ will be ‘work’.)

#STEMMINGfrom nltk.stem import PorterStemmerps = PorterStemmer()example_words = ["python","pythoner","pythoning","pythoned","pythonly"]for w in example_words:
print(ps.stem(w))
'''
THERE ARE A LOT MANY STEMMERS AVAILABLE IN IN THE NLTK LIBRARY:
1)PorterStemmer
2)SnowballStemmer
3)LancasterStemmer
4)RegexpStemmer
5)RSLPStemmer
'''

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (5)

Lemmatization - Word lemmatizing is similar to stemming, but the difference lies in the output. The Lemmatized output is a real word and not just any trimmed word. For this piece of code to work, you will have to download the wordnet package for nltk.

#LEMMATIZATIONfrom nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize('increases'))
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (6)

Now that we have a little bit of a clear idea on how to use NLTK, Lets level things up a notch.

Stop words: There are some words in English like “the,” “of,” “a,” “an,” and so on. These are ‘stop words’. Stop words differ from language to language. These stop words may affect the results and thus removing them is necessary.

#FILTERING ALL THE STOPWORDSfrom nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration. Here you can write whatever you want to. You can also add a very big text file and see how this technique works"
#STOP WORDS ARE PARTICULAR TO RESPECTIVE LANGUAGES(english, spanish, french Et cetera)
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)filtered_sentence = [w for w in word_tokens if not in stop_words]print(filtered_sentence)

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (7)

Count word frequency - Counting the frequency of occurrence of a word is a crucial part of language analysis. NLTK ships with a word frequency counter in order to count the number of times the word is repeated in a particular dataset.

#COUNTING THE FREQUENCY OF THE WORDS USEDimport nltkEXAMPLE_TEXT = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favourable any. Unknown chiefly showing to conduct no."frequency = nltk.FreqDist(EXAMPLE_TEXT) 
for key,val in frequency.items():
print (str(key) + ':' + str(val))

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (8)

Synonyms/Antonyms - And finally, we can also find Synonyms as well as Antonyms of any English word we desire.

from nltk.corpus import wordnetsynonyms = []
for syns in wordnet.synsets('dog'):
synonyms.append(syns.name())
print ("synonyms", synonyms)#FINDING ANTONYMS FROM WORDNETS
from nltk.corpus import wordnet
antonyms = []
for syn in wordnet.synsets("good"):
for l in syn.lemmas():
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(antonyms)

Output for synonyms would be:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (9)

And the output for antonyms would be:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (10)

BONUS: Extracting email from any given sentence using Regular expression.

import re
text = "Please contact us at contact@blahblah.com for further information."+\
" You can also give feedback at feedback@blah.com"

emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
print (emails)

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (11)

Here we conclude the first part of this tutorial series. In the next part, we will cover more advanced topics such as Chunking, Part of Speech tagging, etc.

I have a Github repository containing all of the above-explained codes in a very well commented structure. I will make sure to update it as this series of posts advances.

Stay tuned. Until next time…!

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (2024)
Top Articles
Latest Posts
Article information

Author: Allyn Kozey

Last Updated:

Views: 5605

Rating: 4.2 / 5 (63 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Allyn Kozey

Birthday: 1993-12-21

Address: Suite 454 40343 Larson Union, Port Melia, TX 16164

Phone: +2456904400762

Job: Investor Administrator

Hobby: Sketching, Puzzles, Pet, Mountaineering, Skydiving, Dowsing, Sports

Introduction: My name is Allyn Kozey, I am a outstanding, colorful, adventurous, encouraging, zealous, tender, helpful person who loves writing and wants to share my knowledge and understanding with you.