Python’s Natural Language Tool Kit (NLTK) Tutorial part

Ishan Dixit

Published in

Analytics Vidhya

OK. Enough chit - chat. Let us start with some of the basics scripts in order to get our hands adept with NLTK.

Tokenisation - Splitting bigger parts to small parts. We can tokenize paragraphs to sentences and sentences to words. The process of converting the normal text strings into a list of tokens (words that we actually want).

#TOKENISATIONfrom nltk.tokenize import sent_tokenize, word_tokenizeEXAMPLE_TEXT = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favorable any. Unknown chiefly showing to conduct no."tokened_sent = sent_tokenize(EXAMPLE_TEXT)tokened_word = word_tokenize(EXAMPLE_TEXT)print(tokened_sent)
print(tokened_word)

The output of the above program would be like this.

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (4)

Stemming - Removing affixes from words and returning the root word. (The stem of the word ‘working’ will be ‘work’.)

#STEMMINGfrom nltk.stem import PorterStemmerps = PorterStemmer()example_words = ["python","pythoner","pythoning","pythoned","pythonly"]for w in example_words:
 print(ps.stem(w))'''
THERE ARE A LOT MANY STEMMERS AVAILABLE IN IN THE NLTK LIBRARY:
1)PorterStemmer
2)SnowballStemmer
3)LancasterStemmer
4)RegexpStemmer
5)RSLPStemmer
'''

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (5)

Lemmatization - Word lemmatizing is similar to stemming, but the difference lies in the output. The Lemmatized output is a real word and not just any trimmed word. For this piece of code to work, you will have to download the wordnet package for nltk.

#LEMMATIZATIONfrom nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize('increases'))
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="v")) 
print(lemmatizer.lemmatize('playing', pos="n")) 
print(lemmatizer.lemmatize('playing', pos="a")) 
print(lemmatizer.lemmatize('playing', pos="r"))
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (6)

Now that we have a little bit of a clear idea on how to use NLTK, Lets level things up a notch.

Stop words: There are some words in English like “the,” “of,” “a,” “an,” and so on. These are ‘stop words’. Stop words differ from language to language. These stop words may affect the results and thus removing them is necessary.

#FILTERING ALL THE STOPWORDSfrom nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration. Here you can write whatever you want to. You can also add a very big text file and see how this technique works"#STOP WORDS ARE PARTICULAR TO RESPECTIVE LANGUAGES(english, spanish, french Et cetera)
stop_words = set(stopwords.words('english'))word_tokens = word_tokenize(example_sent)filtered_sentence = [w for w in word_tokens if not in stop_words]print(filtered_sentence)

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (7)

Count word frequency - Counting the frequency of occurrence of a word is a crucial part of language analysis. NLTK ships with a word frequency counter in order to count the number of times the word is repeated in a particular dataset.

#COUNTING THE FREQUENCY OF THE WORDS USEDimport nltkEXAMPLE_TEXT = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favourable any. Unknown chiefly showing to conduct no."frequency = nltk.FreqDist(EXAMPLE_TEXT) 
for key,val in frequency.items(): 
 print (str(key) + ':' + str(val))

Output:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (8)

Synonyms/Antonyms - And finally, we can also find Synonyms as well as Antonyms of any English word we desire.

from nltk.corpus import wordnetsynonyms = []
for syns in wordnet.synsets('dog'):
 synonyms.append(syns.name())print ("synonyms", synonyms)#FINDING ANTONYMS FROM WORDNETS
from nltk.corpus import wordnetantonyms = []
for syn in wordnet.synsets("good"):
 for l in syn.lemmas():
 if l.antonyms():
 antonyms.append(l.antonyms()[0].name())print(antonyms)

Output for synonyms would be:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (9)

And the output for antonyms would be:

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (10)

BONUS: Extracting email from any given sentence using Regular expression.

import re
text = "Please contact us at contact@blahblah.com for further information."+\
 " You can also give feedback at feedback@blah.com"emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
print (emails)

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (11)

Here we conclude the first part of this tutorial series. In the next part, we will cover more advanced topics such as Chunking, Part of Speech tagging, etc.

I have a Github repository containing all of the above-explained codes in a very well commented structure. I will make sure to update it as this series of posts advances.

Stay tuned. Until next time…!

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1 (2024)