Introducing Natural Language Processing(NLP) using The Natural Language Toolkit (NLTK)

In this post we will show some uses of the NLTK Natural Language Toolkit. In this post we download a corpus from the Gutenberg project through which commands are illustrated.

To download the corpus write:
import nltk
from nltk.book import *

Output

*** Introductory Examples for the NLTK Book ***
Loading text1, …, text9 and sent1, …, sent9
Type the name of the text or sentence to view it.
Type: ‘texts()’ or ‘sents()’ to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

To see the corpus details print the corpus variable.
print(text1)
<Text: Moby Dick by Herman Melville 1851>

Input:-
x=text1.concordance(“wonderfullest”)

Output:-

Displaying 1 of 1 matches:
d seemed scorching to his feet . Wonderfullest things are ever the unmentionabl

Input:-

x=text1.common_contexts([“whale”,”white”])

Hints:-Words used for the same meaning

Output:-

a_- the_- the_, and_- his_- -_, ,_, the_head

Input:-

text1.dispersion_plot([“captain”,”Ahab”])

Hint:-Diagrammatic representation of occurrences of the word in a given corpus based on distance from the beginning.

Output:- 

Input:-

fdist1 = FreqDist(text1)
print(fdist1)
print(fdist1.most_common(50))
print(fdist1[‘Abhay’])

Output:-(it will simply return the 50 words according to their no. of occurrence)

<FreqDist with 19317 samples and 260819 outcomes>
[(‘,’, 18713), (‘the’, 13721), (‘.’, 6862), (‘of’, 6536), (‘and’, 6024), (‘a’, 4569), (‘to’, 4542), (‘;’, 4072), (‘in’, 3916), (‘that’, 2982), (“‘”, 2684), (‘-‘, 2552), (‘his’, 2459), (‘it’, 2209), (‘I’, 2124), (‘s’, 1739), (‘is’, 1695), (‘he’, 1661), (‘with’, 1659), (‘was’, 1632), (‘as’, 1620), (‘”‘, 1478), (‘all’, 1462), (‘for’, 1414), (‘this’, 1280), (‘!’, 1269), (‘at’, 1231), (‘by’, 1137), (‘but’, 1113), (‘not’, 1103), (‘–‘, 1070), (‘him’, 1058), (‘from’, 1052), (‘be’, 1030), (‘on’, 1005), (‘so’, 918), (‘whale’, 906), (‘one’, 889), (‘you’, 841), (‘had’, 767), (‘have’, 760), (‘there’, 715), (‘But’, 705), (‘or’, 697), (‘were’, 680), (‘now’, 646), (‘which’, 640), (‘?’, 637), (‘me’, 627), (‘like’, 624)]
0

Input:-

print(text1.collocations())

Output:-(Find combinations of most repeated collections)

Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
None

Input:-

text = word_tokenize(“what do you do?”)
print(nltk.pos_tag(text))

Output:-word_tokenize creates a list of words from a given sentence.And pos_tag returns part of speech of each word.

[‘what’, ‘do’, ‘you’, ‘do’, ‘?’]
[(‘what’, ‘WP’), (‘do’, ‘VBP’), (‘you’, ‘PRP’), (‘do’, ‘VB’), (‘?’, ‘.’)]
.: sentence terminator
. ! ?

similar finds words in the same lexical context meaning they have the same type of neighbors on the left and the right.

import nltk
from nltk.book import *
from nltk.tokenize import word_tokenize

text = word_tokenize(“This is three cats. This is two cats.”)
text = nltk.Text(text)
print(nltk.pos_tag(text))
print(‘similar’)
text.similar(‘two’)

Output
[(‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘three’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’),

(‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘two’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’)]
similar
three

Tagset will mark every word with its POS(Part of Speech) tag. For instance
text = word_tokenize(“This is three cats. This is two cats.”)
print(text)
x=nltk.pos_tag(text)
print(x)

[‘This’, ‘is’, ‘three’, ‘cats’, ‘.’, ‘This’, ‘is’, ‘two’, ‘cats’, ‘.’]
[(‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘three’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’), (‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘two’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’)]
[(‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘three’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’), (‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘two’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’)]

The following program prints a frequency distribution plot from the given sentences both for the words and the tags.

import nltk
from nltk.book import *
from nltk.tokenize import word_tokenize
text = word_tokenize(“This is three cats. This is two cats.”)
text = nltk.Text(text)
x=nltk.pos_tag(text)
print(x)
x=(tag for (word, tag) in x)
print(x)
tag_fd = nltk.FreqDist(x)
tag_fd.plot(cumulative=True)
print(tag_fd.most_common())
x=nltk.pos_tag(text)
x=(word for (word, tag) in x)
print(x)
tag_fd = nltk.FreqDist(x)
print(tag_fd.most_common())
tag_fd.plot(cumulative=True)

[(‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘three’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’), (‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘two’, ‘CD’), (‘cats’, ‘NNS’), (‘.’, ‘.’)]
<generator object <genexpr> at 0x0000006694125938>

frequency distribution plot
[(‘DT’, 2), (‘VBZ’, 2), (‘CD’, 2), (‘NNS’, 2), (‘.’, 2)]
<generator object <genexpr> at 0x0000006694125938>
[(‘This’, 2), (‘is’, 2), (‘cats’, 2), (‘.’, 2), (‘three’, 1), (‘two’, 1)]

frequency distribution plot

“startswith” returns the words that are called in the command.

Input:-

words=[‘This’, ‘is’, ‘a’, ‘train’,’with’,’brains’]

for word in words:
if word.startswith(“t”):
print(word)

Output:-train

similarly, endswith will return the wrods that are used in the command box.

for word in words:
if word.endswith(“s”):
print(word)

Output:-This
is
brains

Input:-

for word in words:
if word.__contains__(“rain”):
print(word)

Output:-

train
brains

 

Input:-

words=[‘This’, ‘is’, ‘a’, ‘Train’,’with’,’brains’]
for token in words:
if token.islower():
print(token, ‘is a lowercase word’)
elif token.istitle():
print(token, ‘is a titlecase word’)
else:
print(token, ‘is punctuation’)

Output:-

This is a titlecase word
is is a lowercase word
a is a lowercase word
Train is a titlecase word
with is a lowercase word
brains is a lowercase word

 


Leave a Reply