Python - Klasifikasi Teks

Seringkali, kita perlu mengkategorikan teks yang tersedia ke dalam berbagai kategori dengan beberapa kriteria yang telah ditentukan sebelumnya. nltk menyediakan fitur seperti itu sebagai bagian dari berbagai korpora. Dalam contoh di bawah ini kami melihat korpus ulasan film dan memeriksa kategorisasi yang tersedia.

# Lets See how the movies are classified
from nltk.corpus import movie_reviews
all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)

When we run the above program, we get the following output −

['neg', 'pos']

Now let's look at the content of one of the files with a positive review. The sentences in this file are tokenized and we print the first four sentences to see the sample.

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()
sample = movie_reviews.raw("pos/cv944_13521.txt")
token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])

When we run the above program we get the following output −

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films 
like deep impact , = godzilla , the x-files , armageddon , the truman show , 
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film 
releases from the = spielberg-katzenberg-geffen's dreamworks production company .

Next, we tokenize the words in each of these files and find the most common words by using the FreqDist function from nltk.

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))

When we run the above program we get the following output −

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]