Gensim - Tạo mô hình chủ đề LSI & HDP

Chương này đề cập đến việc tạo lập chỉ mục ngữ nghĩa tiềm ẩn (LSI) và mô hình chủ đề Quy trình Dirichlet phân cấp (HDP) liên quan đến Gensim.

Các thuật toán mô hình hóa chủ đề lần đầu tiên được triển khai trong Gensim với Phân bổ Dirichlet tiềm ẩn (LDA) là Latent Semantic Indexing (LSI). Nó còn được gọi làLatent Semantic Analysis (LSA). Nó được cấp bằng sáng chế vào năm 1988 bởi Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landaur, Karen Lochbaum và Lynn Streeter.

Trong phần này, chúng tôi sẽ thiết lập mô hình LSI của chúng tôi. Nó có thể được thực hiện theo cùng một cách thiết lập mô hình LDA. Chúng tôi cần nhập mô hình LSI từgensim.models.

Vai trò của LSI

Trên thực tế, LSI là một kỹ thuật NLP, đặc biệt là trong ngữ nghĩa phân phối. Nó phân tích mối quan hệ giữa một tập hợp các tài liệu và các điều khoản mà các tài liệu này chứa đựng. Nếu chúng ta nói về hoạt động của nó, thì nó xây dựng một ma trận chứa số lượng từ trên mỗi tài liệu từ một đoạn văn bản lớn.

Sau khi được xây dựng, để giảm số lượng hàng, mô hình LSI sử dụng một kỹ thuật toán học được gọi là phân rã giá trị số ít (SVD). Cùng với việc giảm số lượng hàng, nó cũng bảo tồn cấu trúc tương tự giữa các cột.

Trong ma trận, các hàng đại diện cho các từ duy nhất và các cột đại diện cho mỗi tài liệu. Nó hoạt động dựa trên giả thuyết phân bố, tức là nó giả định rằng các từ gần nghĩa sẽ xuất hiện trong cùng một loại văn bản.

Thực hiện với Gensim

Ở đây, chúng tôi sẽ sử dụng LSI (Lập chỉ mục ngữ nghĩa tiềm ẩn) để trích xuất các chủ đề được thảo luận tự nhiên từ tập dữ liệu.

Đang tải tập dữ liệu

Tập dữ liệu mà chúng tôi sẽ sử dụng là tập dữ liệu của ’20 Newsgroups’có hàng nghìn tin bài từ các phần khác nhau của một bản tin. Nó có sẵn dướiSklearncác tập dữ liệu. Chúng tôi có thể dễ dàng tải xuống với sự trợ giúp của tập lệnh Python sau:

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

Hãy xem một số tin tức mẫu với sự trợ giúp của tập lệnh sau:

newsgroups_train.data[:4]
["From: [email protected] (where's my thing)\nSubject: 
WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: 
University of Maryland, College Park\nLines: 15\n\n 
I was wondering if anyone out there could enlighten me on this car 
I saw\nthe other day. It was a 2-door sports car,
looked to be from the late 60s/\nearly 70s. It was called a Bricklin. 
The doors were really small. In addition,\nthe front bumper was separate from 
the rest of the body. This is \nall I know. If anyone can tellme a model name, 
engine specs, years\nof production, where this car is made, history, or 
whatever info you\nhave on this funky looking car, 
please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood 
Lerxst ----\n\n\n\n\n",

"From: [email protected] (Guy Kuo)\nSubject: 
SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: 
SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: 
University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA 
fair number of brave souls who upgraded their SI clock oscillator have\nshared their 
experiences for this poll. Please send a brief message detailing\nyour experiences with 
the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat 
sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies 
are especially requested.\n\nI will be summarizing in the next two days, so please add 
to the network\nknowledge base if you have done the clock upgrade and haven't answered 
this\npoll. Thanks.\n\nGuy Kuo <[email protected]>\n",

'From: [email protected] (Thomas E Willis)\nSubject: 
PB questions...\nOrganization: Purdue University Engineering Computer 
Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the 
ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i\'m in the 
market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into 
picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) 
somebody can answer:\n\n* does anybody know any dirt on when the next round of 
powerbook\nintroductions are expected? i\'d heard the 185c was supposed to make 
an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t 
have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has 
anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s 
just went through recently?\n\n* what\'s the impression of the display on the 180? i 
could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don\'t 
really have\na feel for how much "better" the display is (yea, it looks great in 
the\nstore, but is that all "wow" or is it really that good?). could i solicit\nsome 
opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk 
size and money hit to get the active display? (i realize\nthis is a real subjective 
question, but i\'ve only played around with the\nmachines in a computer store breifly 
and figured the opinions of somebody\nwho actually uses the machine daily might prove 
helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any 
info - if you could email, i\'ll post a\nsummary (news reading time is at a premium 
with finals just around the\ncorner... :( )\n--\nTom Willis \\ [email protected] 
\\ Purdue Electrical 
Engineering\n---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n',

'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: Harris 
Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host: 
amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert J.C. Kyanko 
([email protected]) wrote:\n > [email protected] writes in article <
[email protected]>:\n> > Anyone know about the Weitek P9000 
graphics chip?\n > As far as the low-level stuff goes, it looks pretty nice. It\'s 
got this\n > quadrilateral fill command that requires just the four
points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get some 
information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris 
Corporation\[email protected]\t\t\tComputer Systems Division\n"The only thing that 
really scares me is a person with no sense of humor."\n\t\t\t\t\t\t-- Jonathan 
Winters\n']

Điều kiện tiên quyết

Chúng tôi cần từ dừng từ NLTK và mô hình tiếng Anh từ Scapy. Cả hai đều có thể được tải xuống như sau:

import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

Nhập các gói cần thiết

Để xây dựng mô hình LSI, chúng ta cần nhập gói cần thiết sau:

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt

Chuẩn bị từ dừng

Bây giờ chúng ta cần nhập các Từ dừng và sử dụng chúng -

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

Dọn dẹp văn bản

Bây giờ, với sự giúp đỡ của Gensim's simple_preprocess()chúng ta cần phân tách từng câu thành một danh sách các từ. Chúng ta cũng nên loại bỏ các dấu câu và các ký tự không cần thiết. Để làm điều này, chúng tôi sẽ tạo một hàm có tênsent_to_words() -

def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

Xây dựng mô hình Bigram & Trigram

Như chúng ta biết rằng bigrams là hai từ thường xuyên xuất hiện cùng nhau trong tài liệu và bát quái là ba từ thường xuyên xuất hiện cùng nhau trong tài liệu. Với sự trợ giúp của mô hình Cụm từ của Gensim, chúng tôi có thể làm điều này -

bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Lọc ra các từ dừng

Tiếp theo, chúng ta cần lọc ra các Từ dừng. Cùng với đó, chúng tôi cũng sẽ tạo ra các chức năng để tạo ra bigrams, bát quái và cho lemmatisation -

def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

Xây dựng Từ điển & Tập đoàn cho Mô hình Chủ đề

Bây giờ chúng ta cần xây dựng từ điển & kho ngữ liệu. Chúng tôi cũng đã làm điều đó trong các ví dụ trước -

id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

Xây dựng mô hình chủ đề LSI

Chúng tôi đã triển khai mọi thứ cần thiết để đào tạo mô hình LSI. Bây giờ, đã đến lúc xây dựng mô hình chủ đề LSI. Đối với ví dụ triển khai của chúng tôi, nó có thể được thực hiện với sự trợ giúp của dòng mã sau:

lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

Ví dụ triển khai

Hãy xem ví dụ triển khai hoàn chỉnh để xây dựng mô hình chủ đề LDA -

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(
   data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
)
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

Bây giờ chúng ta có thể sử dụng mô hình LSI đã tạo ở trên để lấy các chủ đề.

Xem chủ đề trong mô hình LSI

Mô hình LSI (lsi_model)chúng tôi đã tạo ở trên có thể được sử dụng để xem các chủ đề từ các tài liệu. Nó có thể được thực hiện với sự trợ giúp của tập lệnh sau:

pprint(lsi_model.print_topics())
doc_lsi = lsi_model[corpus]

Đầu ra

[
   (0,
   '1.000*"ax" + 0.001*"_" + 0.000*"tm" + 0.000*"part" +    0.000*"pne" + '
   '0.000*"biz" + 0.000*"mbs" + 0.000*"end" + 0.000*"fax" + 0.000*"mb"'),
   (1,
   '0.239*"say" + 0.222*"file" + 0.189*"go" + 0.171*"know" + 0.169*"people" + '
   '0.147*"make" + 0.140*"use" + 0.135*"also" + 0.133*"see" + 0.123*"think"')
]

Quy trình Dirichlet phân cấp (HPD)

Các mô hình chủ đề như LDA và LSI giúp tóm tắt và sắp xếp các kho lưu trữ văn bản lớn mà không thể phân tích bằng tay. Ngoài LDA và LSI, một mô hình chủ đề mạnh mẽ khác trong Gensim là HDP (Quy trình Dirichlet phân cấp). Về cơ bản, nó là một mô hình thành viên hỗn hợp để phân tích dữ liệu được nhóm không giám sát. Không giống như LDA (đối tác hữu hạn của nó), HDP suy ra số lượng chủ đề từ dữ liệu.

Thực hiện với Gensim

Để triển khai HDP trong Gensim, chúng ta cần đào tạo kho ngữ liệu và từ điển (như đã làm trong các ví dụ trên khi triển khai mô hình chủ đề LDA và LSI) mô hình chủ đề HDP mà chúng ta có thể nhập từ gensim.models.HdpModel. Ở đây chúng tôi cũng sẽ triển khai mô hình chủ đề HDP trên dữ liệu 20Newsgroup và các bước thực hiện cũng tương tự.

Đối với kho ngữ liệu và từ điển của chúng tôi (được tạo trong các ví dụ trên cho mô hình LSI và LDA), chúng tôi có thể nhập HdpModel như sau:

Hdp_model = gensim.models.hdpmodel.HdpModel(corpus=corpus, id2word=id2word)

Xem chủ đề trong mô hình LSI

Mô hình HDP (Hdp_model)có thể được sử dụng để xem các chủ đề từ các tài liệu. Nó có thể được thực hiện với sự trợ giúp của tập lệnh sau:

pprint(Hdp_model.print_topics())

Đầu ra

[
   (0,
   '0.009*line + 0.009*write + 0.006*say + 0.006*article + 0.006*know + '
   '0.006*people + 0.005*make + 0.005*go + 0.005*think + 0.005*be'),
   (1,
   '0.016*line + 0.011*write + 0.008*article + 0.008*organization + 0.006*know '
   '+ 0.006*host + 0.006*be + 0.005*get + 0.005*use + 0.005*say'),
   (2,
   '0.810*ax + 0.001*_ + 0.000*tm + 0.000*part + 0.000*mb + 0.000*pne + '
   '0.000*biz + 0.000*end + 0.000*wwiz + 0.000*fax'),
   (3,
   '0.015*line + 0.008*write + 0.007*organization + 0.006*host + 0.006*know + '
   '0.006*article + 0.005*use + 0.005*thank + 0.004*get + 0.004*problem'),
   (4,
   '0.004*line + 0.003*write + 0.002*believe + 0.002*think + 0.002*article + '
   '0.002*belief + 0.002*say + 0.002*see + 0.002*look + 0.002*organization'),
   (5,
   '0.005*line + 0.003*write + 0.003*organization + 0.002*article + 0.002*time '
   '+ 0.002*host + 0.002*get + 0.002*look + 0.002*say + 0.001*number'),
   (6,
   '0.003*line + 0.002*say + 0.002*write + 0.002*go + 0.002*gun + 0.002*get + '
   '0.002*organization + 0.002*bill + 0.002*article + 0.002*state'),
   (7,
   '0.003*line + 0.002*write + 0.002*article + 0.002*organization + 0.001*none '
   '+ 0.001*know + 0.001*say + 0.001*people + 0.001*host + 0.001*new'),
   (8,
   '0.004*line + 0.002*write + 0.002*get + 0.002*team + 0.002*organization + '
   '0.002*go + 0.002*think + 0.002*know + 0.002*article + 0.001*well'),
   (9,
   '0.004*line + 0.002*organization + 0.002*write + 0.001*be + 0.001*host + '
   '0.001*article + 0.001*thank + 0.001*use + 0.001*work + 0.001*run'),
   (10,
   '0.002*line + 0.001*game + 0.001*write + 0.001*get + 0.001*know + '
   '0.001*thing + 0.001*think + 0.001*article + 0.001*help + 0.001*turn'),
   (11,
   '0.002*line + 0.001*write + 0.001*game + 0.001*organization + 0.001*say + '
   '0.001*host + 0.001*give + 0.001*run + 0.001*article + 0.001*get'),
   (12,
   '0.002*line + 0.001*write + 0.001*know + 0.001*time + 0.001*article + '
   '0.001*get + 0.001*think + 0.001*organization + 0.001*scope + 0.001*make'),
   (13,
   '0.002*line + 0.002*write + 0.001*article + 0.001*organization + 0.001*make '
   '+ 0.001*know + 0.001*see + 0.001*get + 0.001*host + 0.001*really'),
   (14,
   '0.002*write + 0.002*line + 0.002*know + 0.001*think + 0.001*say + '
   '0.001*article + 0.001*argument + 0.001*even + 0.001*card + 0.001*be'),
   (15,
   '0.001*article + 0.001*line + 0.001*make + 0.001*write + 0.001*know + '
   '0.001*say + 0.001*exist + 0.001*get + 0.001*purpose + 0.001*organization'),
   (16,
   '0.002*line + 0.001*write + 0.001*article + 0.001*insurance + 0.001*go + '
   '0.001*be + 0.001*host + 0.001*say + 0.001*organization + 0.001*part'),
   (17,
   '0.001*line + 0.001*get + 0.001*hit + 0.001*go + 0.001*write + 0.001*say + '
   '0.001*know + 0.001*drug + 0.001*see + 0.001*need'),
   (18,
   '0.002*option + 0.001*line + 0.001*flight + 0.001*power + 0.001*software + '
   '0.001*write + 0.001*add + 0.001*people + 0.001*organization + 0.001*module'),
   (19,
   '0.001*shuttle + 0.001*line + 0.001*roll + 0.001*attitude + 0.001*maneuver + '
   '0.001*mission + 0.001*also + 0.001*orbit + 0.001*produce + 0.001*frequency')
]