Gensim - Creación del modelo de tema LDA

Este capítulo le ayudará a aprender cómo crear un modelo de tema de asignación de Dirichlet latente (LDA) en Gensim.

Extracción automática de información sobre temas de un gran volumen de textos en una de las principales aplicaciones de PNL (procesamiento del lenguaje natural). Un gran volumen de textos podría ser información de reseñas de hoteles, tweets, publicaciones de Facebook, fuentes de cualquier otro canal de redes sociales, reseñas de películas, noticias, comentarios de los usuarios, correos electrónicos, etc.

En esta era digital, saber de qué hablan las personas / clientes, comprender sus opiniones y sus problemas, puede ser muy valioso para las empresas, las campañas políticas y los administradores. Pero, ¿es posible leer manualmente volúmenes tan grandes de texto y luego extraer la información de los temas?

No, no es. Requiere un algoritmo automático que pueda leer este gran volumen de documentos de texto y extraer automáticamente la información o los temas que se debaten.

Papel de LDA

El enfoque de LDA para el modelado de temas es clasificar el texto de un documento en un tema en particular. Modelado como distribuciones de Dirichlet, LDA construye:

Un tema por modelo de documento y
Palabras por modelo de tema

Después de proporcionar el algoritmo del modelo de tema LDA, para obtener una buena composición de la distribución tema-palabra clave, reorganiza:

Las distribuciones de temas dentro del documento y
Distribución de palabras clave dentro de los temas.

Durante el procesamiento, algunas de las suposiciones hechas por LDA son:

Cada documento se modela como distribuciones multinominales de temas.
Cada tema se modela como distribuciones multinominales de palabras.
Deberíamos elegir el corpus correcto de datos porque LDA asume que cada fragmento de texto contiene las palabras relacionadas.
LDA también asume que los documentos se producen a partir de una combinación de temas.

Implementación con Gensim

Aquí, vamos a utilizar LDA (asignación de Dirichlet latente) para extraer los temas discutidos naturalmente del conjunto de datos.

Cargando conjunto de datos

El conjunto de datos que vamos a utilizar es el conjunto de datos de ’20 Newsgroups’tener miles de artículos de noticias de varias secciones de un informe de noticias. Está disponible bajoSklearnconjuntos de datos. Podemos descargar fácilmente con la ayuda de la siguiente secuencia de comandos de Python:

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

Veamos algunas de las noticias de muestra con la ayuda del siguiente guión:

newsgroups_train.data[:4]

["From: [email protected] (where's my thing)\nSubject: 
WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: 
University of Maryland, College Park\nLines: 
15\n\n I was wondering if anyone out there could enlighten me on this car 
I saw\nthe other day. It was a 2-door sports car, looked to be from the 
late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. 
In addition,\nthe front bumper was separate from the rest of the body. 
This is \nall I know. If anyone can tellme a model name, 
engine specs, years\nof production, where this car is made, history, or 
whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,
\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",

"From: [email protected] (Guy Kuo)\nSubject: SI Clock Poll - Final 
Call\nSummary: Final call for SI clock reports\nKeywords: 
SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: 
University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA 
fair number of brave souls who upgraded their SI clock oscillator have\nshared their 
experiences for this poll. Please send a brief message detailing\nyour experiences with 
the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat 
sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies 
are especially requested.\n\nI will be summarizing in the next two days, so please add 
to the network\nknowledge base if you have done the clock upgrade and haven't answered 
this\npoll. Thanks.\n\nGuy Kuo <;[email protected]>\n",

'From: [email protected] (Thomas E Willis)\nSubject: 
PB questions...\nOrganization: Purdue University Engineering 
Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, 
my mac plus finally gave up the ghost this weekend after\nstarting 
life as a 512k way back in 1985. sooo, i\'m in the market for 
a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking 
into picking up a powerbook 160 or maybe 180 and have a bunch\nof 
questions that (hopefully) somebody can answer:\n\n* does anybody 
know any dirt on when the next round of powerbook\nintroductions 
are expected? i\'d heard the 185c was supposed to make an\nappearence 
"this summer" but haven\'t heard anymore on it - and since i\ndon\'t 
have access to macleak, i was wondering if anybody out there had\nmore 
info...\n\n* has anybody heard rumors about price drops to the powerbook 
line like the\nones the duo\'s just went through recently?\n\n* what\'s 
the impression of the display on the 180? i could probably swing\na 180 
if i got the 80Mb disk rather than the 120, but i don\'t really have\na 
feel for how much "better" the display is (yea, it looks great in the\nstore, 
but is that all "wow" or is it really that good?). could i solicit\nsome 
opinions of people who use the 160 and 180 day-to-day on if its
worth\ntaking the disk size and money hit to get the active display? 
(i realize\nthis is a real subjective question, but i\'ve only played around 
with the\nmachines in a computer store breifly and figured the opinions 
of somebody\nwho actually uses the machine daily might prove helpful).\n\n* 
how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - 
if you could email, i\'ll post a\nsummary (news reading time is at a premium 
with finals just around the\ncorner... :
( )\n--\nTom Willis \\ [email protected] \\ Purdue Electrical 
Engineering\n---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n',

'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: 
Harris Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host: 
amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert 
J.C. Kyanko ([email protected]) wrote:\n >[email protected] writes in article 
<[email protected] >:\n> > Anyone know about the 
Weitek P9000 graphics chip?\n > As far as the low-level stuff goes, it looks 
pretty nice. It\'s got this\n> quadrilateral fill command that requires just 
the four points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get 
some information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris 
Corporation\[email protected]\t\t\tComputer Systems Division\n"The only 
thing that really scares me is a person with no sense of humor.
"\n\t\t\t\t\t\t-- Jonathan Winters\n']

Requisito previo

Necesitamos palabras vacías de NLTK y el modelo en inglés de Scapy. Ambos se pueden descargar de la siguiente manera:

import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

Importación de paquetes necesarios

Para construir el modelo LDA, necesitamos importar el siguiente paquete necesario:

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

Preparación de palabras vacías

Ahora, necesitamos importar las palabras vacías y usarlas:

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

Limpiar el texto

Ahora, con la ayuda de Gensim's simple_preprocess()necesitamos tokenizar cada oración en una lista de palabras. También debemos eliminar las puntuaciones y los caracteres innecesarios. Para hacer esto, crearemos una función llamadasent_to_words() -

def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

Creación de modelos Bigram y Trigram

Como sabemos, los bigramas son dos palabras que aparecen juntas con frecuencia en el documento y los trigramas son tres palabras que aparecen juntas con frecuencia en el documento. Con la ayuda de Gensim'sPhrases modelo, podemos hacer esto -

bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Filtrar palabras vacías

A continuación, debemos filtrar las palabras vacías. Junto con eso, también crearemos funciones para hacer bigramas, trigramas y para lematización -

def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
     doc = nlp(" ".join(sent))
     texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

Creación de diccionario y corpus para el modelo de tema

Ahora necesitamos construir el diccionario y el corpus. También lo hicimos en los ejemplos anteriores:

id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

Modelo de tema de construcción de LDA

Ya implementamos todo lo que se requiere para entrenar el modelo LDA. Ahora es el momento de construir el modelo de temas de LDA. Para nuestro ejemplo de implementación, se puede hacer con la ayuda de la siguiente línea de códigos:

lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

Ejemplo de implementación

Veamos el ejemplo de implementación completo para construir el modelo de temas de LDA:

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

Ahora podemos usar el modelo LDA creado anteriormente para obtener los temas y calcular la perplejidad del modelo.