Приступайте к работе с вложениями GloVe

Nov 27 2022

Вы хотите использовать вложения GloVe в своем проекте? Различные термины вызывают у вас затруднения? Поздравляю! Вы находитесь в правильном месте. Примечание. В этой статье не обсуждается математика внедрения GloVe.

Фото Ника Моррисона на Unsplash

Вы хотите использовать вложения GloVe в своем проекте? Различные термины вызывают у вас затруднения? Поздравляю! Вы находитесь в правильном месте.

Примечание. В этой статье не обсуждается математика внедрения GloVe.

В этой статье мы узнаем, как использовать вложения GloVe для преобразования любых текстовых данных в числа. Мы изучим шаги, используя короткий текстовый корпус, а затем применим эти шаги, чтобы получить вложение для набора данных обзора фильмов IMDB. Мы будем использовать полученное встраивание для обучения бинарного классификатора настроений на том же наборе данных.

Давайте начнем!

Вступление

Существует множество предварительно обученных вложений слов GloVe, доступных для загрузки. Более подробную информацию об обучающем корпусе различных вложений Glove можно найти на этом веб-сайте. В этом уроке мы будем использовать вложения glovetwitter27b50d, которые имеют 50 измерений и были обучены на твитах 2B из Twitter.

Вложение доступно в виде текстового файла, где каждая строка содержит строку, содержащую слово и его векторное представление. Мы преобразуем содержимое этого текстового файла в словарь.

# Read the text file
glovetwitter27b50d = "pathe_to_glovetwitter27b50d.txt"
file = open(glovetwitter27b50d)
glovetwitter27b50d = file.readlines()


# Convert the text file into a dictionary
def ConvertToEmbeddingDictionary(glovetwitter27b50d):
    embedding_dictionary = {}
    for word_embedding in tqdm(glovetwitter27b50d):
        word_embedding = word_embedding.split()
        word = word_embedding[0]
        embedding = np.array([float(i) for i in word_embedding[1:]])
        embedding_dictionary[word] = embedding
    return embedding_dictionary
embedding_dictionary = ConvertToEmbeddingDictionary(glovetwitter27b50d)

# Let's look at the embedding of the word "hello."
embedding_dictionary['hello']
Output:
array([ 0.28751  ,  0.31323  , -0.29318  ,  0.17199  , -0.69232  ,
       -0.4593   ,  1.3364   ,  0.709    ,  0.12118  ,  0.11476  ,
       -0.48505  , -0.088608 , -3.0154   , -0.54024  , -1.326    ,
        0.39477  ,  0.11755  , -0.17816  , -0.32272  ,  0.21715  ,
        0.043144 , -0.43666  , -0.55857  , -0.47601  , -0.095172 ,
        0.0031934,  0.1192   , -0.23643  ,  1.3234   , -0.45093  ,
       -0.65837  , -0.13865  ,  0.22145  , -0.35806  ,  0.20988  ,
        0.054894 , -0.080322 ,  0.48942  ,  0.19206  ,  0.4556   ,
       -1.642    , -0.83323  , -0.12974  ,  0.96514  , -0.18214  ,
        0.37733  , -0.19622  , -0.12231  , -0.10496  ,  0.45388  ])

sample_corpus = ['The woods are lovely, dark and deep',
                 'But I have promises to keep',   
                 'And miles to go before I sleep', 
                 'And miles to go before I sleep']

# This is the maximum number of tokens we wish to consider from our dataset.
# When there are more tokens, the tokens with the highest frequency are chosen.
max_number_of_words = 5

# Note: Keras tokenizer selects only top n-1 tokens if the num_words is set to n
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(sample_corpus)
sample_corpus_tokenized = tokenizer.texts_to_sequences(sample_corpus)
print(tokenizer.word_index)
Output:
{'and': 1, 'i': 2, 'to': 3, 'miles': 4, 'go': 5, 'before': 6, 'sleep': 7, 'the': 8, 'woods': 9, 'are': 10, 'lovely': 11, 'dark': 12, 'deep': 13, 'but': 14, 'have': 15, 'promises': 16, 'keep': 17}
print("But I have promises to keep: ", sample_corpus_tokenized[1])
Output:
But I have promises to keep:  [2, 3]

Теперь, когда мы выбрали набор токенов из нашего текстового корпуса, мы должны разработать для них матрицу внедрения. Матрица внедрения будет иметь столбцы, равные размерности внедрения, и строки, равные количеству токенов .

# Create embedding matrix
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words,50))
for word, i in tokenizer.word_index.items():
    if i >= total_number_of_words: break
    if word in embedding_dictionary.keys():
        embedding_vector = embedding_dictionary[word]
        embedding_matrix[i] = embedding_vector

Искусственные нейронные сети и алгоритмы машинного обучения не могут работать с входными данными переменной длины, поэтому нам нужно преобразовать вложения каждой входной последовательности в фиксированный размер. Для этого есть много подходов, но самый простой — суммировать встраивание каждой лексемы в предложение и нормализовать вектор.

def convertToSentenceVector(sentences):
    new_sentences = []
    for sentence in sentences:
        sentence_vector = []
        for word_index in sentence:
            sentence_vector.append(embedding_matrix[word_index])
        sentence_vector = np.array(sentence_vector).sum(axis=0)
        embedding_vector / np.sqrt((embedding_vector ** 2).sum())
        new_sentences.append(sentence_vecto

sample_corpus_vectorized = convertToSentenceVector(sample_corpus_tokenized)

# Print the 50-dimensional embedding of the first sentence in our text corpus.
print(sample_corpus_vectorized[0])
Output:
[-0.43196   -0.18965   -0.028294  -0.25903   -0.4481     0.53591
  0.94627   -0.07806   -0.54519   -0.72878   -0.030083  -0.28677
 -6.464     -0.31295    0.12351   -0.2463     0.029458  -0.83529
  0.19647   -0.15722   -0.5562    -0.027029  -0.23915    0.18188
 -0.15156    0.54768    0.13767    0.21828    0.61069   -0.3679
  0.023187   0.33281   -0.18062   -0.0094163  0.31861   -0.19201
  0.35759    0.50104    0.55981    0.20561   -1.1167    -0.3063
 -0.14224    0.20285    0.10245   -0.39289   -0.26724   -0.37573
  0.16076   -0.74501  ]

Классификация настроений: набор данных обзоров фильмов IMDB

# Read the data
df = pd.read_csv("IMDB_Dataset.csv")
X = df['review']
y = df['sentiment']
# Convert labels to numbers
le = LabelEncoder()
y = le.fit_transform(y)
# Split the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=0)
#Set the maximum number of tokens to 50000
max_number_of_words = 50000

# Tokenize training set
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

# Create embedding matrix
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words+1,50))
for word, i in tokenizer.word_index.items():
    if i >= total_number_of_words: break
    if word in embedding_dictionary.keys():
        embedding_vector = embedding_dictionary[word]
        embedding_matrix[i] = embedding_vector

# Get a fixed-size embedding for every comment using the function defined earlier
X_train = convertToSentenceVector(X_train)
X_test = convertToSentenceVector(X_test)

# Define a sequential model
model = Sequential()
model.add(Dense(100, input_shape = (50,), activation = "relu"))
model.add(Dense(1000, activation = "relu"))
model.add(Dropout(0.2))
model.add(Dense(1, activation = "sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# Fit the model
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.1)

# Get the classification report on the test set
y_pred = model.predict(X_test)
y_pred = y_pred.round()
print(classification_report(y_test, y_pred))
Output:
              precision    recall  f1-score   support

           0       0.82      0.77      0.79      2553
           1       0.77      0.82      0.80      2447

    accuracy                           0.80      5000
   macro avg       0.80      0.80      0.80      5000
weighted avg       0.80      0.80      0.80      5000

max_number_of_words = 50000
max_length = 100

tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

# We can use padding to make the length of every comment equal to max_length
X_train = pad_sequences(X_train, maxlen=max_length)
X_test = pad_sequences(X_test, maxlen=max_length)

# Define a sequential model
model = Sequential()
# Add an embedding layer and pass the embedding matrix as weights
model.add(Embedding(max_number_of_words+1, 50, input_shape = (100,), weights=[embedding_matrix]))
model.add(Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# Fit the model
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.1);

# Get the classification report
y_pred = model.predict(X_test)
y_pred = y_pred.round()
print(classification_report(y_test, y_pred))
Output:
              precision    recall  f1-score   support

           0       0.87      0.83      0.85      2553
           1       0.83      0.87      0.85      2447

    accuracy                           0.85      5000
   macro avg       0.85      0.85      0.85      5000
weighted avg       0.85      0.85      0.85      5000

Теперь, когда вы лучше понимаете встраивание GloVe, вы готовы применить его к различным задачам НЛП.