Mulailah dengan Embedding GloVe
Apakah Anda ingin menggunakan penyematan GloVe di proyek Anda? Apakah berbagai terminologi membuat Anda kesulitan? Selamat! Anda berada di tempat yang tepat.
Catatan: Artikel ini tidak membahas matematika di balik penyematan GloVe.
Pada artikel ini, kita akan mempelajari cara menggunakan penyematan GloVe untuk mengubah data teks apa pun menjadi angka. Kita akan mempelajari langkah-langkahnya menggunakan short text corpus, dan kemudian kita akan menerapkan langkah-langkah tersebut untuk mendapatkan embedding untuk dataset review film IMDB. Kami akan menggunakan penyematan yang diperoleh untuk melatih pengklasifikasi sentimen biner pada kumpulan data yang sama.
Mari kita mulai!
pengantar
Ada berbagai penyematan kata GloVe terlatih yang tersedia untuk diunduh. Informasi lebih lanjut tentang korpus pelatihan dari berbagai penyematan Glove dapat ditemukan di situs web ini . Dalam tutorial ini, kita akan menggunakan embeddings glovetwitter27b50d, yang memiliki 50 dimensi dan dilatih pada 2B tweet dari Twitter.
Penyematan tersedia sebagai file teks di mana setiap baris memiliki string yang berisi kata dan representasi vektornya. Kami akan mengonversi konten file teks ini menjadi kamus.
# Read the text file
glovetwitter27b50d = "pathe_to_glovetwitter27b50d.txt"
file = open(glovetwitter27b50d)
glovetwitter27b50d = file.readlines()
# Convert the text file into a dictionary
def ConvertToEmbeddingDictionary(glovetwitter27b50d):
embedding_dictionary = {}
for word_embedding in tqdm(glovetwitter27b50d):
word_embedding = word_embedding.split()
word = word_embedding[0]
embedding = np.array([float(i) for i in word_embedding[1:]])
embedding_dictionary[word] = embedding
return embedding_dictionary
embedding_dictionary = ConvertToEmbeddingDictionary(glovetwitter27b50d)
# Let's look at the embedding of the word "hello."
embedding_dictionary['hello']
Output:
array([ 0.28751 , 0.31323 , -0.29318 , 0.17199 , -0.69232 ,
-0.4593 , 1.3364 , 0.709 , 0.12118 , 0.11476 ,
-0.48505 , -0.088608 , -3.0154 , -0.54024 , -1.326 ,
0.39477 , 0.11755 , -0.17816 , -0.32272 , 0.21715 ,
0.043144 , -0.43666 , -0.55857 , -0.47601 , -0.095172 ,
0.0031934, 0.1192 , -0.23643 , 1.3234 , -0.45093 ,
-0.65837 , -0.13865 , 0.22145 , -0.35806 , 0.20988 ,
0.054894 , -0.080322 , 0.48942 , 0.19206 , 0.4556 ,
-1.642 , -0.83323 , -0.12974 , 0.96514 , -0.18214 ,
0.37733 , -0.19622 , -0.12231 , -0.10496 , 0.45388 ])
sample_corpus = ['The woods are lovely, dark and deep',
'But I have promises to keep',
'And miles to go before I sleep',
'And miles to go before I sleep']
# This is the maximum number of tokens we wish to consider from our dataset.
# When there are more tokens, the tokens with the highest frequency are chosen.
max_number_of_words = 5
# Note: Keras tokenizer selects only top n-1 tokens if the num_words is set to n
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(sample_corpus)
sample_corpus_tokenized = tokenizer.texts_to_sequences(sample_corpus)
print(tokenizer.word_index)
Output:
{'and': 1, 'i': 2, 'to': 3, 'miles': 4, 'go': 5, 'before': 6, 'sleep': 7, 'the': 8, 'woods': 9, 'are': 10, 'lovely': 11, 'dark': 12, 'deep': 13, 'but': 14, 'have': 15, 'promises': 16, 'keep': 17}
print("But I have promises to keep: ", sample_corpus_tokenized[1])
Output:
But I have promises to keep: [2, 3]
Sekarang kami telah memilih satu set token dari korpus teks kami, kami harus mengembangkan matriks penyematan untuk mereka. Matriks penyematan akan memiliki kolom yang sama dengan dimensi penyematan dan baris sama dengan jumlah token .
# Create embedding matrix
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words,50))
for word, i in tokenizer.word_index.items():
if i >= total_number_of_words: break
if word in embedding_dictionary.keys():
embedding_vector = embedding_dictionary[word]
embedding_matrix[i] = embedding_vector
Jaringan saraf tiruan dan algoritme ML tidak dapat menangani input dengan panjang variabel, jadi kami perlu mengonversi penyematan setiap urutan input ke ukuran tetap. Ada banyak pendekatan untuk melakukan ini, tetapi yang paling sederhana adalah menjumlahkan penyematan setiap token dalam sebuah kalimat dan menormalkan vektor.
def convertToSentenceVector(sentences):
new_sentences = []
for sentence in sentences:
sentence_vector = []
for word_index in sentence:
sentence_vector.append(embedding_matrix[word_index])
sentence_vector = np.array(sentence_vector).sum(axis=0)
embedding_vector / np.sqrt((embedding_vector ** 2).sum())
new_sentences.append(sentence_vecto
sample_corpus_vectorized = convertToSentenceVector(sample_corpus_tokenized)
# Print the 50-dimensional embedding of the first sentence in our text corpus.
print(sample_corpus_vectorized[0])
Output:
[-0.43196 -0.18965 -0.028294 -0.25903 -0.4481 0.53591
0.94627 -0.07806 -0.54519 -0.72878 -0.030083 -0.28677
-6.464 -0.31295 0.12351 -0.2463 0.029458 -0.83529
0.19647 -0.15722 -0.5562 -0.027029 -0.23915 0.18188
-0.15156 0.54768 0.13767 0.21828 0.61069 -0.3679
0.023187 0.33281 -0.18062 -0.0094163 0.31861 -0.19201
0.35759 0.50104 0.55981 0.20561 -1.1167 -0.3063
-0.14224 0.20285 0.10245 -0.39289 -0.26724 -0.37573
0.16076 -0.74501 ]
Klasifikasi Sentimen: Kumpulan Data Ulasan Film IMDB
# Read the data
df = pd.read_csv("IMDB_Dataset.csv")
X = df['review']
y = df['sentiment']
# Convert labels to numbers
le = LabelEncoder()
y = le.fit_transform(y)
# Split the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=0)
#Set the maximum number of tokens to 50000
max_number_of_words = 50000
# Tokenize training set
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
# Create embedding matrix
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words+1,50))
for word, i in tokenizer.word_index.items():
if i >= total_number_of_words: break
if word in embedding_dictionary.keys():
embedding_vector = embedding_dictionary[word]
embedding_matrix[i] = embedding_vector
# Get a fixed-size embedding for every comment using the function defined earlier
X_train = convertToSentenceVector(X_train)
X_test = convertToSentenceVector(X_test)
# Define a sequential model
model = Sequential()
model.add(Dense(100, input_shape = (50,), activation = "relu"))
model.add(Dense(1000, activation = "relu"))
model.add(Dropout(0.2))
model.add(Dense(1, activation = "sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
# Fit the model
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.1)
# Get the classification report on the test set
y_pred = model.predict(X_test)
y_pred = y_pred.round()
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
0 0.82 0.77 0.79 2553
1 0.77 0.82 0.80 2447
accuracy 0.80 5000
macro avg 0.80 0.80 0.80 5000
weighted avg 0.80 0.80 0.80 5000
max_number_of_words = 50000
max_length = 100
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
# We can use padding to make the length of every comment equal to max_length
X_train = pad_sequences(X_train, maxlen=max_length)
X_test = pad_sequences(X_test, maxlen=max_length)
# Define a sequential model
model = Sequential()
# Add an embedding layer and pass the embedding matrix as weights
model.add(Embedding(max_number_of_words+1, 50, input_shape = (100,), weights=[embedding_matrix]))
model.add(Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
# Fit the model
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.1);
# Get the classification report
y_pred = model.predict(X_test)
y_pred = y_pred.round()
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
0 0.87 0.83 0.85 2553
1 0.83 0.87 0.85 2447
accuracy 0.85 5000
macro avg 0.85 0.85 0.85 5000
weighted avg 0.85 0.85 0.85 5000
Sekarang setelah Anda memiliki pemahaman yang lebih baik tentang penyematan GloVe, Anda siap menerapkannya ke berbagai masalah NLP.
Catatan: Anda dapat mengakses kode lengkap dari tautan ini .
Referensi :
- https://www.tensorflow.org/text/guide/word_embeddings
- https://www.kaggle.com/code/jhoward/improved-lstm-baseline-glove-dropout
- https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle