GloVe 임베딩으로 시작하기

Nov 27 2022

프로젝트에서 GloVe 임베딩을 사용하시겠습니까? 다양한 용어 때문에 고민이신가요? 축하해요! 잘 찾아오셨습니다. 참고: 이 문서에서는 GloVe 임베딩 이면의 수학에 대해 논의하지 않습니다.

Unsplash의 Nick Morrison 사진

프로젝트에서 GloVe 임베딩을 사용하시겠습니까? 다양한 용어 때문에 고민이신가요? 축하해요! 잘 찾아오셨습니다.

참고: 이 문서에서는 GloVe 임베딩 이면의 수학에 대해 논의하지 않습니다.

이 기사에서는 GloVe 임베딩을 사용하여 텍스트 데이터를 숫자로 변환하는 방법을 배웁니다. 짧은 텍스트 코퍼스를 사용하여 단계를 학습한 다음 이러한 단계를 적용하여 IMDB 영화 리뷰 데이터 세트에 대한 임베딩을 가져옵니다. 획득한 임베딩을 사용하여 동일한 데이터 세트에서 이진 감정 분류기를 교육합니다.

시작하자!

소개

미리 훈련된 다양한 GloVe 단어 임베딩을 다운로드할 수 있습니다. 다양한 Glove 임베딩의 교육 코퍼스에 대한 자세한 내용은 이 웹사이트에서 확인할 수 있습니다. 이 튜토리얼에서는 50개의 차원이 있고 Twitter의 2B 트윗에 대해 훈련된 glovetwitter27b50d 임베딩을 사용합니다.

임베딩은 각 줄에 단어와 해당 벡터 표현을 포함하는 문자열이 있는 텍스트 파일로 사용할 수 있습니다. 이 텍스트 파일의 내용을 사전으로 변환합니다.

# Read the text file
glovetwitter27b50d = "pathe_to_glovetwitter27b50d.txt"
file = open(glovetwitter27b50d)
glovetwitter27b50d = file.readlines()


# Convert the text file into a dictionary
def ConvertToEmbeddingDictionary(glovetwitter27b50d):
    embedding_dictionary = {}
    for word_embedding in tqdm(glovetwitter27b50d):
        word_embedding = word_embedding.split()
        word = word_embedding[0]
        embedding = np.array([float(i) for i in word_embedding[1:]])
        embedding_dictionary[word] = embedding
    return embedding_dictionary
embedding_dictionary = ConvertToEmbeddingDictionary(glovetwitter27b50d)

# Let's look at the embedding of the word "hello."
embedding_dictionary['hello']
Output:
array([ 0.28751  ,  0.31323  , -0.29318  ,  0.17199  , -0.69232  ,
       -0.4593   ,  1.3364   ,  0.709    ,  0.12118  ,  0.11476  ,
       -0.48505  , -0.088608 , -3.0154   , -0.54024  , -1.326    ,
        0.39477  ,  0.11755  , -0.17816  , -0.32272  ,  0.21715  ,
        0.043144 , -0.43666  , -0.55857  , -0.47601  , -0.095172 ,
        0.0031934,  0.1192   , -0.23643  ,  1.3234   , -0.45093  ,
       -0.65837  , -0.13865  ,  0.22145  , -0.35806  ,  0.20988  ,
        0.054894 , -0.080322 ,  0.48942  ,  0.19206  ,  0.4556   ,
       -1.642    , -0.83323  , -0.12974  ,  0.96514  , -0.18214  ,
        0.37733  , -0.19622  , -0.12231  , -0.10496  ,  0.45388  ])

sample_corpus = ['The woods are lovely, dark and deep',
                 'But I have promises to keep',   
                 'And miles to go before I sleep', 
                 'And miles to go before I sleep']

# This is the maximum number of tokens we wish to consider from our dataset.
# When there are more tokens, the tokens with the highest frequency are chosen.
max_number_of_words = 5

# Note: Keras tokenizer selects only top n-1 tokens if the num_words is set to n
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(sample_corpus)
sample_corpus_tokenized = tokenizer.texts_to_sequences(sample_corpus)
print(tokenizer.word_index)
Output:
{'and': 1, 'i': 2, 'to': 3, 'miles': 4, 'go': 5, 'before': 6, 'sleep': 7, 'the': 8, 'woods': 9, 'are': 10, 'lovely': 11, 'dark': 12, 'deep': 13, 'but': 14, 'have': 15, 'promises': 16, 'keep': 17}
print("But I have promises to keep: ", sample_corpus_tokenized[1])
Output:
But I have promises to keep:  [2, 3]

이제 텍스트 코퍼스에서 토큰 세트를 선택했으므로 이를 위한 임베딩 매트릭스를 개발해야 합니다. 임베딩 매트릭스에는 임베딩 차원 과 동일한 열과 토큰 수와 동일한 행이 있습니다.

# Create embedding matrix
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words,50))
for word, i in tokenizer.word_index.items():
    if i >= total_number_of_words: break
    if word in embedding_dictionary.keys():
        embedding_vector = embedding_dictionary[word]
        embedding_matrix[i] = embedding_vector

인공 신경망과 ML 알고리즘은 다양한 길이의 입력을 처리할 수 없으므로 모든 입력 시퀀스의 임베딩을 고정된 크기로 변환해야 합니다. 이를 수행하는 방법에는 여러 가지가 있지만 가장 간단한 방법은 문장의 모든 토큰 임베딩을 합산하고 벡터를 정규화하는 것입니다.