Python - Lập trình và bổ sung hóa

Trong các lĩnh vực Xử lý ngôn ngữ tự nhiên, chúng ta gặp tình huống mà hai hoặc nhiều từ có chung một gốc. Ví dụ, ba từ - đồng ý, đồng ý và đồng ý có cùng một từ gốc là đồng ý. Một tìm kiếm liên quan đến bất kỳ từ nào trong số này sẽ coi chúng là cùng một từ là từ gốc. Vì vậy, nó trở nên cần thiết để liên kết tất cả các từ thành từ gốc của chúng. Thư viện NLTK có các phương thức để thực hiện việc liên kết này và đưa ra kết quả hiển thị từ gốc.

Chương trình dưới đây sử dụng Thuật toán tạo gốc Porter để tạo gốc.

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
       print "Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w))

Khi chúng tôi thực thi đoạn mã trên, nó tạo ra kết quả như sau.

Actual: It  Stem: It
Actual: originated  Stem: origin
Actual: from  Stem: from
Actual: the  Stem: the
Actual: idea  Stem: idea
Actual: that  Stem: that
Actual: there  Stem: there
Actual: are  Stem: are
Actual: readers  Stem: reader
Actual: who  Stem: who
Actual: prefer  Stem: prefer
Actual: learning  Stem: learn
Actual: new  Stem: new
Actual: skills  Stem: skill
Actual: from  Stem: from
Actual: the  Stem: the
Actual: comforts  Stem: comfort
Actual: of  Stem: of
Actual: their  Stem: their
Actual: drawing  Stem: draw
Actual: rooms  Stem: room

Phép bổ ngữ cũng tương tự như cách ghép từ nhưng nó mang lại ngữ cảnh cho các từ. Vì vậy, nó tiến thêm một bước bằng cách liên kết các từ có nghĩa tương tự với một từ. Ví dụ: nếu một đoạn văn có các từ như ô tô, tàu hỏa và ô tô, thì nó sẽ liên kết tất cả chúng thành ô tô. Trong chương trình dưới đây, chúng tôi sử dụng cơ sở dữ liệu từ vựng Mạng Word để bổ sung.

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
       print "Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w))

Khi chúng tôi thực thi đoạn mã trên, nó tạo ra kết quả như sau.

Actual: It  Lemma: It
Actual: originated  Lemma: originated
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: idea  Lemma: idea
Actual: that  Lemma: that
Actual: there  Lemma: there
Actual: are  Lemma: are
Actual: readers  Lemma: reader
Actual: who  Lemma: who
Actual: prefer  Lemma: prefer
Actual: learning  Lemma: learning
Actual: new  Lemma: new
Actual: skills  Lemma: skill
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: comforts  Lemma: comfort
Actual: of  Lemma: of
Actual: their  Lemma: their
Actual: drawing  Lemma: drawing
Actual: rooms  Lemma: room