文をつなぐグラフ

Aug 20 2020

私は以下のようないくつかのトピック（2つ）の文章のリストを持っています：

Sentences
Trump says that it is useful to win the next presidential election. 
The Prime Minister suggests the name of the winner of the next presidential election.
In yesterday's conference, the Prime Minister said that it is very important to win the next presidential election. 
The Chinese Minister is in London to discuss about climate change.
The president Donald Trump states that he wants to win the presidential election. This will require a strong media engagement.
The president Donald Trump states that he wants to win the presidential election. The UK has proposed collaboration. 
The president Donald Trump states that he wants to win the presidential election. He has the support of his electors.

ご覧のとおり、文には類似性があります。

複数の文を関連付け、グラフ（有向）を使用してそれらの特性を視覚化しようとしています。グラフは、上記のように文の行順序を適用することにより、類似性マトリックスから作成されます。文の順序を示すために新しい列Timeを作成したので、最初の行（Trumpによると....）は時間1になります。2行目（首相が提案する...）は時間2であり、以下同様です。このようなもの

Time    Sentences
1           Trump said that it is useful to win the next presidential election. 
2           The Prime Minister suggests the name of the winner of the next presidential election.

3           In today's conference, the Prime Minister said that it is very important to win the next presidential election. 

...

次に、トピックの概要を明確にするために、関係を見つけたいと思います。文の複数のパスは、それに関連付けられた複数の情報があることを示します。2つの文の類似性を判断するために、次のように名詞と動詞を抽出しようとしました。

noun=[]
verb=[]
for  index, row in df.iterrows():
      nouns.append([word for word,pos in pos_tag(row[0]) if pos == 'NN'])
      verb.append([word for word,pos in pos_tag(row[0]) if pos == 'VB'])

彼らはどんな文のキーワードでもあるので。したがって、キーワード（名詞または動詞）が文xに表示され、他の文には表示されない場合、それはこれら2つの文の違いを表します。しかし、より良いアプローチは、word2vecまたはgensim（WMD）を使用することかもしれないと思います。

この類似性は、文ごとに計算する必要があります。上記の例の文の内容を示すグラフを作成したいと思います。2つのトピック（トランプと中国の大臣）があるので、それぞれについてサブトピックを探す必要があります。たとえば、トランプにはサブトピックの大統領選挙があります。グラフのノードは文を表す必要があります。各ノードの単語は文の違いを表し、文の新しい情報を示します。たとえばstates、時間5の文の単語は、時間6と7の隣接する文にあります。次の図に示すように、同様の結果を得る方法を見つけたいと思います。主に名詞と動詞の抽出を使用してみましたが、おそらくそれは正しい方法ではありません。私がやろうとしたのは、時間1の文を検討し、それを他の文と比較して、類似性スコア（名詞と動詞の抽出だけでなく、word2vecも使用）を割り当て、他のすべての文に対してそれを繰り返すことでした。しかし、私の問題は、違いを抽出して意味のあるグラフを作成する方法にあります。

グラフの一部として、networkx（DiGraph）の使用を検討します。

G = nx.DiGraph()
N = Network(directed=True)

関係の方向性を示すため。

わかりやすくするために別の例を示しました（ただし、前の例で作業した場合も問題ありません。ご不便をおかけして申し訳ありませんが、最初の質問があまり明確ではなかったため、より良い例も提供する必要がありました。おそらく簡単です、例）。

回答

4 ilia Oct 10 2020 at 17:42

動詞/名詞の分離にNLPを実装せず、適切な単語のリストを追加しただけです。それらは比較的簡単に抽出してspacyで正規化できます。これwalkは1、2、5文で発生し、トライアドを形成することに注意してください。

import re
import networkx as nx
import matplotlib.pyplot as plt

plt.style.use("ggplot")

sentences = [
    "I went out for a walk or walking.",
    "When I was walking, I saw a cat. ",
    "The cat was injured. ",
    "My mum's name is Marylin.",
    "While I was walking, I met John. ",
    "Nothing has happened.",
]

G = nx.Graph()
# set of possible good words
good_words = {"went", "walk", "cat", "walking"}

# remove punctuation and keep only good words inside sentences
words = list(
    map(
        lambda x: set(re.sub(r"[^\w\s]", "", x).lower().split()).intersection(
            good_words
        ),
        sentences,
    )
)

# convert sentences to dict for furtehr labeling
sentences = {k: v for k, v in enumerate(sentences)}

# add nodes
for i, sentence in sentences.items():
    G.add_node(i)

# add edges if two nodes have the same word inside
for i in range(len(words)):
    for j in range(i + 1, len(words)):
        for edge_label in words[i].intersection(words[j]):
            G.add_edge(i, j, r=edge_label)

# compute layout coords
coord = nx.spring_layout(G)

plt.figure(figsize=(20, 14))

# set label coords a bit upper the nodes
node_label_coords = {}
for node, coords in coord.items():
    node_label_coords[node] = (coords[0], coords[1] + 0.04)

# draw the network
nodes = nx.draw_networkx_nodes(G, pos=coord)
edges = nx.draw_networkx_edges(G, pos=coord)
edge_labels = nx.draw_networkx_edge_labels(G, pos=coord)
node_labels = nx.draw_networkx_labels(G, pos=node_label_coords, labels=sentences)
plt.title("Sentences network")
plt.axis("off")

更新
異なる文間の類似性を測定したい場合は、文の埋め込み間の違いを計算することをお勧めします。
これにより、「複数の男性が遊んでいるサッカーゲーム」や「一部の男性がスポーツをしている」など、単語が異なる文の意味的な類似性を見つけることができます。BERTを使用したほぼSoTAのアプローチはここにあり、より単純なアプローチはここにあります。
類似性の尺度があるので、add_edgeブロックを置き換えるだけで、類似性の尺度があるしきい値よりも大きい場合にのみ新しいエッジを追加できます。結果のエッジの追加コードは次のようになります。

# add edges if two nodes have the same word inside
tresold = 0.90
for i in range(len(words)):
    for j in range(i + 1, len(words)):
        # suppose you have some similarity function using BERT or PCA
        similarity = check_similarity(sentences[i], sentences[j])
        if similarity > tresold:
            G.add_edge(i, j, r=similarity)

1 mujjiga Oct 10 2020 at 20:09

これを処理する1つの方法は、トークン化してストップワードを削除し、語彙を作成することです。次に、この語彙に基づいてグラフを描きます。以下にユニグラムベースのトークンの例を示しますが、はるかに優れたアプローチは、フレーズ（ngram）を識別し、ユニグラムの代わりに語彙として使用することです。同様に、文は、より多くの程度と程度を持つノード（および対応する文）によって絵で描かれます。