テキスト内の数字の前後の名詞句を検索します

Aug 19 2020

テキストが与えられた場合、check_wordsリスト（ストップワードの種類）に属するストップワードまでのすべての番号の前のワードを見つける必要があります。

私のコード：

check_words = ['the', 'a', 'with','to']
mystring = 'the code to find the beautiful words 78 that i have to nicely check 45 with the snippet'
list_of_words = mystring.split()

その特定のテキストでは私が前にチェックする'78'と'45'、私は（以上8ワード以上ではなく）check_words内の単語のいずれかを見つけるまでに後方移動します。

そのためのコードは次のようになります。

preceding_chunks = []
for i,word in enumerate(list_of_words):
    if any(char.isdigit() for char in word):
       
        # 8 precedent words (taking into account that I can not slice with 8 back at the beginning)
        preceding_words = list_of_words[max(0,i-8):i]
        preceding_words[::-1]

        # I check from the end of the list towards the start
        for j,sub_word in enumerate(preceding_words[::-1]):
            if  sub_word in check_words:
                # printing out j for checking
                myposition = j
                print(j)
                real_preceding_chunk = preceding_words[len(preceding_words)-j:]
                print(real_preceding_chunk)
                preceding_chunks.append(real_preceding_chunk)
                break

このコードは機能します。基本的に私はすべての単語をチェックしますが、私はそれが1つのライナーのカップルで、したがってループなしで達成できるという印象を持っています（おそらく私は間違っています）。何か案が？

注：この質問は、コードの可読性を向上させ、ループを取り除いてコードを高速化し、PythonのZenの一部であるコードをより良くしようとすることに関するものです。

注2：私が行った以前のチェック：

別のリストの番号から別のリストのアイテムの位置を見つける
リスト内のアイテムのインデックスを見つける
リストで検索

回答

MarioIshac Aug 19 2020 at 09:13

私はこれを思いついた：

import itertools
import re

chunks = (grouped_chunk.split() for grouped_chunk in re.split("\\s+\\d+\\s+", mystring))
preceding_chunks = []

for reversed_chunk in map(reversed, chunks):
    preceding_chunk = list(itertools.takewhile(lambda word: word not in check_words, reversed_chunk))[::-1]
    preceding_chunks.append(preceding_chunk)

前のチャンクを逆の順序で与えるに適用itertools.takewhileしますreversed_chunk。次にpreceding_chunk、最後にを逆にして、正しい順序を取得し[::-1]ます。

正規表現mystringは、数値（エスケープされた\d+）に基づいて分割されます。周囲のエスケープされた\s+sは、数値の周囲のパディングを表します。これにより、数字と文字が同じ単語に混在している場合（たとえばa1）、このコードの動作は実際のコードとは異なります。