OpenNLP-文の検出

自然言語を処理する際に、文の最初と最後を決定することは、対処すべき問題の1つです。このプロセスは、Sentence B境界 Disambiguation（SBD）または単に文を壊す。

特定のテキスト内の文を検出するために使用する手法は、テキストの言語によって異なります。

Javaを使用した文の検出

正規表現と一連の単純なルールを使用して、Javaで特定のテキスト内の文を検出できます。

たとえば、ピリオド、疑問符、または感嘆符が指定されたテキストの文を終了すると仮定すると、を使用して文を分割できます。 split() の方法 Stringクラス。ここでは、文字列形式で正規表現を渡す必要があります。

以下は、Java正規表現を使用して特定のテキストの文を決定するプログラムです。 (split method)。このプログラムを名前のファイルに保存しますSentenceDetection_RE.java。

public class SentenceDetection_RE {  
   public static void main(String args[]){ 
     
      String sentence = " Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
     
      String simple = "[.?!]";      
      String[] splitString = (sentence.split(simple));     
      for (String string : splitString)   
         System.out.println(string);      
   } 
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します。

javac SentenceDetection_RE.java 
java SentenceDetection_RE

上記のプログラムを実行すると、次のメッセージを表示するPDFドキュメントが作成されます。

Hi 
How are you 
Welcome to Tutorialspoint 
We provide free tutorials on various technologies

OpenNLPを使用した文の検出

文を検出するために、OpenNLPは事前定義されたモデルを使用します。 en-sent.bin。この事前定義されたモデルは、特定の生のテキスト内の文を検出するようにトレーニングされています。

ザ・ opennlp.tools.sentdetect パッケージには、文検出タスクを実行するために使用されるクラスとインターフェースが含まれています。

OpenNLPライブラリを使用して文を検出するには、次のことを行う必要があります。

をロードします en-sent.bin を使用したモデル SentenceModel クラス
インスタンス化する SentenceDetectorME クラス。
を使用して文を検出します sentDetect() このクラスのメソッド。

以下は、与えられた生のテキストから文を検出するプログラムを書くために従うべきステップです。

ステップ1：モデルをロードする

文検出のモデルは、という名前のクラスで表されます。 SentenceModel、パッケージに属します opennlp.tools.sentdetect。

文検出モデルをロードするには-

作成する InputStream モデルのオブジェクト（FileInputStreamをインスタンス化し、モデルのパスを文字列形式でコンストラクターに渡します）。
インスタンス化する SentenceModel クラスと合格 InputStream 次のコードブロックに示すように、コンストラクターへのパラメーターとしてのモデルの（オブジェクト）-

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/ensent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

ステップ2：SentenceDetectorMEクラスをインスタンス化する

ザ・ SentenceDetectorME パッケージのクラス opennlp.tools.sentdetect生のテキストを文に分割するメソッドが含まれています。このクラスは、最大エントロピーモデルを使用して、文字列内の文末文字を評価し、それらが文の終わりを示しているかどうかを判断します。

以下に示すように、このクラスをインスタンス化し、前の手順で作成したモデルオブジェクトを渡します。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

ステップ3：文を検出する

ザ・ sentDetect() の方法 SentenceDetectorMEクラスは、渡された生のテキスト内の文を検出するために使用されます。このメソッドは、String変数をパラメーターとして受け入れます。

文の文字列形式をこのメソッドに渡して、このメソッドを呼び出します。

//Detecting the sentence 
String sentences[] = detector.sentDetect(sentence);

Example

以下は、与えられた生のテキストの文を検出するプログラムです。このプログラムをという名前のファイルに保存しますSentenceDetectionME.java。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionME { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
    
      //Detecting the sentence
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);  
   } 
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentenceDetectorME.java 
java SentenceDetectorME

実行時に、上記のプログラムは指定された文字列を読み取り、その中の文を検出して、次の出力を表示します。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies

文の位置を検出する

また、sentPosDetect（）メソッドを使用して文の位置を検出することもできます。 SentenceDetectorME class。

以下は、与えられた生のテキストから文の位置を検出するプログラムを書くために従うべきステップです。

ステップ1：モデルをロードする

文検出のモデルは、という名前のクラスで表されます。 SentenceModel、パッケージに属します opennlp.tools.sentdetect。

文検出モデルをロードするには-

作成する InputStream モデルのオブジェクト（FileInputStreamをインスタンス化し、モデルのパスを文字列形式でコンストラクターに渡します）。
インスタンス化する SentenceModel クラスと合格 InputStream 次のコードブロックに示すように、コンストラクターへのパラメーターとしてのモデルの（オブジェクト）。

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

ステップ2：SentenceDetectorMEクラスをインスタンス化する

このクラスをインスタンス化し、前の手順で作成したモデルオブジェクトを渡します。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

ステップ3：文の位置を検出する

ザ・ sentPosDetect() の方法 SentenceDetectorMEクラスは、渡された生のテキスト内の文の位置を検出するために使用されます。このメソッドは、String変数をパラメーターとして受け入れます。

文の文字列形式をパラメータとしてこのメソッドに渡すことにより、このメソッドを呼び出します。

//Detecting the position of the sentences in the paragraph  
Span[] spans = detector.sentPosDetect(sentence);

ステップ4：文のスパンを印刷する

ザ・ sentPosDetect() の方法 SentenceDetectorME クラスは、次のタイプのオブジェクトの配列を返します Span。Span oftheという名前のクラスopennlp.tools.util パッケージは、セットの開始整数と終了整数を格納するために使用されます。

によって返されたスパンを保存できます sentPosDetect() 次のコードブロックに示すように、Span配列のメソッドを取得して出力します。

//Printing the sentences and their spans of a sentence 
for (Span span : spans)         
System.out.println(paragraph.substring(span);

Example

以下は、与えられた生のテキストの文を検出するプログラムです。このプログラムをという名前のファイルに保存しますSentenceDetectionME.java。

import java.io.FileInputStream; 
import java.io.InputStream; 
  
import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span;

public class SentencePosDetection { 
  
   public static void main(String args[]) throws Exception { 
   
      String paragraph = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the raw text 
      Span spans[] = detector.sentPosDetect(paragraph); 
       
      //Printing the spans of the sentences in the paragraph 
      for (Span span : spans)         
         System.out.println(span);  
   } 
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentencePosDetection.java 
java SentencePosDetection

実行時に、上記のプログラムは指定された文字列を読み取り、その中の文を検出して、次の出力を表示します。

[0..16) 
[17..43) 
[44..93)

文とその位置

ザ・ substring() Stringクラスのメソッドは begin そしてその end offsetsそれぞれの文字列を返します。次のコードブロックに示すように、このメソッドを使用して、文とそのスパン（位置）を一緒に印刷できます。

for (Span span : spans)         
   System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);

以下は、与えられた生のテキストから文を検出し、それらをそれらの位置とともに表示するプログラムです。このプログラムを名前の付いたファイルに保存しますSentencesAndPosDetection.java。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span; 
   
public class SentencesAndPosDetection { 
  
   public static void main(String args[]) throws Exception { 
     
      String sen = "Hi. How are you? Welcome to Tutorialspoint." 
         + " We provide free tutorials on various technologies"; 
      //Loading a sentence model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the paragraph  
      Span[] spans = detector.sentPosDetect(sen);  
      
      //Printing the sentences and their spans of a paragraph 
      for (Span span : spans)         
         System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);  
   } 
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentencesAndPosDetection.java 
java SentencesAndPosDetection

実行時に、上記のプログラムは指定された文字列を読み取り、文とその位置を検出して、次の出力を表示します。

Hi. How are you? [0..16) 
Welcome to Tutorialspoint. [17..43)  
We provide free tutorials on various technologies [44..93)

文の確率の検出

ザ・ getSentenceProbabilities() の方法 SentenceDetectorME クラスは、sentDetect（）メソッドへの最新の呼び出しに関連付けられた確率を返します。

//Getting the probabilities of the last decoded sequence       
double[] probs = detector.getSentenceProbabilities();

以下は、sentDetect（）メソッドの呼び出しに関連する確率を出力するプログラムです。このプログラムを名前のファイルに保存しますSentenceDetectionMEProbs.java。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionMEProbs { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);  
      
      //Detecting the sentence 
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);   
         
      //Getting the probabilities of the last decoded sequence       
      double[] probs = detector.getSentenceProbabilities(); 
       
      System.out.println("  "); 
       
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]); 
   } 
}

次のコマンドを使用して、コマンドプロンプトから保存したJavaファイルをコンパイルして実行します-

javac SentenceDetectionMEProbs.java 
java SentenceDetectionMEProbs

実行時に、上記のプログラムは指定された文字列を読み取り、文を検出して出力します。さらに、以下に示すように、sentDetect（）メソッドへの最新の呼び出しに関連付けられた確率も返します。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies 
   
0.9240246995179983 
0.9957680129995953 
1.0