|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectkr.ac.kaist.swrc.jhannanum.plugin.SupplementPlugin.PlainTextProcessor.SentenceSegmentor.SentenceSegmentor
public class SentenceSegmentor
This plug-in reads a document which consists of more than one sentence, and recognize the end of each sentence based on punctuation marks. So if punctuation marks were not used correctly in the sentences, this plug-in will not work well. It considers '.', '!', '?' as the marks for the end of sentence, but these symbols can be used in other purpose, so it deals with those problems. For example, - 12.42 : number - A. Introduction : section title - I'm fine... : ellipsis - U.S. : abbreviation It is a Plain Text Processor plug-in which is a supplement plug-in of phase 1 in HanNanum work flow.
Field Summary | |
---|---|
private java.lang.String[] |
bufEojeols
the buffer for storing the remaining part after one sentence returned |
private int |
bufEojeolsIdx
the index of the buffer for storing the remaining part |
private java.lang.String |
bufRes
the buffer for storing intermediate results |
private int |
documentID
the ID of the document |
private boolean |
endOfDocument
the flag to check whether current sentence is the end of document |
private boolean |
hasRemainingData
the flag to check if there is remaining data in the input buffer |
private int |
sentenceID
the ID of the sentence |
Constructor Summary | |
---|---|
SentenceSegmentor()
|
Method Summary | |
---|---|
PlainSentence |
doProcess(PlainSentence ps)
It recognizes the end of each sentence and return the first sentence. |
PlainSentence |
flush()
It returns the text which has been stored in the internal buffer. |
boolean |
hasRemainingData()
It checks if there are some remaining text. |
void |
initialize(java.lang.String baseDir,
java.lang.String configFile)
This method is called before the work flow starts in order to initialize the plug-in. |
private boolean |
isSym(char c)
Checks if the specified symbol can appear with previous symbols. |
void |
shutdown()
This method is called before the work flow is closed. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private int documentID
private int sentenceID
private boolean hasRemainingData
private java.lang.String bufRes
private java.lang.String[] bufEojeols
private int bufEojeolsIdx
private boolean endOfDocument
Constructor Detail |
---|
public SentenceSegmentor()
Method Detail |
---|
private boolean isSym(char c)
c
- - the character to check
public PlainSentence doProcess(PlainSentence ps)
doProcess
in interface PlainTextProcessor
ps
- - the plain sentence which can consist of several sentences
public void initialize(java.lang.String baseDir, java.lang.String configFile) throws java.io.FileNotFoundException, java.io.IOException
Plugin
initialize
in interface Plugin
baseDir
- - the base directory of HanNanum filesconfigFile
- - the path for the configuration file
java.io.FileNotFoundException
java.io.IOException
public void shutdown()
Plugin
shutdown
in interface Plugin
public PlainSentence flush()
PlainTextProcessor
flush
in interface PlainTextProcessor
public boolean hasRemainingData()
PlainTextProcessor
hasRemainingData
in interface PlainTextProcessor
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |