kr.ac.kaist.swrc.jhannanum.plugin.SupplementPlugin.PlainTextProcessor.InformalSentenceFilter
Class InformalSentenceFilter

java.lang.Object
  extended by kr.ac.kaist.swrc.jhannanum.plugin.SupplementPlugin.PlainTextProcessor.InformalSentenceFilter.InformalSentenceFilter
All Implemented Interfaces:
Plugin, PlainTextProcessor

public class InformalSentenceFilter
extends java.lang.Object
implements PlainTextProcessor

This plug-in filters informal sentences in which an eojeol is quite long and some characters were repeated many times. These informal patterns occur poor performance of morphological analysis so this plug-in should be used in HanNanum work flow which will analyze documents with informal sentences. It is a Plain Text Processor plug-in which is a supplement plug-in of phase 1 in HanNanum work flow.

Author:
Sangwon Park (hudoni@world.kaist.ac.kr), CILab, SWRC, KAIST

Field Summary
private static int REPEAT_CHAR_ALLOWED
          the maximum number of repetition of a character allowed
 
Constructor Summary
InformalSentenceFilter()
           
 
Method Summary
 PlainSentence doProcess(PlainSentence ps)
          It recognizes informal sentences in which an eojeol is quite long and some characters were repeated many times.
 PlainSentence flush()
          It returns the text which has been stored in the internal buffer.
 boolean hasRemainingData()
          It checks if there are some remaining text.
 void initialize(java.lang.String baseDir, java.lang.String configFile)
          This method is called before the work flow starts in order to initialize the plug-in.
 void shutdown()
          This method is called before the work flow is closed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

REPEAT_CHAR_ALLOWED

private static final int REPEAT_CHAR_ALLOWED
the maximum number of repetition of a character allowed

See Also:
Constant Field Values
Constructor Detail

InformalSentenceFilter

public InformalSentenceFilter()
Method Detail

doProcess

public PlainSentence doProcess(PlainSentence ps)
It recognizes informal sentences in which an eojeol is quite long and some characters were repeated many times. To prevent decrease of analysis performance because of those unimportant irregular pattern, it inserts some blanks in those eojeols to seperate them.

Specified by:
doProcess in interface PlainTextProcessor
Parameters:
ps - - the plain text
Returns:
the result plain sentence after processing

initialize

public void initialize(java.lang.String baseDir,
                       java.lang.String configFile)
                throws java.io.FileNotFoundException,
                       java.io.IOException
Description copied from interface: Plugin
This method is called before the work flow starts in order to initialize the plug-in. A configuration file can be passed to the plug-in, which makes the plug-in more flexible.

Specified by:
initialize in interface Plugin
Parameters:
baseDir - - the base directory of HanNanum files
configFile - - the path for the configuration file
Throws:
java.io.FileNotFoundException
java.io.IOException

flush

public PlainSentence flush()
Description copied from interface: PlainTextProcessor
It returns the text which has been stored in the internal buffer. This method is called by HanNanum work flow only if hasRemainingData() returns true.

Specified by:
flush in interface PlainTextProcessor
Returns:
the data in the internal buffer, if the internal buffer is empty, null is returned

shutdown

public void shutdown()
Description copied from interface: Plugin
This method is called before the work flow is closed.

Specified by:
shutdown in interface Plugin

hasRemainingData

public boolean hasRemainingData()
Description copied from interface: PlainTextProcessor
It checks if there are some remaining text. If it returns true, the HanNanum work flow will not give more data to this plug-in by passing null for doProcess(). It's because from the next phase the processing unit should be just one sentence. This mechanism allows the plug-in not to manage am input buffer.

Specified by:
hasRemainingData in interface PlainTextProcessor
Returns:
true: there are some remaining data, false: all given text were processed