kr.ac.kaist.swrc.jhannanum.plugin.MajorPlugin.PosTagger.HmmPosTagger
Class HMMTagger

java.lang.Object
  extended by kr.ac.kaist.swrc.jhannanum.plugin.MajorPlugin.PosTagger.HmmPosTagger.HMMTagger
All Implemented Interfaces:
PosTagger, Plugin

public class HMMTagger
extends java.lang.Object
implements PosTagger

Hidden Markov Model based Part Of Speech Tagger. It is a POS Tagger plug-in which is a major plug-in of phase 3 in HanNanum work flow. It uses Hidden Markov Model regarding the features of Korean Eojeol to choose the most promising morphological analysis results of each eojeol for entire sentence.

Author:
Sangwon Park (hudoni@world.kaist.ac.kr), CILab, SWRC, KAIST

Nested Class Summary
private  class HMMTagger.MNode
          Node for the markov model.
private  class HMMTagger.WPhead
          Header of an eojeol.
 
Field Summary
private static double LAMBDA
          lambda value
private static double Lambda1
          lambda 1
private static double Lambda2
          lambda 2
private  HMMTagger.MNode[] mn
          the nodes for the markov model
private  int mn_end
          the last index of the markov model
private static double PCONSTANT
          the default probability
private  java.lang.String PTT_POS_TDBM_FILE
          the statistic file for the probability P(T|T) for morphemes
private  ProbabilityDBM ptt_pos_tf
          for the probability P(T|T)
private  java.lang.String PTT_WP_TDBM_FILE
          the statistic file for the probability P(T|T) for eojeols
private  ProbabilityDBM ptt_wp_tf
          for the probability P(T|T) for eojeols
private  java.lang.String PWT_POS_TDBM_FILE
          the statistic file for the probability P(T|W) for morphemes
private  ProbabilityDBM pwt_pos_tf
          for the probability P(W|T)
private static double SF
          log 0.01 - smoothing factor
private  HMMTagger.WPhead[] wp
          the list of nodes for each eojeol
private  int wp_end
          the last index of eojeol list
 
Constructor Summary
HMMTagger()
           
 
Method Summary
private  double compute_wt(Eojeol eojeol)
          Computes P(T_i, W_i) of the specified eojeol.
private  Sentence end_sentence(SetOfSentences sos)
          Runs viterbi to get the final morphological analysis result which has the highest probability.
 void initialize(java.lang.String baseDir, java.lang.String configFile)
          This method is called before the work flow starts in order to initialize the plug-in.
private  int new_mnode(Eojeol eojeol, java.lang.String wp_tag, double prob)
          Adds a new node for the markov model.
private  int new_wp(java.lang.String str)
          Adds a new header of an eojeol.
private  void reset()
          Resets the model.
 void shutdown()
          This method is called before the work flow is closed.
 Sentence tagPOS(SetOfSentences sos)
          It performs POS tagging, which selects the most promising morphological analysis result of each eojeol, so that the final result is the morphologically analyzed sentence with the highest probability.
private  void update_prob_score(int from, int to)
          Updates the probability regarding the transition between two eojeols.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SF

private static double SF
log 0.01 - smoothing factor


wp

private HMMTagger.WPhead[] wp
the list of nodes for each eojeol


wp_end

private int wp_end
the last index of eojeol list


mn

private HMMTagger.MNode[] mn
the nodes for the markov model


mn_end

private int mn_end
the last index of the markov model


pwt_pos_tf

private ProbabilityDBM pwt_pos_tf
for the probability P(W|T)


ptt_pos_tf

private ProbabilityDBM ptt_pos_tf
for the probability P(T|T)


ptt_wp_tf

private ProbabilityDBM ptt_wp_tf
for the probability P(T|T) for eojeols


PWT_POS_TDBM_FILE

private java.lang.String PWT_POS_TDBM_FILE
the statistic file for the probability P(T|W) for morphemes


PTT_POS_TDBM_FILE

private java.lang.String PTT_POS_TDBM_FILE
the statistic file for the probability P(T|T) for morphemes


PTT_WP_TDBM_FILE

private java.lang.String PTT_WP_TDBM_FILE
the statistic file for the probability P(T|T) for eojeols


PCONSTANT

private static final double PCONSTANT
the default probability

See Also:
Constant Field Values

LAMBDA

private static final double LAMBDA
lambda value

See Also:
Constant Field Values

Lambda1

private static final double Lambda1
lambda 1

See Also:
Constant Field Values

Lambda2

private static final double Lambda2
lambda 2

See Also:
Constant Field Values
Constructor Detail

HMMTagger

public HMMTagger()
Method Detail

tagPOS

public Sentence tagPOS(SetOfSentences sos)
Description copied from interface: PosTagger
It performs POS tagging, which selects the most promising morphological analysis result of each eojeol, so that the final result is the morphologically analyzed sentence with the highest probability.

Specified by:
tagPOS in interface PosTagger
Parameters:
sos - - the result morphological analysis where each eojeol has more than one candidate of analysis
Returns:
the morphologically analyzed sentence with the highest probability

initialize

public void initialize(java.lang.String baseDir,
                       java.lang.String configFile)
                throws java.lang.Exception
Description copied from interface: Plugin
This method is called before the work flow starts in order to initialize the plug-in. A configuration file can be passed to the plug-in, which makes the plug-in more flexible.

Specified by:
initialize in interface Plugin
Parameters:
baseDir - - the base directory of HanNanum files
configFile - - the path for the configuration file
Throws:
java.lang.Exception - x

shutdown

public void shutdown()
Description copied from interface: Plugin
This method is called before the work flow is closed.

Specified by:
shutdown in interface Plugin

compute_wt

private double compute_wt(Eojeol eojeol)
Computes P(T_i, W_i) of the specified eojeol.

Parameters:
eojeol - - the eojeol to compute the probability
Returns:
P(T_i, W_i) of the specified eojeol

end_sentence

private Sentence end_sentence(SetOfSentences sos)
Runs viterbi to get the final morphological analysis result which has the highest probability.

Parameters:
sos - - all the candidates of morphological analysis
Returns:
the final morphological analysis result which has the highest probability

new_mnode

private int new_mnode(Eojeol eojeol,
                      java.lang.String wp_tag,
                      double prob)
Adds a new node for the markov model.

Parameters:
eojeol - - the eojeol to add
wp_tag - - the eojeol tag
prob - - the probability P(w|t)
Returns:
the index of the new node

new_wp

private int new_wp(java.lang.String str)
Adds a new header of an eojeol.

Parameters:
str - - the plain string of the eojeol
Returns:
the index of the new header

reset

private void reset()
Resets the model.


update_prob_score

private void update_prob_score(int from,
                               int to)
Updates the probability regarding the transition between two eojeols.

Parameters:
from - - the previous eojeol
to - - the current eojeol