hepple.postag
Class POSTagger

java.lang.Object
  extended by hepple.postag.POSTagger

public class POSTagger
extends Object

A Java POS Tagger Author: Mark Hepple (hepple@dcs.shef.ac.uk) Input: An ascii text file in "Brill input format", i.e. one sentence per line, tokens separated by spaces. Output: Same text with each token tagged, i.e. "token" -> "token/tag". Output is just streamed to std-output, so commonly will direct into some target file. Revision: 13/9/00. Version 1.0. Comments: Implements a version of the decision list based tagging method described in: M. Hepple. 2000. Independence and Commitment: Assumptions for Rapid Training and Execution of Rule-based Part-of-Speech Taggers. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL-2000). Hong Kong, October 2000. Modified by Niraj Aswani/Ian Roberts to allow explicit specification of the character encoding to use when reading rules and lexicon files. $Id: POSTagger.java 12919 2010-08-03 10:31:37Z valyt $


Field Summary
private  String[] deflex_CD
           
private  String[] deflex_JJ
           
private  String[] deflex_NN
           
private  String[] deflex_NNP
           
private  String[] deflex_NNS
           
private  String[] deflex_RB
           
private  String[] deflex_VBG
           
private  String encoding
           
 String[][] lexBuff
           
(package private)  Lexicon lexicon
           
protected  Map rules
           
(package private) static String staart
           
private  String[] staartLex
           
 String[] tagBuff
           
 String[] wordBuff
           
 
Constructor Summary
POSTagger(URL lexiconURL, URL rulesURL)
          Construct a POS tagger using the platform's native encoding to read the lexicon and rules files.
POSTagger(URL lexiconURL, URL rulesURL, String encoding)
          Construct a POS tagger using the specified encoding to read the lexicon and rules files.
 
Method Summary
protected  String[] classifyWord(String wd)
          Attempts to classify an unknown word.
 Rule createNewRule(String ruleId)
          Creates a new rule of the required type according to the provided ID.
private static void help()
          Prints the help message
static void main(String[] args)
          Main method.
protected  boolean oneStep(String word, List taggedSentence)
          Adds a new word to the window of 7 words (on the last position) and tags the word currently in the middle (i.e. on position 3).
private static List readInput(String file)
          Reads one input file and creates the structure needed by the tagger for input.
 void readRules(URL rulesURL)
          Reads the rules from the rules input file
 List runTagger(List sentences)
          Runs the tagger over a set of sentences.
 void setEncoding(String encoding)
          Deprecated. The rules and lexicon are read at construction time, so setting the encoding later will have no effect.
 void showRules()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

rules

protected Map rules

lexicon

Lexicon lexicon

encoding

private String encoding

staart

static final String staart
See Also:
Constant Field Values

staartLex

private String[] staartLex

deflex_NNP

private String[] deflex_NNP

deflex_JJ

private String[] deflex_JJ

deflex_CD

private String[] deflex_CD

deflex_NNS

private String[] deflex_NNS

deflex_RB

private String[] deflex_RB

deflex_VBG

private String[] deflex_VBG

deflex_NN

private String[] deflex_NN

wordBuff

public String[] wordBuff

tagBuff

public String[] tagBuff

lexBuff

public String[][] lexBuff
Constructor Detail

POSTagger

public POSTagger(URL lexiconURL,
                 URL rulesURL)
          throws InvalidRuleException,
                 IOException
Construct a POS tagger using the platform's native encoding to read the lexicon and rules files.

Throws:
InvalidRuleException
IOException

POSTagger

public POSTagger(URL lexiconURL,
                 URL rulesURL,
                 String encoding)
          throws InvalidRuleException,
                 IOException
Construct a POS tagger using the specified encoding to read the lexicon and rules files.

Throws:
InvalidRuleException
IOException
Method Detail

createNewRule

public Rule createNewRule(String ruleId)
                   throws InvalidRuleException
Creates a new rule of the required type according to the provided ID.

Parameters:
ruleId - the ID for the rule to be created
Throws:
InvalidRuleException

runTagger

public List runTagger(List sentences)
Runs the tagger over a set of sentences.

Parameters:
sentences - a List of Lists of words to be tagged. Each list is a sentence represented as a list of words.
Returns:
a List of Lists of String[]. A list of tagged sentences, each sentence being itself a list having pairs of strings as elements with the word on the first position and the tag on the second.

setEncoding

public void setEncoding(String encoding)
Deprecated. The rules and lexicon are read at construction time, so setting the encoding later will have no effect.

This method sets the encoding that POS tagger uses to read rules and the lexicons.


oneStep

protected boolean oneStep(String word,
                          List taggedSentence)
Adds a new word to the window of 7 words (on the last position) and tags the word currently in the middle (i.e. on position 3). This function also reads the word on the first position and adds its tag to the taggedSentence structure as this word would be lost at the next advance. If this word completes a sentence then it returns true otherwise it returns false.

Parameters:
word - the new word
taggedSentence - a List of pairs of strings representing the results of tagging the current sentence so far.
Returns:
returns true if a full sentence is now tagged, otherwise false.

readRules

public void readRules(URL rulesURL)
               throws IOException,
                      InvalidRuleException
Reads the rules from the rules input file

Throws:
IOException
InvalidRuleException

showRules

public void showRules()

classifyWord

protected String[] classifyWord(String wd)
Attempts to classify an unknown word.

Parameters:
wd - the word to be classified

main

public static void main(String[] args)
Main method. Runs the tagger using the arguments to find the resources to be used for initialisation and the input file.


help

private static void help()
Prints the help message


readInput

private static List readInput(String file)
                       throws IOException
Reads one input file and creates the structure needed by the tagger for input.

Throws:
IOException