gate.creole.annic.lucene
Class LuceneDocument

java.lang.Object
  extended by gate.creole.annic.lucene.LuceneDocument

public class LuceneDocument
extends Object

Given an instance of Gate Document, this class provides a method to convert it into the format that lucene can understand and can store in its indexes. This class also stores the tokenStream on the disk in order to retrieve it at the time of searching

Author:
niraj

Nested Class Summary
private  class LuceneDocument.OffsetGroup
          Internal class used for storing the offsets of annotations.
 
Constructor Summary
LuceneDocument()
           
 
Method Summary
 List<Document> createDocuments(String corpusPersistenceID, Document gateDoc, String documentID, ArrayList<String> annotSetsToInclude, ArrayList<String> annotSetsToExclude, ArrayList<String> featuresToInclude, ArrayList<String> featuresToExclude, String indexLocation, String baseTokenAnnotationType, Boolean createTokensAutomatically, String indexUnitAnnotationType)
          Given an instance of Gate Document, it converts it into the format that lucene can understand and can store in its indexes.
private  boolean createTokens(Document gateDocument, AnnotationSet set)
           
private  String getCompatibleName(String name)
          Some file names are not compatible to the underlying file system.
private  ArrayList<Token>[] getTokens(Document document, AnnotationSet inputAs, ArrayList<String> featuresToInclude, ArrayList<String> featuresToExclude, String baseTokenAnnotationType, AnnotationSet baseTokenSet, String indexUnitAnnotationType, AnnotationSet indexUnitSet, Set<String> indexedFeatures)
          This method given a GATE document and other required parameters, for each annotation of type indexUnitAnnotationType creates a separate list of baseTokens underlying in it.
private  void writeOnDisk(ArrayList tokenStream, String folderName, String fileName, String location)
          This method, given a tokenstream and file name, writes the tokenstream on the provided location.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LuceneDocument

public LuceneDocument()
Method Detail

createDocuments

public List<Document> createDocuments(String corpusPersistenceID,
                                      Document gateDoc,
                                      String documentID,
                                      ArrayList<String> annotSetsToInclude,
                                      ArrayList<String> annotSetsToExclude,
                                      ArrayList<String> featuresToInclude,
                                      ArrayList<String> featuresToExclude,
                                      String indexLocation,
                                      String baseTokenAnnotationType,
                                      Boolean createTokensAutomatically,
                                      String indexUnitAnnotationType)
Given an instance of Gate Document, it converts it into the format that lucene can understand and can store in its indexes. This method also stores the tokenStream on the disk in order to retrieve it at the time of searching

Parameters:
corpusPersistenceID -
gateDoc -
documentID -
annotSet -
featuresToExclude -
indexLocation -
baseTokenAnnotationType -
indexUnitAnnotationType -
Returns:

createTokens

private boolean createTokens(Document gateDocument,
                             AnnotationSet set)

getCompatibleName

private String getCompatibleName(String name)
Some file names are not compatible to the underlying file system. This method replaces all those incompatible characters with '_'.

Parameters:
name -
Returns:

writeOnDisk

private void writeOnDisk(ArrayList tokenStream,
                         String folderName,
                         String fileName,
                         String location)
                  throws Exception
This method, given a tokenstream and file name, writes the tokenstream on the provided location.

Parameters:
tokenStream -
fileName -
location -
Throws:
Exception

getTokens

private ArrayList<Token>[] getTokens(Document document,
                                     AnnotationSet inputAs,
                                     ArrayList<String> featuresToInclude,
                                     ArrayList<String> featuresToExclude,
                                     String baseTokenAnnotationType,
                                     AnnotationSet baseTokenSet,
                                     String indexUnitAnnotationType,
                                     AnnotationSet indexUnitSet,
                                     Set<String> indexedFeatures)
This method given a GATE document and other required parameters, for each annotation of type indexUnitAnnotationType creates a separate list of baseTokens underlying in it.

Parameters:
document -
inputAs -
featuresToExclude -
baseTokenAnnotationType -
indexUnitAnnotationType -
Returns: