gate.creole.annic.lucene
Class LuceneSearchThread

java.lang.Object
  extended by gate.creole.annic.lucene.LuceneSearchThread

public class LuceneSearchThread
extends Object

Given a boolean query, it is translated into one or more AND normalized queries. For example: (A|B)C is translated into AC and BC. For each such query an instance of LuceneSearchThread is created. Here, each query is issued separately and results are submitted to main instance of LuceneSearch.

Author:
niraj

Nested Class Summary
private  class LuceneSearchThread.PatternResult
          Inner class to store pattern results.
private  class LuceneSearchThread.QueryItem
          Inner class to store query Item.
 
Field Summary
private  String baseTokenAnnotationType
          BaseTokenAnnotationType.
private  int contextWindow
          Number of base token annotations to be used in context.
private static boolean DEBUG
          Debug variable
 boolean finished
          Indicates if searching process is finished.
private  List ftp
          First term positions.
private  int ftpIndex
          First term position index.
private  boolean fwdIterationEnded
          Indicates if we've reached the end of search results.
private  String indexLocation
          The location of index.
private  LuceneSearcher luceneSearcher
          Instance of the LuceneSearcher.
private  String query
          Query
private  int queryItemIndex
          QueryItemIndex
private  QueryParser queryParser
          Instance of a QueryParser.
private  Map<String,List<LuceneSearchThread.QueryItem>> searchResultInfoMap
          A Map that holds information about search results.
private  int serializedFileIDIndex
          Index of the serializedFileID we are currently searching for.
private  String serializedFileIDInUse
          We keep track of what was the last ID of the serialized File that we visited.
private  List<String> serializedFilesIDsList
          List of serialized Files IDs retrieved from the lucene index
private  boolean success
          Indicates if the query was success.
private  List<Token> tokenStreamInUse
          This is where we store the tokenStreamInUse
 
Constructor Summary
LuceneSearchThread()
           
 
Method Summary
private  boolean areTheyEqual(List ftp, List ftp1, int qType)
          Checks if two first term positions are identical.
private  List<Pattern> createAnnicPatterns(LuceneQueryResult aResult)
          Given an object of luceneQueryResult this method for each found pattern, converts it into the annic pattern.
private  String getCompatibleName(String name)
          Given a file name, it replaces the all invalid characters with '_'.
private  LuceneSearchThread.PatternResult getPatternResult(List<Token> subTokens, String annotationSetName, int patLen, int qType, int patWindow, String query, String baseTokenAnnotationType, int numberOfResultsToFetch)
          this method takes the tokenStream as a text, the first term positions, pattern length, queryType and patternWindow and returns the GateAnnotations as an array for each pattern with left and right context
private  LuceneSearchThread.PatternResult getPatternResult(List<Token> subTokens, String annotationSetName, int patLen, int patWindow, String query, String baseTokenAnnotationType, int noOfResultsToFetch)
          This method returns the valid patterns back and the respective GateAnnotations
 String getQuery()
          Gets the query.
private  List<Token> getTokenStreamFromDisk(String indexDirectory, String documentFolder, String documentID)
          This method looks on the disk to find the tokenStream
private  List<Pattern> locatePatterns(String docID, String annotationSetName, List<List<PatternAnnotation>> gateAnnotations, List firstTermPositions, List<Integer> patternLength, String queryString)
          Locates the valid patterns in token stream and discards the invalid first term positions returned by the lucene searcher.
 List<Pattern> next(int numberOfResults)
          This method returns a list containing instances of Pattern
private  String removeUnitNumber(String documentID)
          Each index unit is first converted into a separate lucene document.
 boolean search(String query, int patternWindow, String indexLocation, String corpusToSearchIn, String annotationSetToSearchIn, LuceneSearcher luceneSearcher)
          This method collects the necessary information from lucene and uses it when the next method is called
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEBUG

private static boolean DEBUG
Debug variable


contextWindow

private int contextWindow
Number of base token annotations to be used in context.


indexLocation

private String indexLocation
The location of index.


queryParser

private QueryParser queryParser
Instance of a QueryParser.


baseTokenAnnotationType

private String baseTokenAnnotationType
BaseTokenAnnotationType.


luceneSearcher

private LuceneSearcher luceneSearcher
Instance of the LuceneSearcher.


finished

public boolean finished
Indicates if searching process is finished.


serializedFileIDIndex

private int serializedFileIDIndex
Index of the serializedFileID we are currently searching for.


queryItemIndex

private int queryItemIndex
QueryItemIndex


serializedFilesIDsList

private List<String> serializedFilesIDsList
List of serialized Files IDs retrieved from the lucene index


searchResultInfoMap

private Map<String,List<LuceneSearchThread.QueryItem>> searchResultInfoMap
A Map that holds information about search results.


ftpIndex

private int ftpIndex
First term position index.


success

private boolean success
Indicates if the query was success.


fwdIterationEnded

private boolean fwdIterationEnded
Indicates if we've reached the end of search results.


serializedFileIDInUse

private String serializedFileIDInUse
We keep track of what was the last ID of the serialized File that we visited. This is used for optimization reasons


tokenStreamInUse

private List<Token> tokenStreamInUse
This is where we store the tokenStreamInUse


query

private String query
Query


ftp

private List ftp
First term positions.

Constructor Detail

LuceneSearchThread

public LuceneSearchThread()
Method Detail

getCompatibleName

private String getCompatibleName(String name)
Given a file name, it replaces the all invalid characters with '_'.

Parameters:
name -
Returns:

search

public boolean search(String query,
                      int patternWindow,
                      String indexLocation,
                      String corpusToSearchIn,
                      String annotationSetToSearchIn,
                      LuceneSearcher luceneSearcher)
               throws SearchException
This method collects the necessary information from lucene and uses it when the next method is called

Parameters:
limit - limit indicates the number of patterns to retrieve
query - query supplied by the user
patternWindow - number of tokens to refer on left and right context
indexLocation - location of the index the searcher should search in
luceneSearcher - an instance of lucene search from where the instance of SearchThread is invoked
Returns:
true iff search was successful false otherwise
Throws:
SearchException

next

public List<Pattern> next(int numberOfResults)
                   throws Exception
This method returns a list containing instances of Pattern

Parameters:
numberOfResults - the number of results to fetch
Returns:
a list of QueryResult
Throws:
Exception

createAnnicPatterns

private List<Pattern> createAnnicPatterns(LuceneQueryResult aResult)
Given an object of luceneQueryResult this method for each found pattern, converts it into the annic pattern. In other words, for each pattern it collects the information such as annotations in context and so on.

Parameters:
aResult -
Returns:

locatePatterns

private List<Pattern> locatePatterns(String docID,
                                     String annotationSetName,
                                     List<List<PatternAnnotation>> gateAnnotations,
                                     List firstTermPositions,
                                     List<Integer> patternLength,
                                     String queryString)
Locates the valid patterns in token stream and discards the invalid first term positions returned by the lucene searcher.

Parameters:
docID -
gateAnnotations -
firstTermPositions -
patternLength -
queryString -
Returns:

removeUnitNumber

private String removeUnitNumber(String documentID)
Each index unit is first converted into a separate lucene document. And a new ID with documentName and a unit number is assined to it. But when we return results, we take the unit number out.

Parameters:
documentID -
Returns:

getTokenStreamFromDisk

private List<Token> getTokenStreamFromDisk(String indexDirectory,
                                           String documentFolder,
                                           String documentID)
                                    throws Exception
This method looks on the disk to find the tokenStream

Parameters:
location - String
Returns:
ArrayList
Throws:
Exception

getPatternResult

private LuceneSearchThread.PatternResult getPatternResult(List<Token> subTokens,
                                                          String annotationSetName,
                                                          int patLen,
                                                          int qType,
                                                          int patWindow,
                                                          String query,
                                                          String baseTokenAnnotationType,
                                                          int numberOfResultsToFetch)
this method takes the tokenStream as a text, the first term positions, pattern length, queryType and patternWindow and returns the GateAnnotations as an array for each pattern with left and right context

Parameters:
subTokens -
ftp -
patLen -
qType -
patWindow -
query -
baseTokenAnnotationType -
Returns:

getPatternResult

private LuceneSearchThread.PatternResult getPatternResult(List<Token> subTokens,
                                                          String annotationSetName,
                                                          int patLen,
                                                          int patWindow,
                                                          String query,
                                                          String baseTokenAnnotationType,
                                                          int noOfResultsToFetch)
This method returns the valid patterns back and the respective GateAnnotations

Parameters:
subTokens - ArrayList
ftp - ArrayList
patLen - int
patWindow - int
query - String
Returns:
PatternResult

areTheyEqual

private boolean areTheyEqual(List ftp,
                             List ftp1,
                             int qType)
Checks if two first term positions are identical.

Parameters:
ftp -
ftp1 -
qType -
Returns:

getQuery

public String getQuery()
Gets the query.

Returns: