gate
Interface SimpleCorpus

All Superinterfaces:
Collection, FeatureBearer, Iterable, LanguageResource, List, NameBearer, Resource, Serializable
All Known Subinterfaces:
Corpus, IndexedCorpus
All Known Implementing Classes:
CorpusImpl, DatabaseCorpusImpl, SerialCorpusImpl

public interface SimpleCorpus
extends LanguageResource, List, NameBearer

Corpora are lists of Document. TIPSTER equivalent: Collection.


Field Summary
static String CORPUS_DOCLIST_PARAMETER_NAME
           
static String CORPUS_NAME_PARAMETER_NAME
           
 
Method Summary
 String getDocumentName(int index)
          Gets the name of a document in this corpus.
 List<String> getDocumentNames()
          Gets the names of the documents in this corpus.
 void populate(URL directory, FileFilter filter, String encoding, boolean recurseDirectories)
          Fills this corpus with documents created on the fly from selected files in a directory.
 void populate(URL directory, FileFilter filter, String encoding, String mimeType, boolean recurseDirectories)
          Fills this corpus with documents created on the fly from selected files in a directory.
 long populate(URL singleConcatenatedFile, String documentRootElement, String encoding, int numberOfDocumentsToExtract, String documentNamePrefix, DocType documentType)
          Fills the provided corpus with documents extracted from the provided trec file.
 
Methods inherited from interface gate.LanguageResource
getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, init, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.FeatureBearer
getFeatures, setFeatures
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface java.util.List
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray
 

Field Detail

CORPUS_NAME_PARAMETER_NAME

static final String CORPUS_NAME_PARAMETER_NAME
See Also:
Constant Field Values

CORPUS_DOCLIST_PARAMETER_NAME

static final String CORPUS_DOCLIST_PARAMETER_NAME
See Also:
Constant Field Values
Method Detail

getDocumentNames

List<String> getDocumentNames()
Gets the names of the documents in this corpus.

Returns:
a List of Strings representing the names of the documents in this corpus.

getDocumentName

String getDocumentName(int index)
Gets the name of a document in this corpus.

Parameters:
index - the index of the document
Returns:
a String value representing the name of the document at index in this corpus.

populate

void populate(URL directory,
              FileFilter filter,
              String encoding,
              boolean recurseDirectories)
              throws IOException,
                     ResourceInstantiationException
Fills this corpus with documents created on the fly from selected files in a directory. Uses a FileFilter to select which files will be used and which will be ignored. A simple file filter based on extensions is provided in the Gate distribution ( ExtensionFileFilter).

Parameters:
directory - the directory from which the files will be picked. This parameter is an URL for uniformity. It needs to be a URL of type file otherwise an InvalidArgumentException will be thrown. An implementation for this method is provided as a static method at CorpusImpl.populate(Corpus, URL, FileFilter, String, boolean) .
filter - the file filter used to select files from the target directory. If the filter is null all the files will be accepted.
encoding - the encoding to be used for reading the documents
recurseDirectories - should the directory be parsed recursively?. If true all the files from the provided directory and all its children directories (on as many levels as necessary) will be picked if accepted by the filter otherwise the children directories will be ignored.
Throws:
IOException
ResourceInstantiationException

populate

void populate(URL directory,
              FileFilter filter,
              String encoding,
              String mimeType,
              boolean recurseDirectories)
              throws IOException,
                     ResourceInstantiationException
Fills this corpus with documents created on the fly from selected files in a directory. Uses a FileFilter to select which files will be used and which will be ignored. A simple file filter based on extensions is provided in the Gate distribution ( ExtensionFileFilter).

Parameters:
directory - the directory from which the files will be picked. This parameter is an URL for uniformity. It needs to be a URL of type file otherwise an InvalidArgumentException will be thrown. An implementation for this method is provided as a static method at CorpusImpl.populate(Corpus, URL, FileFilter, String, boolean) .
filter - the file filter used to select files from the target directory. If the filter is null all the files will be accepted.
encoding - the encoding to be used for reading the documents
mimeType - the mime type to be used when loading documents. If null, then the mime type will be automatically determined.
recurseDirectories - should the directory be parsed recursively?. If true all the files from the provided directory and all its children directories (on as many levels as necessary) will be picked if accepted by the filter otherwise the children directories will be ignored.
Throws:
IOException
ResourceInstantiationException

populate

long populate(URL singleConcatenatedFile,
              String documentRootElement,
              String encoding,
              int numberOfDocumentsToExtract,
              String documentNamePrefix,
              DocType documentType)
              throws IOException,
                     ResourceInstantiationException
Fills the provided corpus with documents extracted from the provided trec file.

Parameters:
singleConcatenatedFile - the file with multiple documents in it.
documentRootElement - content between the start and end of this element is considered for documents.
encoding - the encoding of the trec file.
numberOfFilesToExtract - indicates the number of files to extract from the concatenated file. -1 to indicate all files.
documentNamePrefix - the prefix to use for document names when creating from
documentType - type of the document (i.e. xml, html etc.)
Returns:
total length of populated documents in the corpus in number of bytes
Throws:
IOException
ResourceInstantiationException