|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
public interface SimpleCorpus
Corpora are lists of Document. TIPSTER equivalent: Collection.
| Field Summary | |
|---|---|
static String |
CORPUS_DOCLIST_PARAMETER_NAME
|
static String |
CORPUS_NAME_PARAMETER_NAME
|
| Method Summary | |
|---|---|
String |
getDocumentName(int index)
Gets the name of a document in this corpus. |
List<String> |
getDocumentNames()
Gets the names of the documents in this corpus. |
void |
populate(URL directory,
FileFilter filter,
String encoding,
boolean recurseDirectories)
Fills this corpus with documents created on the fly from selected files in a directory. |
void |
populate(URL directory,
FileFilter filter,
String encoding,
String mimeType,
boolean recurseDirectories)
Fills this corpus with documents created on the fly from selected files in a directory. |
long |
populate(URL singleConcatenatedFile,
String documentRootElement,
String encoding,
int numberOfDocumentsToExtract,
String documentNamePrefix,
DocType documentType)
Fills the provided corpus with documents extracted from the provided trec file. |
| Methods inherited from interface gate.LanguageResource |
|---|
getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync |
| Methods inherited from interface gate.Resource |
|---|
cleanup, getParameterValue, init, setParameterValue, setParameterValues |
| Methods inherited from interface gate.util.FeatureBearer |
|---|
getFeatures, setFeatures |
| Methods inherited from interface gate.util.NameBearer |
|---|
getName, setName |
| Methods inherited from interface java.util.List |
|---|
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray |
| Field Detail |
|---|
static final String CORPUS_NAME_PARAMETER_NAME
static final String CORPUS_DOCLIST_PARAMETER_NAME
| Method Detail |
|---|
List<String> getDocumentNames()
List of Strings representing the names of the
documents in this corpus.String getDocumentName(int index)
index - the index of the document
void populate(URL directory,
FileFilter filter,
String encoding,
boolean recurseDirectories)
throws IOException,
ResourceInstantiationException
FileFilter to select which
files will be used and which will be ignored. A simple file filter
based on extensions is provided in the Gate distribution (
ExtensionFileFilter).
directory - the directory from which the files will be picked.
This parameter is an URL for uniformity. It needs to be a
URL of type file otherwise an InvalidArgumentException
will be thrown. An implementation for this method is
provided as a static method at
CorpusImpl.populate(Corpus, URL, FileFilter, String, boolean)
.filter - the file filter used to select files from the target
directory. If the filter is null all the files
will be accepted.encoding - the encoding to be used for reading the documentsrecurseDirectories - should the directory be parsed
recursively?. If true all the files from the
provided directory and all its children directories (on as
many levels as necessary) will be picked if accepted by
the filter otherwise the children directories will be
ignored.
IOException
ResourceInstantiationException
void populate(URL directory,
FileFilter filter,
String encoding,
String mimeType,
boolean recurseDirectories)
throws IOException,
ResourceInstantiationException
FileFilter to select which
files will be used and which will be ignored. A simple file filter
based on extensions is provided in the Gate distribution (
ExtensionFileFilter).
directory - the directory from which the files will be picked.
This parameter is an URL for uniformity. It needs to be a
URL of type file otherwise an InvalidArgumentException
will be thrown. An implementation for this method is
provided as a static method at
CorpusImpl.populate(Corpus, URL, FileFilter, String, boolean)
.filter - the file filter used to select files from the target
directory. If the filter is null all the files
will be accepted.encoding - the encoding to be used for reading the documentsmimeType - the mime type to be used when loading documents. If
null, then the mime type will be automatically determined.recurseDirectories - should the directory be parsed
recursively?. If true all the files from the
provided directory and all its children directories (on as
many levels as necessary) will be picked if accepted by
the filter otherwise the children directories will be
ignored.
IOException
ResourceInstantiationException
long populate(URL singleConcatenatedFile,
String documentRootElement,
String encoding,
int numberOfDocumentsToExtract,
String documentNamePrefix,
DocType documentType)
throws IOException,
ResourceInstantiationException
singleConcatenatedFile - the file with multiple documents in it.documentRootElement - content between the start and end of
this element is considered for documents.encoding - the encoding of the trec file.numberOfFilesToExtract - indicates the number of files to
extract from the concatenated file. -1 to indicate all
files.documentNamePrefix - the prefix to use for document names when
creating fromdocumentType - type of the document (i.e. xml, html etc.)
IOException
ResourceInstantiationException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||