gate.corpora
Class TikaFormat
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractLanguageResource
gate.DocumentFormat
gate.corpora.TikaFormat
- All Implemented Interfaces:
- LanguageResource, Resource, FeatureBearer, NameBearer, Serializable
@CreoleResource(name="Apache Tika Document Format",
isPrivate=true,
autoinstances=)
public class TikaFormat- extends DocumentFormat
- See Also:
- Serialized Form
| Methods inherited from class gate.DocumentFormat |
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getMimeTypeForString, getShouldCollectRepositioning, getSupportedFileSuffixes, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup |
| Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
serialVersionUID
private static final long serialVersionUID
- See Also:
- Constant Field Values
log
private static final org.apache.log4j.Logger log
TikaFormat
public TikaFormat()
init
public Resource init()
throws ResourceInstantiationException
- Description copied from class:
AbstractResource
- Initialise this resource, and return it.
- Specified by:
init in interface Resource- Overrides:
init in class AbstractResource
- Throws:
ResourceInstantiationException
assignMime
private void assignMime(MimeType mime,
String... exts)
supportsRepositioning
public Boolean supportsRepositioning()
- Description copied from class:
DocumentFormat
- If the document format could collect repositioning information
during the unpack phase this method will return true.
You should override this method in the child class of the defined
document format if it could collect the repositioning information.
- Overrides:
supportsRepositioning in class DocumentFormat
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Description copied from class:
DocumentFormat
- Unpack the markup in the document. This converts markup from the
native format (e.g. XML, RTF) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use.
- Specified by:
unpackMarkup in class DocumentFormat
- Throws:
DocumentFormatException
unpackMarkup
public void unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
throws DocumentFormatException
- Specified by:
unpackMarkup in class DocumentFormat
- Throws:
DocumentFormatException
createParser
private org.apache.tika.parser.Parser createParser()
setDocumentFeatures
private void setDocumentFeatures(org.apache.tika.metadata.Metadata metadata,
Document doc)
setTikaFeature
private void setTikaFeature(org.apache.tika.metadata.Metadata metadata,
String key,
Map fmap)
extractParserTips
private org.apache.tika.metadata.Metadata extractParserTips(Document doc)
- Tries to extract tips for the parser as specified here -
http://tika.apache.org/0.7/parser.html . The tips are not critical
for successful parsing.
- Parameters:
doc -
- Returns:
- metadata, not null but may be empty