gate.corpora
Class TikaFormat

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractLanguageResource
              extended by gate.DocumentFormat
                  extended by gate.corpora.TikaFormat
All Implemented Interfaces:
LanguageResource, Resource, FeatureBearer, NameBearer, Serializable

@CreoleResource(name="Apache Tika Document Format",
                isPrivate=true,
                autoinstances=)
public class TikaFormat
extends DocumentFormat

See Also:
Serialized Form

Field Summary
private static org.apache.log4j.Logger log
           
private static long serialVersionUID
           
 
Fields inherited from class gate.DocumentFormat
element2StringMap, magic2mimeTypeMap, markupElementsMap, mimeString2ClassHandlerMap, mimeString2mimeTypeMap, suffixes2mimeTypeMap
 
Fields inherited from class gate.creole.AbstractLanguageResource
dataStore, lrPersistentId
 
Fields inherited from class gate.creole.AbstractResource
name
 
Constructor Summary
TikaFormat()
           
 
Method Summary
private  void assignMime(MimeType mime, String... exts)
           
private  org.apache.tika.parser.Parser createParser()
           
private  org.apache.tika.metadata.Metadata extractParserTips(Document doc)
          Tries to extract tips for the parser as specified here - http://tika.apache.org/0.7/parser.html .
 Resource init()
          Initialise this resource, and return it.
private  void setDocumentFeatures(org.apache.tika.metadata.Metadata metadata, Document doc)
           
private  void setTikaFeature(org.apache.tika.metadata.Metadata metadata, String key, Map fmap)
           
 Boolean supportsRepositioning()
          If the document format could collect repositioning information during the unpack phase this method will return true.
 void unpackMarkup(Document doc)
          Unpack the markup in the document.
 void unpackMarkup(Document doc, RepositioningInfo repInfo, RepositioningInfo ampCodingInfo)
           
 
Methods inherited from class gate.DocumentFormat
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getMimeTypeForString, getShouldCollectRepositioning, getSupportedFileSuffixes, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup
 
Methods inherited from class gate.creole.AbstractLanguageResource
cleanup, getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 

Field Detail

serialVersionUID

private static final long serialVersionUID
See Also:
Constant Field Values

log

private static final org.apache.log4j.Logger log
Constructor Detail

TikaFormat

public TikaFormat()
Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Description copied from class: AbstractResource
Initialise this resource, and return it.

Specified by:
init in interface Resource
Overrides:
init in class AbstractResource
Throws:
ResourceInstantiationException

assignMime

private void assignMime(MimeType mime,
                        String... exts)

supportsRepositioning

public Boolean supportsRepositioning()
Description copied from class: DocumentFormat
If the document format could collect repositioning information during the unpack phase this method will return true.
You should override this method in the child class of the defined document format if it could collect the repositioning information.

Overrides:
supportsRepositioning in class DocumentFormat

unpackMarkup

public void unpackMarkup(Document doc)
                  throws DocumentFormatException
Description copied from class: DocumentFormat
Unpack the markup in the document. This converts markup from the native format (e.g. XML, RTF) into annotations in GATE format. Uses the markupElementsMap to determine which elements to convert, and what annotation type names to use.

Specified by:
unpackMarkup in class DocumentFormat
Throws:
DocumentFormatException

unpackMarkup

public void unpackMarkup(Document doc,
                         RepositioningInfo repInfo,
                         RepositioningInfo ampCodingInfo)
                  throws DocumentFormatException
Specified by:
unpackMarkup in class DocumentFormat
Throws:
DocumentFormatException

createParser

private org.apache.tika.parser.Parser createParser()

setDocumentFeatures

private void setDocumentFeatures(org.apache.tika.metadata.Metadata metadata,
                                 Document doc)

setTikaFeature

private void setTikaFeature(org.apache.tika.metadata.Metadata metadata,
                            String key,
                            Map fmap)

extractParserTips

private org.apache.tika.metadata.Metadata extractParserTips(Document doc)
Tries to extract tips for the parser as specified here - http://tika.apache.org/0.7/parser.html . The tips are not critical for successful parsing.

Parameters:
doc -
Returns:
metadata, not null but may be empty