gate.corpora
Class NekoHtmlDocumentFormat
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractLanguageResource
gate.DocumentFormat
gate.corpora.TextualDocumentFormat
gate.corpora.HtmlDocumentFormat
gate.corpora.NekoHtmlDocumentFormat
- All Implemented Interfaces:
- LanguageResource, Resource, FeatureBearer, NameBearer, Serializable
@CreoleResource(name="GATE HTML Document Format",
isPrivate=true,
autoinstances=)
public class NekoHtmlDocumentFormat- extends HtmlDocumentFormat
DocumentFormat that uses Andy Clark's NekoHTML
parser to parse HTML documents. It tries to render HTML in a similar
way to a web browser, i.e. whitespace is normalized, paragraphs are
separated by a blank line, etc. By default the text content of style
and script tags is ignored completely, though the set of tags treated
in this way is configurable via a CREOLE parameter.
This class extends HtmlDocumentFormat to cause DocumentImpl
to put the necessary whitespace normalization information into the
format's ampCodingInfo.
- See Also:
- Serialized Form
| Methods inherited from class gate.DocumentFormat |
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getMimeTypeForString, getShouldCollectRepositioning, getSupportedFileSuffixes, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup |
| Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NekoHtmlDocumentFormat
public NekoHtmlDocumentFormat()
- Default construction
setIgnorableTags
@CreoleParameter(comment="HTML tags whose text content should be ignored",
defaultValue="script;style")
public void setIgnorableTags(Set<String> newTags)
getIgnorableTags
public Set<String> getIgnorableTags()
supportsRepositioning
public Boolean supportsRepositioning()
- We support repositioning info for HTML files.
- Overrides:
supportsRepositioning in class HtmlDocumentFormat
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Old-style unpackMarkup, without repositioning info.
- Overrides:
unpackMarkup in class HtmlDocumentFormat
- Throws:
DocumentFormatException
unpackMarkup
public void unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format into annotations in GATE format. If the document was
created from a String, then is recomandable to set the doc's
sourceUrl to null. So, if the document has a valid URL,
then the parser will try to parse the XML document pointed by the
URL.If the URL is not valid, or is null, then the doc's content
will be parsed. If the doc's content is not a valid XML then the
parser might crash.
- Overrides:
unpackMarkup in class HtmlDocumentFormat
- Parameters:
doc - The gate document you want to parse. If
doc.getSourceUrl() returns null
then the content of doc will be parsed. Using a URL is
recomended because the parser will report errors corectlly
if the document is not well formed.
- Throws:
DocumentFormatException
init
public Resource init()
throws ResourceInstantiationException
- Initialise this resource, and return it.
- Specified by:
init in interface Resource- Overrides:
init in class HtmlDocumentFormat
- Throws:
ResourceInstantiationException