|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectgate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractLanguageResource
gate.DocumentFormat
gate.corpora.TextualDocumentFormat
gate.corpora.HtmlDocumentFormat
gate.corpora.NekoHtmlDocumentFormat
@CreoleResource(name="GATE HTML Document Format",
isPrivate=true,
autoinstances=)
public class NekoHtmlDocumentFormat
DocumentFormat that uses Andy Clark's NekoHTML parser to parse HTML documents. It tries to render HTML in a similar way to a web browser, i.e. whitespace is normalized, paragraphs are separated by a blank line, etc. By default the text content of style and script tags is ignored completely, though the set of tags treated in this way is configurable via a CREOLE parameter.
This class extends HtmlDocumentFormat to cause DocumentImpl
to put the necessary whitespace normalization information into the
format's ampCodingInfo.
| Field Summary | |
|---|---|
private static Pattern |
afterNewlinePattern
Pattern that matches the beginning of every line in a multi-line string. |
private static boolean |
DEBUG
Debug flag |
private Set<String> |
ignorableTags
The set of tags whose text content is to be ignored when parsing. |
| Fields inherited from class gate.DocumentFormat |
|---|
element2StringMap, magic2mimeTypeMap, markupElementsMap, mimeString2ClassHandlerMap, mimeString2mimeTypeMap, suffixes2mimeTypeMap |
| Fields inherited from class gate.creole.AbstractLanguageResource |
|---|
dataStore, lrPersistentId |
| Fields inherited from class gate.creole.AbstractResource |
|---|
name |
| Constructor Summary | |
|---|---|
NekoHtmlDocumentFormat()
Default construction |
|
| Method Summary | |
|---|---|
private int[] |
buildLineOffsets(String docContent)
Build an array giving the starting character offset of each line in the document. |
Set<String> |
getIgnorableTags()
|
Resource |
init()
Initialise this resource, and return it. |
void |
setIgnorableTags(Set<String> newTags)
|
Boolean |
supportsRepositioning()
We support repositioning info for HTML files. |
void |
unpackMarkup(Document doc)
Old-style unpackMarkup, without repositioning info. |
void |
unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
Unpack the markup in the document. |
| Methods inherited from class gate.corpora.TextualDocumentFormat |
|---|
annotateParagraphs, getDataStore, hasContentButNoValidUrl, setNewLineProperty |
| Methods inherited from class gate.creole.AbstractLanguageResource |
|---|
cleanup, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync |
| Methods inherited from class gate.creole.AbstractResource |
|---|
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface gate.LanguageResource |
|---|
getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync |
| Methods inherited from interface gate.Resource |
|---|
cleanup, getParameterValue, setParameterValue, setParameterValues |
| Methods inherited from interface gate.util.NameBearer |
|---|
getName, setName |
| Field Detail |
|---|
private static final boolean DEBUG
private Set<String> ignorableTags
private static Pattern afterNewlinePattern
| Constructor Detail |
|---|
public NekoHtmlDocumentFormat()
| Method Detail |
|---|
@CreoleParameter(comment="HTML tags whose text content should be ignored",
defaultValue="script;style")
public void setIgnorableTags(Set<String> newTags)
public Set<String> getIgnorableTags()
public Boolean supportsRepositioning()
supportsRepositioning in class HtmlDocumentFormat
public void unpackMarkup(Document doc)
throws DocumentFormatException
unpackMarkup in class HtmlDocumentFormatDocumentFormatException
public void unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
throws DocumentFormatException
unpackMarkup in class HtmlDocumentFormatdoc - The gate document you want to parse. If
doc.getSourceUrl() returns null
then the content of doc will be parsed. Using a URL is
recomended because the parser will report errors corectlly
if the document is not well formed.
DocumentFormatExceptionprivate int[] buildLineOffsets(String docContent)
docContent -
public Resource init()
throws ResourceInstantiationException
init in interface Resourceinit in class HtmlDocumentFormatResourceInstantiationException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||