gate.corpora
Class DocumentStaxUtils

java.lang.Object
  extended by gate.corpora.DocumentStaxUtils

public class DocumentStaxUtils
extends Object

This class provides support for reading and writing GATE XML format using StAX (the Streaming API for XML).


Field Summary
static char INVALID_CHARACTER_REPLACEMENT
          The char used to replace characters in text content that are illegal in XML.
static Comparator<Annotation> LONGEST_FIRST_OFFSET_COMPARATOR
          Comparator that compares annotations based on their offsets; when two annotations start at the same location, the longer one is considered to come first in the ordering.
static int LT_THRESHOLD
          The number of < signs after which we encode a string using CDATA rather than writeCharacters.
static String XCES_NAMESPACE
          XCES namespace URI.
static String XCES_VERSION
          Version of XCES that this class can handle.
 
Constructor Summary
DocumentStaxUtils()
           
 
Method Summary
static Boolean readAnnotationSet(javax.xml.stream.XMLStreamReader xsr, AnnotationSet annotationSet, Map<Integer,Long> nodeIdToOffsetMap, Set<Integer> allAnnotIds, Boolean requireAnnotationIds)
          Processes an AnnotationSet element from the given reader and fills the given annotation set with the corresponding annotations.
static FeatureMap readFeatureMap(javax.xml.stream.XMLStreamReader xsr)
          Processes a GateDocumentFeatures or Annotation element to build a feature map.
static void readGateXmlDocument(javax.xml.stream.XMLStreamReader xsr, Document doc)
          Reads GATE XML format data from the given XMLStreamReader and puts the content and annotation sets into the given Document, replacing its current content.
static void readGateXmlDocument(javax.xml.stream.XMLStreamReader xsr, Document doc, StatusListener statusListener)
          Reads GATE XML format data from the given XMLStreamReader and puts the content and annotation sets into the given Document, replacing its current content.
static String readTextWithNodes(javax.xml.stream.XMLStreamReader xsr, Map<Integer,Long> nodeIdToOffsetMap)
          Processes the TextWithNodes element from this XMLStreamReader, returning the text content of the document.
static void readXces(InputStream is, AnnotationSet as)
          Read XML data in XCES format from the given stream and add the corresponding annotations to the given annotation set.
static void readXces(javax.xml.stream.XMLStreamReader xsr, AnnotationSet as)
          Read XML data in XCES format from the given reader and add the corresponding annotations to the given annotation set.
static FeatureMap readXcesFeatureMap(javax.xml.stream.XMLStreamReader xsr)
          Processes a struct element to build a feature map.
static String toXml(Document doc)
          Returns a string containing the specified document in GATE XML format.
static void writeAnnotationSet(AnnotationSet annotations, String asName, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Retained for binary compatibility, new code should call the Collection<Annotation> version instead.
static void writeAnnotationSet(AnnotationSet annotations, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Writes the given annotation set to an XMLStreamWriter as GATE XML format.
static void writeAnnotationSet(Collection<Annotation> annotations, String asName, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Writes the given annotation set to an XMLStreamWriter as GATE XML format.
static void writeDocument(Document doc, File file)
          Write the specified GATE document to a File.
static void writeDocument(Document doc, File file, String namespaceURI)
          Write the specified GATE document to a File, optionally putting the XML in a namespace.
static void writeDocument(Document doc, Map<String,Collection<Annotation>> annotationSets, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Write the specified GATE Document to an XMLStreamWriter.
static void writeDocument(Document doc, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Write the specified GATE Document to an XMLStreamWriter.
static void writeFeatures(FeatureMap features, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Write a feature map to the given XMLStreamWriter.
static void writeTextWithNodes(Document doc, Collection<Collection<Annotation>> annotationSets, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Writes the content of the given document to an XMLStreamWriter as a mixed content element called "TextWithNodes".
static void writeTextWithNodes(Document doc, javax.xml.stream.XMLStreamWriter xsw, String namespaceURI)
          Write a TextWithNodes section containing nodes for all annotations in the given document.
static void writeXcesAnnotations(Collection<Annotation> annotations, OutputStream os, String encoding)
          Save annotations to the given output stream in XCES format, with their IDs included as the "n" attribute of each struct.
static void writeXcesAnnotations(Collection<Annotation> annotations, javax.xml.stream.XMLStreamWriter xsw)
          Save annotations to the given XMLStreamWriter in XCES format, with their IDs included as the "n" attribute of each struct.
static void writeXcesAnnotations(Collection<Annotation> annotations, javax.xml.stream.XMLStreamWriter xsw, boolean includeId)
          Save annotations to the given XMLStreamWriter in XCES format.
static void writeXcesContent(Document doc, OutputStream out, String encoding)
          Save the content of a document to the given output stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INVALID_CHARACTER_REPLACEMENT

public static final char INVALID_CHARACTER_REPLACEMENT
The char used to replace characters in text content that are illegal in XML.

See Also:
Constant Field Values

LT_THRESHOLD

public static final int LT_THRESHOLD
The number of < signs after which we encode a string using CDATA rather than writeCharacters.

See Also:
Constant Field Values

XCES_VERSION

public static final String XCES_VERSION
Version of XCES that this class can handle.

See Also:
Constant Field Values

XCES_NAMESPACE

public static final String XCES_NAMESPACE
XCES namespace URI.

See Also:
Constant Field Values

LONGEST_FIRST_OFFSET_COMPARATOR

public static final Comparator<Annotation> LONGEST_FIRST_OFFSET_COMPARATOR
Comparator that compares annotations based on their offsets; when two annotations start at the same location, the longer one is considered to come first in the ordering.

Constructor Detail

DocumentStaxUtils

public DocumentStaxUtils()
Method Detail

readGateXmlDocument

public static void readGateXmlDocument(javax.xml.stream.XMLStreamReader xsr,
                                       Document doc)
                                throws javax.xml.stream.XMLStreamException
Reads GATE XML format data from the given XMLStreamReader and puts the content and annotation sets into the given Document, replacing its current content. The reader must be positioned on the opening GateDocument tag (i.e. the last event was a START_ELEMENT for which getLocalName returns "GateDocument"), and when the method returns the reader will be left positioned on the corresponding closing tag.

Parameters:
xsr - the source of the XML to parse
doc - the document to update
Throws:
javax.xml.stream.XMLStreamException

readGateXmlDocument

public static void readGateXmlDocument(javax.xml.stream.XMLStreamReader xsr,
                                       Document doc,
                                       StatusListener statusListener)
                                throws javax.xml.stream.XMLStreamException
Reads GATE XML format data from the given XMLStreamReader and puts the content and annotation sets into the given Document, replacing its current content. The reader must be positioned on the opening GateDocument tag (i.e. the last event was a START_ELEMENT for which getLocalName returns "GateDocument"), and when the method returns the reader will be left positioned on the corresponding closing tag.

Parameters:
xsr - the source of the XML to parse
doc - the document to update
statusListener - optional status listener to receive status messages
Throws:
javax.xml.stream.XMLStreamException

readAnnotationSet

public static Boolean readAnnotationSet(javax.xml.stream.XMLStreamReader xsr,
                                        AnnotationSet annotationSet,
                                        Map<Integer,Long> nodeIdToOffsetMap,
                                        Set<Integer> allAnnotIds,
                                        Boolean requireAnnotationIds)
                                 throws javax.xml.stream.XMLStreamException
Processes an AnnotationSet element from the given reader and fills the given annotation set with the corresponding annotations. The reader must initially be positioned on the starting AnnotationSet tag and will be left positioned on the correspnding closing tag.

Parameters:
xsr - the reader
annotationSet - the annotation set to fill.
nodeIdToOffsetMap - a map mapping node IDs (Integer) to their offsets in the text (Long). If null, we assume that the node ids and offsets are the same (useful if parsing an annotation set in isolation).
allAnnotIds - a set to contain all annotation IDs specified in the annotation set. It should initially be empty and will be updated if any of the annotations in this set specify an ID.
requireAnnotationIds - whether annotations are required to specify their IDs. If true, it is an error for an annotation to omit the Id attribute. If false, it is an error for the Id to be present. If null, we have not yet determined what style of XML this is.
Returns:
requireAnnotationIds. If the passed in value was null, and we have since determined what it should be, the updated value is returned.
Throws:
javax.xml.stream.XMLStreamException

readTextWithNodes

public static String readTextWithNodes(javax.xml.stream.XMLStreamReader xsr,
                                       Map<Integer,Long> nodeIdToOffsetMap)
                                throws javax.xml.stream.XMLStreamException
Processes the TextWithNodes element from this XMLStreamReader, returning the text content of the document. The supplied map is updated with the offset of each Node element encountered. The reader must be positioned on the starting TextWithNodes tag and will be returned positioned on the corresponding closing tag.

Parameters:
xsr -
nodeIdToOffsetMap -
Returns:
Throws:
javax.xml.stream.XMLStreamException

readFeatureMap

public static FeatureMap readFeatureMap(javax.xml.stream.XMLStreamReader xsr)
                                 throws javax.xml.stream.XMLStreamException
Processes a GateDocumentFeatures or Annotation element to build a feature map. The element is expected to contain Feature children, each with a Name and Value. The reader will be returned positioned on the closing GateDocumentFeatures or Annotation tag.

Parameters:
xsr -
Returns:
Throws:
javax.xml.stream.XMLStreamException

readXces

public static void readXces(InputStream is,
                            AnnotationSet as)
                     throws javax.xml.stream.XMLStreamException
Read XML data in XCES format from the given stream and add the corresponding annotations to the given annotation set. This method does not close the stream, this is the responsibility of the caller.

Parameters:
is - the input stream to read from, which will not be closed before returning.
as - the annotation set to read into.
Throws:
javax.xml.stream.XMLStreamException

readXces

public static void readXces(javax.xml.stream.XMLStreamReader xsr,
                            AnnotationSet as)
                     throws javax.xml.stream.XMLStreamException
Read XML data in XCES format from the given reader and add the corresponding annotations to the given annotation set. The reader must be positioned on the starting cesAna tag and will be left pointing to the corresponding end tag.

Parameters:
reader - the XMLStreamReader to read from.
as - the annotation set to read into.
Throws:
javax.xml.stream.XMLStreamException

readXcesFeatureMap

public static FeatureMap readXcesFeatureMap(javax.xml.stream.XMLStreamReader xsr)
                                     throws javax.xml.stream.XMLStreamException
Processes a struct element to build a feature map. The element is expected to contain feat children, each with name and value attributes. The reader will be returned positioned on the closing struct tag.

Parameters:
xsr -
Returns:
Throws:
javax.xml.stream.XMLStreamException

toXml

public static String toXml(Document doc)
Returns a string containing the specified document in GATE XML format.

Parameters:
doc - the document

writeDocument

public static void writeDocument(Document doc,
                                 File file)
                          throws javax.xml.stream.XMLStreamException,
                                 IOException
Write the specified GATE document to a File.

Parameters:
doc - the document to write
file - the file to write it to
Throws:
javax.xml.stream.XMLStreamException
IOException

writeDocument

public static void writeDocument(Document doc,
                                 File file,
                                 String namespaceURI)
                          throws javax.xml.stream.XMLStreamException,
                                 IOException
Write the specified GATE document to a File, optionally putting the XML in a namespace.

Parameters:
doc - the document to write
file - the file to write it to
namespaceURI - the namespace URI to use for the XML elements. Must not be null, but can be the empty string if no namespace is desired.
Throws:
javax.xml.stream.XMLStreamException
IOException

writeDocument

public static void writeDocument(Document doc,
                                 Map<String,Collection<Annotation>> annotationSets,
                                 javax.xml.stream.XMLStreamWriter xsw,
                                 String namespaceURI)
                          throws javax.xml.stream.XMLStreamException
Write the specified GATE Document to an XMLStreamWriter. This method writes just the GateDocument element - the XML declaration must be filled in by the caller if required.

Parameters:
doc - the Document to write
annotationSets - the annotations to include. If the map contains an entry for the key null, this will be treated as the default set. All other entries are treated as named annotation sets.
xsw - the StAX XMLStreamWriter to use for output
Throws:
GateException - if an error occurs during writing
javax.xml.stream.XMLStreamException

writeDocument

public static void writeDocument(Document doc,
                                 javax.xml.stream.XMLStreamWriter xsw,
                                 String namespaceURI)
                          throws javax.xml.stream.XMLStreamException
Write the specified GATE Document to an XMLStreamWriter. This method writes just the GateDocument element - the XML declaration must be filled in by the caller if required. This method writes all the annotations in all the annotation sets on the document. To write just specific annotations, use writeDocument(Document, Map, XMLStreamWriter, String).

Throws:
javax.xml.stream.XMLStreamException

writeAnnotationSet

public static void writeAnnotationSet(AnnotationSet annotations,
                                      javax.xml.stream.XMLStreamWriter xsw,
                                      String namespaceURI)
                               throws javax.xml.stream.XMLStreamException
Writes the given annotation set to an XMLStreamWriter as GATE XML format. The Name attribute of the generated AnnotationSet element is set to the default value, i.e. annotations.getName.

Parameters:
annotations - the annotation set to write
xsw - the writer to use for output
namespaceURI -
Throws:
javax.xml.stream.XMLStreamException

writeAnnotationSet

public static void writeAnnotationSet(Collection<Annotation> annotations,
                                      String asName,
                                      javax.xml.stream.XMLStreamWriter xsw,
                                      String namespaceURI)
                               throws javax.xml.stream.XMLStreamException
Writes the given annotation set to an XMLStreamWriter as GATE XML format. The value for the Name attribute of the generated AnnotationSet element is given by asName.

Parameters:
annotations - the annotation set to write
asName - the name under which to write the annotation set. null means that no name will be used.
xsw - the writer to use for output
namespaceURI -
Throws:
javax.xml.stream.XMLStreamException

writeAnnotationSet

public static void writeAnnotationSet(AnnotationSet annotations,
                                      String asName,
                                      javax.xml.stream.XMLStreamWriter xsw,
                                      String namespaceURI)
                               throws javax.xml.stream.XMLStreamException
Retained for binary compatibility, new code should call the Collection<Annotation> version instead.

Throws:
javax.xml.stream.XMLStreamException

writeTextWithNodes

public static void writeTextWithNodes(Document doc,
                                      Collection<Collection<Annotation>> annotationSets,
                                      javax.xml.stream.XMLStreamWriter xsw,
                                      String namespaceURI)
                               throws javax.xml.stream.XMLStreamException
Writes the content of the given document to an XMLStreamWriter as a mixed content element called "TextWithNodes". At each point where there is the start or end of an annotation in any annotation set on the document, a "Node" element is written with an "id" feature whose value is the offset of that node.

Parameters:
doc - the document whose content is to be written
annotationSets - the annotations for which nodes are required. This is a collection of collections.
xsw - the XMLStreamWriter to write to.
namespaceURI - the namespace URI. May be empty but may not be null.
Throws:
javax.xml.stream.XMLStreamException

writeTextWithNodes

public static void writeTextWithNodes(Document doc,
                                      javax.xml.stream.XMLStreamWriter xsw,
                                      String namespaceURI)
                               throws javax.xml.stream.XMLStreamException
Write a TextWithNodes section containing nodes for all annotations in the given document.

Throws:
javax.xml.stream.XMLStreamException
See Also:
writeTextWithNodes(Document, Collection, XMLStreamWriter, String)

writeFeatures

public static void writeFeatures(FeatureMap features,
                                 javax.xml.stream.XMLStreamWriter xsw,
                                 String namespaceURI)
                          throws javax.xml.stream.XMLStreamException
Write a feature map to the given XMLStreamWriter. The map is output as a sequence of "Feature" elements, each having "Name" and "Value" children. Note that there is no enclosing element - the caller must write the enclosing "GateDocumentFeatures" or "Annotation" element. Characters in feature values that are illegal in XML are replaced by INVALID_CHARACTER_REPLACEMENT (a space). Feature names are not modified - an illegal character in a feature name will cause the serialization to fail.

Parameters:
features -
xsw -
namespaceURI -
Throws:
javax.xml.stream.XMLStreamException

writeXcesContent

public static void writeXcesContent(Document doc,
                                    OutputStream out,
                                    String encoding)
                             throws IOException
Save the content of a document to the given output stream. Since XCES content files are plain text (not XML), XML-illegal characters are not replaced when writing. The stream is not closed by this method, that is left to the caller.

Parameters:
doc - the document to save
out - the stream to write to
encoding - the character encoding to use. If null, defaults to UTF-8
Throws:
IOException

writeXcesAnnotations

public static void writeXcesAnnotations(Collection<Annotation> annotations,
                                        OutputStream os,
                                        String encoding)
                                 throws javax.xml.stream.XMLStreamException
Save annotations to the given output stream in XCES format, with their IDs included as the "n" attribute of each struct. The stream is not closed by this method, that is left to the caller.

Parameters:
annotations - the annotations to save, typically an AnnotationSet
os - the output stream to write to
encoding - the character encoding to use.
Throws:
javax.xml.stream.XMLStreamException

writeXcesAnnotations

public static void writeXcesAnnotations(Collection<Annotation> annotations,
                                        javax.xml.stream.XMLStreamWriter xsw)
                                 throws javax.xml.stream.XMLStreamException
Save annotations to the given XMLStreamWriter in XCES format, with their IDs included as the "n" attribute of each struct. The writer is not closed by this method, that is left to the caller. This method writes just the cesAna element - the XML declaration must be filled in by the caller if required.

Parameters:
annotations - the annotations to save, typically an AnnotationSet
xsw - the XMLStreamWriter to write to
Throws:
javax.xml.stream.XMLStreamException

writeXcesAnnotations

public static void writeXcesAnnotations(Collection<Annotation> annotations,
                                        javax.xml.stream.XMLStreamWriter xsw,
                                        boolean includeId)
                                 throws javax.xml.stream.XMLStreamException
Save annotations to the given XMLStreamWriter in XCES format. The writer is not closed by this method, that is left to the caller. This method writes just the cesAna element - the XML declaration must be filled in by the caller if required. Characters in feature values that are illegal in XML are replaced by INVALID_CHARACTER_REPLACEMENT (a space). Feature names are not modified, nor are annotation types - an illegal character in one of these will cause the serialization to fail.

Parameters:
annotations - the annotations to save, typically an AnnotationSet
xsw - the XMLStreamWriter to write to
includeId - should we include the annotation IDs (as the "n" attribute on each struct)?
Throws:
javax.xml.stream.XMLStreamException