Skip navigation links
net.sansa_stack.hadoop.generic

Class RecordReaderGenericBase<U,G,A,T>

    • Field Detail

      • minRecordLengthKey

        protected final String minRecordLengthKey
      • maxRecordLengthKey

        protected final String maxRecordLengthKey
      • probeRecordCountKey

        protected final String probeRecordCountKey
      • headerBytesKey

        protected final String headerBytesKey
      • baseIriKey

        protected final String baseIriKey
      • recordStartPattern

        protected final Pattern recordStartPattern
        Regex pattern to search for candidate record starts used to avoid having to invoke the actual parser (which may start a new thread) on each single character
      • baseIri

        protected String baseIri
      • maxRecordLength

        protected long maxRecordLength
      • minRecordLength

        protected long minRecordLength
      • probeRecordCount

        protected int probeRecordCount
      • currentValue

        protected T currentValue
      • datasetFlow

        protected Iterator<T> datasetFlow
      • decompressor

        protected org.apache.hadoop.io.compress.Decompressor decompressor
      • split

        protected org.apache.hadoop.mapreduce.lib.input.FileSplit split
      • codec

        protected org.apache.hadoop.io.compress.CompressionCodec codec
      • prefixBytes

        protected byte[] prefixBytes
      • rawStream

        protected org.apache.hadoop.fs.FSDataInputStream rawStream
      • isEncoded

        protected boolean isEncoded
      • splitStart

        protected long splitStart
      • splitLength

        protected long splitLength
      • splitEnd

        protected long splitEnd
    • Method Detail

      • initialize

        public void initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit,
                               org.apache.hadoop.mapreduce.TaskAttemptContext context)
                        throws IOException
        Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.
        Specified by:
        initialize in class org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
        Throws:
        IOException
      • parse

        protected abstract io.reactivex.rxjava3.core.Flowable<U> parse(Callable<InputStream> inputStreamSupplier)
        Create a flowable from the input stream. The input stream may be incorrectly positioned in which case the Flowable is expected to indicate this by raising an error event.
        Parameters:
        inputStreamSupplier - A supplier of input streams. May supply the same underlying stream on each call hence only at most a single stream should be taken from the supplier. Supplied streams are safe to use in try-with-resources blocks (possibly using CloseShieldInputStream). Taken streams should be closed by the client code.
        Returns:
      • setStreamToInterval

        public Map.Entry<Long,Long> setStreamToInterval(long start,
                                                        long end)
                                                 throws IOException
        Seek to a given offset and prepare to read up to the 'end' position (exclusive) For non-encoded streams this is just performs a seek on th stream and returns start/end unchanged. Encoded streams will adjust the seek to the data part that follows some header information
        Parameters:
        start -
        end -
        Returns:
        Throws:
        IOException
      • logClose

        public static void logClose(String str)
      • logUnexpectedClose

        public static void logUnexpectedClose(String str)
      • aggregate

        protected io.reactivex.rxjava3.core.Flowable<T> aggregate(boolean isFirstSplit,
                                                                  io.reactivex.rxjava3.core.Flowable<U> splitFlow,
                                                                  List<U> tailItems)
        Modify a flow to perform aggregation of items into records according to specification The complex part here is to correctly combine the two flows: - The first group of the splitAggregateFlow needs to be skipped as this in handled by the previous split's processor - If there are no further groups in splitFlow then no items are emitted at all (because all items belong to s previous split) - ONLY if the splitFlow owned at least one group: The first group in the tailFlow needs to be emitted
        Parameters:
        isFirstSplit - If true then the first record is included in the output; otherwise it is skipped
        splitFlow - The flow of items obtained from the split
        tailItems - The first set of group items after in the next split
        Returns:
      • createRecordFlow

        protected io.reactivex.rxjava3.core.Flowable<T> createRecordFlow()
                                                                  throws IOException
        Throws:
        IOException
      • findNextRecord

        public static long findNextRecord(Pattern recordSearchPattern,
                                          org.aksw.jena_sparql_api.io.binseach.Seekable nav,
                                          long splitStart,
                                          long absProbeRegionStart,
                                          long maxRecordLength,
                                          long absDataRegionEnd,
                                          Predicate<Long> posValidator,
                                          Predicate<org.aksw.jena_sparql_api.io.binseach.Seekable> prober)
                                   throws IOException
        Parameters:
        nav -
        splitStart -
        absProbeRegionStart -
        maxRecordLength -
        absDataRegionEnd -
        posValidator -
        prober -
        Returns:
        Throws:
        IOException
      • findFirstPositionWithProbeSuccess

        public static long findFirstPositionWithProbeSuccess(org.aksw.jena_sparql_api.io.binseach.Seekable rawSeekable,
                                                             Predicate<Long> posValidator,
                                                             Matcher m,
                                                             boolean isFwd,
                                                             Predicate<org.aksw.jena_sparql_api.io.binseach.Seekable> prober)
                                                      throws IOException
        Uses the matcher to find candidate probing positions, and returns the first positoin where probing succeeds. Matching ranges are part of the matcher configuration
        Parameters:
        rawSeekable -
        posValidator - Test whether the seekable's absolute position is a valid start point. Used to prevent testing start points past a split boundary with unknown split lengths.
        m -
        isFwd -
        prober -
        Returns:
        Throws:
        IOException
      • nextKeyValue

        public boolean nextKeyValue()
                             throws IOException
        Specified by:
        nextKeyValue in class org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
        Throws:
        IOException
      • getCurrentKey

        public org.apache.hadoop.io.LongWritable getCurrentKey()
        Specified by:
        getCurrentKey in class org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
      • getCurrentValue

        public T getCurrentValue()
        Specified by:
        getCurrentValue in class org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
      • getProgress

        public float getProgress()
                          throws IOException
        Specified by:
        getProgress in class org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
        Throws:
        IOException

Copyright © 2016–2021 Smart Data Analytics (SDA) Research Group. All rights reserved.