Package net.sansa_stack.hadoop.core
Class RecordReaderGenericBase<U,G,A,T>
java.lang.Object
org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
net.sansa_stack.hadoop.core.RecordReaderGenericBase<U,G,A,T>
- Type Parameters:
U- The element typeG- The group key type - elements are mapped to group keysA- The accumulator type (some collection type such as List, Graph, etc)T- The record type - maps an accumulator to a final value
- All Implemented Interfaces:
Closeable,AutoCloseable
- Direct Known Subclasses:
RecordReaderCsv,RecordReaderCsvUnivocity,RecordReaderGenericRdfBase,RecordReaderJsonArray,RecordReaderJsonSequence
public abstract class RecordReaderGenericBase<U,G,A,T>
extends org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
A generic record reader that uses a callback mechanism to detect a consecutive sequence of records
that must start in the current split and which may extend over any number of successor splits.
The callback that needs to be implemented is
parse(InputStream, boolean).
This method receives a supplier of input streams and must return an RxJava Flowable over such an
input stream.
Note that in constrast to Java Streams, RxJava Flowables idiomatically
support error propagation, cancellation and resource management.
The RecordReaderGenericBase framework will attempt to read a configurable number of records from the
flowable in order to decide whether a position on the input stream is a valid start of records.
If for a probe position the flowable yields an error or zero records then probing continues at a higher
offset until the end of data (or the MAX_RECORD_SIZE limit) is reached.
This implementation can therefore handle data skews where
sometimes large data blocks span multiple splits. E.g. a mix of 1 million graphs of 1 triple
and 1 graph of 1 million triples.
Each split is separated into a head, body and tail region.
- The tail region always extends beyond the current split's end up to the starting position of the third record in the successor split. The first record of the successor split may actually be a continuation of a record on this split: If you condsider two quads separated by the split boundary such as ":g :s :p :o |splitboundary| :g :x :y :z" then the first record after the boundary still uses the graph :g and thus belongs to the graph record started in the current split.
- Likewise, the head region always - with one exception - starts at the third record in a split (because as mentioned, the first record may belong to the prior split. Unless it is the first split (identified by an absolute starting position of 0. Then the head region start at the beginning of the split). The length of the head region depends on the number of data was considered for verifying that the starting position is a valid record start.
- The body region immediately starts after the head region and extends up to the split boundary
- We first buffer the tail region and then the head region. As a consequence, after buffering the head region the underlying stream is positioned at the start of the body region
- The effective input stream comprises the buffer of the head region, the 'live' stream of the body region (up to the split boundary) following by the tail region. This is then passed to the RDF parser which for valid input data is then able to parse all records in a single pass without errors or interruptions.
- Author:
- Claus Stadler, Lorenz Buehmann
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classRemove buffering from a channel. -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final Accumulating<U,G, A, T> protected org.apache.hadoop.io.compress.CompressionCodecprotected AtomicLongprotected Tprotected org.apache.hadoop.io.compress.Decompressorstatic final byte[]protected booleanprotected booleanprotected booleanprotected longprotected longprotected final Stringprotected longprotected final Stringprotected byte[]protected byte[]Subclasses may initialize the pre/post-amble bytes in theinitialize(InputSplit, TaskAttemptContext)method rather than the ctor! A (possibly empty) sequence of bytes to prepended to any stream passed to the parser.protected intprotected final Stringprotected intprotected final Stringprotected org.apache.hadoop.fs.FSDataInputStreamprotected Runnableprotected CustomPatternRegex pattern to search for candidate record starts used to avoid having to invoke the actual parser (which may start a new thread) on each single characterprotected Booleanprotected Booleanprotected intprotected org.apache.hadoop.mapreduce.lib.input.FileSplitprotected longprotected Stringprotected longprotected Stringprotected longprotected org.aksw.commons.io.hadoop.SeekableInputStreamprotected org.aksw.commons.io.buffer.array.BufferOverReadableChannel<byte[]>protected longprotected org.aksw.commons.io.buffer.array.BufferOverReadableChannel<U[]>protected longprotected net.sansa_stack.hadoop.core.ProbeResultprotected longprotected long -
Constructor Summary
ConstructorsConstructorDescriptionRecordReaderGenericBase(RecordReaderConf conf, Accumulating<U, G, A, T> accumulating) -
Method Summary
Modifier and TypeMethodDescriptionstatic Stringabbreviate(InputStreamReader reader, int maxWidth, String abbrevMarker) static Stringabbreviate(InputStream in, Charset charset, int maxWidth, String abbrevMarker) static StringabbreviateAsUTF8(InputStream in, int maxWidth, String abbrevMarker) Modify a flow to perform aggregation of items into records according to specification The complex part here is to correctly combine the two flows: - The first group of the splitAggregateFlow needs to be skipped as this in handled by the previous split's processor - If there are no further groups in splitFlow then no items are emitted at all (because all items belong to s previous split) - ONLY if the splitFlow owned at least one group: The first group in the tailFlow needs to be emittedstatic <T,G, A, U> Stream<U> aggregate(Stream<T> eltStream, Accumulating<T, G, A, U> accumulating) Direct aggregation of a stream via an accumulating instancevoidclose()static ProbeStatsconvert(org.apache.jena.rdf.model.Resource tgt, net.sansa_stack.hadoop.core.ProbeResult src) protected static MatcherFactorycreateMatcherFactory(CustomPattern recordSearchPattern, int relProbeRegionEnd, LongPredicate matcherAbsReadPosValidator) protected voiddetectTail(org.aksw.commons.io.buffer.array.BufferOverReadableChannel<byte[]> tailByteBuffer) Try to detected the nth record offset in the subsequent split.booleandidHitSplitBound(org.apache.hadoop.fs.Seekable seekable, long splitPos) protected InputStreamThis method is meant for overriding the input streamprotected InputStreameffectiveInputStreamSupp(org.aksw.commons.io.input.ReadableChannel<byte[]> seekable) static <U> net.sansa_stack.hadoop.core.ProbeResultfindFirstPositionWithProbeSuccess(org.aksw.commons.io.input.SeekableReadableChannel<byte[]> rawSeekable, LongPredicate posValidator, MatcherFactory matcherFactory, boolean isFwd, org.aksw.commons.io.buffer.array.BufferOverReadableChannel<U[]> outBuffer, Prober<U> prober) Uses the matcher to find candidate probing positions, and returns the first position where probing succeeds.static <U> net.sansa_stack.hadoop.core.ProbeResultfindNextRegion(CustomPattern recordSearchPattern, org.aksw.commons.io.input.SeekableReadableChannel<byte[]> nav, long splitStart, long absProbeRegionStart, long maxRecordLength, long absDataRegionEnd, LongPredicate matcherAbsReadPosValidator, LongPredicate posValidator, org.aksw.commons.io.buffer.array.BufferOverReadableChannel<U[]> outBuffer, Prober<U> prober) org.apache.hadoop.io.LongWritablestatic longgetPos(org.aksw.commons.io.seekable.api.Seekable seekable) static longgetPosition(org.aksw.commons.io.input.SeekableReadableChannel<byte[]> channel) floatgetStats()voidinitialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.voidlines(org.aksw.commons.io.seekable.api.Seekable seekable) static voidstatic voidlogUnexpectedClose(String str) booleanparse(InputStream inputStream, boolean isProbe) Create a flowable from the input stream.parseFromSeekable(org.aksw.commons.io.input.ReadableChannel<byte[]> seekable, boolean isProbe) protected OffsetSeekResultprober(org.aksw.commons.io.input.SeekableReadableChannel<byte[]> seekable, org.aksw.commons.io.buffer.array.BufferOverReadableChannel<U[]> resultBuffer) Turn a sequence of bytes into one of records.setStreamToInterval(long start, long end) Seek to a given offset and prepare to read up to the 'end' position (exclusive) For non-encoded streams this is just performs a seek on th stream and returns start/end unchanged.static <T> Stream<T>unbufferedStream(org.aksw.commons.io.buffer.array.BufferOverReadableChannel<T[]> borc)
-
Field Details
-
EMPTY_BYTE_ARRAY
public static final byte[] EMPTY_BYTE_ARRAY -
minRecordLengthKey
-
maxRecordLengthKey
-
probeRecordCountKey
-
probeElementCountKey
-
recordStartPattern
Regex pattern to search for candidate record starts used to avoid having to invoke the actual parser (which may start a new thread) on each single character -
maxRecordLength
protected long maxRecordLength -
minRecordLength
protected long minRecordLength -
probeRecordCount
protected int probeRecordCount -
probeElementCount
protected int probeElementCount -
enableStats
protected boolean enableStats -
accumulating
-
currentKey
-
currentValue
-
recordFlowCloseable
-
datasetFlow
-
decompressor
protected org.apache.hadoop.io.compress.Decompressor decompressor -
split
protected org.apache.hadoop.mapreduce.lib.input.FileSplit split -
codec
protected org.apache.hadoop.io.compress.CompressionCodec codec -
preambleBytes
protected byte[] preambleBytesSubclasses may initialize the pre/post-amble bytes in theinitialize(InputSplit, TaskAttemptContext)method rather than the ctor! A (possibly empty) sequence of bytes to prepended to any stream passed to the parser. For example, for RDF data this could be a set of prefix declarations. -
postambleBytes
protected byte[] postambleBytes -
rawStream
protected org.apache.hadoop.fs.FSDataInputStream rawStream -
stream
protected org.aksw.commons.io.hadoop.SeekableInputStream stream -
isEncoded
protected boolean isEncoded -
splitStart
protected long splitStart -
splitLength
protected long splitLength -
splitEnd
protected long splitEnd -
isFirstSplit
protected boolean isFirstSplit -
splitName
-
splitId
-
totalEltCount
protected long totalEltCount -
totalRecordCount
protected long totalRecordCount -
skipRecordCount
protected int skipRecordCount -
maxExtraByteCount
protected long maxExtraByteCount -
tailRecordOffset
protected net.sansa_stack.hadoop.core.ProbeResult tailRecordOffset -
tailBytes
protected long tailBytes -
tailEltBuffer
-
tailByteBuffer
protected org.aksw.commons.io.buffer.array.BufferOverReadableChannel<byte[]> tailByteBuffer -
tailElts
-
tailEltsTime
protected long tailEltsTime -
regionStartSearchReadOverSplitEnd
-
regionStartSearchReadOverRegionEnd
-
-
Constructor Details
-
RecordReaderGenericBase
-
-
Method Details
-
initialize
public void initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.- Specified by:
initializein classorg.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T> - Throws:
IOException
-
initRecordFlow
- Throws:
IOException
-
parse
Create a flowable from the input stream. The input stream may be incorrectly positioned in which case the Flowable is expected to indicate this by raising an error event.- Parameters:
inputStream- A supplier of input streams. May supply the same underlying stream on each call hence only at most a single stream should be taken from the supplier. Supplied streams are safe to use in try-with-resources blocks (possibly using CloseShieldInputStream). Taken streams should be closed by the client code.isProbe- Whether the parser should be configured for probing. For example, it is desireable to suppress porse errors during probing. Also, for probing the parser may optimize itself for minimizing latency of yielding items rather than overall throughput.- Returns:
-
setStreamToInterval
Seek to a given offset and prepare to read up to the 'end' position (exclusive) For non-encoded streams this is just performs a seek on th stream and returns start/end unchanged. Encoded streams will adjust the seek to the data part that follows some header information- Parameters:
start-end-- Returns:
- Throws:
IOException
-
logClose
-
logUnexpectedClose
-
didHitSplitBound
public boolean didHitSplitBound(org.apache.hadoop.fs.Seekable seekable, long splitPos) -
aggregate
Modify a flow to perform aggregation of items into records according to specification The complex part here is to correctly combine the two flows: - The first group of the splitAggregateFlow needs to be skipped as this in handled by the previous split's processor - If there are no further groups in splitFlow then no items are emitted at all (because all items belong to s previous split) - ONLY if the splitFlow owned at least one group: The first group in the tailFlow needs to be emitted- Parameters:
isFirstSplit- If true then the first record is included in the output; otherwise it is skippedsplitFlow- The flow of items obtained from the splittailElts- The first set of group items after in the next split- Returns:
-
aggregate
public static <T,G, Stream<U> aggregateA, U> (Stream<T> eltStream, Accumulating<T, G, A, U> accumulating) Direct aggregation of a stream via an accumulating instance -
effectiveInputStream
This method is meant for overriding the input stream -
effectiveInputStreamSupp
protected InputStream effectiveInputStreamSupp(org.aksw.commons.io.input.ReadableChannel<byte[]> seekable) -
getPos
public static long getPos(org.aksw.commons.io.seekable.api.Seekable seekable) -
parseFromSeekable
-
prober
protected OffsetSeekResult prober(org.aksw.commons.io.input.SeekableReadableChannel<byte[]> seekable, org.aksw.commons.io.buffer.array.BufferOverReadableChannel<U[]> resultBuffer) Turn a sequence of bytes into one of records.- Parameters:
seekable-resultBuffer-- Returns:
-
convert
public static ProbeStats convert(org.apache.jena.rdf.model.Resource tgt, net.sansa_stack.hadoop.core.ProbeResult src) -
getStats
-
detectTail
protected void detectTail(org.aksw.commons.io.buffer.array.BufferOverReadableChannel<byte[]> tailByteBuffer) Try to detected the nth record offset in the subsequent split.- Parameters:
tailByteBuffer-
-
createRecordFlow
- Throws:
IOException
-
lines
-
unbufferedStream
public static <T> Stream<T> unbufferedStream(org.aksw.commons.io.buffer.array.BufferOverReadableChannel<T[]> borc) -
findNextRegion
public static <U> net.sansa_stack.hadoop.core.ProbeResult findNextRegion(CustomPattern recordSearchPattern, org.aksw.commons.io.input.SeekableReadableChannel<byte[]> nav, long splitStart, long absProbeRegionStart, long maxRecordLength, long absDataRegionEnd, LongPredicate matcherAbsReadPosValidator, LongPredicate posValidator, org.aksw.commons.io.buffer.array.BufferOverReadableChannel<U[]> outBuffer, Prober<U> prober) throws IOException - Parameters:
nav-splitStart-absProbeRegionStart-maxRecordLength-absDataRegionEnd-posValidator-prober-- Returns:
- Throws:
IOException
-
getPosition
public static long getPosition(org.aksw.commons.io.input.SeekableReadableChannel<byte[]> channel) -
createMatcherFactory
protected static MatcherFactory createMatcherFactory(CustomPattern recordSearchPattern, int relProbeRegionEnd, LongPredicate matcherAbsReadPosValidator) -
findFirstPositionWithProbeSuccess
public static <U> net.sansa_stack.hadoop.core.ProbeResult findFirstPositionWithProbeSuccess(org.aksw.commons.io.input.SeekableReadableChannel<byte[]> rawSeekable, LongPredicate posValidator, MatcherFactory matcherFactory, boolean isFwd, org.aksw.commons.io.buffer.array.BufferOverReadableChannel<U[]> outBuffer, Prober<U> prober) throws IOException Uses the matcher to find candidate probing positions, and returns the first position where probing succeeds. Matching ranges are part of the matcher configuration- Parameters:
rawSeekable-posValidator- Test whether the seekable's absolute position is a valid start point. Used to prevent testing start points past a split boundary with unknown split lengths.matcherFactory- a factory that creates matchers instances for given offsetsisFwd-prober-- Returns:
- Throws:
IOException
-
nextKeyValue
- Specified by:
nextKeyValuein classorg.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T> - Throws:
IOException
-
getCurrentKey
public org.apache.hadoop.io.LongWritable getCurrentKey()- Specified by:
getCurrentKeyin classorg.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
-
getCurrentValue
- Specified by:
getCurrentValuein classorg.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
-
getProgress
- Specified by:
getProgressin classorg.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T> - Throws:
IOException
-
close
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein classorg.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T> - Throws:
IOException
-
abbreviateAsUTF8
public static String abbreviateAsUTF8(InputStream in, int maxWidth, String abbrevMarker) throws IOException - Throws:
IOException
-
abbreviate
public static String abbreviate(InputStream in, Charset charset, int maxWidth, String abbrevMarker) throws IOException - Throws:
IOException
-
abbreviate
public static String abbreviate(InputStreamReader reader, int maxWidth, String abbrevMarker) throws IOException - Throws:
IOException
-