Class RecordReaderCsvUnivocity
java.lang.Object
org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
net.sansa_stack.hadoop.core.RecordReaderGenericBase<String[],String[],String[],String[]>
net.sansa_stack.hadoop.format.univocity.csv.csv.RecordReaderCsvUnivocity
- All Implemented Interfaces:
Closeable,AutoCloseable
public class RecordReaderCsvUnivocity
extends RecordReaderGenericBase<String[],String[],String[],String[]>
A generic parser implementation for CSV with the offset-seeking condition that
CSV rows must all have the same length.
Note the csvw dialect "header" attribute is interpreted slightly differently: It controls whether to use
the given header names (true) or generate excel header names (false) ("a", "b", ..., "aa", ...)
The skip
-
Nested Class Summary
Nested classes/interfaces inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
RecordReaderGenericBase.ReadTooFarException -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final StringThe maximum length of a CSV cell containing new linesstatic final intstatic final Stringstatic final StringKey for the serialized bytes of aCSVFormatinstanceprotected org.aksw.commons.model.csvw.univocity.UnivocityParserFactoryprotected longstatic final Stringstatic final Stringstatic final Stringprotected org.aksw.commons.model.csvw.domain.api.DialectFields inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
accumulating, codec, currentKey, currentValue, datasetFlow, decompressor, EMPTY_BYTE_ARRAY, enableStats, isEncoded, isFirstSplit, maxExtraByteCount, maxRecordLength, maxRecordLengthKey, minRecordLength, minRecordLengthKey, postambleBytes, preambleBytes, probeElementCount, probeElementCountKey, probeRecordCount, probeRecordCountKey, rawStream, recordFlowCloseable, recordStartPattern, regionStartSearchReadOverRegionEnd, regionStartSearchReadOverSplitEnd, skipRecordCount, split, splitEnd, splitId, splitLength, splitName, splitStart, stream, tailByteBuffer, tailBytes, tailEltBuffer, tailElts, tailEltsTime, tailRecordOffset, totalEltCount, totalRecordCount -
Constructor Summary
ConstructorsConstructorDescriptionCreate a regex for matching csv record starts. -
Method Summary
Modifier and TypeMethodDescriptionOverride createRecordFlow to skip the first record if the requested format demands so.voidinitialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.parse(InputStream in, boolean isProbe) Create a flowable from the input stream.protected io.reactivex.rxjava3.core.Flowable<String[]>parse(Callable<InputStream> inputStreamSupplier) Methods inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
abbreviate, abbreviate, abbreviateAsUTF8, aggregate, aggregate, close, convert, createMatcherFactory, detectTail, didHitSplitBound, effectiveInputStream, effectiveInputStreamSupp, findFirstPositionWithProbeSuccess, findNextRegion, getCurrentKey, getCurrentValue, getPos, getPosition, getProgress, getStats, initRecordFlow, lines, logClose, logUnexpectedClose, nextKeyValue, parseFromSeekable, prober, setStreamToInterval, unbufferedStream
-
Field Details
-
RECORD_MINLENGTH_KEY
- See Also:
-
RECORD_MAXLENGTH_KEY
- See Also:
-
RECORD_PROBECOUNT_KEY
- See Also:
-
CSV_FORMAT_RAW_KEY
Key for the serialized bytes of aCSVFormatinstance- See Also:
-
CELL_MAXLENGTH_KEY
The maximum length of a CSV cell containing new lines- See Also:
-
CELL_MAXLENGTH_DEFAULT_VALUE
public static final int CELL_MAXLENGTH_DEFAULT_VALUE- See Also:
-
CELL_MAXLINES_KEY
- See Also:
-
CELL_MAXLINES_DEFAULT_VALUE
public static final int CELL_MAXLINES_DEFAULT_VALUE- See Also:
-
requestedDialect
protected org.aksw.commons.model.csvw.domain.api.Dialect requestedDialect -
parserFactory
protected org.aksw.commons.model.csvw.univocity.UnivocityParserFactory parserFactory -
postProcessRowSkipCount
protected long postProcessRowSkipCount
-
-
Constructor Details
-
RecordReaderCsvUnivocity
public RecordReaderCsvUnivocity()Create a regex for matching csv record starts. Matches the character following a newline character, whereas that newline is not within a csv cell w.r.t. a certain amount of lookahead. The regex does the following: Match a character (the dot at the end) if the immediate preceeding character is a newline; Match a newline; however only if there is no subsequent sole double quote followed by a newline, comma or eof. Stop matching 'dot' if there was a single quote in the past- Parameters:
maxCharsPerColumn- The maximum number of lookahead bytes to check for whether a newline character might be within a csv cell
-
RecordReaderCsvUnivocity
-
-
Method Details
-
initialize
public void initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException Description copied from class:RecordReaderGenericBaseRead out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.- Overrides:
initializein classRecordReaderGenericBase<String[],String[], String[], String[]> - Throws:
IOException
-
createRecordFlow
Override createRecordFlow to skip the first record if the requested format demands so. This assumes that the header actually resides on the first split!- Overrides:
createRecordFlowin classRecordReaderGenericBase<String[],String[], String[], String[]> - Throws:
IOException
-
parse
Description copied from class:RecordReaderGenericBaseCreate a flowable from the input stream. The input stream may be incorrectly positioned in which case the Flowable is expected to indicate this by raising an error event.- Specified by:
parsein classRecordReaderGenericBase<String[],String[], String[], String[]> - Parameters:
in- A supplier of input streams. May supply the same underlying stream on each call hence only at most a single stream should be taken from the supplier. Supplied streams are safe to use in try-with-resources blocks (possibly using CloseShieldInputStream). Taken streams should be closed by the client code.isProbe- Whether the parser should be configured for probing. For example, it is desireable to suppress porse errors during probing. Also, for probing the parser may optimize itself for minimizing latency of yielding items rather than overall throughput.- Returns:
-
parse
protected io.reactivex.rxjava3.core.Flowable<String[]> parse(Callable<InputStream> inputStreamSupplier)
-