Class RecordReaderCsv
java.lang.Object
org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
net.sansa_stack.hadoop.core.RecordReaderGenericBase<List,List,List,List>
net.sansa_stack.hadoop.format.commons_csv.csv.RecordReaderCsv
- All Implemented Interfaces:
Closeable,AutoCloseable
A generic parser implementation for CSV with the offset-seeking condition that
CSV rows must all have the same length.
-
Nested Class Summary
Nested classes/interfaces inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
RecordReaderGenericBase.ReadTooFarException -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final longstatic final StringThe maximum length of a CSV cell containing new linesstatic final StringKey for the serialized bytes of aCSVFormatinstanceprotected org.apache.commons.csv.CSVFormatstatic final Stringstatic final Stringstatic final Stringprotected org.apache.commons.csv.CSVFormatFields inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
accumulating, codec, currentKey, currentValue, datasetFlow, decompressor, EMPTY_BYTE_ARRAY, enableStats, isEncoded, isFirstSplit, maxExtraByteCount, maxRecordLength, maxRecordLengthKey, minRecordLength, minRecordLengthKey, postambleBytes, preambleBytes, probeElementCount, probeElementCountKey, probeRecordCount, probeRecordCountKey, rawStream, recordFlowCloseable, recordStartPattern, regionStartSearchReadOverRegionEnd, regionStartSearchReadOverSplitEnd, skipRecordCount, split, splitEnd, splitId, splitLength, splitName, splitStart, stream, tailByteBuffer, tailBytes, tailEltBuffer, tailElts, tailEltsTime, tailRecordOffset, totalEltCount, totalRecordCount -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionOverride createRecordFlow to skip the first record if the requested format demands so.static CustomPatterncreateStartOfCsvRecordPattern(long n) Create a regex for matching csv record starts.protected org.apache.commons.csv.CSVFormatdisableSkipHeaderRecord(org.apache.commons.csv.CSVFormat csvFormat) voidinitialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.protected org.apache.commons.csv.CSVParsernewCsvParser(Reader reader) parse(InputStream in, boolean isProbe) Create a flowable from the input stream.Methods inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
abbreviate, abbreviate, abbreviateAsUTF8, aggregate, aggregate, close, convert, createMatcherFactory, detectTail, didHitSplitBound, effectiveInputStream, effectiveInputStreamSupp, findFirstPositionWithProbeSuccess, findNextRegion, getCurrentKey, getCurrentValue, getPos, getPosition, getProgress, getStats, initRecordFlow, lines, logClose, logUnexpectedClose, nextKeyValue, parseFromSeekable, prober, setStreamToInterval, unbufferedStream
-
Field Details
-
RECORD_MINLENGTH_KEY
- See Also:
-
RECORD_MAXLENGTH_KEY
- See Also:
-
RECORD_PROBECOUNT_KEY
- See Also:
-
CSV_FORMAT_RAW_KEY
Key for the serialized bytes of aCSVFormatinstance- See Also:
-
CELL_MAXLENGTH_KEY
The maximum length of a CSV cell containing new lines- See Also:
-
CELL_MAXLENGTH_DEFAULT_VALUE
public static final long CELL_MAXLENGTH_DEFAULT_VALUE- See Also:
-
requestedCsvFormat
protected org.apache.commons.csv.CSVFormat requestedCsvFormat -
effectiveCsvFormat
protected org.apache.commons.csv.CSVFormat effectiveCsvFormat
-
-
Constructor Details
-
RecordReaderCsv
public RecordReaderCsv() -
RecordReaderCsv
-
-
Method Details
-
createStartOfCsvRecordPattern
Create a regex for matching csv record starts. Matches the character following a newline character, whereas that newline is not within a csv cell w.r.t. a certain amount of lookahead. The regex does the following: Match a character (the dot at the end) if for the preceeding character it holds: Match a newline; however only if there is no subsequent sole double quote followed by a newline, comma or eof. Stop matching 'dot' if there was a single quote in the past- Parameters:
n- The maximum number of lookahead bytes to check for whether a newline character might be within a csv cell- Returns:
- The corresponding regex pattern
-
initialize
public void initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException Description copied from class:RecordReaderGenericBaseRead out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.- Overrides:
initializein classRecordReaderGenericBase<List,List, List, List> - Throws:
IOException
-
disableSkipHeaderRecord
protected org.apache.commons.csv.CSVFormat disableSkipHeaderRecord(org.apache.commons.csv.CSVFormat csvFormat) -
newCsvParser
- Throws:
IOException
-
createRecordFlow
Override createRecordFlow to skip the first record if the requested format demands so. This assumes that the header actually resides on the first split!- Overrides:
createRecordFlowin classRecordReaderGenericBase<List,List, List, List> - Throws:
IOException
-
parse
Description copied from class:RecordReaderGenericBaseCreate a flowable from the input stream. The input stream may be incorrectly positioned in which case the Flowable is expected to indicate this by raising an error event.- Specified by:
parsein classRecordReaderGenericBase<List,List, List, List> - Parameters:
in- A supplier of input streams. May supply the same underlying stream on each call hence only at most a single stream should be taken from the supplier. Supplied streams are safe to use in try-with-resources blocks (possibly using CloseShieldInputStream). Taken streams should be closed by the client code.isProbe- Whether the parser should be configured for probing. For example, it is desireable to suppress porse errors during probing. Also, for probing the parser may optimize itself for minimizing latency of yielding items rather than overall throughput.- Returns:
-