Class RecordReaderCsv
- java.lang.Object
-
- org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
-
- net.sansa_stack.hadoop.core.RecordReaderGenericBase<String[],String[],String[],String[]>
-
- net.sansa_stack.hadoop.format.univocity.csv.csv.RecordReaderCsv
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public class RecordReaderCsv extends RecordReaderGenericBase<String[],String[],String[],String[]>
A generic parser implementation for CSV with the offset-seeking condition that CSV rows must all have the same length.
-
-
Field Summary
Fields Modifier and Type Field Description static long
CELL_MAXLENGTH_DEFAULT_VALUE
static String
CELL_MAXLENGTH_KEY
The maximum length of a CSV cell containing new linesstatic String
CSV_FORMAT_RAW_KEY
Key for the serialized bytes of aCSVFormat
instanceprotected org.apache.commons.csv.CSVFormat
effectiveCsvFormat
static String
RECORD_MAXLENGTH_KEY
static String
RECORD_MINLENGTH_KEY
static String
RECORD_PROBECOUNT_KEY
protected org.apache.commons.csv.CSVFormat
requestedCsvFormat
-
Fields inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
accumulating, codec, currentKey, currentValue, datasetFlow, decompressor, EMPTY_BYTE_ARRAY, isEncoded, isFirstSplit, maxRecordLength, maxRecordLengthKey, minRecordLength, minRecordLengthKey, postambleBytes, preambleBytes, probeRecordCount, probeRecordCountKey, raisedThrowable, rawStream, recordStartPattern, split, splitEnd, splitLength, splitStart, stream
-
-
Constructor Summary
Constructors Constructor Description RecordReaderCsv()
RecordReaderCsv(String minRecordLengthKey, String maxRecordLengthKey, String probeRecordCountKey, Pattern recordSearchPattern)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected io.reactivex.rxjava3.core.Flowable<String[]>
createRecordFlow()
Override createRecordFlow to skip the first record if the requested format demands so.static Pattern
createStartOfCsvRecordPattern(long n)
Create a regex for matching csv record starts.protected org.apache.commons.csv.CSVFormat
disableSkipHeaderRecord(org.apache.commons.csv.CSVFormat csvFormat)
void
initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context)
Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.protected com.univocity.parsers.csv.CsvParser
newCsvParser(Reader reader)
protected io.reactivex.rxjava3.core.Flowable<String[]>
parse(Callable<InputStream> inputStreamSupplier)
Create a flowable from the input stream.-
Methods inherited from class net.sansa_stack.hadoop.core.RecordReaderGenericBase
aggregate, close, didHitSplitBound, effectiveInputStream, effectiveInputStreamSupp, findFirstPositionWithProbeSuccess, findNextRecord, getCurrentKey, getCurrentValue, getPos, getProgress, initRecordFlow, lines, logClose, logUnexpectedClose, nextKeyValue, parseFromSeekable, prober, setStreamToInterval
-
-
-
-
Field Detail
-
RECORD_MINLENGTH_KEY
public static final String RECORD_MINLENGTH_KEY
- See Also:
- Constant Field Values
-
RECORD_MAXLENGTH_KEY
public static final String RECORD_MAXLENGTH_KEY
- See Also:
- Constant Field Values
-
RECORD_PROBECOUNT_KEY
public static final String RECORD_PROBECOUNT_KEY
- See Also:
- Constant Field Values
-
CSV_FORMAT_RAW_KEY
public static final String CSV_FORMAT_RAW_KEY
Key for the serialized bytes of aCSVFormat
instance- See Also:
- Constant Field Values
-
CELL_MAXLENGTH_KEY
public static final String CELL_MAXLENGTH_KEY
The maximum length of a CSV cell containing new lines- See Also:
- Constant Field Values
-
CELL_MAXLENGTH_DEFAULT_VALUE
public static final long CELL_MAXLENGTH_DEFAULT_VALUE
- See Also:
- Constant Field Values
-
requestedCsvFormat
protected org.apache.commons.csv.CSVFormat requestedCsvFormat
-
effectiveCsvFormat
protected org.apache.commons.csv.CSVFormat effectiveCsvFormat
-
-
Method Detail
-
createStartOfCsvRecordPattern
public static Pattern createStartOfCsvRecordPattern(long n)
Create a regex for matching csv record starts. Matches the character following a newline character, whereas that newline is not within a csv cell w.r.t. a certain amount of lookahead. The regex does the following: Match a character (the dot at the end) if for the preceeding character it holds: Match a newline; however only if there is no subsequent sole double quote followed by a newline, comma or eof. Stop matching 'dot' if there was a single quote in the past- Parameters:
n
- The maximum number of lookahead bytes to check for whether a newline character might be within a csv cell- Returns:
- The corresponding regex pattern
-
initialize
public void initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
Description copied from class:RecordReaderGenericBase
Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.- Overrides:
initialize
in classRecordReaderGenericBase<String[],String[],String[],String[]>
- Throws:
IOException
-
disableSkipHeaderRecord
protected org.apache.commons.csv.CSVFormat disableSkipHeaderRecord(org.apache.commons.csv.CSVFormat csvFormat)
-
newCsvParser
protected com.univocity.parsers.csv.CsvParser newCsvParser(Reader reader) throws IOException
- Throws:
IOException
-
createRecordFlow
protected io.reactivex.rxjava3.core.Flowable<String[]> createRecordFlow() throws IOException
Override createRecordFlow to skip the first record if the requested format demands so. This assumes that the header actually resides on the first split!- Overrides:
createRecordFlow
in classRecordReaderGenericBase<String[],String[],String[],String[]>
- Throws:
IOException
-
parse
protected io.reactivex.rxjava3.core.Flowable<String[]> parse(Callable<InputStream> inputStreamSupplier)
Description copied from class:RecordReaderGenericBase
Create a flowable from the input stream. The input stream may be incorrectly positioned in which case the Flowable is expected to indicate this by raising an error event.- Specified by:
parse
in classRecordReaderGenericBase<String[],String[],String[],String[]>
- Parameters:
inputStreamSupplier
- A supplier of input streams. May supply the same underlying stream on each call hence only at most a single stream should be taken from the supplier. Supplied streams are safe to use in try-with-resources blocks (possibly using CloseShieldInputStream). Taken streams should be closed by the client code.- Returns:
-
-