Class RecordReaderCsv

java.lang.Object
org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
net.sansa_stack.hadoop.core.RecordReaderGenericBase<List,List,List,List>
net.sansa_stack.hadoop.format.commons_csv.csv.RecordReaderCsv
All Implemented Interfaces:
Closeable, AutoCloseable

public class RecordReaderCsv extends RecordReaderGenericBase<List,List,List,List>
A generic parser implementation for CSV with the offset-seeking condition that CSV rows must all have the same length.
  • Field Details

    • RECORD_MINLENGTH_KEY

      public static final String RECORD_MINLENGTH_KEY
      See Also:
    • RECORD_MAXLENGTH_KEY

      public static final String RECORD_MAXLENGTH_KEY
      See Also:
    • RECORD_PROBECOUNT_KEY

      public static final String RECORD_PROBECOUNT_KEY
      See Also:
    • CSV_FORMAT_RAW_KEY

      public static final String CSV_FORMAT_RAW_KEY
      Key for the serialized bytes of a CSVFormat instance
      See Also:
    • CELL_MAXLENGTH_KEY

      public static final String CELL_MAXLENGTH_KEY
      The maximum length of a CSV cell containing new lines
      See Also:
    • CELL_MAXLENGTH_DEFAULT_VALUE

      public static final long CELL_MAXLENGTH_DEFAULT_VALUE
      See Also:
    • requestedCsvFormat

      protected org.apache.commons.csv.CSVFormat requestedCsvFormat
    • effectiveCsvFormat

      protected org.apache.commons.csv.CSVFormat effectiveCsvFormat
  • Constructor Details

    • RecordReaderCsv

      public RecordReaderCsv()
    • RecordReaderCsv

      public RecordReaderCsv(RecordReaderConf conf)
  • Method Details

    • createStartOfCsvRecordPattern

      public static CustomPattern createStartOfCsvRecordPattern(long n)
      Create a regex for matching csv record starts. Matches the character following a newline character, whereas that newline is not within a csv cell w.r.t. a certain amount of lookahead. The regex does the following: Match a character (the dot at the end) if for the preceeding character it holds: Match a newline; however only if there is no subsequent sole double quote followed by a newline, comma or eof. Stop matching 'dot' if there was a single quote in the past
      Parameters:
      n - The maximum number of lookahead bytes to check for whether a newline character might be within a csv cell
      Returns:
      The corresponding regex pattern
    • initialize

      public void initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
      Description copied from class: RecordReaderGenericBase
      Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.
      Overrides:
      initialize in class RecordReaderGenericBase<List,List,List,List>
      Throws:
      IOException
    • disableSkipHeaderRecord

      protected org.apache.commons.csv.CSVFormat disableSkipHeaderRecord(org.apache.commons.csv.CSVFormat csvFormat)
    • newCsvParser

      protected org.apache.commons.csv.CSVParser newCsvParser(Reader reader) throws IOException
      Throws:
      IOException
    • createRecordFlow

      protected Stream<List> createRecordFlow() throws IOException
      Override createRecordFlow to skip the first record if the requested format demands so. This assumes that the header actually resides on the first split!
      Overrides:
      createRecordFlow in class RecordReaderGenericBase<List,List,List,List>
      Throws:
      IOException
    • parse

      protected Stream<List> parse(InputStream in, boolean isProbe)
      Description copied from class: RecordReaderGenericBase
      Create a flowable from the input stream. The input stream may be incorrectly positioned in which case the Flowable is expected to indicate this by raising an error event.
      Specified by:
      parse in class RecordReaderGenericBase<List,List,List,List>
      Parameters:
      in - A supplier of input streams. May supply the same underlying stream on each call hence only at most a single stream should be taken from the supplier. Supplied streams are safe to use in try-with-resources blocks (possibly using CloseShieldInputStream). Taken streams should be closed by the client code.
      isProbe - Whether the parser should be configured for probing. For example, it is desireable to suppress porse errors during probing. Also, for probing the parser may optimize itself for minimizing latency of yielding items rather than overall throughput.
      Returns: