Class RecordReaderCsvUnivocity

java.lang.Object
org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,T>
net.sansa_stack.hadoop.core.RecordReaderGenericBase<String[],String[],String[],String[]>
net.sansa_stack.hadoop.format.univocity.csv.csv.RecordReaderCsvUnivocity
All Implemented Interfaces:
Closeable, AutoCloseable

public class RecordReaderCsvUnivocity extends RecordReaderGenericBase<String[],String[],String[],String[]>
A generic parser implementation for CSV with the offset-seeking condition that CSV rows must all have the same length. Note the csvw dialect "header" attribute is interpreted slightly differently: It controls whether to use the given header names (true) or generate excel header names (false) ("a", "b", ..., "aa", ...) The skip
  • Field Details

    • RECORD_MINLENGTH_KEY

      public static final String RECORD_MINLENGTH_KEY
      See Also:
    • RECORD_MAXLENGTH_KEY

      public static final String RECORD_MAXLENGTH_KEY
      See Also:
    • RECORD_PROBECOUNT_KEY

      public static final String RECORD_PROBECOUNT_KEY
      See Also:
    • CSV_FORMAT_RAW_KEY

      public static final String CSV_FORMAT_RAW_KEY
      Key for the serialized bytes of a CSVFormat instance
      See Also:
    • CELL_MAXLENGTH_KEY

      public static final String CELL_MAXLENGTH_KEY
      The maximum length of a CSV cell containing new lines
      See Also:
    • CELL_MAXLENGTH_DEFAULT_VALUE

      public static final int CELL_MAXLENGTH_DEFAULT_VALUE
      See Also:
    • CELL_MAXLINES_KEY

      public static final String CELL_MAXLINES_KEY
      See Also:
    • CELL_MAXLINES_DEFAULT_VALUE

      public static final int CELL_MAXLINES_DEFAULT_VALUE
      See Also:
    • requestedDialect

      protected org.aksw.commons.model.csvw.domain.api.Dialect requestedDialect
    • parserFactory

      protected org.aksw.commons.model.csvw.univocity.UnivocityParserFactory parserFactory
    • postProcessRowSkipCount

      protected long postProcessRowSkipCount
  • Constructor Details

    • RecordReaderCsvUnivocity

      public RecordReaderCsvUnivocity()
      Create a regex for matching csv record starts. Matches the character following a newline character, whereas that newline is not within a csv cell w.r.t. a certain amount of lookahead. The regex does the following: Match a character (the dot at the end) if the immediate preceeding character is a newline; Match a newline; however only if there is no subsequent sole double quote followed by a newline, comma or eof. Stop matching 'dot' if there was a single quote in the past
      Parameters:
      maxCharsPerColumn - The maximum number of lookahead bytes to check for whether a newline character might be within a csv cell
    • RecordReaderCsvUnivocity

      public RecordReaderCsvUnivocity(RecordReaderConf conf)
  • Method Details

    • initialize

      public void initialize(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
      Description copied from class: RecordReaderGenericBase
      Read out config paramaters (prefixes, length thresholds, ...) and examine the codec in order to set an internal flag whether the stream will be encoded or not.
      Overrides:
      initialize in class RecordReaderGenericBase<String[],String[],String[],String[]>
      Throws:
      IOException
    • createRecordFlow

      protected Stream<String[]> createRecordFlow() throws IOException
      Override createRecordFlow to skip the first record if the requested format demands so. This assumes that the header actually resides on the first split!
      Overrides:
      createRecordFlow in class RecordReaderGenericBase<String[],String[],String[],String[]>
      Throws:
      IOException
    • parse

      protected Stream<String[]> parse(InputStream in, boolean isProbe)
      Description copied from class: RecordReaderGenericBase
      Create a flowable from the input stream. The input stream may be incorrectly positioned in which case the Flowable is expected to indicate this by raising an error event.
      Specified by:
      parse in class RecordReaderGenericBase<String[],String[],String[],String[]>
      Parameters:
      in - A supplier of input streams. May supply the same underlying stream on each call hence only at most a single stream should be taken from the supplier. Supplied streams are safe to use in try-with-resources blocks (possibly using CloseShieldInputStream). Taken streams should be closed by the client code.
      isProbe - Whether the parser should be configured for probing. For example, it is desireable to suppress porse errors during probing. Also, for probing the parser may optimize itself for minimizing latency of yielding items rather than overall throughput.
      Returns:
    • parse

      protected io.reactivex.rxjava3.core.Flowable<String[]> parse(Callable<InputStream> inputStreamSupplier)