Class FileSplitUtils

java.lang.Object
net.sansa_stack.hadoop.util.FileSplitUtils

public class FileSplitUtils extends Object
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static <T> Stream<T>
    createFlow(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.mapreduce.InputFormat<?,T> inputFormat, org.apache.hadoop.mapreduce.InputSplit inputSplit)
    Create a flow of records for a given input split w.r.t.
    static <T> io.reactivex.rxjava3.core.Flowable<T>
    createFlow2(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.mapreduce.InputFormat<?,T> inputFormat, org.apache.hadoop.mapreduce.InputSplit inputSplit)
     
    static List<Object[]>
    createTestParameters(Map<String,com.google.common.collect.Range<Integer>> fileToNumSplits)
    Util method typically for use with split-related unit tests
    getDecodedStreamFromSplit(org.apache.hadoop.mapreduce.lib.input.FileSplit split, org.apache.hadoop.conf.Configuration job)
    Util method to open a decoded stream from a split.
    static List<org.apache.hadoop.mapreduce.InputSplit>
    listFileSplits(org.apache.hadoop.fs.Path path, long fileLengthTotal, long numSplits)
     
    static Stream<org.apache.hadoop.mapreduce.InputSplit>
    streamFileSplits(org.apache.hadoop.fs.Path path, long fileLengthTotal, long numSplits)
    Utility method to create a specific number of splits for a file.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • FileSplitUtils

      public FileSplitUtils()
  • Method Details

    • getDecodedStreamFromSplit

      public static InputStream getDecodedStreamFromSplit(org.apache.hadoop.mapreduce.lib.input.FileSplit split, org.apache.hadoop.conf.Configuration job) throws IOException
      Util method to open a decoded stream from a split. Useful to read out header information from the first split in order to put it into the hadoop configuration before starting parallel processing.
      Parameters:
      split -
      job -
      Returns:
      Throws:
      IOException
    • streamFileSplits

      public static Stream<org.apache.hadoop.mapreduce.InputSplit> streamFileSplits(org.apache.hadoop.fs.Path path, long fileLengthTotal, long numSplits) throws IOException
      Utility method to create a specific number of splits for a file.
      Parameters:
      path -
      fileLengthTotal -
      numSplits -
      Returns:
      Throws:
      IOException
    • listFileSplits

      public static List<org.apache.hadoop.mapreduce.InputSplit> listFileSplits(org.apache.hadoop.fs.Path path, long fileLengthTotal, long numSplits) throws IOException
      Throws:
      IOException
    • createFlow

      public static <T> Stream<T> createFlow(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.mapreduce.InputFormat<?,T> inputFormat, org.apache.hadoop.mapreduce.InputSplit inputSplit)
      Create a flow of records for a given input split w.r.t. a given input format
    • createFlow2

      public static <T> io.reactivex.rxjava3.core.Flowable<T> createFlow2(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.mapreduce.InputFormat<?,T> inputFormat, org.apache.hadoop.mapreduce.InputSplit inputSplit)
    • createTestParameters

      public static List<Object[]> createTestParameters(Map<String,com.google.common.collect.Range<Integer>> fileToNumSplits)
      Util method typically for use with split-related unit tests