Class CBZip2InputStreamAdapted

java.lang.Object
java.io.InputStream
org.aksw.commons.io.hadoop.compress.bzip2.CBZip2InputStreamAdapted
All Implemented Interfaces:
Closeable, AutoCloseable, org.apache.hadoop.io.compress.bzip2.BZip2Constants

public class CBZip2InputStreamAdapted extends InputStream implements org.apache.hadoop.io.compress.bzip2.BZip2Constants
An input stream that decompresses from the BZip2 format (without the file header chars) to be read as any other stream.

The decompression requires large amounts of memory. Thus you should call the close() method as soon as possible, to force CBZip2InputStream to release the allocated memory. See CBZip2OutputStream for information about memory usage.

CBZip2InputStream reads bytes from the compressed source stream via the single byte read() method exclusively. Thus you should consider to use a buffered source stream.

This Ant code was enhanced so that it can de-compress blocks of bzip2 data. Current position in the stream is an important statistic for Hadoop. For example in LineRecordReader, we solely depend on the current position in the stream to know about the progress. The notion of position becomes complicated for compressed files. The Hadoop splitting is done in terms of compressed file. But a compressed file deflates to a large amount of data. So we have handled this problem in the following way. On object creation time, we find the next block start delimiter. Once such a marker is found, the stream stops there (we discard any read compressed data in this process) and the position is reported as the beginning of the block start delimiter. At this point we are ready for actual reading (i.e. decompression) of data. The subsequent read calls give out data. The position is updated when the caller of this class has read off the current block + 1 bytes. In between the block reading, position is not updated. (We can only update the position on block boundaries).

Instances of this class are not threadsafe.

  • Field Details

  • Constructor Details

    • CBZip2InputStreamAdapted

      public CBZip2InputStreamAdapted(InputStream in, org.apache.hadoop.io.compress.SplittableCompressionCodec.READ_MODE readMode) throws IOException
      Constructs a new CBZip2InputStream which decompresses bytes read from the specified stream.

      Although BZip2 headers are marked with the magic "Bz" this constructor expects the next byte in the stream to be the first one after the magic. Thus callers have to skip the first two bytes. Otherwise this constructor will throw an exception.

      Throws:
      IOException - if the stream content is malformed or an I/O error occurs.
      NullPointerException - if in == null
    • CBZip2InputStreamAdapted

      public CBZip2InputStreamAdapted(InputStream in) throws IOException
      Throws:
      IOException
  • Method Details

    • getProcessedByteCount

      public long getProcessedByteCount()
      This method reports the processed bytes so far. Please note that this statistic is only updated on block boundaries and only when the stream is initiated in BYBLOCK mode.
    • updateProcessedByteCount

      protected void updateProcessedByteCount(int count)
      This method keeps track of raw processed compressed bytes.
      Parameters:
      count - count is the number of bytes to be added to raw processed bytes
    • updateReportedByteCount

      public void updateReportedByteCount(int count)
      This method is called by the client of this class in case there are any corrections in the stream position. One common example is when client of this code removes starting BZ characters from the compressed stream.
      Parameters:
      count - count bytes are added to the reported bytes
    • skipToNextMarker

      public boolean skipToNextMarker(long marker, int markerBitLength) throws IOException, IllegalArgumentException
      This method tries to find the marker (passed to it as the first parameter) in the stream. It can find bit patterns of length <= 63 bits. Specifically this method is used in CBZip2InputStream to find the end of block (EOB) delimiter in the stream, starting from the current position of the stream. If marker is found, the stream position will be at the byte containing the starting bit of the marker.
      Parameters:
      marker - The bit pattern to be found in the stream
      markerBitLength - No of bits in the marker
      Returns:
      true if the marker was found otherwise false
      Throws:
      IOException
      IllegalArgumentException - if marketBitLength is greater than 63
    • reportCRCError

      protected void reportCRCError() throws IOException
      Throws:
      IOException
    • numberOfBytesTillNextMarker

      public static long numberOfBytesTillNextMarker(InputStream in) throws IOException
      Returns the number of bytes between the current stream position and the immediate next BZip2 block marker.
      Parameters:
      in - The InputStream
      Returns:
      long Number of bytes between current stream position and the next BZip2 block start marker.
      Throws:
      IOException
    • read

      public int read() throws IOException
      Specified by:
      read in class InputStream
      Throws:
      IOException
    • read

      public int read(byte[] dest, int offs, int len) throws IOException
      In CONTINOUS reading mode, this read method starts from the start of the compressed stream and end at the end of file by emitting un-compressed data. In this mode stream positioning is not announced and should be ignored. In BYBLOCK reading mode, this read method informs about the end of a BZip2 block by returning EOB. At this event, the compressed stream position is also announced. This announcement tells that how much of the compressed stream has been de-compressed and read out of this class. In between EOB events, the stream position is not updated.
      Overrides:
      read in class InputStream
      Returns:
      int The return value greater than 0 are the bytes read. A value of -1 means end of stream while -2 represents end of block
      Throws:
      IOException - if the stream content is malformed or an I/O error occurs.
    • close

      public void close() throws IOException
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Overrides:
      close in class InputStream
      Throws:
      IOException