Crate archive_to_parquet

Expand description

§archive-to-parquet

This is a small tool that recursively extracts data from many archive files and writes the content to a single parquet file. It supports the following compression and archive formats:

Tar
Zip
Gzip
Zstd
Bzip2
Xz

Features:

Archive members are hashed with SHA256, which is included in the output
Recursive extraction of archives within archives
Filtering by file size or type (binary/text)
Content-based deduplication
Speed! :rocket:

§Example: extracting all files within a Docker image

$ skopeo copy docker://python:latest oci:docker-image/ --all
$ archive-to-parquet output.parquet docker-image/blobs/**/*
  INFO archive_to_parquet: Converting 112 files to Parquet
  INFO archive_to_parquet: Options: ConvertionOptions(include=All, unique=false, compression=SNAPPY, min_size=None, size_range=None, batch_count=14, batch_size=100.00 MB)
  ...
  INFO archive_to_parquet::channel: File written in 37 seconds. size=9.43 GB, batches=415 (0 pending), entries: in=263,862 out=263,862 bytes: in=25.23 GB out=25.23 GB

§Usage

$ archive-to-parquet --help
Usage: archive-to-parquet [OPTIONS] <OUTPUT> <PATHS>...

Arguments:
  <OUTPUT>    Output Parquet file to create
  <PATHS>...  Input paths to read

Options:
      --min-size <MIN_SIZE>        Min file size to output. Files below this size are skipped
      --max-size <MAX_SIZE>        Max file size to output. Files above this size are skipped
      --unique                     Only output unique files by hash
      --include <INCLUDE>          Only output text files, skipping binary files [default: all] [possible values: all, text, binary]
      --threads <THREADS>          Number of threads to use when extracting. Defaults to number of CPU cores [default: 12]
      --compression <COMPRESSION>  Compression to use [default: SNAPPY]
      --batch-count <BATCH_COUNT>  Number of batches to buffer in memory at once [default: 14]
      --batch-size <BATCH_SIZE>    Maximum size of each batch in memory [default: 100MB]
  -h, --help                       Print help

Structs§

ArchiveStack
A utility struct to keep track of the current archive stack. This is useful when processing nested archives - it supports pushing and popping archives from the stack, and provides the current nested path - including all previous nested paths.
ConversionCounter
ConvertionOptions
EntryDetails
Represents the details of a FileEntry, including its path and size.
FileEntry
A FileEntry represents a file in an archive, along with its format and size. It can be used to read the file’s contents, and can also be used to visit the contents of an archive.
ParquetSink
ProgressBarConverter
RecordBatchChannel
StandardConverter
Visitor

Enums§

Compression
Supported block compression algorithms.
FormatKind
Supported file/compression formats.
IncludeType

Crate archive_to_parquet

§archive-to-parquet

§Example: extracting all files within a Docker image

§Usage

Structs§

Enums§

Traits§

Functions§

Crate archive_to_parquetCopy item path

§archive-to-parquet

§Example: extracting all files within a Docker image

§Usage

Structs§

Enums§

Traits§

Functions§

Crate archive_to_parquet