archive-to-parquet
This is a small tool that recursively extracts data from many archive files and writes the content to a single parquet file. It supports the following compression and archive formats:
- Tar
- Zip
- Gzip
- Zstd
- Bzip2
- Xz
Features:
- Archive members are hashed with SHA256, which is included in the output
- Recursive extraction of archives within archives
- Filtering by file size or type (binary/text)
- Content-based deduplication
- Speed! :rocket:
Example: extracting all files within a Docker image
$ skopeo copy docker://python:latest oci:docker-image/ --all
$ archive-to-parquet output.parquet docker-image/blobs/**/*
INFO archive_to_parquet: Converting 112 files to Parquet
INFO archive_to_parquet: Options: ConvertionOptions(include=All, unique=false, compression=SNAPPY, min_size=None, size_range=None, batch_count=14, batch_size=100.00 MB)
...
INFO archive_to_parquet::channel: File written in 37 seconds. size=9.43 GB, batches=415 (0 pending), entries: in=263,862 out=263,862 bytes: in=25.23 GB out=25.23 GB
Usage
<OUTPUT> Output
<PATHS>...