A high-performance archiving tool written in C++ that creates compressed archives with intelligent file deduplication. Unlike traditional TAR, this tool eliminates duplicate files by storing them only once and using references, significantly reducing archive size for datasets with repeated content.
- 📦 Custom Archive Format: Uses TLV (Tag-Length-Value) structure for efficient storage
- 🔄 Intelligent Deduplication: Automatically detects and eliminates duplicate files
- 🚀 High Performance: Fast archiving and extraction with optimized algorithms
- 🧮 Content-Based Hashing: Uses XXHash for fast and reliable duplicate detection
- 📁 Directory Support: Preserves complete directory structures
- 🔍 Archive Inspection: List archive contents without extraction
- ✅ Comprehensive Testing: Full test suite with performance benchmarks
- Archive Engine (
Archive.cpp/hpp): Main archiving logic with create/extract/list operations - TLV Format (
Tlv.hpp): Tag-Length-Value structure for efficient data storage - Deduplication: Content-based file comparison using XXHash fingerprinting
- File Handling: Cross-platform file operations with error handling
The tool uses a custom TLV-based format:
TAG_TYPE | DESCRIPTION
------------|--------------------------------------------------
FILE | File entry with metadata and data
DIR_ | Directory entry
NAME | File/directory name
DATA | Actual file content
DATR | Data reference (points to duplicate file data)
- Fingerprint Generation: Calculate XXHash64 for each file
- Duplicate Detection: Compare fingerprints and verify byte-by-byte
- Reference Storage: Store duplicates as references to original data
- Space Optimization: Achieve significant compression for redundant data
- C++17 or higher
- CMake 3.10+
- xxHash (included): Fast hashing algorithm for duplicate detection
- Standard Libraries:
<filesystem>,<fstream>,<map>,<optional>
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install cmake build-essential g++CentOS/RHEL:
sudo yum install cmake gcc-c++ makemacOS:
brew install cmake- Clone and navigate to project:
git clone <repository-url>
cd custom-tar- Configure with CMake:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release- Build the project:
cmake --build build --parallel $(nproc)- Verify build:
ls -la build/bin/custom-tar
./build/bin/custom-tar helpThe tool supports three main operations:
./build/bin/custom-tar create <archive_path> <input_path>Examples:
# Archive a directory
./build/bin/custom-tar create backup.mtar ./documents/
# Archive specific files
./build/bin/custom-tar create project.mtar ./src/./build/bin/custom-tar list <archive_path>Example:
./build/bin/custom-tar list backup.mtar./build/bin/custom-tar extract <archive_path> <output_path>Examples:
# Extract to specific directory
./build/bin/custom-tar extract backup.mtar ./restored/
# Extract to current directory
./build/bin/custom-tar extract project.mtar ./# Test with sample data
python3 generate_test_data.py
./build/bin/custom-tar create test.mtar test_logs/
./build/bin/custom-tar list test.mtar
./build/bin/custom-tar extract test.mtar extracted/# Check archive efficiency
ls -lh input_directory/ # Original size
./build/bin/custom-tar create archive.mtar input_directory/
ls -lh archive.mtar # Archive sizeThe project includes comprehensive test suites covering functionality, performance, and edge cases.
# Install Python for test runner
sudo apt-get install python3
# Ensure project is built
cmake --build buildcd tests
python3 run_tests.pypython3 -m unittest test_archive.CustomTarTest -vCoverage:
- Archive creation and validation
- Directory structure preservation
- File extraction and integrity verification
- Empty directory handling
- Large file support (1MB+)
- Duplicate file detection and handling
python3 -m unittest test_performance.CustomTarPerformanceTest -vCoverage:
- Many small files (1000+ files)
- Deep directory structures (20+ levels)
- Special filenames (spaces, unicode, special chars)
- Binary file handling
- Compression ratio analysis
- Deduplication efficiency testing
- Basic Tests: File integrity, directory structure, duplicate handling
- Performance Tests: Compression ratios, deduplication efficiency
- Edge Cases: Special characters, large files, empty directories
| Data Type | Expected Compression | Use Case |
|---|---|---|
| Highly Repetitive | 60-70% ratio | Logs, generated files |
| Random Data | 95-120% ratio | Binary data, encrypted files |
| Mixed with Duplicates | 70-85% ratio | Backup scenarios |
| Many Small Files | 100-300% ratio | Source code, configs |
- Backup Scenarios: 20-80% space savings
- Log Files: 50-90% space savings
- Development Projects: 10-40% space savings
- Generated Data: 70-95% space savings
# Clean and rebuild
cmake --build build --target clean
cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build buildchmod +x build/bin/custom-tar# Check executable exists
ls -la build/bin/custom-tar
# Run individual test
python3 -m unittest test_archive.CustomTarTest.test_create_archive -vFor development and troubleshooting:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
gdb ./build/bin/custom-tar├── Archive.cpp/hpp # Core archiving engine
├── main.cpp # CLI interface
├── xxhash.c/h # Hashing library
├── Tlv.hpp # TLV format definitions
├── tests/ # Test suite
├── build/ # Build artifacts
└── test_logs/ # Sample test data
- New Archive Format: Extend TLV tags in
Tlv.hpp - Compression: Add compression layer in
Archive.cpp - Encryption: Implement in file read/write operations
- CLI Options: Extend argument parsing in
main.cpp
- Follow C++17 standards
- Add tests for new features
- Update documentation
- Ensure cross-platform compatibility
This project is licensed under the MIT License - see the LICENSE file for details.
Tested on Ubuntu 22.04, Intel i7, 16GB RAM:
- Creation Speed: ~50MB/s for mixed data
- Extraction Speed: ~80MB/s average
- Deduplication: Real-time during archiving
- Memory Usage: <100MB for typical workloads
- Compression integration (zlib/lz4)
- Encryption support (AES-256)
- Incremental backups
- Network streaming
- GUI interface
- Plugin architecture