Skip to content

A high-performance file deduplication tool that detects and manages duplicate files using content hashing and intelligent similarity analysis.

License

Notifications You must be signed in to change notification settings

BaseMax/go-smart-deduper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-smart-deduper

A high-performance file deduplication tool that detects and manages duplicate files using content hashing and intelligent similarity analysis.

Features

  • Fast and Concurrent: Uses Go's goroutines and worker pools for high-performance parallel file hashing
  • SHA-256 Hashing: Secure and reliable content-based duplicate detection
  • Fuzzy Hashing: Optional similarity detection for finding near-duplicate files
  • Recursive Scanning: Scan entire directory trees with configurable depth
  • Smart Filtering:
    • Exclude patterns (glob-style)
    • File size thresholds (min/max)
    • Hidden file handling
    • Symbolic link following
  • Multiple Output Formats: Text, JSON, and CSV reports
  • Interactive Modes:
    • Terminal UI (TUI) using Bubbletea
    • Interactive CLI deletion
  • Flexible Actions:
    • Dry-run mode (preview without changes)
    • Automatic deletion (keeps oldest file)
    • Hard-link replacement (saves disk space)
  • Cross-Platform: Works on Linux, macOS, and Windows

Installation

From Source

git clone https://github.com/BaseMax/go-smart-deduper.git
cd go-smart-deduper
go build -o go-smart-deduper

Using Go Install

go install github.com/BaseMax/go-smart-deduper@latest

Usage

Basic Usage

Scan the current directory for duplicates:

go-smart-deduper

Scan specific directories:

go-smart-deduper /path/to/dir1 /path/to/dir2

Filtering Options

Set minimum file size (in bytes):

go-smart-deduper --min-size 1024

Set maximum file size:

go-smart-deduper --max-size 10485760  # 10MB

Exclude patterns:

go-smart-deduper --exclude "*.tmp" --exclude "*.log"

Include hidden files:

go-smart-deduper --exclude-hidden=false

Follow symbolic links:

go-smart-deduper --follow-symlinks

Output Formats

Generate JSON report:

go-smart-deduper --format json

Generate CSV report:

go-smart-deduper --format csv --output duplicates.csv

Verbose output:

go-smart-deduper -v

Action Modes

Dry-run (preview without making changes):

go-smart-deduper --delete --dry-run

Interactive deletion (choose which files to delete):

go-smart-deduper --interactive

Automatic deletion (keeps oldest file in each duplicate group):

go-smart-deduper --delete

Hard-link replacement (replace duplicates with hard links to save space):

go-smart-deduper --hard-link

Terminal UI Mode

Launch the interactive TUI:

go-smart-deduper --tui

In TUI mode:

  • Use arrow keys or j/k to navigate
  • Press space to select duplicate groups
  • Press q to quit

Note: TUI mode currently displays duplicates for review only. To delete files, use CLI mode with --interactive, --delete, or --hard-link options.

Advanced Options

Use fuzzy hashing for similarity detection:

go-smart-deduper --fuzzy

Set number of worker threads:

go-smart-deduper --workers 8

Combine multiple options:

go-smart-deduper /home/user/Documents \
  --min-size 1024 \
  --exclude "*.tmp" \
  --exclude-hidden \
  --workers 8 \
  --format json \
  --output report.json \
  -v

Output Examples

Text Output

=== Duplicate Files Report ===

Group 1 (Hash: d2a84f4b8b650937...):
  Count: 3 files
  Size: 12 B per file
  Wasted space: 24 B
  Files:
    - /tmp/test-deduper/file1.txt (modified: 2025-12-19 17:28:05)
    - /tmp/test-deduper/file2.txt (modified: 2025-12-19 17:28:05)
    - /tmp/test-deduper/subdir/file4.txt (modified: 2025-12-19 17:28:12)

=== Summary ===
Total duplicate groups: 1
Total duplicate files: 3
Total wasted space: 24 B

JSON Output

{
  "duplicates": [
    {
      "hash": "d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26",
      "files": [
        "/tmp/test-deduper/file1.txt",
        "/tmp/test-deduper/file2.txt",
        "/tmp/test-deduper/subdir/file4.txt"
      ],
      "size": 12,
      "count": 3
    }
  ],
  "summary": {
    "total_files": 3,
    "total_groups": 1,
    "wasted_space": 24
  }
}

CSV Output

Group,Hash,File,Size,Modified
1,d2a84f4b8b650937...,/tmp/test-deduper/file1.txt,12,2025-12-19 17:28:05
1,d2a84f4b8b650937...,/tmp/test-deduper/file2.txt,12,2025-12-19 17:28:05
1,d2a84f4b8b650937...,/tmp/test-deduper/subdir/file4.txt,12,2025-12-19 17:28:12

Architecture

The tool is organized into several packages:

  • scanner: Recursive directory scanning with filtering
  • hasher: SHA-256 and fuzzy hashing implementation
  • deduper: Duplicate detection with worker pool pattern
  • reporter: Report generation in multiple formats
  • tui: Terminal UI using Bubbletea
  • cmd: Command-line interface using Cobra

Performance

The tool uses several optimizations for performance:

  1. Size-based pre-filtering: Only files with identical sizes are compared
  2. Worker pool pattern: Concurrent file hashing with configurable workers
  3. Buffered I/O: Efficient file reading with 64KB buffers
  4. Early termination: Stops processing when no duplicates are possible

Safety Features

  • Dry-run mode: Preview changes before committing
  • Interactive mode: Manual control over deletions
  • Oldest-first preservation: Automatic mode keeps the oldest file
  • Error handling: Continues scanning even if some files are inaccessible

Command-Line Options

Flags:
  -d, --delete            Automatically delete duplicates (keep oldest)
  -n, --dry-run           Don't actually delete or modify files
  -e, --exclude strings   Exclude patterns (glob style)
      --exclude-hidden    Exclude hidden files and directories (default true)
      --follow-symlinks   Follow symbolic links
  -f, --format string     Output format: text, json, csv (default "text")
      --fuzzy             Use fuzzy hashing for similarity detection
      --hard-link         Replace duplicates with hard links
  -h, --help              help for go-smart-deduper
  -i, --interactive       Interactive deletion mode
      --max-size int      Maximum file size in bytes (0 = no limit)
      --min-size int      Minimum file size in bytes
  -o, --output string     Output file (default: stdout)
  -p, --path strings      Paths to scan (can specify multiple) (default [.])
  -t, --tui               Use Terminal UI mode
  -v, --verbose           Verbose output
  -w, --workers int       Number of worker goroutines for hashing (default 4)

Testing

Run the test suite:

go test ./pkg/...

Run tests with coverage:

go test ./pkg/... -cover

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Acknowledgments

About

A high-performance file deduplication tool that detects and manages duplicate files using content hashing and intelligent similarity analysis.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published