Skip to content

klynastor/dedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dedupe

No-frills file deduplication for extent/reflink-based filesystems (e.g. ext4, XFS, BTRFS).

dedupe is a high performance file-deduplication program with low memory usage, fast file comparison using the XXHash library, and kernel-level deduplication using extents. It performs whole-file deduplication only—hashing only files whose byte sizes match and do not already share extents. By default, dedupe stores hashes by inode and modification time in /etc/dedupe.db, allowing for a quick startup on future runs.

Deduplication is performed by using the FIDEDUPERANGE ioctl() call to the kernel between two files. This is an atomic filesystem operation whereby the kernel reads and compares two files for equality before merging any extents. In other words, it is a safe procedure that will not cause any loss of data, even when those files are being written to concurrently.

Program Options

Usage: dedupe [options] [files|directories] ...
Options:
  -c         Clear hash database before beginning
  -d dbfile  Database for saving hashes between runs; default=/etc/dedupe.db
  -f         Force hashing (do not use hash db)
  -n         Do not deduplicate (still updates hash db)
  -q         Be quiet (don't display statistics)
  -r         Recurse into subdirectories
  -s size    Only dedupe files of 'size' or greater; default=4096
  -x         Only scan one filesystem
  -v         Be verbose; maximum level is -vvv
  -V         Display version information and exit

-c – Clear hash database before beginning

This flag ignores reading the /etc/dedupe.db file at the start of the operation. It can be used to "forget" all of the previous hashes, storing only newly scanned files in /etc/dedupe.db at the end of the run.

-d dbfile – Database for saving hashes between runs

This flag specifies the file to use in place of /etc/dedupe.db, for both loading hashes from a previous run and saving new hashes at the end of the run.

-f – Force hashing (do not use hash db)

This flag ignores use of the /etc/dedupe.db hash database altogether. No hash database will be loaded at the start of the run, or saved at the end of the run. This flag is used mainly for deduplicating removable filesystems, whose contents can change between subsequent mounts.

-n – Do not deduplicate (still updates hash db)

This flag performs a dry-run of all of the filesize matching and hashing functionality, except no file deduplication will be performed. The /etc/dedupe.db hash database is still updated at the end of the run (unless the -f flag was specified).

-q – Be quiet (don't display statistics)

This flag omits the output line of statistics (count of files deduplicated, and number of bytes saved) at the end of the run.

-r – Recurse into subdirectories

This flag is required to recurse into subdirectories when a directory is specified as an argument on the command-line. Without this flag, only files in the specified directory are scanned.

-s size – Only dedupe files of 'size' or greater

This flag sets the minimum filesize for consideration when deduplicating files, in bytes. By default, this value is 4096. No file-size units (K, M, G, etc.) are currently accepted.

-x – Only scan one filesystem

This flag, used in conjunction with the -r (recursion) flag, only scans the filesystems containing the directories specified on the command line. It will not recurse into directories that are under a different mount point.

-v – Be verbose

This flag increases the verbosity of the output of the program. It can be specified at most 3 times.

Verbosity level 1 displays when each file gets deduplicated (along with the number of bytes saved per file), and it reports when a file has been modified between the initial scan and the deduplication pass.

Verbosity level 2 additionally displays when individual deduplication attempts are rejected by the kernel, such as when a file's contents change between hashing and the deduplication ioctl.

Verbosity level 3 additionally displays one line of output for every file that is hashed, and it displays the final number of hash entries saved to /etc/dedupe.db at the end of the run.

-V – Display version information and exit

This displays the program version and release date.

Build Prerequisites

This package requires the XXHash library to be installed in the system. This can be downloaded from https://xxhash.com.

Build Instructions

Simply run make to compile the package, or make install to install it in the default location (/usr/local/sbin/dedupe), or e.g. make install PREFIX=/usr to install it in /usr/sbin/dedupe.

Examples

Use the following command to deduplicate several files by name, e.g.:

dedupe file1 file2 file3

The following /etc/crontab cron job can be used to deduplicate all local filesystems on the system, once per week:

1 0 * * sun root dedupe -qr /

Or specify each filesystem by mount point directly:

1 0 * * sun root dedupe -xqr / /usr /home

Notes

  • The /etc/dedupe.db file will eventually grow and contain stale hashes of files deleted long ago. The -c flag (or removing the database file) can be used periodically to clear stale entries. This can be performed once per year (or once per month depending on filesystem usage). The next time dedupe is run, it will rehash all remaining same-size files and recreate the database upon completion.

    This example cron entry removes the file every December 1st:

    0 0 1 12 * root rm -f /etc/dedupe.db
    

License

This project is licensed under the MIT License. See LICENSE for details.

About

File deduplicator for extent-based filesystems

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors