No-frills file deduplication for extent/reflink-based filesystems (e.g. ext4, XFS, BTRFS).
dedupe is a high performance file-deduplication program with low memory usage, fast file comparison
using the XXHash library, and kernel-level deduplication using extents. It
performs whole-file deduplication only—hashing only files whose byte sizes match and do not already
share extents. By default, dedupe stores hashes by inode and modification time in /etc/dedupe.db,
allowing for a quick startup on future runs.
Deduplication is performed by using the FIDEDUPERANGE ioctl() call to the kernel between two
files. This is an atomic filesystem operation whereby the kernel reads and compares two files for
equality before merging any extents. In other words, it is a safe procedure that will not cause any
loss of data, even when those files are being written to concurrently.
Usage: dedupe [options] [files|directories] ...
Options:
-c Clear hash database before beginning
-d dbfile Database for saving hashes between runs; default=/etc/dedupe.db
-f Force hashing (do not use hash db)
-n Do not deduplicate (still updates hash db)
-q Be quiet (don't display statistics)
-r Recurse into subdirectories
-s size Only dedupe files of 'size' or greater; default=4096
-x Only scan one filesystem
-v Be verbose; maximum level is -vvv
-V Display version information and exit
This flag ignores reading the /etc/dedupe.db file at the start of the operation. It can be used
to "forget" all of the previous hashes, storing only newly scanned files in /etc/dedupe.db at
the end of the run.
This flag specifies the file to use in place of /etc/dedupe.db, for both loading hashes from a
previous run and saving new hashes at the end of the run.
This flag ignores use of the /etc/dedupe.db hash database altogether. No hash database will be
loaded at the start of the run, or saved at the end of the run. This flag is used mainly for
deduplicating removable filesystems, whose contents can change between subsequent mounts.
This flag performs a dry-run of all of the filesize matching and hashing functionality, except no
file deduplication will be performed. The /etc/dedupe.db hash database is still updated at the
end of the run (unless the -f flag was specified).
This flag omits the output line of statistics (count of files deduplicated, and number of bytes saved) at the end of the run.
This flag is required to recurse into subdirectories when a directory is specified as an argument on the command-line. Without this flag, only files in the specified directory are scanned.
This flag sets the minimum filesize for consideration when deduplicating files, in bytes. By default, this value is 4096. No file-size units (K, M, G, etc.) are currently accepted.
This flag, used in conjunction with the -r (recursion) flag, only scans the filesystems containing the directories specified on the command line. It will not recurse into directories that are under a different mount point.
This flag increases the verbosity of the output of the program. It can be specified at most 3 times.
Verbosity level 1 displays when each file gets deduplicated (along with the number of bytes saved per file), and it reports when a file has been modified between the initial scan and the deduplication pass.
Verbosity level 2 additionally displays when individual deduplication attempts are rejected by the kernel, such as when a file's contents change between hashing and the deduplication ioctl.
Verbosity level 3 additionally displays one line of output for every file that is hashed, and it
displays the final number of hash entries saved to /etc/dedupe.db at the end of the run.
This displays the program version and release date.
This package requires the XXHash library to be installed in the system. This can be downloaded from https://xxhash.com.
Simply run make to compile the package, or make install to install it in the default location
(/usr/local/sbin/dedupe), or e.g. make install PREFIX=/usr to install it in /usr/sbin/dedupe.
Use the following command to deduplicate several files by name, e.g.:
dedupe file1 file2 file3
The following /etc/crontab cron job can be used to deduplicate all local filesystems on the
system, once per week:
1 0 * * sun root dedupe -qr /
Or specify each filesystem by mount point directly:
1 0 * * sun root dedupe -xqr / /usr /home
-
The
/etc/dedupe.dbfile will eventually grow and contain stale hashes of files deleted long ago. The -c flag (or removing the database file) can be used periodically to clear stale entries. This can be performed once per year (or once per month depending on filesystem usage). The next time dedupe is run, it will rehash all remaining same-size files and recreate the database upon completion.This example cron entry removes the file every December 1st:
0 0 1 12 * root rm -f /etc/dedupe.db
This project is licensed under the MIT License. See LICENSE for details.