#duperemove

Matthias Rißematrss
2026-03-07

@dbat the same issue exists in the research data management world with / . One thing that I am doing for our storage servers is regularly run on it. It requires filesystem support (xfs/btrfs), but deduplicates on an extent basis, so below the file level. If the difference between two versions only affects a small part of a file it should be able to help. I wonder if it could be run as a post-commit hook, or something like that.

Tout va bien : je fais juste tourner #duperemove.

Load AVG: 30.3 32.36 24.0
Sergei Trofimovichtrofi@fosstodon.org
2025-03-31

Today's bug is a `duperemove` infinite looping bug: github.com/markfasheh/duperemo

There `duperemove` was not able to dedupe against NoCOW file:

$ dd if=/dev/urandom bs=8M count=1 > a
$ touch b
$ chattr +C b
$ cat a >> b
$ ./duperemove -d -q --batchsize=0 --dedupe-options=partial,same a b
<hangup>

I noticed it about a month ago but got to debug it only today. It's a 0.15 regression. The fix is trivial once bisected.

#duperemove #bug

2024-12-02

Bin dann doch ein wenig neugierig, ob der seit 29.11. laufende duperemove-Prozess noch irgendwann enden oder in die Erbmasse mit eingehen wird.

#duperemove

Sergei Trofimovichtrofi@fosstodon.org
2023-11-24

Today's `duperemove` bug is a github.com/markfasheh/duperemo.

There `duperemove` crashes when the file being deduped gets truncated down to zero.

And the bug is already fixed!

#duperemove #bug

Sergei Trofimovichtrofi@fosstodon.org
2023-11-21

`dupermove-0.14` is a lot faster than `duperemove-0.13`!

Unfortunately it crashes sometimes on my input data. It takes about 10 minutes to observe the crash.

I wrote a trivial fuzzer to generate funny filesystem states for `duperemove`. Guess how long it takes to crash `duperemove `with it.

Spoiler: trofi.github.io/posts/305-fuzz

#duperemove #bug

Sergei Trofimovichtrofi@fosstodon.org
2023-11-10

Today's `duperemove` bug is a github.com/markfasheh/duperemo.

There quite aggressive `--dedupe-options=partial` option used less optimized `sqlite` query to fetch unique file extents. That caused the whole database scan when data was queries for each individual file.

The fix switched `JOIN` query for nested `SELECT` query to convert from full scan to an index lookup.

#duperemove #bug

Sergei Trofimovichtrofi@fosstodon.org
2023-11-09

Today's `duperemove` bug is a minor accounting bug: github.com/markfasheh/duperemo

$ ls -lh /nix/var/nix/db/db.sqlite
1.4G /nix/var/nix/db/db.sqlite

Before the change:

$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 27065321263104 shared bytes

After the change:

$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 1169276928 shared bytes

The size reduction is not as impressive as initially reported :)

#duperemove #bug

Sergei Trofimovichtrofi@fosstodon.org
2023-11-07

Today's bug is a `duperemove` quadratic slowdown: github.com/markfasheh/duperemo

There `duperemove` was struggling to dedupe small files inlined into metadata entries. It kept trying to dedupe all of them as a single set (even if files' contents did not match).

This fix is a one-liner: just don't track non-dedupable files.

Without the fix dedupe run never finished on my system. I always had to run it on a subset to get any progress. Now the whole run takes 20 minutes.

#duperemove #bug

Sergei Trofimovichtrofi@fosstodon.org
2023-11-01

It feels like `duperemove` could have worked a lot faster than it does today.

What would it take to get a 2x speedup on small files? A one-liner: github.com/markfasheh/duperemo

There are still a ton of low hanging improvements hiding there.

#duperemove #bug

Sergei Trofimovichtrofi@fosstodon.org
2023-10-31

Today's `duperemove` bug is a hangup bug on a directory with 1 million of unique 1KB files: github.com/markfasheh/duperemo

In theory it should take about a minute to hash every single file and less than a minute to find out that all the files have unique hashes.

In practice the process gets stuck somewhere in the middle.

#duperemove #bug

Sergei Trofimovichtrofi@fosstodon.org
2023-09-13

Today's (or rather this month's) bug is a quadratic slowdown of incremental `duperemove` runs.

There running `duperemove` incrementally over one directory at a time caused `duperemove` to rescan all previous files over and over.

In github.com/markfasheh/duperemo JackSlateur added `--dedupe-option=norescan_files` option to avoid the rescans.

Meanwhile I found a few more aggressive deduping options and keep reporting various failures there. I hope we'll get it working soon.

#duperemove #bug

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst