@dbat the same issue exists in the research data management world with #DataLad / #gitAnnex. One thing that I am doing for our storage servers is regularly run #duperemove on it. It requires filesystem support (xfs/btrfs), but deduplicates on an extent basis, so below the file level. If the difference between two versions only affects a small part of a file it should be able to help. I wonder if it could be run as a post-commit hook, or something like that.







