So you know you have a bunch of files all with similar content but different file names strewn all over a directory tree. How would you identify the repetitions so that you can work towards eliminating them?

Here is some CLI magic that helps you do so (sourced from CommandLineFu).

find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate

The first column is the md5 hash and the second is the full path. The rows are grouped by files with similar content and separated by a line per group.

Oh, so beautiful! Use effectively to eliminate code and test smell from your project!

