Find duplicate files

So you know you have a bunch of files all with similar content but different file names strewn all over a directory tree. How would you identify the repetitions so that you can work towards eliminating them?

Here is some CLI magic that helps you do so (sourced from CommandLineFu).

find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate

The first column is the md5 hash and the second is the full path. The rows are grouped by files with similar content and separated by a line per group.

Oh, so beautiful! Use effectively to eliminate code and test smell from your project!



Tags: ,
This entry was posted on Sunday, January 8th, 2012 at 5:58 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

Your comment