Showing many duplicate files
- Anmelden oder Registrieren um Kommentare zu schreiben
I often did backups of photos from a mobile phone but without remembering if I already did other backups so it is likely that I have a number of duplicates. File names are usually different, although contains are identical byte by byte. I found the programme fdupes that can list duplicates. fdupes -Sr lists duplicates like this:
2167950 bytes each: ./2018-03-09 All/IMG_4175.JPG ./IMG_20171113_170712.JPG 3154136 bytes each: ./2018-03-09 All/IMG_0187.JPG ./IMG_20160715_183705.JPG 2836777 bytes each: ./2018-03-09 All/IMG_0807.JPG ./IMG_20161123_011537.JPG
Still, I have several thousands of duplicates, in different directories, so I need something more readable. I stored the output in a file and then wrote an awk script to run on it, which gives me something like this:
1) . 2) ./2018-03-09 All 3985 set(s) of duplicates: IMG_20171113_170712.JPG IMG_4175.JPG IMG_20160715_183705.JP IMG_0187.JPG IMG_20161123_011537.JPG IMG_0807.JPG
Files are grouped by lists of directories under which they are found (can be multiple times the same directory, can be any number of directories).
I inspect these lists, process them with emacs to make lists of files to delete (I use the kill-region C-w and kill-rectangle C-x r k functions), and when I am confident enough that I did not do any mistake, I run "xargs rm " (note: don't do that if you have files with space of characters interpreted by bash, like * or ?).
My awk script is probably over complicated but it seems to work. I tried putting enough comments to make it readable. You can use it, and I welcome suggestions. While writing this post, I am thinking that one improvement could be to put a \ before space, * and ? characters, to make it safe to run with "xargs rm ".
# Sort full paths by dir, ignoring the file name function compare_by_dir(i1, v1, i2, v2) { # i1 and i2 are indexes, they are ignored # v1 and v2 are values, they are full file paths # extract the directory # it is the longest string of non-newline characters followed by / and at least one non-newline dir1 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v1) dir2 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v2) if (dir1 < dir2) return -1 if (dir1 > dir2) return 1 return 0 } # a record is a list of lines, separated by an empty line BEGIN { RS = "" ; FS = "\n" ; max_len = 1 } { record = $0 # Remove line with size information sub(/[0-9]+ bytes each:\n/,"",record) # Split each line of the record into path_array split(record, path_array, "\n") dir_string = "" file_string = "" # parse directories sorted by name to avoid e.g. getting two dir_strings A:::B and B:::A PROCINFO["sorted_in"] = "compare_by_dir" for (i in path_array) { # extract dir name and file name from full path dir = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",path_array[i]) file = gensub(/[^\n]+\/([^\n]+)/,"\\1","g",path_array[i]) # concatenate dir/file names into dir_string/file_string, with ::: as separator if (dir_string == "") { dir_string = dir file_string = file } else { dir_string = dir_string ":::" dir file_string = file_string ":::" file } } # store the max file name length (to use as column width for printing) if (length(file_string) > max_len) max_len = length(file_string) # dir_array[dir_string] is an array # dir_array[dir_string][i], where i is is an integer from 1 to n, is a file_string # dir_array[dir_string]["files"] is n (i.e. highest index value) # set dir_array[dir_string]["files"] for the new file_string if (!dir_string in dir_array) dir_array[dir_string]["files"] = 1 else dir_array[dir_string]["files"]++ # add the new file_string dir_array[dir_string][dir_array[dir_string]["files"]] = file_string } END { # make a separator, used between dir groups sep = "-" for (i = 1 ; i < max_len ; i++) sep = sep "-" # compare_by_dir is not suitable to parse dir_array, so restore default sort PROCINFO["sorted_in"] = "@unsorted" for (dir_string in dir_array) { # split every string to an arry again, so that we can print it on different line split (dir_string,dir_list,":::") # print each dir with a number i = 1 for (dir in dir_list) { printf "%d) %s\n", i, dir_list[dir] i++ } # i reached (number of files in a duplicate set) + 1 num_files = i - 1 printf "\n%d set(s) of duplicates:\n\n", dir_array[dir_string]["files"] # parse all file_strings for (j = 1 ; j<=dir_array[dir_string]["files"] ; j++) { # split into elements, to print in columns split (dir_array[dir_string][j],file_array,":::") for (k = 1; k <= num_files ; k++) printf "%-" max_len "s", file_array[k] print "" } # separate from the next group printf "\n" for (k = 1; k <= num_files ; k++) printf sep printf "\n\n" } }
The way you are doing it is probably safer, but assuming that all these pictures were taken with the same device, not in burst mode, I may try to use file metadata instead and identify duplicates based on the time at which the picture was created. I believe this is what you are getting in a terminal with: ls -l --full-time
Otherwise, if manually selecting, classifying and sorting all these pictures was not an option, I would most probably dump them to save time and storage space at once.
ls -l --full-time
I looked at that also and so far this would have worked. For some photos, I found multiple files identical to each other but with slightly different time (like 30 to 120s difference).
Otherwise, if manually selecting, classifying and sorting all these pictures..
I do a manual selection but with a long delay. For sorting, I usually just move files of a given date to a directory with a name starting with the date and providing information on what is in. This has been a good extension of my memory so far.
Anyway, below is a version with improved output (column width is adapted for each column, and separately for each set of duplicate). At least I am learning about awk.
# Sort full paths by dir, ignoring the file name function compare_by_dir(i1, v1, i2, v2) { # i1 and i2 are indexes, they are ignored # v1 and v2 are values, they are full file paths # extract the directory # it is the longest string of non-newline characters followed by / and at least one non-newline dir1 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v1) dir2 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v2) if (dir1 < dir2) return -1 if (dir1 > dir2) return 1 return 0 } # a record is a list of lines, separated by an empty line BEGIN { RS = "" ; FS = "\n" ; max_len = 1 } { record = $0 # Remove line with size information sub(/[0-9]+ bytes each:\n/,"",record) # Split each line of the record into path_array split(record, path_array, "\n") dir_string = "" file_string = "" # parse directories sorted by name to avoid e.g. getting two dir_strings A:::B and B:::A PROCINFO["sorted_in"] = "compare_by_dir" num_dir = 0 for (i in path_array) { num_dir++ # extract dir name and file name from full path dir = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",path_array[i]) file = gensub(/[^\n]+\/([^\n]+)/,"\\1","g",path_array[i]) file_len[num_dir] = length(file) # concatenate dir/file names into dir_string/file_string, with ::: as separator if (dir_string == "") { dir_string = dir file_string = file } else { dir_string = dir_string ":::" dir file_string = file_string ":::" file } } # store the max file name length (to use as column width for printing) if (length(file_string) > max_len) max_len = length(file_string) # dir_array[dir_string] is an array # dir_array[dir_string][i], where i is is an integer from 1 to n, is a file_string # dir_array[dir_string]["files"] is n (i.e. highest index value) # dir_array[dir_string]["col" i], where i is an interger from 1 to the number of dir/files in a dir_string/file_sting is the max length of a filename in column i # if this is the first occurence of this file string if (!dir_string in dir_array) { # capture the number of file_strings, i.e. 1 dir_array[dir_string]["files"] = 1 # capture the number of directories/files in the dir_string/file_string dir_array[dir_string]["num_dir"] = num_dir # set the length of columns to the filename lengths for (i = 1 ; i <= num_dir ; i++) dir_array[dir_string]["col" i] = file_len[i] } # if this is not the first occurence else { #increase the number of file_strings by one dir_array[dir_string]["files"]++ for (i = 1 ; i <= num_dir ; i++) if (file_len[i] > dir_array[dir_string]["col" i]) dir_array[dir_string]["col" i] = file_len[i] } # add the new file_string dir_array[dir_string][dir_array[dir_string]["files"]] = file_string } END { # compare_by_dir is not suitable to parse dir_array, so restore default sort PROCINFO["sorted_in"] = "@unsorted" for (dir_string in dir_array) { # split every string to an arry again, so that we can print it on different line split (dir_string,dir_list,":::") # print each dir with a number i = 1 for (dir in dir_list) { printf "%d) %s\n", i, dir_list[dir] i++ } # i reached (number of files in a duplicate set) + 1 num_files = i - 1 printf "\n%d set(s) of duplicates:\n\n", dir_array[dir_string]["files"] # parse all file_strings for (j = 1 ; j<=dir_array[dir_string]["files"] ; j++) { # split into elements, to print in columns split (dir_array[dir_string][j],file_array,":::") printf "%-" dir_array[dir_string]["col1"] "s", file_array[1] for (k = 2; k <= num_files ; k++) printf " %-" dir_array[dir_string]["col" k] "s", file_array[k] print "" } # separate from the next group printf "\n" for (k = 1; k <= num_files ; k++) for (l = 1 ; l <= dir_array[dir_string]["col" k]+1 ; l++) printf "-" printf "\n\n" } }
- Anmelden oder Registrieren um Kommentare zu schreiben