Showing many duplicate files

2 replies [Last post]
Avron

I am a translator!

Offline
Joined: 08/18/2020

I often did backups of photos from a mobile phone but without remembering if I already did other backups so it is likely that I have a number of duplicates. File names are usually different, although contains are identical byte by byte. I found the programme fdupes that can list duplicates. fdupes -Sr lists duplicates like this:

2167950 bytes each:
./2018-03-09 All/IMG_4175.JPG
./IMG_20171113_170712.JPG

3154136 bytes each:
./2018-03-09 All/IMG_0187.JPG
./IMG_20160715_183705.JPG

2836777 bytes each:
./2018-03-09 All/IMG_0807.JPG
./IMG_20161123_011537.JPG

Still, I have several thousands of duplicates, in different directories, so I need something more readable. I stored the output in a file and then wrote an awk script to run on it, which gives me something like this:

1) .
2) ./2018-03-09 All

3985 set(s) of duplicates:

IMG_20171113_170712.JPG                           IMG_4175.JPG                                                                       
IMG_20160715_183705.JP                            IMG_0187.JPG                                                                       
IMG_20161123_011537.JPG                           IMG_0807.JPG                                                                       

Files are grouped by lists of directories under which they are found (can be multiple times the same directory, can be any number of directories).

I inspect these lists, process them with emacs to make lists of files to delete (I use the kill-region C-w and kill-rectangle C-x r k functions), and when I am confident enough that I did not do any mistake, I run "xargs rm " (note: don't do that if you have files with space of characters interpreted by bash, like * or ?).

My awk script is probably over complicated but it seems to work. I tried putting enough comments to make it readable. You can use it, and I welcome suggestions. While writing this post, I am thinking that one improvement could be to put a \ before space, * and ? characters, to make it safe to run with "xargs rm ".

# Sort full paths by dir, ignoring the file name

function compare_by_dir(i1, v1, i2, v2)
{
    # i1 and i2 are indexes, they are ignored
    # v1 and v2 are values, they are full file paths

    # extract the directory
    # it is the longest string of non-newline characters followed by / and at least one non-newline

    dir1 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v1)
    dir2 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v2)
    
    if (dir1 < dir2)
	return -1
    if (dir1 > dir2)
	return 1
    return 0
}

# a record is a list of lines, separated by an empty line
BEGIN { RS = "" ; FS = "\n" ; max_len = 1 }

{
    record = $0

    # Remove line with size information
    sub(/[0-9]+ bytes each:\n/,"",record)
 
    # Split each line of the record into path_array
    split(record, path_array, "\n")

    dir_string = ""
    file_string = ""

    # parse directories sorted by name to avoid e.g. getting two dir_strings A:::B and B:::A
    PROCINFO["sorted_in"] = "compare_by_dir"

    for (i in path_array)
    {
	# extract dir name and file name from full path
	dir  = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",path_array[i])
	file = gensub(/[^\n]+\/([^\n]+)/,"\\1","g",path_array[i])
	
	# concatenate dir/file names into dir_string/file_string, with ::: as separator
	
	if (dir_string == "") {	    
	    dir_string = dir
	    file_string = file
	}
	else {
	    dir_string = dir_string ":::" dir
	    file_string = file_string ":::" file
	}
    }

    # store the max file name length (to use as column width for printing)
    if (length(file_string) > max_len)
	max_len = length(file_string)
    
    # dir_array[dir_string] is an array
    # dir_array[dir_string][i], where i is is an integer from 1 to n, is a file_string
    # dir_array[dir_string]["files"] is n (i.e. highest index value)


    # set dir_array[dir_string]["files"] for the new file_string
    if (!dir_string in dir_array)	
	dir_array[dir_string]["files"] = 1
    else
	dir_array[dir_string]["files"]++

    # add the new file_string
    dir_array[dir_string][dir_array[dir_string]["files"]] = file_string    
}

END {

    # make a separator, used between dir groups
    sep = "-"
    for (i = 1 ; i < max_len ; i++)
	sep = sep "-"
    
    # compare_by_dir is not suitable to parse dir_array, so restore default sort
    PROCINFO["sorted_in"] = "@unsorted"

    for (dir_string in dir_array) {
	
	# split every string to an arry again, so that we can print it on different line
	split (dir_string,dir_list,":::")
	
	# print each dir with a number
	
	i = 1
	for (dir in dir_list) {
	    printf "%d) %s\n", i, dir_list[dir]
	    i++
	}

	# i reached (number of files in a duplicate set) + 1
	num_files = i - 1
	
	printf "\n%d set(s) of duplicates:\n\n", dir_array[dir_string]["files"]

	# parse all file_strings
	for (j = 1 ; j<=dir_array[dir_string]["files"] ; j++) {

	    # split into elements, to print in columns
	    split (dir_array[dir_string][j],file_array,":::")

	    for (k = 1; k <= num_files ; k++)
		printf "%-" max_len "s", file_array[k]
	    print ""
	}

	# separate from the next group
	printf "\n"
	for (k = 1; k <= num_files ; k++)
	    printf sep
	printf "\n\n"
    }
}
prospero
Offline
Joined: 05/20/2022

The way you are doing it is probably safer, but assuming that all these pictures were taken with the same device, not in burst mode, I may try to use file metadata instead and identify duplicates based on the time at which the picture was created. I believe this is what you are getting in a terminal with: ls -l --full-time

Otherwise, if manually selecting, classifying and sorting all these pictures was not an option, I would most probably dump them to save time and storage space at once.

Avron

I am a translator!

Offline
Joined: 08/18/2020

ls -l --full-time

I looked at that also and so far this would have worked. For some photos, I found multiple files identical to each other but with slightly different time (like 30 to 120s difference).

Otherwise, if manually selecting, classifying and sorting all these pictures..

I do a manual selection but with a long delay. For sorting, I usually just move files of a given date to a directory with a name starting with the date and providing information on what is in. This has been a good extension of my memory so far.

Anyway, below is a version with improved output (column width is adapted for each column, and separately for each set of duplicate). At least I am learning about awk.

# Sort full paths by dir, ignoring the file name

function compare_by_dir(i1, v1, i2, v2)
{
    # i1 and i2 are indexes, they are ignored
    # v1 and v2 are values, they are full file paths

    # extract the directory
    # it is the longest string of non-newline characters followed by / and at least one non-newline

    dir1 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v1)
    dir2 = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",v2)
    
    if (dir1 < dir2)
	return -1
    if (dir1 > dir2)
	return 1
    return 0
}

# a record is a list of lines, separated by an empty line
BEGIN { RS = "" ; FS = "\n" ; max_len = 1 }

{
    record = $0

    # Remove line with size information
    sub(/[0-9]+ bytes each:\n/,"",record)
 
    # Split each line of the record into path_array
    split(record, path_array, "\n")

    dir_string = ""
    file_string = ""

    # parse directories sorted by name to avoid e.g. getting two dir_strings A:::B and B:::A
    PROCINFO["sorted_in"] = "compare_by_dir"

    num_dir = 0
    
    for (i in path_array)
    {
	num_dir++
	
	# extract dir name and file name from full path
	dir  = gensub(/([^\n]+)\/[^\n]+/,"\\1","g",path_array[i])
	file = gensub(/[^\n]+\/([^\n]+)/,"\\1","g",path_array[i])

	file_len[num_dir] = length(file)
	
	# concatenate dir/file names into dir_string/file_string, with ::: as separator
	
	if (dir_string == "") {	    
	    dir_string = dir
	    file_string = file
	}
	else {
	    dir_string = dir_string ":::" dir
	    file_string = file_string ":::" file
	}
    }

    # store the max file name length (to use as column width for printing)
    if (length(file_string) > max_len)
	max_len = length(file_string)
    
    # dir_array[dir_string] is an array
    # dir_array[dir_string][i], where i is is an integer from 1 to n, is a file_string
    # dir_array[dir_string]["files"] is n (i.e. highest index value)
    # dir_array[dir_string]["col" i], where i is an interger from 1 to the number of dir/files in a dir_string/file_sting is the max length of a filename in column i


    # if this is the first occurence of this file string

    if (!dir_string in dir_array) {	

	# capture the number of file_strings, i.e. 1 
	dir_array[dir_string]["files"] = 1

	# capture the number of directories/files in the dir_string/file_string
	dir_array[dir_string]["num_dir"] = num_dir

	# set the length of columns to the filename lengths
	for (i = 1 ; i <= num_dir ; i++) 
	    dir_array[dir_string]["col" i] = file_len[i]
    }

    # if this is not the first occurence
    
    else {

	#increase the number of file_strings by one
	dir_array[dir_string]["files"]++
	for (i = 1 ; i <= num_dir ; i++)
	    if (file_len[i] > dir_array[dir_string]["col" i])
		dir_array[dir_string]["col" i] = file_len[i]
    }
    
    # add the new file_string
    dir_array[dir_string][dir_array[dir_string]["files"]] = file_string    
	
}

END {
    
    # compare_by_dir is not suitable to parse dir_array, so restore default sort
    PROCINFO["sorted_in"] = "@unsorted"

    for (dir_string in dir_array) {
	
	# split every string to an arry again, so that we can print it on different line
	split (dir_string,dir_list,":::")
	
	# print each dir with a number
	
	i = 1
	for (dir in dir_list) {
	    printf "%d) %s\n", i, dir_list[dir]
	    i++
	}

	# i reached (number of files in a duplicate set) + 1
	num_files = i - 1
	
	printf "\n%d set(s) of duplicates:\n\n", dir_array[dir_string]["files"]

	# parse all file_strings
	for (j = 1 ; j<=dir_array[dir_string]["files"] ; j++) {

	    # split into elements, to print in columns
	    split (dir_array[dir_string][j],file_array,":::")

	    printf "%-" dir_array[dir_string]["col1"] "s", file_array[1]

	    for (k = 2; k <= num_files ; k++)
		
	    printf "  %-" dir_array[dir_string]["col" k] "s", file_array[k]
	    print ""
	}

	# separate from the next group
	printf "\n"
	for (k = 1; k <= num_files ; k++)
	    for (l = 1 ; l <= dir_array[dir_string]["col" k]+1 ; l++)
	    printf "-"
	printf "\n\n"
    }
}