5

I have 2 old similar directory trees with MP3 files in them. I am happily using tools like diff and Rsync to identify and merge the files that are only present on one side, or are identical, I'm left with a bunch of files that are bitwise different.

On running diff over a pair actually different files, (with -a tag to force text analysis) it produces incomprehensible gibberish. I have listened to files from both sides, and they both seem to play fine (but at nearly 10 minutes per song, when listening to them twice each, I haven't done many)

I suspect the differences are due to some player in the past "enhancing" my collection by messing about with ID3 tags, but I can't be certain. Even if I identify differences in ID3 tags, I would like to confirm that no cosmic ray or file copy error issues have damaged any of the files.

One method that occurs to be is finding the byte locations of the differences, and ignoring all changes in the first ~10kb of each file, but I don't know how to do this.

I have on the order of a hundred or so files that differ across the directory tree.

I found How to compare mp3, flac audio data in a file, ignoring header data (ID3 tag) etc.? -- but I can't run alldup due to being Linux only, and from the sounds of it, it would only partially solve my issues anyway.

3 Answers 3

1

Beyond Compare according to topic?

Beyond Compare 3 does not run as a console application on Linux. It requires X-Windows.

SUPPORTED LINUX DISTRIBUTIONS

Red Hat Enterprise Linux 4-6

Fedora 4-14

Novell Suse Linux Enterprise Desktop 10

openSUSE 10.3-11.2

Ubuntu 6.06-10.10

Debian 5.04

Mandriva 2010

1

Beyond Compare (referenced above) looks like a great solution. I've never used it. The bit about Xwindows just means that it wants to run in a gui, not straight command line. If you have a gui installed, then the chances that Xwindows is already properly installed on your system are extremely good.

Some ideas on how to proceed:

cmp -i 10kB file1 file2

will compare bytewise two arbitrary files on Linux, first skipping 10kb on each file. It even has an option for skipping different byte counts on each file. The -b parameter will print out differing bytes, but that might be a very long output, so if you use it, pipe the output into a file or into less. You'd have to decide how many bytes to skip. I don't know that answer. To use it effectively for multiple files, you'd have to write a script in bash or another language. Maybe running it as part of a find command with an exec option would work.

In the future, if looking for duplicate files, check out fdupes. It's a utility designed just for that. I used it when I was still figuring out how to manage photos on my computer and ended up with a bunch of directories with lots of duplicates in them.

https://code.google.com/p/fdupes/

Also, if you look up fdupes on wikipedia, there's a whole raft of Linux file compare programs listed in the entry.

Just for the heck of it, I had a look at:

http://www.id3.org/id3v2.4.0-structure

which specifies the structure of id3 tags. It "recommends" that the tags be placed at the start of the file, but also provides for additional tags to be added at the end of the file, so unless nobody uses that option, there may be meta information elsewhere in the file, not just at the beginning. A cursory look at the spec reveals that id3 tag info is variable in length, so there would be no exact byte count that would be guaranteed to skip over it, but 10k as originally suggested ought to be way more than enough to skip the initial tags.

0

As possible solution you may use any tool to convert file into uncompressed stream (pcm, wav) without metadata info and then compare it. For conversion you may use any software you have like ffmpeg, sox or avidemux.

For example how I do that with ffmpeg

Say I have for that example 2 files with different metadata: $ diff Original.mp3 Possible-dup.mp3 ; echo $? Binary files Original.mp3 and Possible-dup.mp3 differ Brute force comparison complain they are differ.

Then we just convert and diff body: $ diff <( ffmpeg -loglevel 8 -i Original.mp3 -map_metadata -1 -f wav - ) <( ffmpeg -loglevel 8 -i Possible-dup.mp3 -map_metadata -1 -f wav - ) ; echo $? 0

Off course ; echo $? part is just for demonstration purpose to see return code.

Processing multiple files (traverse directories)

If you want try duplicates in collection it have worth to calculate checksums (any like crc, md5, sha2, sha256) of data and then just find there collisions.

  1. First calculate hash of data in each file (and place into file for next processing): for file in *.mp3; do printf "%s:%s\n" "$( ffmpeg -loglevel 8 -i "$file" -map_metadata -1 -f wav - | sha256sum | cut -d' ' -f1 )" "$file"; done > mp3data.hashes For you case you may compare just multiple directories, f.e.: find -L orig-dir dir-with-duplicates -name '*.mp3' -print0 | while read -r -d $'\0' file; do printf "%s:%s\n" "$( ffmpeg -loglevel 8 -i \"$file\" -map_metadata -1 -f wav - | sha256sum | cut -d' ' -f1 )" "$file"; done > mp3data.hashes

File will be looks like: $ cat mp3data.hashes ad48913a11de29ad4639253f2f06d8480b73d48a5f1d0aaa24271c0ba3998d02:file1.mp3 54320b708cea0771a8cf71fac24196a070836376dd83eedd619f247c2ece7480:file2.mp3 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Original.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other-dup.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other.mp3 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Possible-dup.mp3 Any RDBMS will be very helpful there to aggregate count and select such data. But continue pure command-line solution you may want do simple steps like further.

See duplicates hashes if any (extra step to show how it works, does not needed for find dupes): $ count.by.regexp.awk '([0-9a-f]+):' mp3data.hashes [1:54320b708cea0771a8cf71fac24196a070836376dd83eedd619f247c2ece7480]=1 [1:1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f]=2 [1:ad48913a11de29ad4639253f2f06d8480b73d48a5f1d0aaa24271c0ba3998d02]=1

  1. And all together to list files duplicated by content: $ grep mp3data.hashes -f <( count.by.regexp.awk '([0-9a-f]+):' mp3data.hashes | grep -oP '(?<=\[1:).{64}(?!]=1$)' ) | sort 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Original.mp3 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Possible-dup.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other-dup.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other.mp3

count.by.regexp.awk is simple awk script to count regexp patterns.

P.S. Slightly adjusted variant of https://superuser.com/a/1219353/435801.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .