How would I diff two mbox files in an intelligent way?

Question

I have two mbox files containing about 6k emails each. They should be more or less identical, though #1 contains about 100 emails which #2 does not contain. I would love to generate a third mbox file containing my 100 messages - a diff so to speak.

I used to automatically forward messages from one inbox into another (server-side), which randomly did not forward a few messages for some odd undetermined reason. #2 is the inbox into which emails were forwarded - lot's of read and replied-to messages with additional headers containing information on how they were forwarded. #1 is a recent dump, using imap, of 6k unread messages.

I am working with Thunderbird under Linux.

s-m-e · Accepted Answer · 2015-10-24 11:24:39Z

2

The following Python script solves the problem:

import mailbox

inbox_2 = mailbox.mbox('inbox_2_file')
inbox_1 = mailbox.mbox('inbox_1_file')

inbox_diff = mailbox.mbox('inbox_diff_file', create=True)

inbox_2_ids = []
for message in inbox_2:
    inbox_2_ids.append(message.get('Message-ID'))

for message in inbox_1:
    if message.get('Message-ID') not in inbox_2_ids:
        inbox_diff.add(message)

inbox_diff.flush()

edited Oct 24, 2015 at 11:24

answered Oct 22, 2015 at 17:58

s-m-e

3791 silver badge13 bronze badges

Add a comment |

Miles · Accepted Answer · 2017-09-13 10:08:42Z

1

Thanks so much for posting both your question and answer, @s-m-e. The Python script works beautifully so long as each of the email messages include a Message-ID. Sadly, that is not always the case, since Message-ID is surprisingly not a mandatory field.

With a very slight alteration, your script can be used to create a new mbox file containing all of the email messages that are missing the Message-ID field:

import mailbox

inbox_1 = mailbox.mbox('inbox_1_file')

inbox_missing_message_id = mailbox.mbox('inbox_missing_message_id_file', create=True)

for message in inbox_1:
    if message.get('Message-ID') is None:
        inbox_missing_message_id.add(message)

inbox_missing_message_id.flush()

edited Sep 13, 2017 at 10:08

answered Sep 13, 2017 at 10:03

Miles

92611 silver badges14 bronze badges

1

Interesting, I did not know that it was not mandatory. Thanks a lot for pointing this out. The real question then becomes how you would adjust the actual diff code. How can you figure out which email is which in a reliable manner, i.e. how can you safely define an "alternative id"? I thought about using a mix of subject lines and timestamps, but the latter seems to be really unreliable in too many different ways.
– s-m-e
Commented Sep 14, 2017 at 13:23

Add a comment |

Rosario Lombardo · Accepted Answer · 2022-11-26 08:17:56Z

Some 5 years later. I had to compare 306MB and 338MB Thunderbird backed-up mailboxes. So I put all the excellent advice into a self-contained helper script that makes the difference mbox2 - mbox1. In the process are also created the corresponding mailboxes with emails lacking the Message-ID field. For me this works out-of-the-box on a Mac, producing two additional mailboxes of emails without a Message-ID of 52KB and 10KB. More than manageable!

Hope this may come handy. Thanks for your insights.

#!/usr/bin/env python3

import mailbox
import sys
import os

if len(sys.argv) <= 2:
    print("USAGE: {} <mbox2> <mbox1>".format(os.path.basename(sys.argv[0])))
    print("       <diff> = <mbox2> - <mbox1> is based on 'Message-ID' field")
    print("       E-mails with no 'Message-ID' are copied <mbox[1|2]_no_Message-ID>.")
    print("       Original mailboxes are left unaltered.")
    sys.exit(1)
    
if not os.path.isfile(sys.argv[1]):
    print(" * mbox_file {} does not exist.".format(sys.argv[1]))
    sys.exit(2)
if not os.path.isfile(sys.argv[2]):
    print(" * mbox_file {} does not exist.".format(sys.argv[2]))
    sys.exit(2)

mbox1_file = sys.argv[1]
mbox2_file = sys.argv[2]

outfile = os.path.join(
    os.path.dirname(mbox1_file)
    , '{}2_diff_{}1'.format(os.path.basename(mbox2_file), os.path.basename(mbox1_file))
)
if os.path.isfile(outfile):
    print(" * OUTPUT mbox_file {} already exists.".format(outfile))
    sys.exit(2)

nomsgid1_file = os.path.join(
    os.path.dirname(mbox1_file)
    , '{}_no_Message-ID'.format(os.path.basename(mbox1_file))
)
if os.path.isfile(nomsgid1_file):
    print(" * OUTPUT mbox_file {} already exists.".format(nomsgid1_file))
    sys.exit(2)


nomsgid2_file = os.path.join(
    os.path.dirname(mbox2_file)
    , '{}_no_Message-ID'.format(os.path.basename(mbox2_file))
)
if os.path.isfile(nomsgid2_file):
    print(" * OUTPUT mbox_file {} already exists.".format(nomsgid2_file))
    sys.exit(2)


inbox_1 = mailbox.mbox(mbox1_file)
inbox_2 = mailbox.mbox(mbox2_file)
inbox_diff = mailbox.mbox(outfile, create=True)
inbox1_missing_message_id = mailbox.mbox(nomsgid1_file, create=True)
inbox2_missing_message_id = mailbox.mbox(nomsgid2_file, create=True)


inbox_2_ids = []
for message in inbox_2:
    msgid = message.get('Message-ID')
    if msgid is None:
        inbox2_missing_message_id.add(message)
    else:
        inbox_2_ids.append(msgid)

for message in inbox_1:
    msgid = message.get('Message-ID')
    if msgid is None:
        inbox1_missing_message_id.add(message)
    elif msgid not in inbox_2_ids:
        inbox_diff.add(message)

inbox_diff.flush()
inbox2_missing_message_id.flush()
inbox1_missing_message_id.flush()

P.S. I also encountered several encoding issues. As you know, emails are very hard to be automatically checked for the right encoding and utf-8 not always works. A brutal - but working solution - was to change every occurrence of the ascii encoding in the mailbox.py library to latin-1 and utf8, as follows:

get_message function:

try:
  msg.set_from(from_line[5:].decode('latin-1'))
except Exception as e:
  msg.set_from(from_line[5:].decode('utf8'))
return msg

_install_message() function:

author = message.get_from().encode('utf8')
[...]
from_line = from_line.encode('utf8')

Stack Exchange Network

How would I diff two mbox files in an intelligent way?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
email
batch
diff
mbox
.

Hot Network Questions

How would I diff two mbox files in an intelligent way?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged emailbatchdiffmbox.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
email
batch
diff
mbox
.