Some 5 years later. I had to compare 306MB and 338MB Thunderbird backed-up mailboxes. So I put all the excellent advice into a self-contained helper script that makes the difference mbox2 - mbox1. In the process are also created the corresponding mailboxes with emails lacking the Message-ID field. For me this works out-of-the-box on a Mac, producing two additional mailboxes of emails without a Message-ID of 52KB and 10KB. More than manageable!
Hope this may come handy. Thanks for your insights.
#!/usr/bin/env python3
import mailbox
import sys
import os
if len(sys.argv) <= 2:
print("USAGE: {} <mbox2> <mbox1>".format(os.path.basename(sys.argv[0])))
print(" <diff> = <mbox2> - <mbox1> is based on 'Message-ID' field")
print(" E-mails with no 'Message-ID' are copied <mbox[1|2]_no_Message-ID>.")
print(" Original mailboxes are left unaltered.")
sys.exit(1)
if not os.path.isfile(sys.argv[1]):
print(" * mbox_file {} does not exist.".format(sys.argv[1]))
sys.exit(2)
if not os.path.isfile(sys.argv[2]):
print(" * mbox_file {} does not exist.".format(sys.argv[2]))
sys.exit(2)
mbox1_file = sys.argv[1]
mbox2_file = sys.argv[2]
outfile = os.path.join(
os.path.dirname(mbox1_file)
, '{}2_diff_{}1'.format(os.path.basename(mbox2_file), os.path.basename(mbox1_file))
)
if os.path.isfile(outfile):
print(" * OUTPUT mbox_file {} already exists.".format(outfile))
sys.exit(2)
nomsgid1_file = os.path.join(
os.path.dirname(mbox1_file)
, '{}_no_Message-ID'.format(os.path.basename(mbox1_file))
)
if os.path.isfile(nomsgid1_file):
print(" * OUTPUT mbox_file {} already exists.".format(nomsgid1_file))
sys.exit(2)
nomsgid2_file = os.path.join(
os.path.dirname(mbox2_file)
, '{}_no_Message-ID'.format(os.path.basename(mbox2_file))
)
if os.path.isfile(nomsgid2_file):
print(" * OUTPUT mbox_file {} already exists.".format(nomsgid2_file))
sys.exit(2)
inbox_1 = mailbox.mbox(mbox1_file)
inbox_2 = mailbox.mbox(mbox2_file)
inbox_diff = mailbox.mbox(outfile, create=True)
inbox1_missing_message_id = mailbox.mbox(nomsgid1_file, create=True)
inbox2_missing_message_id = mailbox.mbox(nomsgid2_file, create=True)
inbox_2_ids = []
for message in inbox_2:
msgid = message.get('Message-ID')
if msgid is None:
inbox2_missing_message_id.add(message)
else:
inbox_2_ids.append(msgid)
for message in inbox_1:
msgid = message.get('Message-ID')
if msgid is None:
inbox1_missing_message_id.add(message)
elif msgid not in inbox_2_ids:
inbox_diff.add(message)
inbox_diff.flush()
inbox2_missing_message_id.flush()
inbox1_missing_message_id.flush()
P.S.
I also encountered several encoding issues. As you know, emails are very hard to be automatically checked for the right encoding and utf-8 not always works.
A brutal - but working solution - was to change every occurrence of the ascii
encoding in the mailbox.py
library to latin-1
and utf8
, as follows:
get_message
function:
try:
msg.set_from(from_line[5:].decode('latin-1'))
except Exception as e:
msg.set_from(from_line[5:].decode('utf8'))
return msg
_install_message()
function:
author = message.get_from().encode('utf8')
[...]
from_line = from_line.encode('utf8')