2

I have two mbox files containing about 6k emails each. They should be more or less identical, though #1 contains about 100 emails which #2 does not contain. I would love to generate a third mbox file containing my 100 messages - a diff so to speak.

I used to automatically forward messages from one inbox into another (server-side), which randomly did not forward a few messages for some odd undetermined reason. #2 is the inbox into which emails were forwarded - lot's of read and replied-to messages with additional headers containing information on how they were forwarded. #1 is a recent dump, using imap, of 6k unread messages.


I am working with Thunderbird under Linux.

3 Answers 3

2

The following Python script solves the problem:

import mailbox

inbox_2 = mailbox.mbox('inbox_2_file')
inbox_1 = mailbox.mbox('inbox_1_file')

inbox_diff = mailbox.mbox('inbox_diff_file', create=True)

inbox_2_ids = []
for message in inbox_2:
    inbox_2_ids.append(message.get('Message-ID'))

for message in inbox_1:
    if message.get('Message-ID') not in inbox_2_ids:
        inbox_diff.add(message)

inbox_diff.flush()
1

Thanks so much for posting both your question and answer, @s-m-e. The Python script works beautifully so long as each of the email messages include a Message-ID. Sadly, that is not always the case, since Message-ID is surprisingly not a mandatory field.

With a very slight alteration, your script can be used to create a new mbox file containing all of the email messages that are missing the Message-ID field:

import mailbox

inbox_1 = mailbox.mbox('inbox_1_file')

inbox_missing_message_id = mailbox.mbox('inbox_missing_message_id_file', create=True)

for message in inbox_1:
    if message.get('Message-ID') is None:
        inbox_missing_message_id.add(message)

inbox_missing_message_id.flush()
1
  • 1
    Interesting, I did not know that it was not mandatory. Thanks a lot for pointing this out. The real question then becomes how you would adjust the actual diff code. How can you figure out which email is which in a reliable manner, i.e. how can you safely define an "alternative id"? I thought about using a mix of subject lines and timestamps, but the latter seems to be really unreliable in too many different ways.
    – s-m-e
    Commented Sep 14, 2017 at 13:23
0

Some 5 years later. I had to compare 306MB and 338MB Thunderbird backed-up mailboxes. So I put all the excellent advice into a self-contained helper script that makes the difference mbox2 - mbox1. In the process are also created the corresponding mailboxes with emails lacking the Message-ID field. For me this works out-of-the-box on a Mac, producing two additional mailboxes of emails without a Message-ID of 52KB and 10KB. More than manageable!

Hope this may come handy. Thanks for your insights.

#!/usr/bin/env python3

import mailbox
import sys
import os

if len(sys.argv) <= 2:
    print("USAGE: {} <mbox2> <mbox1>".format(os.path.basename(sys.argv[0])))
    print("       <diff> = <mbox2> - <mbox1> is based on 'Message-ID' field")
    print("       E-mails with no 'Message-ID' are copied <mbox[1|2]_no_Message-ID>.")
    print("       Original mailboxes are left unaltered.")
    sys.exit(1)
    
if not os.path.isfile(sys.argv[1]):
    print(" * mbox_file {} does not exist.".format(sys.argv[1]))
    sys.exit(2)
if not os.path.isfile(sys.argv[2]):
    print(" * mbox_file {} does not exist.".format(sys.argv[2]))
    sys.exit(2)

mbox1_file = sys.argv[1]
mbox2_file = sys.argv[2]

outfile = os.path.join(
    os.path.dirname(mbox1_file)
    , '{}2_diff_{}1'.format(os.path.basename(mbox2_file), os.path.basename(mbox1_file))
)
if os.path.isfile(outfile):
    print(" * OUTPUT mbox_file {} already exists.".format(outfile))
    sys.exit(2)

nomsgid1_file = os.path.join(
    os.path.dirname(mbox1_file)
    , '{}_no_Message-ID'.format(os.path.basename(mbox1_file))
)
if os.path.isfile(nomsgid1_file):
    print(" * OUTPUT mbox_file {} already exists.".format(nomsgid1_file))
    sys.exit(2)


nomsgid2_file = os.path.join(
    os.path.dirname(mbox2_file)
    , '{}_no_Message-ID'.format(os.path.basename(mbox2_file))
)
if os.path.isfile(nomsgid2_file):
    print(" * OUTPUT mbox_file {} already exists.".format(nomsgid2_file))
    sys.exit(2)


inbox_1 = mailbox.mbox(mbox1_file)
inbox_2 = mailbox.mbox(mbox2_file)
inbox_diff = mailbox.mbox(outfile, create=True)
inbox1_missing_message_id = mailbox.mbox(nomsgid1_file, create=True)
inbox2_missing_message_id = mailbox.mbox(nomsgid2_file, create=True)


inbox_2_ids = []
for message in inbox_2:
    msgid = message.get('Message-ID')
    if msgid is None:
        inbox2_missing_message_id.add(message)
    else:
        inbox_2_ids.append(msgid)

for message in inbox_1:
    msgid = message.get('Message-ID')
    if msgid is None:
        inbox1_missing_message_id.add(message)
    elif msgid not in inbox_2_ids:
        inbox_diff.add(message)

inbox_diff.flush()
inbox2_missing_message_id.flush()
inbox1_missing_message_id.flush()

P.S. I also encountered several encoding issues. As you know, emails are very hard to be automatically checked for the right encoding and utf-8 not always works. A brutal - but working solution - was to change every occurrence of the ascii encoding in the mailbox.py library to latin-1 and utf8, as follows:

get_message function:

try:
  msg.set_from(from_line[5:].decode('latin-1'))
except Exception as e:
  msg.set_from(from_line[5:].decode('utf8'))
return msg

_install_message() function:

author = message.get_from().encode('utf8')
[...]
from_line = from_line.encode('utf8')

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .