62
\$\begingroup\$

So a friend happened to show me how odd and specific the general email syntax rules are. For instance, emails can have "comments". Basically you can put characters in parentheses that are just ignored. So not only is it valid, email(this seems extremely redundant)@email.com is the same email as [email protected].

Now most email providers have more simpler and easier to work restrictions (like only ascii, digits, dots and dashes). But I thought it'd be a fun exercise to follow the exact guidelines as best I could. I wont delineate every specific here, as I (hopefully) have made it all clear in the code itself.

I did heavily consult the font of all knowledge, Wikipedia for its summary on the rules.

I'm particularly interested on feedback for how robust I made this and how I did the testing and separation of functions. In theory this should be a module people could import and call on (though I have no idea when someone would actually want to use it) so I'd like reviews to focus on that. Feedback about better or more efficient methods are, of course, welcome.

"""This module will evaluate whether a string is a valid email or not.

It is based on the criteria laid out in RFC documents, summarised here:
https://en.wikipedia.org/wiki/Email_address#Syntax

Many email providers will restrict these further, but this module is primarily
for testing whether an email is syntactically valid or not.

Calling validate() will run all tests in intelligent order.
Any error found will raise an InvalidEmail error, but this also inherits from
ValueError, so errors can be caught with either of them.

If you're using any other functions, note that some of the tests will return
a modified string for the convenience of how the default tests are structured.
Just calling valid_quotes(string) will work fine, just don't use the assigned
value unless you want the quoted sections removed.
Errors will be raised from the function regardless.

>>> validate("local-part@domain")
>>> validate("[email protected]")
>>> validate("[email protected]")
Traceback (most recent call last):
  ...
InvalidEmail: Consecutive periods are not permitted.
>>> validate("[email protected]")
>>> validate("[email protected]")
>>> validate("john.smith(comment)@example.com")
>>> validate("(comment)[email protected]")
>>> validate("(comment)john.smith@example(comment).com")
>>> validate('"abcdefghixyz"@example.com')
>>> validate('abc."defghi"[email protected]')
Traceback (most recent call last):
  ...
InvalidEmail: Local may neither start nor end with a period.
>>> validate('abc."def<>ghi"[email protected]')
Traceback (most recent call last):
  ...
InvalidEmail: Incorrect double quotes formatting.
>>> validate('abc."def<>ghi"[email protected]')
>>> validate('jsmith@[192.168.2.1]')
>>> validate('jsmith@[192.168.12.2.1]')
Traceback (most recent call last):
  ...
InvalidEmail: IPv4 domain must have 4 period separated numbers.
>>> validate('jsmith@[IPv6:2001:db8::1]')
>>> validate('john.smith@(comment)example.com')
"""


import re

from string import ascii_letters, digits


HEX_BASE = 16
MAX_ADDRESS_LEN = 256
MAX_LOCAL_LEN = 64
MAX_DOMAIN_LEN = 253
MAX_DOMAIN_SECTION_LEN = 63

MIN_UTF8_CODE = 128
MAX_UTF8_CODE = 65536
MAX_IPV4_NUM = 256

IPV6_PREFIX = 'IPv6:'
VALID_CHARACTERS = ascii_letters + digits + "!#$%&'*+-/=?^_`{|}~"
EXTENDED_CHARACTERS = VALID_CHARACTERS + r' "(),:;<>@[\]'
DOMAIN_CHARACTERS = ascii_letters + digits + '-.'

# Find quote enclosed sections, but ignore \" patterns.
COMMENT_PATTERN = re.compile(r'\(.*?\)')
QUOTE_PATTERN = re.compile(r'(^(?<!\\)".*?(?<!\\)"$|\.(?<!\\)".*?(?<!\\)"\.)')

class InvalidEmail(ValueError):
    """String is not a valid Email."""

def strip_comments(s):
    """Return s with comments removed.

    Comments in an email address are any characters enclosed in parentheses.
    These are essentially ignored, and do not affect what the address is.

    >>> strip_comments('exam(alammma)ple@e(lectronic)mail.com')
    '[email protected]'"""

    return re.sub(COMMENT_PATTERN, "", s)

def valid_quotes(local):
    """Parse a section of the local part that's in double quotation marks.

    There's an extended range of characters permitted inside double quotes.
    Including: "(),:;<>@[\] and space.
    However " and \ must be escaped by a backslash to be valid.

    >>> valid_quotes('"any special characters <>"')
    ''
    >>> valid_quotes('this."is".quoted')
    'this.quoted'
    >>> valid_quotes('this"wrongly"quoted')
    Traceback (most recent call last):
      ...
    InvalidEmail: Incorrect double quotes formatting.
    >>> valid_quotes('still."wrong"')
    Traceback (most recent call last):
      ...
    InvalidEmail: Incorrect double quotes formatting."""

    quotes = re.findall(QUOTE_PATTERN, local)
    if not quotes and '"' in local:
        raise InvalidEmail("Incorrect double quotes formatting.")

    for quote in quotes:
        if any(char not in EXTENDED_CHARACTERS for char in quote.strip('.')):
            raise InvalidEmail("Invalid characters used in quotes.")

        # Remove valid escape characters, and see if any invalid ones remain
        stripped = quote.replace('\\\\', '').replace('\\"', '"').strip('".')
        if '\\' in stripped:
            raise InvalidEmail('\ must be paired with " or another \.')
        if '"' in stripped:
            raise InvalidEmail('Unescaped " found.')

        # Test if start and end are both periods
        # If so, one of them should be removed to prevent double quote errors
        if quote.endswith('.'):
            quote = quote[:-1]
        local = local.replace(quote, '')
    return local

def valid_period(local):
    """Raise error for invalid period, return local without any periods.

    Raises InvalidEmail if local starts or ends with a period or 
    if local has consecutive periods.

    >>> valid_period('example.email')
    'exampleemail'
    >>> valid_period('.example')
    Traceback (most recent call last):
      ...
    InvalidEmail: Local may neither start nor end with a period."""

    if local.startswith('.') or local.endswith('.'):
        raise InvalidEmail("Local may neither start nor end with a period.")

    if '..' in local:
        raise InvalidEmail("Consecutive periods are not permitted.")

    return local.replace('.', '')

def valid_local_characters(local):
    """Raise error if char isn't in VALID_CHARACTERS or the UTF8 code range"""

    if any(not MIN_UTF8_CODE <= ord(char) <= MAX_UTF8_CODE
           and char not in VALID_CHARACTERS for char in local):
        raise InvalidEmail("Invalid character in local.")

def valid_local(local):
    """Raise error if any syntax rules are broken in the local part."""

    local = valid_quotes(local)
    local = valid_period(local)
    valid_local_characters(local)


def valid_domain_lengths(domain):
    """Raise error if the domain or any section of it is too long.

    >>> valid_domain_lengths('long.' * 52)
    Traceback (most recent call last):
      ...
    InvalidEmail: Domain length must not exceed 253 characters.
    >>> valid_domain_lengths('proper.example.com')"""

    if len(domain.rstrip('.')) > MAX_DOMAIN_LEN:
        raise InvalidEmail("Domain length must not exceed {} characters."
                           .format(MAX_DOMAIN_LEN))

    sections = domain.split('.')
    if any(1 > len(section) > MAX_DOMAIN_SECTION_LEN for section in sections):
        raise InvalidEmail("Invalid section length between domain periods.")

def valid_ipv4(ip):
    """Raise error if ip doesn't match IPv4 syntax rules.

    IPv4 is in the format xxx.xxx.xxx.xxx
    Where each xxx is a number 1 - 256 (with no leading zeroes).

    >>> valid_ipv4('256.12.1.12')
    >>> valid_ipv4('256.12.1.312')
    Traceback (most recent call last):
      ...
    InvalidEmail: IPv4 domain must be numbers 1-256 and periods only"""

    numbers = ip.split('.')
    if len(numbers) != 4:
        raise InvalidEmail("IPv4 domain must have 4 period separated numbers.")
    try:
        if any(0 > int(num) or int(num) > MAX_IPV4_NUM for num in numbers):
            raise InvalidEmail
    except ValueError:
        raise InvalidEmail("IPv4 domain must be numbers 1-256 and periods only")

def valid_ipv6(ip):
    """Raise error if ip doesn't match IPv6 syntax rules.

    IPv6 is in the format xxxx:xxxx::xxxx::xxxx
    Where each xxxx is a hexcode, though they can 0-4 characters inclusive.

    Additionally there can be empty spaces, and codes can be ommitted entirely
    if they are just 0 (or 0000). To accomodate this, validation just checks
    for valid hex codes, and ensures that lengths never exceed max values.
    But no minimums are enforced.

    >>> valid_ipv6('314::ac5:1:bf23:412')
    >>> valid_ipv6('IPv6:314::ac5:1:bf23:412')
    >>> valid_ipv6('314::ac5:1:bf23:412g')
    Traceback (most recent call last):
      ...
    InvalidEmail: Invalid IPv6 domaim: '412g' is invalid hex value.
    >>> valid_ipv6('314::ac5:1:bf23:314::ac5:1:bf23:314::ac5:1:bf23:41241')
    Traceback (most recent call last):
      ...
    InvalidEmail: Invalid IPv6 domain"""

    if ip.startswith(IPV6_PREFIX):
        ip = ip.replace(IPV6_PREFIX, '')
    hex_codes = ip.split(':')
    if len(hex_codes) > 8 or any(len(code) > 4 for code in hex_codes):
        raise InvalidEmail("Invalid IPv6 domain")

    for code in hex_codes:
        try:
            if code:
                int(code, HEX_BASE)
        except ValueError:
            raise InvalidEmail("Invalid IPv6 domaim: '{}' is invalid hex value.".format(code))

def valid_domain_characters(domain):
    """Raise error if any invalid characters are used in domain."""

    if any(char not in DOMAIN_CHARACTERS for char in domain):
        raise InvalidEmail("Invalid character in domain.")

def valid_domain(domain):
    """Raise error if domain is neither a valid domain nor IP.

    Domains (sections after the @) can be either a traditional domain or an IP
    wrapped in square brackets. The IP can be IPv4 or IPv6.
    All these possibilities are accounted for."""

    # Check if it's an IP literal
    if domain.startswith('[') and domain.endswith(']'):
        ip = domain[1:-1]
        if '.' in ip:
            valid_ipv4(ip)
        elif ':' in ip:
            valid_ipv6(ip)
        else:
            raise InvalidEmail("IP domain not in either IPv4 or IPv6 format.")
    else:
        valid_domain_lengths(domain)

def validate(address):
    """Raises an error if address is an invalid email string."""

    try:
        local, domain = strip_comments(address).split('@')
    except ValueError:
        raise InvalidEmail("Address must have one '@' only.")

    if len(local) > MAX_LOCAL_LEN:
        raise InvalidEmail("Only {} characters allowed before the @"
                         .format(MAX_LOCAL_LEN))
    if len(domain) > MAX_ADDRESS_LEN:
        raise InvalidEmail("Only {} characters allowed in address"
                         .format(MAX_ADDRESS_LEN))

    valid_local(strip_comments(local))
    valid_domain(strip_comments(domain))


if __name__ == "__main__":
    import doctest
    doctest.testmod()
    raw_input('>DONE<')
\$\endgroup\$
12
  • 1
    \$\begingroup\$ Unfortunately, I couldn't get your code to work (I get an IndentationError), but I suspect that it might fail even on some of the more simple examples from RFC3696. \$\endgroup\$ Commented Jan 22, 2016 at 15:40
  • 1
    \$\begingroup\$ Your handling of comments isn't strictly correct; quoted-string can only contain FWS between the quotes, not CFWS, so anything that looks like a comment inside a quoted-string isn't a comment, and shouldn't be removed. Something similar is true for domain-literals inside square brackets. Neither is likely to have much real-world impact, but if you want to be absolutely correct you might want to think about how to handle that. \$\endgroup\$
    – hobbs
    Commented Jan 22, 2016 at 17:34
  • 1
    \$\begingroup\$ you may want to take a look at ex-parrot.com/pdw/Mail-RFC822-Address.html \$\endgroup\$
    – njzk2
    Commented Jan 22, 2016 at 19:07
  • 2
    \$\begingroup\$ Well I just tried sending an email to somebody(with_a_comment)@gmail.com and my gmail won't even let me send it. It says "somebody" is invalid. The comment is not even mentioned in the error message. \$\endgroup\$
    – Octopus
    Commented Jan 22, 2016 at 21:58
  • 13
    \$\begingroup\$ "I did heavily consult the font of all knowledge, Wikipedia for its summary on the rules." - there's your problem. If implementing something technical, you should always get the official spec - which is RFC 2822 (and the updates to it) for your case. \$\endgroup\$
    – Bergi
    Commented Jan 23, 2016 at 21:40

6 Answers 6

33
\$\begingroup\$

"@"@example.com and "\ "@example.com both fail, but they are valid.

" "@example.com passes, but it is, in fact, invalid.*

You probably missed the idea to confirm your knowledge with the relevant RFCs, as a conforming implementation should abide by the rules described therein. While Wikipedia is quite reliable nowadays, it is by no means a normative source.

 

*RFC 5322 describes quoted-string as follows:

quoted-string   =   [CFWS]
                    DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                    [CFWS]

FWS means "folding white space" and is a construct containing an optional sequence made up of whitespaces that are followed by a single CRLF; that sequence (if present) preceding a mandatory part that consists of a single whitespace. While an address' local part can legally begin and end with a space, both spaces need to be separated by at least one character forming qcontent.

\$\endgroup\$
2
  • 6
    \$\begingroup\$ This answer precisely describes why validating valid addresses is a mostly futile exercise. It's far easier to get it wrong than it is to get it right. Back in the day, you could just finger addresses to get an approximation of deliverability, but these days you may as well just send it out and hope for the best. \$\endgroup\$
    – phyrfox
    Commented Jan 23, 2016 at 6:15
  • 2
    \$\begingroup\$ The only way to validate an email address is to try and send an email to it. If it fails it doesn't necessarily mean its an invalid address, but it means that your methods of sending email can't send it there, so its validity isn't that important (whether or not this means you want to continue using that email library is up to you..) \$\endgroup\$ Commented Jan 24, 2016 at 18:37
24
\$\begingroup\$

Said this in chat already, but @ succeeds even although it is not a valid email address. You should require at least 1 character in the local part and 1 character in the domain.

\$\endgroup\$
24
\$\begingroup\$

I personally find it hard to fault your code. Actually I'm quite surprised about the absence of code.

Other than a few PEP8 errors there are three changes that I would recommend. You remove both \\ and \" from your quotes, but you do it in an overly verbose way:

stripped = quote.replace('\\\\', '').replace('\\"', '"').strip('".')

Instead you can use re.sub:

stripped = re.sub(r'\\[\\"]', '', quote).strip('".')

This way you read that it can replace both \\ and \" at the same time for slightly better readability. I don't know if there is much of a performance difference however.


I would also add another function, as currently you will use the validate function as follows:

try:
    validate('')
except:
    # Handle non-valid email
else:
    # Handle valid email

Personally this is a lot of boilerplate if all you wish to know is if it is valid. For these cases I would recommend that you make an is_valid function. This would change the above to:

if is_valid(''):
    # Handle valid email
else:
    # Handle non-valid email

It can help readability in cases where you don't want to know the error. Which probably isn't how you want it to be used, with all the helpful errors. But is a way I know I would want to use it.


All your functions are public which encourages me to do:

import email
email.valid_quotes('joe@domain')

This should be a private function, that I shouldn't be using, and so you should name it _valid_quotes. Whilst it can still be used the same way, it's now a 'Python private'. And follows how re.py defines its functions.

And as @Mathias said you should also add __all__ too.


Other than the above three points, you have a few PEP8 errors you may not have picked up on. But they're quite petty:

  1. Surround top-level function and class definitions with two blank lines.

  2. You have too much whitespace around your imports, two blank lines would be enough (Which would still go against PEP8).

  3. You didn't indent enough on two of the errors in validate.

  4. You have one line that goes above 79 chars.

\$\endgroup\$
1
  • 2
    \$\begingroup\$ is_valid is a great idea, especially since there's only really one public function at the moment. \$\endgroup\$ Commented Jan 22, 2016 at 14:18
17
\$\begingroup\$

You seems to build your module to contain only validate as "public" function. You may want to enforce that by declaring __all__ = ['validate', 'InvalidEmail']. It will affect the way that pydoc and the help builtin display help on your module (they will show only the module docstring, the exception and the validate function) as well as how from the_ultimate_email_validator import * is handled (letting only validate and InvalidEmail leak into the global namespace).

Other than that, looking at the intended usecase of validate it closely resembles int or related builtins. As such, it could be useful to rename it to a less passive action (say email) and call it like:

valid_address = email(user_input)

The returned value could be stripped of comments and any parsing issue would raise InvalidEmail the same way guess = int(raw_input()) would raise ValueError. The caller would still be responsible of handling invalid addresses using a try .. except as in your actual version.

Speaking of that return value, I guess it would be something along the lines of

return '{}@{}'.format(local, domain)

at the end of validate because comments are already stripped at the first line of the function. But then, why do you call valid_local(strip_comments(local)) and valid_domain(strip_comments(domain)) instead of valid_local(local) and valid_domain(domain)? There doesn't seems to be any case where comments could be left in either local or domain after stripping the entire address.

\$\endgroup\$
1
  • 1
    \$\begingroup\$ Very good tip with __all__, I'd previously just used _ but it felt more unwieldy to do here, this is a great solution! Also you're right about the redundant duplicate of strip_comment, I had previously arranged it so that it wouldn't be called so early and didn't update to match the change. \$\endgroup\$ Commented Jan 22, 2016 at 14:17
12
\$\begingroup\$

The code that you've written is generally really good, but as you seem to have found parsing non-trivial strings starts to get kind of complicated, and has all sorts of room for nasty edge cases.

Your remove_comments method appears not to account for nested comments, which are explicitly allowed by the RFC.

As expected, remove_comments("Hello (new) world") returns "Hello world", but when I ran it, remove_comments("Hello (new (old) ish) world") returned 'Hello ish) world'.

Removing nested comments with regular expressions is hard, indeed with a purist view of regular expressions, it's impossible. Basically, to do this you need a recursive regex, which seems not to be supported by Python's RE engine.

In this particular case, it shouldn't be too hard for you to roll your own comment remover, all you really need to do is iterate over the string, keeping track of the number of brackets currently open. For a next iteration, this shouldn't be too hard.

You may find, though, that this gets unmanageable quite quickly when you try and account for quoted strings and escaped characters - how would you write your parser such that it parses foo"\")"("")@example.com down to foo")@example.com? If you really want to hit as many pathological edge cases as possible, I'd suggest learning about formal languages and parsers, then digging out a parser library for Python to help you build your own. The Python Wiki lists several, and this one in particular looks pretty nice, though I haven't tried to use it myself.

\$\endgroup\$
7
\$\begingroup\$

Instead of reviewing your code, I looked at your tests.

You do not seem to support IPv4 mapped IPv6 addresses:

validate('hello@[::FFFF:222.1.41.90]')

(See http://www.tcpipguide.com/free/t_IPv6IPv4AddressEmbedding-2.htm)

Furthermore, it turns out the validation of the domain is quite lax.

validate('hello@!^&&%^%#&^%&%$^#%^&%$^%#&^*&^*^%^#$')

And even worse:

 validate('[email protected]\n')
 validate('hello@exa\nmp\nle.com\n')
 validate('[email protected]\nX-Spam-Score: 0')

(I don't know the proper headers for spam-checkers, but by only using your validator, I would be able to inject headers into an e-mail.)

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.