Validating e-mail addresses

tl;dr: Most likely, you want to validate using the regular expression from the WhatWG (please think about the trade-off you want between practicality and precision); but if you read the caveats below and still want to validate to RFC 5322, then you want libemailvalidation.

Validating e-mail addresses is hard, and not something which you normally want to do in great detail: while it’s possible to spend a lot of time checking the syntax of an e-mail address, the real measure of whether it’s valid is whether the mail server on that domain accepts it. There is ultimately no way around checking that.

Given that a lot of mail providers implement their own restrictions on the local-part (the bit before the ‘@’) of an e-mail address, an address like !!@gmail.com (which is syntactically valid) probably won’t actually be accepted. So what’s the value in doing syntax checks on e-mail addresses? The value is in catching trivial user mistakes, like pasting the wrong data into an e-mail address field, or making a trivial typo in one.

So, for most use cases, there’s no need to bother with fancy validation: just check that the e-mail address matches the regular expression from the WhatWG. That should catch simple mistakes, accept all valid e-mail addresses, and reject some invalid addresses.

Why have I been doing further? Walbottle needs it — I think where one RFC references another is one of the few times it’s necessary to fully implement e-mail validation. In this case, Walbottle needs to be able to validate e-mail addresses provided in JSON files, for its email defined format.

So, I’ve just finished writing a small copylib to validate e-mail addresses according to all the RFCs I could get my hands on; mostly RFC 5322, but there is a sprinking of 5234, 5321, 3629 and 6532 in there too. It’s called libemailvalidation (because naming is hard; typing is easier). Since it’s only about 1000 lines of code, there seems to be little point in building a shared library for it and distributing that; so add it as a git submodule to your code, and use validate.c and validate.h directly. It provides a single function:

size_t error_position;

is_valid = emv_validate_email_address (address_to_check,
                                       length_of_address_to_check,
                                       EMV_VALIDATE_FLAGS_NONE,
                                       &error_position);

if (!is_valid)
  fprintf (stderr, "Invalid e-mail address; error at byte %zu\n",
           error_position);

I’ve had fun testing this lot using test cases generated from the ABNF rules taken directly from the RFCs, thanks to abnfgen. If you find any problems, please get in touch!

Fun fact for the day: due to the obs-qp rule, a valid e-mail address can contain a nul byte. So unless you ignore deprecated syntax for e-mail addresses (not an option for programs which need to be interoperable), e-mail addresses cannot be passed around as nul-terminated strings.

8 thoughts on “Validating e-mail addresses”

Mathias December 8, 2016 at 11:08

Actually that nul byte rule gives a strong reason to _not_ strictly follow the published standard, as accepting such email addresses with nul byte easily can cause harm in other software components that don't know each dirty detail of the spec. I believe it would be responsible to at least have a switch that rejects email addresses that make no much sense in the wild but contain potentially dangerous data. Well, or just drop them on the floor immediately, despite the standard allowing them. Without application in real world applications you might just helping criminals every single time you let them thru.

Jeffrey Stedfast December 8, 2016 at 16:48

(Note: reposted to properly html-encode angle brackets in my comment text)

It seems odd to me that you chose to implement validation of rfc5322 addresses as opposed to rfc5321 addresses.

rfc5322 is the format for messages but is not what is accepted by SMTP servers and is also not the form that users are likely to enter in a text field. Essentially, rfc5321 is the canonical form and is what you expect people to paste in, say, a textbox that is prompting a user for their email address. No one in their right mind is going to type: "Joe Sixpack" <joe (comment freak)@sixpack (as in abs, not beer!).org>

Besides which, it would require users to actually *know* the email address grammar to get right.

So it just doesn't make sense to validate the rfc5322 form instead of the rfc5321 form.

To add to that, if you validate the rfc5322 form and then try to take that string and feed it to SMTP as part of a MAIL FROM or RCPT TO command, it *will* fail.

What you want (and need!) is the canonical form, so that means your validator is useless unless it outputs the canonical form (at which point, you'd need to make it a parser rather than a validator).

Anyway, all that said, I wrote something similar a few years back based on rfc5321 (amusingly with a nearly identical name): https://github.com/jstedfast/EmailValidation

Hopefully you take my comment as constructive criticism. Based on a look over your code, it looks like it probably does a good job of validating the ABNF grammar that you targeted, I just think you targeted the wrong syntax 🙂

Philip Withnall Post authorDecember 12, 2016 at 20:08

All good points, and I entirely agree. Thanks for the feedback. 🙂

However, the motivation for this was JSON Schema, which explicitly asks for RFC 5322 validation: http://json-schema.org/latest/json-schema-validation.html#rfc.section.7.3.2.p.1.

Whether the JSON Schema specification is incorrect is another matter, and I will try and find time to query this with them.

Pingback: Smooth transition to new major versions of a set of libraries | Sébastien Wilmet

Mike Gratton December 12, 2016 at 12:30

What are your thoughts about some of the counter examples listed here: http://www.regular-expressions.info/email.html ?

E.g. "john {at} aol..(.)com" is apparently not valid, but will match the regex in the tl;dr above.

Philip Withnall Post authorDecember 12, 2016 at 20:14

My thoughts are along the same lines as that article: there’s always a tradeoff between what’s practical and what’s precise. And there’s always the question of what the purpose of the validation is. Are you trying to catch all typos in e-mail address entry? Or are you trying to catch the common mistakes of a) not filling out an e-mail address field or b) not putting an e-mail address in it? The regex I suggested is definitely not the only regex which is suitable for validating e-mail addresses, but I think it’s a reasonable compromise between catching really basic mistakes and not going overboard with regexes.

That said, to properly work out what the purpose of the validation is would require a model of the mistakes people commonly make when entering e-mail addresses. Kind of like the cool stuff Peter has been doing with libinput analysis recently: http://who-t.blogspot.co.uk/2016/12/libinput-touchpad-tap-analysis.html

Bastien December 12, 2016 at 20:21

Would be great to have in GLib, so it can be used in GtkEntry when in
email mode (with "purpose" set to GTK_INPUT_PURPOSE_EMAIL).

Philip Withnall Post authorDecember 12, 2016 at 20:25

Hmm, maybe. I don’t think this (RFC 5322) validation is what’s needed in GLib, if anything. I also think e-mail addresses are a little too specialised to be in GLib — but support for them would probably be appropriate in GTK+. However, the documentation for GtkInputPurpose explicitly says that “the purpose is not meant to impose a totally strict rule about allowed characters, and does not replace input validation”. I think validation is probably a useful thing to have in GTK+, but it would need a wider discussion about whether to use GtkInputPurpose for that, or whether to add something separate. I don’t have time to get into that discussion now (especially since I’m not working on anything which needs this in GTK+, so I have no motivating use-case). 🙁

Comments are closed.