This is a free PHP function to validate an email address against the various relevant internet RFCs.
I have also collected together some of the other widely-used validators that are in the public domain, along with their test cases, and compared them below. After a close reading of the RFCs (and their published errata) I disagree with a few of the test cases proposed by other authors. I've listed these exceptions and my reasoning after the results below.
Download my function and the test cases.
You can compare the validators against all the test cases. You can download your chosen validator to use in your project (click on the author's name below).
You can consult the RFCs yourself (good luck): RFC3696, RFC5322, RFC1123, RFC4291, RFC5321. Let me also point out some RFCs that are less relevant, even thought they are frequently cited: RFC822 (obsoleted by RFC5322), RFC2822 (obsoleted by RFC5322) and RFC1035 (updated by RFC1123).
If there's an industrial-strength validation function that I've missed please let me know using the contact channels on the left.
Percent correct
100%
100%
74%
73%
68%
See the full results and analysis »
There's some misinformation that is often repeated about email address formats. I'll attempt to quash a few myths here:
There are still some websites that won't allow you to use a plus sign (+) in an email address. In fact it's a perfectly valid character to use and it can be really useful. Gmail, for instance, will pre-tag your incoming emails with the part after the plus sign.
Why is this useful? Well if you register at a particular website as john.doe+dating@example.com then Gmail will label all messages from that website with the dating label.
This arises from the simple arithmetic of maximum length of a domain (255 characters) + maximum length of a mailbox (64 characters) + the @ symbol = 320 characters. Wrong.
This canard is actually documented in the original version of RFC3696. It was corrected in the errata, but nobody reads the errata (except me it would appear).
There's actually a restriction from RFC5321 on the path element of an SMTP transaction of 256 characters. But this includes angled brackets around the email address, so the maximum length of an email address is 254 characters. You read it here first.
This is not a new restriction - it goes right the way back to RFC821 and was also in RFC2821.
This is only true if you also put double quotes around the recipient part. Some of the examples given in RFC3696 are actually wrong, and were corrected in the errata.
This is OK: "Abc\@def"@example.com and so is this: "Abc@def"@example.com. But this isn't: Abc\@def@example.com
The range of characters you can use without quotes is as follows: any letter, any digit, any of the following !#$%&'*+-/=?^_`{|}~. You can't use these without quoting them: ()<>[]:;@\, with one exception: you can use parentheses for comments.
Comments are text surrounded by parentheses (like these). These are OK but don't form part of the address. In other words mail sent to first.last@example.com will go to the same place as first(a).last(b)@example(c).com(d). Strange but true.
Nope. Check out http://www.3com.com. This old rule from RFC1035 was changed in RFC1123 in 1989.
More interesting is whether Top Level Domains (TLDs) need to begin with a letter. At the moment they all do, but ICANN is introducing all sorts of new ones. RFC1123 assumes that TLDs will always start with a letter, and this is clearly sensible since there needs to be a systematic way of distinguishing a domain name from an IP address.
Consider 123.123.123.123. If TLDs could be numeric then this might be a valid domain name. If you entered it into your browser, how would it know whether to straight to the IP address 123.123.123.123 or go via the Domain Name System (DNS)?
RFC5322 is a Draft Standard. This trumps the Proposed Standard status of RFC2822.
Are you from, like, the olden days?
I can't, but then I suck at regexes. If you think you can, please feel free to use my test cases to verify your regex.
Here are the test cases I think are wrong (in other words, the expected result given by the author is different to mine):
Doug Lovell's Linux Journal article (see below) casually cites this as an address you could adopt to fool a spambot. Unfortunately it's not a valid address.
Phil gives this as an example of an address he could have if he was perverse enough to want it. By my reading of RFC5322, however, you can't escape an unquoted string like this. In fact John Klensin added an erratum to RFC3696 for exactly this reason. Phil: you need to read the RFCs and the errata!
This is one of Dave Child's test cases and he claims it is a valid address. I don't agree that you can quote part of the string like this. If you read RFC5322 Section 3.4.1 carefully, the local-part must be either a quoted-string or a dot-atom, not both.
Both Dave Child and Cal Henderson kindly wrote to me and pointed out that this form is valid, although obsolete. This is clear if you read RFC5322 right down to the small print at the bottom :-)
Another one of Dave Child's test cases which also turns out to be invalid contrary to his expectation. Looking at RFC3696 Section 2, it says "There is an additional rule that essentially requires that top-level domain names not be all-numeric." I'm not sure what authority the author of RFC3696 is citing here, and it's only an informational RFC, but I think it most unlikely there will ever be all-numeric TLDs.
Dave Child expects this to fail - in other words it is an invalid address. I disagree, citing RFC3696 again: "any ASCII character, including control characters, may appear quoted, or in a quoted string". This includes the backslash, so this is a perfectly good email address (although not one I'd recommend you take).
I've rethought this and I now believe Dave Child was right all along. It all comes down to what RFC3696 means by "quoted" and "quoted string". These are technical terms, defined in RFC5322, and it's clear from there that a backslash must be escaped even in a quoted string. Apologies to Dave Child for doubting him :-).
Cal Henderson weighed in on this example too. Our consensus is that a backslash can escape anything in a quoted string (but it has to escape something - it can't be the last character in the string).
Doug Lovell published a robust function in Linux Journal in 2007, but I can't use it here because his code is copyright All Rights Reserved by Linux Journal (ironically).
Doug's article is also wrong to claim that domain labels must start with an alphabetic character. This was changed in 1989 by RFC1123 to allow domains such as 3com.com. Doug also cites some invalid examples from RFC3696 that were corrected in the errata.
I've extracted the raw specification of a valid email address in BNF (Backus-Naur form) from RFC5322 and put it here: The BNF from RFC5322 defining parts of a valid email address. If you try to follow this in the RFC you end up going backwards and forwards and getting confused - it's much easier to understand in the way I've laid it out.
Go to full results and analysis »
< Back to Home | Blog posts: Email address validation | Code | Latest post