Lamson's Bounce Detection Algorithm
I just finished committing 0.9.6 code for doing bounce detection and analysis with Lamson. It’s part of the new mailing list example I’m coding up for the 0.9.6 release which I’ll be running on a free mailing list site I’m going to release soon. In this blog post I’d like to go through the bounce detection algorithm and get some feedback and samples from people. So far it works great for the samples I have, but I want it to be fairly bullet proof.
What I have here is a quick description of how Lamson is doing bounce detection, and then how it gives you parsed information to do analysis beyond that. I’m hoping that people who have to deal with bounces from systems like MS Exchange can possibly either provide samples or advice so I can tune the algorithm better.
The State of RFC 3464 And 1893
The two big RFCs for bounce handling are RFC3464 – An Extensible Message Format for Delivery Status Notifications and RFC1893 – Enhanced Mail System Status Codes which cover the headers and error messages you’ll encounter.
In RFC3464 there’s various headers that are given which a bounce message should have, as well as the double and sometimes triple multipart-mime wrapping you need to do to give back the bounced message. A major pain with this standard is that the headers are spread out across the different nested parts, some might be duplicated, and the format is never respected by the various MTAs out there.
With RFC1893 you have a large number of status codes, but they are combined in weird ways such that you have a primary code with the first digit, a secondary code with the second, and then a “third” code that’s the second and third digit combined. Unlike HTTP error codes where there’s just a single number mapped to a single status code, in RFC1893 it’s this strangely nested first+second+(second+third) thing.
Adding to this is there’s a few headers where you can put these status codes, and in various formats. Here’s a sample of the parsed Diagnostic-Code from a gmail bounce message:
 'Diagnostic-Code': [(u'smtp',
                      u'550-5.1.1',
                      u"The email account that you tried to reach does...")],
	The end result is that if you want to determine if a message has bounce (not even hard vs. soft) then you have to go trolling through nested attachments looking for headers and parsing their values on the chance they might be important to the bounce.
Soft and Hard
The general consensus is that if a primary status code is greater than 4 then the message is a hard bounce, anything else is a soft bounce. Of course there’s no way this is an absolute truth, and I’m sure there will end up being various complex things people will want to do with a soft vs. a hard bounce.
Since there’s no way Lamson could know what you want to base your decisions to honor a soft vs. hard bounce, it simply lets you figure out which type it is, and then analyze the information it’s collected to determine the bounce.
Lamson’s Simplification
Lamson already converts the email into a normalized form, so the difficult part of parsing out the bounce information is simply trolling through the parts and finding all these randomly dispersed bounce headers. Then it’s a matter of parsing the values into something that can be use for analysis in code.
To simplify this, Lamson uses a dict of headers and matching regexes to search for in the email:
BOUNCE_MATCHERS = {
    'Action': re.compile(r'(failed|delayed|delivered|relayed|expanded)', re.IGNORECASE | re.DOTALL),
    'Content-Description': 
        re.compile(r'(Notification|Undelivered Message|Delivery Report)', re.IGNORECASE | re.DOTALL),
    'Diagnostic-Code': re.compile(r'(.+);\s*([0-9\-\.]+)?\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Final-Recipient': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Received': re.compile(r'(.+)', re.IGNORECASE | re.DOTALL),
    'Remote-Mta': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Reporting-Mta': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Status': re.compile(r'([0-9]+)\.([0-9]+)\.([0-9]+)', re.IGNORECASE | re.DOTALL)
}
	Each of these headers shows up about once, maybe more in the parts of a message that has bounced information, and they don’t show up at all or rarely in regular messages. What Lamson does is just walk through each part, finds the matches, parses with the regex, and then produces something like this:
{'Action': [(u'failed',)],
 'Content-Description': [(u'Notification',),
                         (u'Undelivered Message',),
                         (u'Delivery report',)],
 'Diagnostic-Code': [(u'smtp',
                  u'550-5.1.1',
                  u"The email account that you tried to reach does...")],
 'Final-Recipient': [(u'rfc822',
              u'asdfasdfasdfasdfasdfasdfewrqertrtyrthsfgdfgadfqeadvxzvz@gmail.com')],
 'Received': [(u'by mail.zedshaw.com ...',)],
 'Remote-Mta': [(u'dns', u'gmail-smtp-in.l.google.com')],
 'Reporting-Mta': [(u'dns', u'mail.zedshaw.com')],
 'Status': [(u'5', u'1', u'1')]}
	I’ve abbreviated some of the messages since they get stupidly long, but this shows you the information you get now. Lamson also provides a nice API through the lamson.bounce.BounceAnalyzer class that you’ll see in a moment.
If you know of additional headers and formats that show up in bounces, let me know.
Bounce Probability
Now, we’ve collected up all the different headers you might find in a bounce package, and we’ve attempted to parse the values into something meaningful. That means, the more of these headers that we find and successfully parse, the more likely the message is a bounce.
Lamson keeps a score while it’s finding the above headers, and for each one it finds and parses it adds a “point”. It then produces a probability <= 1.0 that the message is a bounce or not. The more headers it finds, the closer to 1.0, the higher the chance. So far it looks like any message above 0.3 is most likely a bounce message, with only a few spams that are below that.
Checking for a bounce then looks like this:
    bm = mail.MailRequest(None,None,None, open("tests/bounce.msg").read())
    assert bm.is_bounce()
    assert bm.bounce
    assert bm.bounce.score == 1.0
    assert bm.bounce.probable()
	Which is from the test suite, and shows you various ways to get information about the probability of the message being a bounce.
Bounce Header Information
Finally, you’ll need to get at this parsed information to make any final decisions about the bounce. You might want to keep track of which MTAs consistently fail and include them in your black lists. You might want to tune your retries for different kind of soft bounces.
Whatever you need to do, you have the original raw headers that Lamson found, but you also have a higher level API you can use:
if msg.is_bounce():
    print "-------"
    print "hard", msg.bounce.is_hard()
    print "soft", msg.bounce.is_soft()
    print 'score', msg.bounce.score
    print 'primary', msg.bounce.primary_status
    print 'secondary', msg.bounce.secondary_status
    print 'combined', msg.bounce.combined_status
    print 'remote_mta', msg.bounce.remote_mta
    print 'reporting_mta', msg.bounce.reporting_mta
    print 'final_recipient', msg.bounce.final_recipient
    print 'diagnostic_codes', msg.bounce.diagnostic_codes
    print 'action', msg.bounce.action
	This is a simple sample that prints out the various bits of information available on a bounce message. Here’s a sample of the output from the above:
hard True soft False score 1.0 primary (5, u'Permanent Failure') secondary (1, u'Addressing Status') combined (11, u'Bad destination mailbox address') remote_mta gmail-smtp-in.l.google.com reporting_mta mail.zedshaw.com final_recipient asdfasdfasdfasdfasdfasdfewrqertrtyrthsfgdfgadfqeadvxzvz@gmail.com diagnostic_codes (u'550-5.1.1', u"The email account that you tried...") action failed
This demonstrates how I’ve pulled out all of the error codes mentioned in RFC1893 and put them into Lamson so that you can get meaningful error messages for the various multi-level error codes you’ll encounter. All of these error codes are available in the lamson.bounce module.
Comments Or Suggestions
That’s it for the review of Lamson’s bounce detection. It’s in the early stages but already showing a lot of promise. It turned out to be much easier than I anticipated, but only after doing the work in lamson.encoding to clean up emails.
If you can think of situations that should be covered, and if you have samples of horrible bounce messages, then I’d love to have them.
Mailing Lists Coming Back
Tomorrow I’ll be using this to reimplement the mailing list sample and put it back into the bzr repository, then back online. I’ll have a nice status update about that tomorrow as I finish it up.
