Rick van Rein

Fri 20 October 2017


Spam Filtering: Balancing the Odds

Mail was designed as a gullable system, and this is generally abused by spammers. Fighting spam is a balancing act between failures to accept and reject. With the more refined identity model of the IdentityHub, we have new tools to strike a better balance.

Email is inherently stable; when mail servers pass around email they are always very clear on whose responsibility it is to continue to handle an email, so no messages can be lost. Unfortunately, the abuse through spam has introduced a need to reject messages before they reach their recipients, thereby introducing risk of falsely rejecting, or falsely accepting an email.

Spam Filtering Techniques

There are a few common techniques that help to combat spam. We will first describe those (briefly) and return to what the IdentiyHub can add further down.

  • Blacklisting is a technique based on publicly maintained lists of known abusers: spammers, phishers and virus spreaders. Whitelisting is the opposite, and is done to protect well-behaving mail senders from being classified as spammers.

  • Greylisting is a technique to challenge the sender within the constraints of a protocol. In the case of email, it is common to initially report I am too busy to take your email right now and rely on the sender to try again later. The principle of greylisting can be expanded to other protocols, always in their own way; phone calls may be redirected and messaging may be subjected to standard inquiries, for example. All these complicate the complexity beyond the usual simplicity of the simple implementations in the bot nets that are in use by spammers.

  • Standards compliance checks help to raise the bar that a spammer needs to adhere to. Where it has long been tradition to be lenient, there now is a reason to require things like a proper Date: on each email, an existing sender address, and so on.

  • Pattern recognition is where things get vague, and uncertain. The email is scanned for signs that indicate possible spam, like a Subject: with only uppercase characters. Many suspicious factors are collected, and an estimate is made by assigning a weight to each and seeing if the total spam score exceeds a threshold. This threshold causes a balancing question of whether we want to lean towards the risk of missing proper email or receiving improper email.

  • Spam training is the process where human feedback is used to tell a spam filter that certain mail is spam or non-spam. Spam training works by changing the weights assigned to the various patterns being recognised. It helps to personalise the separation between spam and non-spam, but it can never lead to a perfect and flawless result.

Spam Prevention Techniques

A few modern approaches exist to make it impossible or highly unlikely that spam can be sent. This usually consists of two things:

  • Authentication is used to get certainty of who sent an email. SPF lets the sending domain declares what the possible senders are for email from a domain. DKIM lets the sending mail server sign email so it can be validated to have come from the domain's mail server.

  • Reputation is the collected behaviour of a sender, and it works best when combined with authentication. This is used to weigh the probability that a sender is a spammer.

This may sound accurate, but it is in fact a branch of statistics, with its inherent uncertainties. What statistics can help to do is not just express average values, but also how certain these are.

Opportunities for Improvement

The balance to strike is best illustrated by looking both at the probabilities of something being spam, and something being non-spam. Let's say we want to have 90% certainty, then we can distinguish cases where we are less than 5% certain that something is spam as non-spam (green), and when we are at least than 95% certain that something is non-spam, we reject it without any misgivings (red).

Classification as spam (red), non-spam (green), and uncertain (yellow)

We can combine the two diagrams, and show a yellow middle portion to indicate what we are uncertain about. To strike the best balance, we should aim for making that area as small as possible. Which of the following two systems looks more attractive as a spam filter?

Preference of a narrow band of uncertainty

Clearly, the narrower case is the ideal; it certain about more spam classifications, thus rejecting them rightfully, and it is certain about more non-spam classifications, thus passing them rightfully. The yellow portion is where we may get involved, and so the narrowest yellow portion is an attractive property of any spam filtering system.

The common interpretation of these boundaries between red, yellow and green is in terms of the spam score that a spam filter derives from the weighted evidence of a spam selection.

The rules used in spam filters are selected to have a high degree of probability of finding a good balance. In addition, when training a spam filter about email that the user considers non-spam or spam, the weighing of the many rules helps to get more accurate. Rules that turn out to be less effective (after training) are blunted in the result, whereas those rules that are most precise will be sharpened, and have a heightened impact on the selection of what is considered spam or non-spam.

Balancing Spam Score with Expectations

Spam scoring can be influenced on a per-person basis, based on personal preferences. It would be unpractical to do it for each alias in our IdentityHub, as that would impart a lot of maintenance work on the user. Distinction between pseudonyms is perhaps possible, though even here we might wonder if it is useful.

What we may do however, is alter our expectations based on the context in which an email arrives. We have a lot of that in the IdentiyHub, but the aforementioned techniques also help to raise or lower bars:

  • DKIM and SPF are frameworks for establishing sender identity. When we have expectations about the use of either or both technologies, we can raise the bar for a sender. The techniques are sensitive to forwarding and mailing lists, but email contains a trace of where it has been, which can help to recover from problems. To this end, we're looking into Lenient DKIM, Lenient SPF and Lenient DMARC; some of these may make it into a formal standard, others are not likely to get that far but may still be of use in local configurations using, for example, our IdentityHub.

  • Aliases help us to sort our contacts, and we can set different expectations on each. Where we may be lenient about spam in a naughty alias, we are likely to be less interest in our formal capacity. Also, aliases offer a good distinction between what we expect senders to do with DKIM and SPF; we may well have an alias for a business contact whose mail server always uses these techniques, for instance, but in return we may want to be light on spam checking. On the other hand, a generally reachable address info@example.com is more prone to receive mail from the general public and may be subjected to intense spam filtering. In short, we have different expectations and so we set different cut-off bounds for spam filtering.

  • Access Control Lists or ACL for short authorise access to an alias. These settings are very useful in the selection process. An ACL entry matching a complete email address is perfect, certainly when the expectations on DKIM and SPF match, an ACL entry with just a domain match is less perfect, but it gets less perfect when the user name is not matched, and when the domain erodes form accurate to subdomains, all the way to a wildcard matching any domain. This means that an ACL can be helpful to modify the expectations for each sender individually.

  • Mail Submission is the term for an outgoing mail server. When using IdentityHub, you will send mail through its related server, so the various forms of identity rewriting can be made — like adding an alias or changing to a group member identity. This is also needed to behave properly under DKIM and SPF. Even when using a webmail solution, you would setup the mail submission server for your addresses under an IdentityHub domain. There are various things that we may learn about outgoing mail, and that are commonly reused in reply email. This is a great help when receiving returned emails. So, we can use outgoing email to fill (volatile) databases with information scavenged from outgoing email, and use those databases to help to match incoming email.

The general idea is to raise expectations on spam score depending on these factors. As an example, aliases may allow us to group spam expectations per domain, per sender or other sender attributes.

Training spam and non-spam may be slightly different under such circumstances. In a first layer, there will be the need to learn about the expectations. This is similar to the learning process for spam filters, but this time the factors are of impact on expectations. Then, after learning these factors, it is possible to train a spam filter about spam or non-spam that remain after the new classification. This training should not aim for a fixed setting of spam scores, but instead be relative to the expectations (which is a complicating new idea).

Note that we may succeed in adding the aforementioned aspects as factors to a spam filtering solution; this is in fact a common practice in spam filters; but that may not work for all forms of computation; some things may be more logically arranged through multiplication thant through addition, for example. We have an interesting problem ahead of us, but the reason that we are optimistic about the ability of the IdentityHub to add value is that we do indeed have more information available, which is always good to improve filtering.

Go Top