[BBLISA] State of spam filtering?

Tue May 21 19:33:36 EDT 2013

Rich Braun wrote:
> ...with the rise of port-25 filtering by cable/DSL
> providers starting a decade ago I had to supplement this...

Doesn't business-class net service also address this?

Or have you found that the cost increase for business-class service
exceeds the cost of these other services?

(Business class also buys you a lack of port blocking for other
services, and usually no data caps.)

> 1) Inbound forwarders
> For item #1, forwarders, much inbound spam gets caught by the relatively
> lightweight rules of the DNS provider I use, EasyDNS, which also includes
> port-25 remapping as a free service.

What exactly does this service do?

Presumably it accepts mail on standard port 25 and forwards it to your
server on some non-standard port, but is it acting as a typical
store-and-froward MX, or as a real-time proxy? (In the latter case, if
your server is down, then the proxy refuses the message. On the upside,
it lets you implement anti-spam rules on your server and reject messages
without causing backscatter. Anti-spam measures have made
store-and-forward MXs all but obsolete.)

What anti-spam rules does it provide?

Even with a real-time proxy, usually when you put anything between the
client and your anti-spam measures, you decrease their effectiveness,
because you no longer have a direct connection with the client, and thus
don't know their IP address and can't use more sophisticated techniques,
like passive OS fingerprinting. (Not sure anyone actually uses the
latter in production.)

Does it do anything to compensate for the loss of metadata, like passing
on the client's IP address? Usually you can extract that from a message
header, but you've already wasted a pile of resources by the time you've
gotten as far into a transaction to see the Received: lines in the
message body. Compare that to refusing the socket accept() call if you
don't like the client's IP address.

> 2) Outbound relays
> For item #2, there are a whole bunch of companies which you can aim your
> outbound postfix/sendmail config at via the smarthost/transport map rules. 

I have been sending mail from a LAN hosted relay for a very long time,
and the only recipient that ever complained (of course these days mail
is just as likely to be silently dropped as rejected) about PTR records
not matching was Craigslist. Common anti-spam advice says to use the
lack of ant PTR record as a risk factor, and to accept any PTR record as
OK. Craigslist apparently implemented a rule where the HELO host name
had to exactly match your client IP's PTR record, or the message was
rejected.

After many years of requests, Covad eventually complied with my request
and provided a customized PTR record. But this was for business class
service with a static IP.

> Officially, that's not supposed to matter; unofficially, a
> gazillion large webmail providers (read:  yahoo, aol, et al) include any IP
> address they can see in the headers as part of their secret spam-control
> methodology.

So in your experience they are sifting out all the IPs and looking for
any that are on known blocks of IPs that are handed out dynamically to
consumer net connections?

How are they not getting tripped up by users who mail through their
ISP's relay? Are the anti-spam rules smart enough to say, this is OK
because even though it originated from a Comcast dynamic IP, it
traversed a Comcast relay?

What's stopping you from using your ISP provided relay?

> You're reading this through that service now, so the headers of this message
> should show you what I mean by that.

On the copy of the message you sent me directly, the first (oldest)
Received header was added by Google, so it seems Mailjet is stripping
all Received headers, including its own (which I guess would be the only
received header if that was the first hop from your mail client).

> The reason Spamassassin hasn't been updated since 3.3.2 is simple:  it's a
> mature plugin-based technology that really doesn't need changes.

I haven't looked at Spamassassin since the early 2000s and my impression
was that it was pretty ugly. Have they cleaned up the code and config
since then?

I know Spamassassin does a lot more than statistical classification
(Bayesian probability) that a message body is spam, but back then that
seemed to be the main reason why you'd use it.

Are you using statistical classification at all?

Is Spamassassin invoked to analyze the metadata of a message as it is
being received, and does it have access to the client IP and all the
SMTP commands issued by the client, or is it still being used in a model
where the message gets spooled to disk first, then handed to
Spamassassin for analysis?

It seems like if you've gone that far to accept and study the message,
you've already lost. For the sort of limited access mail server I'm
contemplating, there would be no use for statistical classification.

Google's spam filtering supposedly relies primarily on identifying who
the sender is, and rejecting messages at that level, with only a
minority of "unknown sender" messages being passed on to statistical
classification. A restrictive server would simply reject unknown senders.

Most of the other stuff, like consulting RBLs or SPF, seems like it
should be doable via lighter weight Sender Policy Framework services in
Postfix. What you lack is the ability to aggregate the various bits of
metadata and put them into a weighted formula to make a fuzzy
determination of ham vs. spam, which of course Spamassassin does.

If you change your focus from looking for spam-like behavior to strictly
identifying the sender, as Google does, can the problem be greatly
simplified, with the understanding that this would not be a general
solution, as it would reject lots of ham under typical conditions.

 -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/