[BBLISA] Automating anomaly detection, was Re: More on SNMP statistical prediction & problem detection

Paul Beltrani spamgrinder at gmail.com
Mon Aug 19 11:50:53 EDT 2013


Those of you interested in automating anomaly and problem detection
may want to check out some of the work Etsy is doing.

The abstract and slides from their talk at Velocity 2013 in Santa
Clara, CA is available at
http://velocityconf.com/velocity2013/public/schedule/detail/28177  If
this interests you, you may want to find the video of the talk as the
post presentation Q&A was useful.

Another version with a link to video from a different conference,
https://speakerdeck.com/astanway/bring-the-noise-continuously-deploying-under-a-hailstorm-of-metrics

The Etsy blog posting introducing Kale (Skyline and Oculus), the tool
discussed in the talk, is available at
 http://codeascraft.com/2013/06/11/introducing-kale/

  - Paul Beltrani

On Tue, Aug 13, 2013 at 11:55 AM, Alex Aminoff <alex at basespace.net> wrote:
> On 8/9/2013 4:19 PM, Marc Chiarini (school) wrote:
>
> There is a very important academic & practical discussion to be had about
> this.  In fact Alva Couch and I and others have been examining similar
> topics for years.  Unfortunately I don't have the bandwidth right now to get
> into it, perhaps in a few months.  I'll leave you with these two tidbits:
> thresholds are no good in these circumstances (except as a coarse
> lower/upper bound)...you need to combine learning (small amounts of
> hysteresis) and highly reactive management.  Second, one might be able to
> obtain unrefined but useful estimates of performance in various components
> (e.g., cpu, disk, network, etc) without an agent -- via analysis of
> response-time and other statistics...essentially building a black-box model
> over time of how the system is *expected* to work.
>
> Regards,
> Marc
>
>
> Thanks for this tidbit.
>
> I read the slides from your 2009 paper,
> http://www.cs.tufts.edu/~couch/publications/mace-09-slides.pdf
>
> Not sure I understood the details, but enough to move forward.
>
> I presume you are aware of the work that Jake Brutlag did and added to
> RRDTool, presented at
>
> https://www.usenix.org/legacy/events/lisa00/full_papers/brutlag/brutlag_html/
>
> He implemented the Holt-Winters algorithm for time-series modeling. I'm
> going to use that because it's already been done for me.
>
> So the only thing I'm going to add is a meta-analysis where you collect say
> 10 SNMP variables from 10 switches each of which has 24 ports, total 2400
> time-serieses, and then ask the question do enough of these differ from
> their predicted values enough to indicate a systemic problem.
>
> My question is, does anyone have a suggestion for what statistical method to
> use for the meta-analysis? In your paper, it looks like you were only
> looking at one time-series at a time: has anyone looked at how to sensibly
> combine? Alternatively, I have not looked closely at what you can get from
> the Holt-Winters stuff in RRDTool - has anyone used that for any purpose?
>
>  - Alex Aminoff
>    BaseSpace.net & NBER
>
>
>
> _______________________________________________
> bblisa mailing list
> bblisa at bblisa.org
> http://www.bblisa.org/mailman/listinfo/bblisa



More information about the bblisa mailing list