[BBLISA] Monitoring survey

Thu Feb 4 10:32:05 EST 2016

We're straddling two tools and doing on call with pager duty.

- AWS systems - Cloudwatch + custom metrics -> pager duty for alerting
- Traditional datacenters - check_mk (based on nagios see: omdistro.org)
does servers and networking equip and sends some emails -> pager duty for
alerting
- Softlayer - Zabbix, because their dev group loves it for some reason ->
pager duty for alerting

We also looked at datadog and really liked it, but it had a minimal story
for networking equipment. Stack driver also looked great as a "statsd +
graphite as a service" service, but like datadog, not the greatest story
for networking equipment.

It seems like if you're doing datacenter compute, datacenter networking,
and cloud monitoring all together youre in the wastelands of multiple tools
or having visibility gaps. There's another saas called flapjack.io that
does event roll up. I think victorops is in that space too. We just punted
and configure multiple tools to roll alerting into pager duty. It all feels
bad and lacking cohesion, but it works.

I've also done some playing with the ELK stack + topbeat and done logging /
metrics in the same place. Feels weird putting my system metrics in the
same hole as my system logs, but it works surprisingly well. Watcher
(elasticsearch commercial offering) does a decent job with a alerting, but
is a bear to config right (a mess of json and lucene search queries).

On Thu, Feb 4, 2016 at 8:38 AM, Antony Rudie <antony.rudie at gmail.com> wrote:

> That last sentence is god's own truth.
>
> One more thing you need:  A solid plan for what happens to the alerts.
> Where do they go?  who deals with them?  How do you know if they've been
> dealt with?  It's not rocket science, but it's really important.  I worked
> in a place where that piece was missing, and frankly, I'm not sure the
> whole monitoring setup added any value.
>
> On Wed, Feb 3, 2016 at 7:52 PM, John Stoffel <john at stoffel.org> wrote:
>
>>
>> One of the things that people seem to miss, or overlook in my opinion
>> is the cost of doing all this monitoring, and the steep learning curve
>> you have for all of it.  It's going to suck in a bunch of time at
>> first, way more than people think, and getting it tuned so that it's
>> not sending out false alarms is a huge task.
>>
>> I've played with Nagios, and we have Solarwinds at $WORK, but neither
>> is well done or really used outside of silos.  I also played around
>> with collectd and graphite, but found it too simplistic in terms of
>> access control for what I/we wanted.  And we have an old instance of
>> WhatsUp running as well for another group.  It's all hodge podge.  We
>> really should dedicate someone to doing this work, but we all keep
>> getting pulled in new directions all the time.
>>
>> It might be easier if you're just upgrading from something and you
>> know what you want to monitor, etc.  But it's not a simple drop-in
>> tool like some people make out.  It requires commitment and discipline
>> to use effectively.
>>
>> John
>>
>> _______________________________________________
>> bblisa mailing list
>> bblisa at bblisa.org
>> http://www.bblisa.org/mailman/listinfo/bblisa
>>
>
>
> _______________________________________________
> bblisa mailing list
> bblisa at bblisa.org
> http://www.bblisa.org/mailman/listinfo/bblisa
>

-- 
*Patrick **Flaherty  *|Systems Architect
*@platformpatrick*       *e:* patrick.flaherty at weather.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bblisa.org/pipermail/bblisa/attachments/20160204/827abfa4/attachment.html>