[BBLISA] Monitoring survey

John Miller johnmill at brandeis.edu
Thu Feb 4 10:35:08 EST 2016


Thanks to everyone for their responses!  Fortunately, I don't think
anyone here thinks that we can just do a drop-in replacement for
Hyperic.  There are so many metrics that you can monitor, and it's
highly unlikely that they'll be configured in the same way (polling
intervals, alert thresholds, etc.) as a previous tool.  Combining that
with alerting schedules, groups, etc., and it's a substantial task.

On the alerting end, we're slowly moving towards getting everyone in
our department to use OpsGenie.  The truth is that you're never going
to have just one monitoring tool these days - I think we have ten or
so in various states of use - so you need something that can aggregate
all of your alerts and notify you in a uniform fashion.  It's way
easier to have your monitoring software say "This is a priority one
alert," then ship it off to OpsGenie (PagerDuty, etc.), and let that
worry about who's on call, how people should be notified, etc.  It
also allows people to choose how they'd like to be notified (SMS,
e-mail, app, phone call, etc.).  No way we could ever go back--it's
just such an awesome tool.

Patrick - I actually remember you giving a lightning talk on Sensu a
couple of years back.  It's got good community support, so it's one of
the things I'm going to demo in the next few days.

Does anyone have experience with LogicMonitor, New Relic, Scout,
Librato, Traverse, etc.?  I could see those tools being a very good
fit for us, provided that metrics are collected in a sane fashion (we
don't need hundreds of devices trying to report off site) and the
billing model is flexible (we want our VMs now, thank you very much).
We've also got strong in-house Windows experience: has anyone used
SCOM to monitor Linux devices?

John

On Thu, Feb 4, 2016 at 8:38 AM, Antony Rudie <antony.rudie at gmail.com> wrote:
> That last sentence is god's own truth.
>
> One more thing you need:  A solid plan for what happens to the alerts.
> Where do they go?  who deals with them?  How do you know if they've been
> dealt with?  It's not rocket science, but it's really important.  I worked
> in a place where that piece was missing, and frankly, I'm not sure the whole
> monitoring setup added any value.
>
> On Wed, Feb 3, 2016 at 7:52 PM, John Stoffel <john at stoffel.org> wrote:
>>
>>
>> One of the things that people seem to miss, or overlook in my opinion
>> is the cost of doing all this monitoring, and the steep learning curve
>> you have for all of it.  It's going to suck in a bunch of time at
>> first, way more than people think, and getting it tuned so that it's
>> not sending out false alarms is a huge task.
>>
>> I've played with Nagios, and we have Solarwinds at $WORK, but neither
>> is well done or really used outside of silos.  I also played around
>> with collectd and graphite, but found it too simplistic in terms of
>> access control for what I/we wanted.  And we have an old instance of
>> WhatsUp running as well for another group.  It's all hodge podge.  We
>> really should dedicate someone to doing this work, but we all keep
>> getting pulled in new directions all the time.
>>
>> It might be easier if you're just upgrading from something and you
>> know what you want to monitor, etc.  But it's not a simple drop-in
>> tool like some people make out.  It requires commitment and discipline
>> to use effectively.
>>
>> John
>>
>> _______________________________________________
>> bblisa mailing list
>> bblisa at bblisa.org
>> http://www.bblisa.org/mailman/listinfo/bblisa
>
>



-- 
John Miller
Systems Engineer
Brandeis University
johnmill at brandeis.edu
(781) 736-4619



More information about the bblisa mailing list