[BBLISA] Monitoring survey

Thu Feb 4 11:46:24 EST 2016

On 2/4/2016 10:35 AM, John Miller wrote:
> Does anyone have experience with LogicMonitor, New Relic, Scout,
> Librato, Traverse, etc.?  I could see those tools being a very good
> fit for us, provided that metrics are collected in a sane fashion (we
> don't need hundreds of devices trying to report off site) and the
> billing model is flexible (we want our VMs now, thank you very much).
> We've also got strong in-house Windows experience: has anyone used
> SCOM to monitor Linux devices?
>

I use Librato extensively and have used New Relic peripherally.

New Relic was fairly easy but their first solution was in the appserver 
monitoring space while they've grown beyond it, I think it colors them.  
On the other hand, they have great navigation and summaries for problems 
in that space.  If I have any needs for appserver monitoring I'd 
probably look to them first, though the DataDog demo's I've seen also 
look very cool.

Librato is basically graphite-as-a-service, though it's a much improved 
graphite.  It has excellent support for metric ingest from a number of 
sources (including directly from CloudWatch), a very easy to use REST 
protocol and language-specific APIs.   It supports standard threshold 
alerting and integrates with PagerDuty.  I've built more complex 
alerting by extracting data via their API (quite easy) and doing more 
interesting calculations.

I'm currently using it with CloudWatch ingest for infrastructure, 
collectd for guest level metrics (heavily filtered to avoid duplication 
with CloudWatch and irrelevant metrics) and direct use of their API for 
app level metrics.

Their billing model is per metric with different update rates costing 
different amounts.  I *never* have to think about Librato when adding 
new resources -- it "just works".  I believe they cost on the order of 
1% as much as our AWS bill :-)

I'm not sure what you mean about "metrics are collected in a sane 
fashion".  It *is* a service and thus your devices do need to report to 
their system over the Internet (TLS, of course).  For our needs (mix of 
public cloud and on-prem) this is sane. I don't recall any time their 
metric collection went down.

I haven't investigated if they have some kind of relay agent.  We did 
take down their (then beta) display system once by doing evil things 
with lots of people viewing dashboards of composite metrics on large 
metrics sets.  However, they were back up within 4 hours (and worked 
with us really well) and their metric ingest never hiccuped.  (Note to 
self:  inform all service vendors in advance of large product launch.  
Your blast radius is larger than your company).

--
Dewey