[BBLISA] simpler alternative to Nagios

Dean Anderson dean at av8.com
Wed Sep 1 14:53:15 EDT 2010


It may be a question of semantics.

Being out of process slots only means new processes can't be run. Being
out of file descriptors only means that no more files can be opened.
Neither of these mean the box is down; the kernel has not stopped
working; old work proceeds. They are merely resource depletions; when
tasks end and files are closed, new work proceeds.  "Crash" means the
kernel has stopped working; it cannot schedule old work and it will not
fix itself.  It has "hung".  By contast, merely being out of process
slots or file descriptors is a problem that has the potential to fix
itself.

Whether the system should be rebooted to fix resource depletions is a
matter of situation and judgement, but it isn't a "crash".

Depending on how the system is managed, this might even be a normal
condition. Well, probably not anymore; large, cheap memory has had the
effect that no one runs batch systems with just enough processes and
file descriptors for the anticipated load.  Indeed, probably few people
remember when the number of process slots and file descriptors were
significant tuning factors...)

		--Dean

On Wed, 1 Sep 2010, Brian O'Neill wrote:

> I haven't experienced this, but there are many reasons why a linux box 
> (or solaris, etc.) can respond to a ping but nothing else is working - 
> out of process slots, file descriptors, etc.
> 
> ping only means your network connectivity between here and there works, 
> and the box is at least powered on and the network stack is functioning.
> 
> 
> On 9/1/2010 12:15 PM, Dean Anderson wrote:
> > I haven't checked recently if this is still the case.  Years ago linux
> > changed the interrupt handler to respond to ping.  This greatly improved
> > ping response by cheating, but had some other side effects.  The kernel
> > can crash, and if the crash didn't interfere with the interrupt table,
> > it will still ping while dead.
> >
> > 		--Dean
> >
> > On Fri, 27 Aug 2010, Robert Keyes wrote:
> >
> >>
> >>
> >> On Fri, 27 Aug 2010, David N. Blank-Edelman wrote:
> >>
> >>> Hi Alex-
> >>>   Big Brother's successor, Xymon (formerly Hobbit) at http://sourceforge.net/projects/xymon/ may get you closer to what you seek.
> >>
> >> I am going to jump on the bandwagon here and yell 'Xymon!'
> >>
> >> Looking at your earlier idea about checking with ping, I cringed for a
> >> second, but then recovered. But it might be worth mentioning here that
> >> ping is NOT sufficient to see if a host is alive! I have known of an
> >> organization which used ping to see if its servers were alive, but ping
> >> didn't detect when a DoS attack ran the servers out of filehandles causing
> >> all net services to become unavailable, including SSH, without affecting
> >> the ICMP stack.
> >>
> >> _______________________________________________
> >> bblisa mailing list
> >> bblisa at bblisa.org
> >> http://www.bblisa.org/mailman/listinfo/bblisa
> >>
> >>
> >
> 
> _______________________________________________
> bblisa mailing list
> bblisa at bblisa.org
> http://www.bblisa.org/mailman/listinfo/bblisa
> 
> 

-- 
Av8 Internet   Prepared to pay a premium for better service?
www.av8.net         faster, more reliable, better service
617 256 5494




More information about the bblisa mailing list