 |
Building a Network Management System
by Mark Cooper, in Editorials - Sun, Mar 13th 2005 00:00 UTC
This article looks at current NMS offerings and considers how and what
would make a "real" NMS.
Copyright notice: All reader-contributed material on freshmeat.net
is the property and responsibility of its author; for reprint rights, please contact the author
directly.
What is an NMS?
The normal definition of NMS is "Network Management System". This is
nice and easy to say, but very hard to pin down to an exact
specification. What constitutes a well-rounded NMS?
I believe it to consist of at least:
- Up/Downtime Monitoring
- Reporting
- Configuration Change Management
- IP/Asset Management
- Security
- Event Correlation/Root Cause
- Alerting
There are a large number of Free/Open Source Software and commercial
systems that claim to be NMSes, but none come close to covering all this
functionality. Typically, systems fall into either a Network Monitoring
(Up/Down) or Network Reporting role, not both.
NMS Generations
The types of systems available can be crudely categorized into three
distinct generations:
- Pure Up/Down Monitoring.
Typically with just ICMP, but some with applications (DNS, HTTP,
etc.).
- Event correlation.
Polling using SNMP, ICMP, and applications. Alerting on SNMP traps
and syslog.
- Root Cause Analysis.
Advanced event correlation to ensure minimum false negative
alerts.
Event Correlation/RCA
Event correlation is the core functionality of an NMS. Without it, too
many false negative alerts are generated, which make the system
ineffective.
Root Core Analysis takes event correlation a step further. Rather than
just dampening alerts from nodes downstream of an existing problem, it
only alerts on the real cause of a problem, to significantly
reduce the time needed for a fix.
Efficient/Intelligent Polling
Currently, a typical NMS platform will consist of two main systems, with
one solution doing the Up/Down monitoring, the other the reporting.
This leads to extremely inefficient double polling of devices. Why ping
a host to see if it's up when you've just gathered interface stats from
it? Some systems can be integrated to help reduce this double polling,
but only a single NMS solution will truly provide the most efficient use
of the network.
To map, or not to map?
The traditional NMS provides a network map for operators to be able to
point and click through to any problems. Some systems have dropped this
functionality, claiming that operators only really need to be told what
the real problems are. These are typically Second Generation event
correlation engines, that just provide a list of problems for the
operator.
However, no matter how advanced the logic is in an NMS, it cannot cover
all problems, and providing a visual representation for operators to
work with can provide major gains. The human brain works best with
visual images rather than the written word. NMSes need a map!
It's all about the Man-Machine Interface!
Aside from alerting (via email, SMS, etc.), how should an NMS interface
with the operators? There are two distinct camps, dedicated GUI and
HTTP. A growing number of HTTP interfaces (typically with some Java
thrown in) are being used.
While this type of interface may have its uses, it is not the best
medium in an operational environment. A dedicated GUI is the only way to
provide a fast, efficient, reliable mechanism for operators to interact
with an NMS.
Putting the M in NMS
The M stands for Management, but what's being managed, exactly? Network
problems, mainly. A single generic management interface is somewhat of a
holy grail that some people have been chasing. Is it achievable?
How far should management be taken? Many vendors have proprietary
management software for their systems to provide an alternative to the
commandline. Should an NMS allow full management of a device without
having to resort to a CLI? Some things can be done easily by SNMP, but
what interaction should an NMS have with a device's CLI? RANCID provides an easy
change management system for routers, but also shows the possibilities
of being able to integrate functions into an NMS that typically are done
at the CLI level.
Think being able to do mass changes (for example, SNMP community
changes) via a few clicks on a GUI, rather than manually having
to login to thousands of devices.
Current Solutions
F/OSS
I'll mention the commercial solutions as well, as they typically have
far better Man-Machine Interfaces. This is a typical problem with F/OSS,
as programmers don't usually make good UI engineers.
Commercial
- HP OpenView
- SMARTS
- Aprisma
- Netcool
- Concord
- Proviso
- InfoVista
Recreation or integration?
The beauty of F/OSS is that we have a huge, growing repository of code.
So, do we start coding the "perfect" NMS from scratch, or use the tools
already provided and just integrate the functionality we require?
Some commercial vendors make big claims about how their code is
multi-threaded and "industrial strength". Producing good, clean,
efficient code that will run on many platforms and is part of a large
system is hard to do! Such a large, complicated system can also be
extremely hard for new coders to get into. Keeping the functionality
compartmentalized into small programs can ease these problems. This ties
in well with using existing toolsets and just concentrating on an
integration issue.
OSSIM is taking the integration approach, and it is well worth watching
how well this works. Obviously, the double polling issue rears its
head here, and would be a serious limiting factor in any large
implementation. Although OSSIM is coming from a security requirements
background, it offers an example for the creation of a "proper" F/OSS
NMS system. How much work is involved in integrating Nagios with
RRDTOOL? Could the cheops-ng GUI be used as the frontend for Nagios?
How do our original NMS requirements map to existing F/OSS projects?
| Up/Down Monitoring: | Nagios, BigBrother |
| Reporting: | MRTG, RRDTool |
| Configuration Change Management: | RANCID |
| IP/Asset Management: | Northstar |
| Security: | Snort, Tripwire |
| Alerting: | Sendmail, etc. |
Most of the functionality is covered across a number of projects. This
only leaves Root Cause Analysis. Unfortunately, this is probably one of
the hardest things to do.
To Poll, or not to Poll?
First generation NMSes like Nagios and Big Brother rely on polling, via
ICMP or an application-specific method (HTTP, FTP, etc.), to do their
up/down monitoring. Unfortunately, this really isn't network
management. It's just node polling, and has major disadvantages.
To poll means there is a polling interval. What is the state of your
network during these intervals? Actively polling the network is also a
major scalability problem. The larger your network, the more polling
required. Active polling systems are fine for monitoring a handful of
systems, but to manage a network, you have to look at other
mechanisms.
This is where systems such as OpenNMS and JFFNMS come in. These
are realtime event-driven systems. Events are typically from SNMP traps,
but can come from other sources such as syslog. There is no polling
interval as such in these systems. If a node goes down, an SNMP trap is
generated by the switch immediately. You now have true realtime network
monitoring.
Of course, SNMP traps are typically not generated on application
failures. Most NMSes will resort back to polling to monitor
applications.
The Next Generation F/OSS NMS?
It would be nice to see better support for enterprise/carrier-grade
functionality in F/OSS NMSes, such as support for bulkstats, netflow,
and RCA.
However, there is something I have not seen either F/OSS or commercial
systems using: Host/Network sniffing. Having a local host-based sniffer
or a dedicated sniffer on a mirrored switch port could leverage enormous
gains for NMSes:
- Network efficiency
- No polling! No extra traffic is generated, as it relies on seeing
exactly what's happening on the network.
- Spotting problems immediately
- It sees TCP RSTs, switch ports losing carrier signal, etc.
- Real graphing
- Not from graphing host to destination, but actual "real world"
traffic.
- The ability to track full user QoS
- Tied into the network authentication platform (radius, et al), it
can give real world user QoS reporting.
- Extra functionality
- Massive potential for per-IP-block monitoring/reporting, etc.
- It's fast, flexible, distributed, and scalable!
-
Developing an NMS-centric pcap-based sniffer seems like the way forward.
It could be easily integrated with current systems by being developed
separately, and just generating SNMP traps when required.
Author's bio:
Mark Cooper can be reached at mark@mcooper.demon.co.uk.
T-Shirts and Fame!
We're eager to find people interested in writing articles on
software-related topics. We're flexible on length, style, and
topic, so long as you know what you're talking about and back up
your opinions with facts. Anyone who writes an article gets a
t-shirt from ThinkGeek
in addition to 15 minutes of fame. If you think you'd like to try
your hand at it, let jeff.covey@freshmeat.net
know what you'd like to write about.
[Comments are disabled]
Comments
[»]
GroundWork Monitor wasn't included here
by Amy Abascal - Oct 10th 2007 15:03:42
GroundWork Monitor Open Source is a great option:
http://www.groundworkopensource.com/
[reply]
[top]
[»]
A Real NMS
by MadEyeMoody - Sep 6th 2006 09:26:44
Most Developers out there today are all jacked up about SOA. Its easy to
program, uses SSL / HTTPS for security, and its becoming very prolific.
When you throw in J2EE and JMS, you now have all your Dev guys
drooling.
Some of this stuff just doesn't work in certain cases. For example, lets
say you have a process that collects performance data on a device in
clumps. Like Netflow data. Data sets are huge. And you want your data
keyed correctly so that it is usable and functional. So, you end up
encoding Netflow data into XML records. This becomes a huge behemoth
across the wire as not only reach record delinited, it is also escaped. For
example, you use a field called ACMEVALUE. In XML speak, thats:
<ACMEVALUE>1234567890
</ACMEVALUE>
So now, you've added alot more data to the dataset for the sake of
flexibility. And this really adds up across the wire!
The second thing you do is that you take a long time to process.A SOAP
transaction cannot be completed until all of the data is encapsulated in
the SOAP envelope. This may take an inordinate amount of time and blocks
vital resources dirung the process.
In SOA, when you start using stuff like their publish and subscribe in
near real time, it ends up blocking during the IO phases which slows down
everything and makes it non-scaleable in large environments.
Additionally, everyone is wrapped up around a CMDB concept as introduced
by ITIL. Not to say a CMDB won't work.... In some cases, a CMDB is used...
Its only localized. Think about windows registries and you get the jist.
Stuff happens too fast on some levels of the data to keep this data in a
centralized spot. You ebnd up having to mix data elements and locations
dependent upon the volatility and usefullness of the data elements
themselves.
SNMP, when you look at it, is a schema for a highly distrubted database
where the data access mexchanism is accomplished via SNMP versus somethnig
like SQL*Net or ODBC and SQL.
The thought of a Real NMS is evolving very quickly - almost haphazardly.
Yet the technology being used to do the next generation NMS systems lacks
stability and may not be very scalable. It has been said that Corporate
America is spending a huge amount of money to convert all their
applications to SOA and JAva only to lose functionality, stability, and
scalability. And worse yet, they are offshoring the coding in many cases
making it impossible to support in the future with off shoring support as
well!
[reply]
[top]
[»]
OpenNMS should be added...
by MadEyeMoody - Sep 6th 2006 07:47:01
OpenNMS is doing very well these days. It should be part of the list.
[reply]
[top]
[»]
Solar Winds
by Terrance - Dec 26th 2005 10:13:48
Would you add Solarwinds to this list?
[reply]
[top]
[»]
RE: Building a Network Management System
by MadEyeMoody - Apr 4th 2005 12:12:21
I wrote a white paper in 1994 called "Network Management: What It Is
and What It Isn't" that is still somewhat pertinent even after all
these years.
You're correct. Rarely do all of the functions of the FCAPS model find
their way into valid implementations... However, I think that if you work
on the things that you can get the most value, you can achieve some level
of success without doing the know all - end all - be all NMS
implementation.
RCA - (Another overused term if ever there was one!) For my own
intentioned purposes, I define 6 levels of correlation within Management
implementations. I do this so that I can explain to a management person a
specific function without having to deal with FUD presented from
Vendors.
Event correlation
Device Correlation
Alarm Correlation
System Correlation
Business Correlation
Performance Correlation
RCA is difficult. In fact, I see even the commercial products that tout
it, can be misleading in certain situations. For example, if you base your
Root Cause Analysis on a Topology and you do not programmatically reverify
the topology during an outage, your analysis may be based on facts that are
not true.
For certain polling situations such as polling a service via a spoofed
transaction - This can be VERY Dangerous! I have witnessed some app
developers "tighten up" their transactions for the spoofed
transactions. This can skew the results or even mask issues.
Passive monitoring holds great promise because you can gain the
perspective of the ACTUAL END USER... Not a spoofed user. So if you have a
Catalog / Shopping Cart application, you can see what customers are
ACTUALLY doing versus sending in a secret shopper.
[reply]
[top]
[»]
Monitor depth
by Daniel Feenberg - Mar 14th 2005 19:05:57
One thing I rarely see discussed about monitoring, is how throughly each
server is tested. For instance, for SMTP, testing could stop at any of the
following points:
1) TCP connection
2) SMTP banner received
3) test message accepted
4) test message delivered
For DHCP testing could stop at
1) TCP connection
2) get an address from a pool
3) check that address works (in some sense)
In our experience, lots of broken servers
will complete step 1.
[reply]
[top]
[»]
Intermapper
by Mark - Mar 13th 2005 23:52:41
Intermapper is another
interesting commercial package with an emphasis on SNMP monitoring and
automated mapping. They used to provide free demo downloads directly from
their website, and I remember pleasant afternoons mapping out huge networks
with the package.
[reply]
[top]
[»]
SmokePing
by X-Nc - Mar 13th 2005 19:29:51
I would add SmokePing to the list of tools.
It has come in very handy many times for the rather large, e.g. the
biggest Intranet on the planet, network where I previously worked. It also
comes from Tobi, who made MRTG and RRD.
-- If I actually _could_ spell I'd have spelled it right in the first place.
[reply]
[top]
[»]
Re: SmokePing
by gollum - Mar 14th 2005 10:51:18
> I would add SmokePing to the list of
> tools.
Yup, smokeping comes under the general RRDTOOL banner, RRDTOOL being the
data storage method and smokeping et al being the collection/reporting
method. Loadsa RRDTOOL based tools at
http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/rrdworld/index.html
[reply]
[top]
[»]
Project Maintenance
by imipak - Mar 13th 2005 17:17:00
It is no good, writing the perfect program, if you then neglect it. Systems
evolve, so the monitoring software must do likewise. Big Brother is a
classic example of "bit rot", for example. There have been no updates to
Big Brother for many years. In that interval, Perl has been largely
resculpted, virtual systems have largely supplanted physical servers and
many organizations are either using, or have adopted, SANs and other
"special interest" network technologies.
In the tradition of Unix tools, there won't be a "one size fits all"
solution. Rather, there will be a large number of specialist tools that can
be integrated. That is inevitable, as that is the only solution that has
proved workable in the long-term.
However, NMS systems don't play well together. Typically, you would need
to use several solutions (Smokeping, MRTG, Big Brother/Big Sister, pchar,
Ganglia, etc) and hope that you've covered the bases. You'd be very lucky
if you did. More likely, you'll have an uneven mix of data that overlaps,
possibly conflicts and likely confuses more than helps. There is no easy
way, for example, to tell MRTG that if Big Brother detects a host as down,
it needn't bother querying it for SNMP data. There is no easy way to tell
Big Brother that, if pchar detects severe latency, it needs to extend its
timeouts accordingly.
As far as I know, none of these programs play nice with ECN, so if the
network detects overload, none of these programs can be instructed to
throttle back. MRTG uses SNMP but I can see no obvious way to take
advantage of SNMPv3 over the earlier variants. MRTG also supports IPv6, but
IPv6 supports mobility and MRTG uses static hostnames. None of them work
with multicasting, to the best of my knowledge. Nor do they support RSVP to
reserve bandwidth for communication.
Frankly, I'm not impressed with the state of NMS at the moment. Too much
overlap, too little useful information, poor integration, poor maintenance
and poor designs. Large infrastructures are hard to maintain, because there
is really nothing to maintain them with. That is not a good situation to be
in.
[reply]
[top]
[»]
Re: Project Maintenance
by Simon Clift - Mar 14th 2005 09:05:05
> Big Brother is a classic
> example of "bit rot", for example.
I've deployed a program, BigSister, that uses the BigBrother protocol
(which is lightweight and simple to deploy across diverse systems, in my
case Unix, Windows and VMS). The Perl structure of BigSister is, at least
at first, only translucent, but I was able to make the extensions I
required. My only complaint was speed of update; it is a web-based system
so the browser refresh rate was a problem. The advantage of that is,
however, no client to deploy.
Big Sister on SF
In my experience, setting up a monitoring system is an easily
underestimated task.
[reply]
[top]
[»]
Re: Project Maintenance
by gollum - Mar 14th 2005 11:02:37
> It is no good, writing the perfect
> program, if you then neglect it. Systems
> evolve, so the monitoring software must
> do likewise. Big Brother is a classic
> example of "bit rot", for example. There
> have been no updates to Big Brother for
> many years. In that interval, Perl has
> been largely resculpted, virtual
> systems have largely supplanted physical
> servers and many organizations are
> either using, or have adopted, SANs and
> other "special interest" network
> technologies.
I totally agree!
One of the major problems with NMSes is how to keep it 'in-sync' with the
network.
Without the NMS being aware, to a greator or lesser extend, of the
protocols ( layer 3/4 ) and the network ( layer 1/2 ) it's hard to do
decent Root Cause Analysis.
[reply]
[top]
[»]
Re: Project Maintenance
by Katja - Sep 4th 2006 17:25:16
> In the tradition of Unix tools, there
> won't be a "one size fits all" solution.
> Rather, there will be a large number of
> specialist tools that can be integrated.
> That is inevitable, as that is the only
> solution that has proved workable in the
> long-term.
I agree with you.
The scale of options by the suggested NMS varied very wide.
It takes a lot of time to install, administrate and monitoring different
programs.
Katja
[reply]
[top]
[»]
Zabbix should have been mentioned!
by welshpjw - Mar 13th 2005 17:10:47
http://www.zabbix.com/ is VERY good and VERY configurable! I recently
found out about it and dropped looking at other options.
[reply]
[top]
[»]
Re: Zabbix should have been mentioned!
by Michael Shigorin - Mar 18th 2005 09:59:21
+1
-- Michael Shigorin
mike SOMEWHERE AT altlinux PLUS DOT org
[reply]
[top]
[»]
Diverse enviroments cannot support 1 does all
by Gustaf Gunnarsson - Mar 13th 2005 15:55:45
I think that one has to accept that the word NMS may mean different things
depending on where in the chain of EM's (element managers) you are.
For instance the top-layer NMS handling alarms, which in a large network
handles well above 100k alarms on average per day is complicated enough as
it is.
Once you get this big, the configuration management system or the
reporting system will be a system in itself, even if labeled as the same
product simply because the number of different equipment you have will
have.
The key is interoperability between systems and this is what should be
achieved, trying to make a system which can do all will give no benefit but
headache once you need to extend it to support a new kind of element.
What I am saying is, according to me, the OSS systems which are today are
fine, and if anybody wants to take this further it should be done by making
for instance a reporting system which can easily be configured to get data
from different sources like RRD/SQL databases and then present the
information structured and in a generic way to the user.
The same would apply to the alarmhandling system, focus should be on
handling alarms in a generic way, dont try to interface all different kinds
of equipment and verify that they do work, there are plenty of systems
allready doing this excellently. Just try to recieve alarms which somebody
sends, interpret the language and the message and present this in a generic
fashion to the operator.
-- failure is not an option (f) 2008 bus[iy]ness as usual team
[reply]
[top]
[»]
Big Brother
by pitchpoledave - Mar 13th 2005 07:39:58
Hi,
I don't think that you have fully explored the potential for this article.
If you did full a feature comparison it might shed more light on the
topic.
For eg, Big Brother CAN do "real time" monitoring via snmp traps
and reporting via LARRD or Butter. It cal also do SLA reporting which I
don't think that any of the other OSS products can do.
Also one HUGE advantage that Big Brother has is that it has agents for
windows and other nix..This way you can do NETWORK and SERVER monitoring
from the same console. Just checking ports up/down doesn't cut it any more.
[reply]
[top]
[»]
Monitor depth
by Daniel Feenberg - Mar 13th 2005 07:10:52
One thing I rarely see discussed about monitoring, is how throughly each
server is tested. For instance, for SMTP, testing could stop at any of the
following points:
1) TCP connection
2) SMTP banner received
3) test message accepted
4) test message delivered
For DHCP testing could stop at
1) TCP connection
2) get an address from a pool
3) check that address works (in some sense)
In our experience, lots of broken servers
will complete step 1.
[reply]
[top]
[»]
Re: Monitor depth
by Bill Carlson - Mar 15th 2005 08:00:13
> One thing I rarely see discussed about
> monitoring, is how throughly each server
> is tested. For instance, for SMTP,
>
Many of the systems discussed, at the very least Nagios, base their
monitoring around an open plugin system. One can certainly write a plugin
to comprehensively test a service, it's just a matter of doing it.
Some plugins just require a little thought to implement. For instance, on
a host with several virtual websites, it's not enough to get to the IP and
speak HTTP, you need to know that a particular site is being accessed. An
easy setup is to use check_http to check for a specific page on each site
and do a critical error on a 404 (page not found). Simple and works.
[reply]
[top]
[»]
Event Correlation
by Brian E. Seppanen - Mar 13th 2005 06:05:24
The topic is overly broad and not many details.
It mentions Event Correlation, but doesn't mention open source option -
sec (Simple Event Correlator) at http://kodu.neti.ee/~risto/sec/
Where I work,
Nagios, snmptrapd, sec, perl, mysql, sendmail, procmail, paging and all
those other open source tools have managed to create a wonderful
environment which we're going to scale via management tools.
Our parent company constantly talks of 6-18 months for implementing
anything, which when implemented still doesn't cover simple polling,
service verifications, and is almost all trap based event correlation.
Good for somethings, but mail servers and proxy servers don't send out
traps when misconfigured and the customer is impacted.
I would just like to say that people shouldn't assume that a commercial
product is going to do any better than OSS in the right hands. That isn't
to say that a commercial product in the right hands wouldn't be able to do
it all, but you'd have to hope there is a budget for it.
-- Area 54: The Secret Government Disco Labs in Provo Utah
[reply]
[top]
[»]
Re: Event Correlation
by gollum - Mar 14th 2005 11:18:47
> It mentions Event Correlation, but
> doesn't mention open source option - sec
> (Simple Event Correlator) at
> http://kodu.neti.ee/~risto/sec/
Event Correlation can be just event dampening with a bit more logivc which
is still a long way from the Root Cause Analysis that is the real
requirement.
RCA is *hard* to do :/
[reply]
[top]
[»]
Re: Event Correlation
by Michael Shigorin - Dec 15th 2005 08:41:23
> Event Correlation can be just event
> dampening with a bit more logivc which
> is still a long way from the Root Cause
> Analysis that is the real requirement.
Just for a record (as this editorial is bookmarked for NMS stuff): seems
like inspecting AutoNOC might be useful while implementing RCA in
any of available free NMS systems. (linked here)
-- Michael Shigorin
mike SOMEWHERE AT altlinux PLUS DOT org
[reply]
[top]
[»]
very interesting...
by Jean-Luc Fontaine - Mar 13th 2005 05:00:55
...and nicely summarized. I have to read it again more thoroughly.
Have you taken a look at moodss (http://moodss.sourceforge.net) which I
hope has a good enough GUI for a free piece?
Thanks for this nice article. Jean-Luc
[reply]
[top]
[»]
Polling is very important
by thaig - Mar 13th 2005 04:51:53
Polling is important because it is positive proof that a service is working
for end-users and meeting it's Service Level Agreements.
If you don't simulate the end user's access method then you aren't testing
the service's availability. For this reason polling will never go away.
Some quite large networks are polled for HTTP and ICMP by various
commercial solutions and it works well (1000's of polls every 10
minutes).
[reply]
[top]
[»]
Re: Polling is very important
by gollum - Mar 14th 2005 11:12:25
> Polling is important because it is
> positive proof that a service is working
> for end-users and meeting it's Service
> Level Agreements.
I disagree. As I stated in the article, a pcap based system would actually
*see* the end users traffic and therefore provide a *real* SLA.
> If you don't simulate the end user's
> access method then you aren't testing
> the service's availability. For this
> reason polling will never go away.
Polling is not real world. Polling from one GigE connected server to
another GigE connected server is not simultaing the end users access
method
> Some quite large networks are polled for
> HTTP and ICMP by various commercial
> solutions and it works well (1000's of
> polls every 10 minutes).
Hmmm, define 'works well' ? :) I would argue not. Ten minutes before I
know my main authentication platform is down? No, thats not 'works well'
for me ;)
[reply]
[top]
[»]
Re: Polling is very important
by Bill Carlson - Mar 14th 2005 14:35:57
> I disagree. As I stated in the article,
> a pcap based system would actually *see*
> the end users traffic and therefore
> provide a *real* SLA.
Without some kind of polling, you can't say whether a service is actually
working. Depending strictly on generated traps/events is not good enough,
that mechanism can fail and would then be silently dead. Polling is a good
cross check.
> Polling is not real world. Polling from
> one GigE connected server to another
> GigE connected server is not simultaing
> the end users access method
You're picking on certain cases. The argument still stands, polling is
useful.
> Hmmm, define 'works well' ? :) I would
> argue not. Ten minutes before I know my
> main authentication platform is down?
> No, thats not 'works well' for me ;)
Again, specific case fails to address the argument.
[reply]
[top]
[»]
Re: Polling is very important
by gollum - Mar 14th 2005 22:51:41
> Without some kind of polling, you can't
> say whether a service is actually
> working. Depending strictly on generated
> traps/events is not good enough, that
> mechanism can fail and would then be
> silently dead. Polling is a good cross
> check.
Nope. With a sniffer on the host or a mirrored port you would see any
problem immediately, rather than waiting for a polling period.
It would not be 'silently dead'. You would see TCP RST's or ICMP errors,
the list goes on. Polling would miss about IP problems that would only show
up as maybe increased latency. Packet loss, retransmission, TCP window size
changes etc etc - all picked up by a pcap based monitor.
I'm not ruling out polling completely, just saying it should not be the
core way of finding problems on the network
[reply]
[top]
[»]
Re: Polling is very important
by Bill Carlson - Mar 15th 2005 07:52:56
>
> Nope. With a sniffer on the host or a
> mirrored port you would see any problem
> immediately, rather than waiting for a
> polling period.
>
AGAIN, if the host can't communicate back to the NMS, you won't know it's
dead. You need some kind of polling going, at the very least a check that
traps or whatever mechanism gets back to the NMS still works.
> I'm not ruling out polling completely,
> just saying it should not be the core
> way of finding problems on the
> network
>
No, your original point was that polling was worthless. You are now
agreeing with my point, you need to have a polling mechanism somewhere, if
for no other reason than to cross check that certain services are still
active (SNMP traps are being sent and received, event messages are being
sent, received and processed, etc). I'm not talking about the service
itself, you're correct that an active monitor (I don't like the term 'real
time') would be a plus for some services. But you need something to make
sure your reporting infrastructure is working and that means periodic "I'm
alive" messages, ie polling.
[reply]
[top]
[»]
Re: Polling is very important
by gollum - Mar 15th 2005 11:16:07
> AGAIN, if the host can't communicate
> back to the NMS, you won't know it's
> dead. You need some kind of polling
> going, at the very least a check that
> traps or whatever mechanism gets back to
> the NMS still works.
If the host cannot communicate back to the NMS, you would know instantly (
rather than waiting for a polling period ) as the connection would fail. By
this I mean that the pcap monitoring would be constantly sending info back
to the NMS via either an always on TCP connection or a stream of UDP.
Why did the host suddenly lose connection? The NMS should already know the
answer as it should be able to do root cause and see a switch or some other
fault in the network path, etc etc.
If you are receiving SNMP traps or syslog from a router, and these
suddenly stop, there is obviously a problem.
The more you integrate the NMS into the network, the easier the RCA
becomes. If your entire infrastructure is set-up to do syslog and SNMP trap
back to the NMS, it should already know why the host cannot communicate
back because it's just tracked a network admin log in to a router and
delete a static route by mistake ( for example ).
> No, your original point was that polling
> was worthless. You are now agreeing with
> my point, you need to have a polling
> mechanism somewhere, if for no other
> reason than to cross check that certain
> services are still active (SNMP traps
> are being sent and received, event
> messages are being sent, received and
> processed, etc). I'm not talking about
> the service itself, you're correct that
> an active monitor (I don't like the term
> 'real time') would be a plus for some
> services. But you need something to make
> sure your reporting infrastructure is
> working and that means periodic "I'm
> alive" messages, ie polling.
I'm proposing you sniff the *actual* traffic. You see *ALL* the traffic.
Why poll a device when you can see traffic going to and from it? You can
see people connecting to port 80, GETing a url, etc etc. This establishes
both the host and service are up and running aswell as the routing between
the host and the person connecting.
If the the webserver dies, you would instantly see TCP RST's, no waiting
for a poll, instant, 'real-time' :)
[reply]
[top]
[»]
Re: Polling is very important
by Jason Martin - Apr 25th 2005 14:42:51
Most NMS's done persist a connection to the central server, so you can't
check for a 'failed connection'. Additionally, most event managers don't
have a comprehensive list of all host sending them events so they don't
know to look for non-events.
The traffic behavior of your application helps decide if polling or
sniffing is appropriate. Nobody wants to get up at 2:00am to respond to an
alarm that there is no traffic to a host, only to find that the reason is
it is Christmas Eve and nobody happens to want to visit the site. Ideally
the lack of traffic would kick off a poll to perform an independant check.
Also, there is a difference between 'Event Management' and 'Network
Monitoring'. EM is more along the lines of passive,
someone-sends-in-alerts, while NM is more of the
check-and-verify-it-is-working.
[reply]
[top]
[»]
Re: Polling is very important
by Todd - Aug 2nd 2005 12:50:32
>
> % AGAIN, if the host can't communicate
> % back to the NMS, you won't know it's
> % dead. You need some kind of polling
> % going, at the very least a check that
> % traps or whatever mechanism gets back
> to
> % the NMS still works.
>
>
>
> If the host cannot communicate back to
> the NMS, you would know instantly (
> rather than waiting for a polling period
> ) as the connection would fail. By this
> I mean that the pcap monitoring would be
> constantly sending info back to the NMS
> via either an always on TCP connection
> or a stream of UDP.
>
> Why did the host suddenly lose
> connection? The NMS should already know
> the answer as it should be able to do
> root cause and see a switch or some
> other fault in the network path, etc
> etc.
>
> If you are receiving SNMP traps or
> syslog from a router, and these suddenly
> stop, there is obviously a problem.
>
> The more you integrate the NMS into the
> network, the easier the RCA becomes. If
> your entire infrastructure is set-up to
> do syslog and SNMP trap back to the NMS,
> it should already know why the host
> cannot communicate back because it's
> just tracked a network admin log in to a
> router and delete a static route by
> mistake ( for example ).
>
>
>
> % No, your original point was that
> polling
> % was worthless. You are now agreeing
> with
> % my point, you need to have a polling
> % mechanism somewhere, if for no other
> % reason than to cross check that
> certain
> % services are still active (SNMP traps
> % are being sent and received, event
> % messages are being sent, received and
> % processed, etc). I'm not talking
> about
> % the service itself, you're correct
> that
> % an active monitor (I don't like the
> term
> % 'real time') would be a plus for some
> % services. But you need something to
> make
> % sure your reporting infrastructure is
> % working and that means periodic "I'm
> % alive" messages, ie polling.
>
>
>
> I'm proposing you sniff the *actual*
> traffic. You see *ALL* the traffic. Why
> poll a device when you can see traffic
> going to and from it? You can see people
> connecting to port 80, GETing a url, etc
> etc. This establishes both the host and
> service are up and running aswell as the
> routing between the host and the person
> connecting.
>
> If the the webserver dies, you would
> instantly see TCP RST's, no waiting for
> a poll, instant, 'real-time' :)
>
And what happens if nobody is accessing the web server or smtp server in
question for 5 or 10 minutes ?
[reply]
[top]
[»]
Re: Polling is very important
by MadEyeMoody - Sep 6th 2006 07:43:00
> Polling is important because it is
> positive proof that a service is working
> for end-users and meeting it's Service
> Level Agreements.
> If you don't simulate the end user's
> access method then you aren't testing
> the service's availability. For this
> reason polling will never go away.
> Some quite large networks are polled for
> HTTP and ICMP by various commercial
> solutions and it works well (1000's of
> polls every 10 minutes).
>
Polling via services can be significantly enhanced through passive
monitoring techniques. If you're watching sessions via end users and these
are working correctly, why poll? However, when you do no see the traffic
or sessions any more or you see session issues, you should poll to
verify.
Intelligent polling in SNMP is vital. First of all, an ICMP ping may not
be a reliable mechanism in your environment. I have seen environments when
Pings are rate limited or blocked. (I even saw one environment where they
attempted to block all of ICMP. Doh!)
I use a technique in Open Service NerveCenter that I call implicit status
determination. In this technique, I use the validity of other SNMP polls
to imply a status for a higher order object. For example, I use good
status polls of Interfaces via ifEntry as valid good status for my Node
Status. When this happens, I use the finite state Machine function of
NerveCenter to hold off actual poling until a given interval therein
creating a sliding window for status and status polling.
When I first deployed this techique, I was managing a series of Centillion
switches employing LANE. I was able to maintain a 20 second status interval
while actually polling was reduced to 1 in 8 on average.
Using this same technique on a 2 minute interval, I was able to benchmark
against HP Openview polling on a 5 minute status interval using less than
25% bandwidth that OpenView NNM used.
[reply]
[top]
|
 |