[collectd] Collectd and load average

Discussion:

Valentino Volonghi

2009-04-20 18:30:41 UTC

Hello, I'm running collectd 4.6.0 on a set of machines
(one process per machine monitoring only the
machine it's running on).

Usually the process doesn't increase load and all the servers
remain at reasonable load values. For some reason though
every once in a while one of them starts ramping up in load
average up to more than 2 load average values for the 1 minute
number.
I've also seen collectd using up to 90% CPU in those conditions,
which seems to me completely wrong.

Collectd is running on an EC2 Ubuntu 8.04
with 2.6.21.7-2.fc8xen kernel, the installed versions of the needed
plugins are all the ones that ship in that ubuntu by default, and sice
I've read in the past that librrd might be the one responsible for this,
it is running with version 1.2.19-1ubuntu1. Here's the config:

Hostname "localhost"
Interval 10

LoadPlugin logfile
LoadPlugin syslog

LoadPlugin cpu
LoadPlugin cpufreq
LoadPlugin df
LoadPlugin disk
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin network
LoadPlugin processes
LoadPlugin rrdtool
LoadPlugin swap
LoadPlugin tcpconns
LoadPlugin vmem

<Plugin df>
Device "/dev/sda1"
MountPoint "/"
FSType "ext3"
IgnoreSelected false
</Plugin>

<Plugin interface>
Interface "eth0"
IgnoreSelected false
</Plugin>

<Plugin rrdtool>
DataDir "/opt/collectd/var/lib/collectd/rrd"
CacheTimeout 120
CacheFlush 900
</Plugin>

<Plugin tcpconns>
ListeningPorts false
LocalPort "80"
RemotePort "*"
</Plugin>

--
Valentino Volonghi aka Dialtone
Now running MacOS X 10.5
Home Page: http://www.twisted.it
http://www.adroll.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090420/5be685c7/attachment.pgp

Florian Forster

2009-04-22 21:24:42 UTC

Permalink

Hi Valentino,

Usually the process doesn't increase load and all the servers remain
at reasonable load values. For some reason though every once in a
while one of them starts ramping up in load average up to more than 2
load average values for the 1 minute number.
I've also seen collectd using up to 90% CPU in those conditions, which
seems to me completely wrong.

hm, sounds nasty.. What state is the process in when this condition
occurs? Running (`R' in `top')? Waiting for IO (`D' in `top')? And which
process is using the other 10% of CPU? Is that a multi processor
machine?

Maybe you could activate the `processes' plugin with this config:
<Plugin "processes">
Process "collectd"
</Plugin>

You should then get a graph with the user- and system-CPU-time used by
the collectd process. Having this graph lines up with the load graph
would be very interesting I think..

<Plugin tcpconns>
ListeningPorts false
LocalPort "80"
RemotePort "*"
</Plugin>

Probably not related, but `RemotePort "*"' should not work. If it would,
it'd mean something like ``create one value for each outgoing
connection'' which is very likely not what you want..

Regards,
-octo
--
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090422/543bb4e2/attachment.pgp

Valentino Volonghi

2009-04-22 23:16:13 UTC

Permalink

Post by Florian Forster
hm, sounds nasty.. What state is the process in when this condition
occurs? Running (`R' in `top')? Waiting for IO (`D' in `top')? And which
process is using the other 10% of CPU? Is that a multi processor
machine?

From what I can see it's always in S.

The machine is multi-processor, the other 10% isn't necessarily used,
the other process running there uses around 0-5% CPU right now.

Post by Florian Forster
<Plugin "processes">
Process "collectd"
</Plugin>
You should then get a graph with the user- and system-CPU-time used by
the collectd process. Having this graph lines up with the load graph
would be very interesting I think..

Ok, I'm gonna add this.

Post by Florian Forster

Post by Valentino Volonghi
<Plugin tcpconns>
ListeningPorts false
LocalPort "80"
RemotePort "*"
</Plugin>

Probably not related, but `RemotePort "*"' should not work. If it would,
it'd mean something like ``create one value for each outgoing
connection'' which is very likely not what you want..

Ok, I'm gonna remove RemotePort then.

In the meantime I changed the value of CacheTimeout and CacheFlush
to be closed to each other like this:

<Plugin rrdtool>
DataDir "/opt/collectd/var/lib/collectd/rrd"
CacheTimeout 450
CacheFlush 900
</Plugin>

In this way I've managed to at least not see the load issue for almost
a day and generally it never grows above 1.

--
Valentino Volonghi aka Dialtone
Now running MacOS X 10.5
Home Page: http://www.twisted.it
http://www.adroll.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090422/1d1772df/attachment.pgp

Florian Forster

2009-04-23 07:10:15 UTC

Permalink

Hi Valentino,

Post by Valentino Volonghi

Post by Florian Forster
hm, sounds nasty.. What state is the process in when this condition
occurs? Running (`R' in `top')? Waiting for IO (`D' in `top')? And
which process is using the other 10% of CPU? Is that a multi
processor machine?

From what I can see it's always in S.

`S' is sleeping. The process shouldn't use any CPU time in that state..?
That's weird..

Post by Valentino Volonghi
In the meantime I changed the value of CacheTimeout and CacheFlush to
<Plugin rrdtool>
DataDir "/opt/collectd/var/lib/collectd/rrd"
CacheTimeout 450
CacheFlush 900
</Plugin>

I don't think `CacheFlush' is involved in this. Setting it to 900
seconds is very likely a unnecessarily low value.

If your load is related to writing RRD files (how many do you have of
those?), you should try setting the `WritesPerSecond' option. Setting it
to `10' will mean that writing 300 RRD files will take 30 seconds, but
you shouldn't be able to see *any* problem in the load then ;)

Regards,
-octo
--
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090423/066b9acc/attachment.pgp

Valentino Volonghi

2009-04-23 07:31:50 UTC

Permalink

Post by Florian Forster

I don't think `CacheFlush' is involved in this. Setting it to 900
seconds is very likely a unnecessarily low value.

It was in the defaults of the 4.6.0, I'll up that.

Post by Florian Forster
If your load is related to writing RRD files (how many do you have of
those?), you should try setting the `WritesPerSecond' option.
Setting it
to `10' will mean that writing 300 RRD files will take 30 seconds, but
you shouldn't be able to see *any* problem in the load then ;)

Each machine monitor only itself. There are 13 plugins installed (those
in the first mail) and I suppose those are the rrd files that it's
writing.

--
Valentino Volonghi aka Dialtone
Now running MacOS X 10.5
Home Page: http://www.twisted.it
http://www.adroll.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090423/9916bf78/attachment.pgp