Monday, April 14, 2008

Spring, time for (round) robin (databases)

On Friday I tried to tantalize with the ability to take some brutally dry numbers and automate graphing them so we could report on the utilization of all of our boxes. That post showed a rolling 24 hour graph of cpu and memory usage. Admittedly, that's usually what we're most interested in seeing. However, we also create rolling weekly, monthly, and yearly reports on the same site. So what? Well, that detail is important as part of the introduction to how we're generating the graphs.

Storing one day's worth of cpu and memory numbers collected every 5 minutes takes roughly 20K. If we needed to keep a year's worth of files around to do the trending graphs described above we'd have roughly 7M of data per host (and 730 files -- one each day for both cpu and memory). Instead, using rrdtool (http://oss.oetiker.ch/rrdtool), we store two 47K files for each host (one for cpu and one for memory). Rrdtool allows us to set up aggregation functions in a round robin database. Basically, we setup the database ahead of time, telling it how many data points to keep, and what kind of aggregation we want to use on that data. For our graphs, we feed data in every 5 minutes. For the rolling 24 hour graph, we display the data as it comes in, and store 600 datapoints. For our "weekly" graph, rrdtool averages 6 of those readings (30 minutes) and stores 700 of those aggregates. The monthly one averages 24 readings (2 hours) and stores 775 averages. The yearly one averages 288 entries (24 hours) and stores 797 averages.

Everyone I haven't bored to tears is now asking why we store significantly more data than we show in a graph. The answer is easy. I anticipated the request to compare today to yesterday, this week to last, etc., and I didn't want to have to go to backups to make that comparison.

Enough background. Let's look at some code. Here's an excerpt from my collector.sh script, which iterates over the hosts I report on:


for i in host1 host2 host3
do
for j in `/usr/bin/find /var/adm/sa -newer ${i}_sarcpu.rrd -name sa* -type f -exec /usr/bin/ls -ct {} \;`
do
/usr/bin/ssh $i sar -f $j |grep \:|grep -v usr|perl -pi -e 's/ +/,/g' |/usr/local/bin/rrd_cpu_collector.pl -host $i -

done
done


Translation -- for each host:
  • get the file names of all sar datafiles that have been modified more recently than my rrd database
  • ssh to the host and run sar against the file(s)
  • strip out all of the lines that don't have data (i.e. the headers) and all the extraneous whitespace, making it comma delimited
  • pass the data into rrd_cpu_collector.pl


So, what's in rrd_cpu_collector.pl? Two main things. First, there's logic to create the rrd if it doesn't exist:

if (-f $rrd_file) {
debug_msg("$rrd_file exists");
$rrd_last_updated=`rrdtool last $rrd_file`;

} else {
$rrd_time=timelocal(0,0,0,$mday,$mon,$year);
debug_msg("rrdtool create $rrd_file --start $rrd_time DS:user:GAUGE:1800:0:100 DS:sys:GAUGE:1800:0:100 RRA:AVERAGE:0.5:1:600 RRA:AVERAGE:0.5:6:700 RRA:AVERAGE:0.5:24:775 RRA:AVERAGE:0.5:288:797");
`rrdtool create $rrd_file --start $rrd_time DS:user:GAUGE:1800:0:100 DS:sys:GAUGE:1800:0:100 RRA:AVERAGE:0.5:1:600 RRA:AVERAGE:0.5:6:700 RRA:AVERAGE:0.5:24:775 RRA:AVERAGE:0.5:288:797`;
$rrd_last_updated=$rrd_time;
}

Second, there's logic to update the rrd with the data passed in from sar:

foreach $entry (<>){
chomp ($entry);
debug_msg("the entry is $entry");
###################################################
## update the hours and minutes based on what we ##
## get from sar. use this to generate the time ##
## for the rrd command ##
###################################################
($sar_time,$user_cpu,$sys_cpu,$wio,$idle)=split(/,/,$entry);
($hours,$min,$sec)=split(/:/,$sar_time);
debug_msg("sar hours min sec are $hours $min $sec");
$rrd_time=timelocal(0,$min,$hours,$mday,$mon,$year);
debug_msg("rrd_time is $rrd_time");
####################################################
## this check is here to allow us to iterate over ##
## the full output of sar several times a day w/o ##
## reprocessing an entry. ##
####################################################
if ($rrd_time > $rrd_last_updated) {
debug_msg("rrdtool update $rrd_file $rrd_time:$user_cpu:$sys_cpu");
`rrdtool update $rrd_file $rrd_time:$user_cpu:$sys_cpu`;
} else {
debug_msg("rrd file updated more recently than this entry. skipping");
}
}


Up next, generating the cpu graphs now that the data is in place.

2 comments:

Anonymous said...

What? You don't use Actuate no more for all your graphing needs?

John McDevitt said...

I almost replied, "If I don't support it, I don't use it." But I know how quickly I could be back on Actuate...

counter free hit invisible