Realtime Dashboard

Once, sitting at work late in the evening, I wanted to make a simple, smart dashboard that would draw graphs of errors or other varnings from Apache or Ngnix logs. The term Realtime is a bit flattering, in reality we see an update on the chart every 3 seconds. Such a dashboard plan is very useful especially when a new version is being rolled out into battle, you sit and watch it quietly crawl around the servers, changing the direction of the curves on the graph.

Just want to show what happened

')
Server part

on every server where there is a log, you can go to it

tail -f /var/log/log | perl frenko_sender.pl

frenko_sender.pl

 #!/usr/local/bin/perl -w use strict; use IO::Socket; my($sock, $server_host, $msg, $PORTNO); $PORTNO = 15555; $server_host = '188.xxx'; $sock = IO::Socket::INET->new(Proto => 'udp', PeerPort => $PORTNO, PeerAddr => $server_host) or die "Creating socket: $!\n"; while (<>){ $sock->send($_) }

As you can see, frenko_sender.pl, simply sends each line from the udp log to the server where the data will be aggregated. By the way, you can do without this script, if you teach syslogd to immediately send logs via udp to the right place for us, and, as Dreadatour nc (Netcat) noted, nobody has also canceled:

 tail -f /var/log/log | nc -u 188.xxx 15555

The pearl barley option was chosen, since the plans are to screw in a round of robin there in order to distribute the load cyclically between the servers that will process the logs.

Now about the server (frenko_listen.pl) which considers us statistics, thanks to Dima Karasik for IO :: Lambda, since it was on him that I chose.

frenko_listen.pl line (1-30)

 #!/usr/local/bin/perl -w use strict; use warnings; use Cache::Memcached::Fast; use IO::Socket; use IO::Lambda qw(:all); use IO::Lambda::Socket qw(:all); use Getopt::Std; use POSIX qw/setsid/; my $PORT=15555; my $LOCALHOST ='188.xxx'; my $CONFIG='./frenko_config'; my $time_to_die=0; my $memd = new Cache::Memcached::Fast({ servers => [ { address => 'localhost:11211', weight => 2.5 }], connect_timeout => 0.2, io_timeout => 0.5, close_on_error => 1, max_failures => 3, failure_timeout => 2, ketama_points => 150, nowait => 1, utf8 => ($^V >= 5.008001 ? 1 : 0), });

$ LOCALHOST is an external IP for our server, which listens to port 15555 via UDP. As you guessed, the numbers will be written in memcached =).

frenko_listen.pl line (32-46)

 my %Regexps=(); load_config(\%Regexps); sub load_config{ my $re=shift; %{$re}=(); open (F,$CONFIG) || die 'cant open config '.$CONFIG; while (<F>){ next if /^[\s\n]+$/; my ($regexp,$key,undef,$rate)=split("\t",$_,4); $$re{$key}{re}=$regexp; $$re{$key}{rate}=$rate || 1; } close (F); }

Here we load the config with regular expressions, which will determine the rules for calculating statistics. Config looks like this:

frenko_config

 errdescr= 9 [E2;Javascript error at web interface - nginx acclog] 1 errorcode=\d+ 10 [E3;Fail upload with errorcode - nginx acclog] 1

In the first column there is a regular schedule, then a unique key according to which we will read and output data on a chart, in the third column a description of the error, short and complete, but in the last column this is the probability with which we will save the data. For example, if we display user hits on a graph, then we will have a huge gap between the curves on the same graph, for example, errors 3 and 3000 hits during this time, use a logarithmic scale for this situation, but I would like it to be beautiful. Therefore, for hits we would put for example 0.1 - it means that only in 10% we will increase the hit counter.

frenko_listen.pl line (49-80)

 sub sendstat{ my ($key,$rate,$memd)= @ _; if ($rate==1 || $rate<rand()){ saver($key,$_,$memd) for (2,300,1200) } } sub saver{ my ($key,$accur,$memd)=@ _; my $memc=startpoint($accur); $key=$key.'_'.$accur.'_'.$memc; my $ttl=1000; if ($accur==300){$ttl=7400} elsif($accur==1200){$ttl=87000} $memd->set($key,1,$ttl) unless $memd->incr($key); } sub startpoint{ my $accur=shift; my $t=time(); my $x=$t % $accur; my $memc=$t-$x; return $memc; }

Here we directly increase the counters for our data. Moreover, we store data in 2x, 300 and 1200 second cells, for graphs that are updated every 3 seconds, once every 5 minutes, and once every 20 minutes. Keys for memkesh for example for a 2 second cell (for Javascript error from our frenko_config) will look like this: 9_2_1338129044, 9_2_1338129046 for a 300 second cell: 9_300_1338121800, 9_300_1338122100, etc.

frenko_listen.pl line (82-128)

 sub udp_server(&@) { my ($callback, $port, $host) = @ _ ; my $server = IO::Socket::INET-> new( Proto => 'udp', LocalPort => $port, LocalHost => $host, Blocking => 0, ); return $! unless $server; return lambda { context $server, 65536, 0; recv { my ( $addr, $msg) = @ _ ; again; context $callback-> (), $addr, $msg; &tail(); }} } sub handle_incoming_connection_udp { lambda { my ( $addr, $msg ) = @ _ ; if (length($msg)>5){ if (! index($msg,'reconfig')){ load_config(\%Regexps); } else { foreach (keys %Regexps){ if ($msg=~/$Regexps{$_}{re}/){ sendstat($_,$Regexps{$_}{rate},$memd); last; } } } } } } sub runserv{ my $server = udp_server { handle_incoming_connection_udp } $PORT,$LOCALHOST; die $server unless ref $server; $server-> wait; }

Log processing takes place inside sub handle_incoming_connection_udp. We look that if a line is longer than 5 characters, then we start parsing it. Running through the list of our regulars, loaded from the config.

Imagine a situation, for example, you are listening to the logs on three hundred servers, and you need to add a new regular to the config, if you stop frenko_listen.pl, you will lose data. To avoid such situations, you need to teach our server to update the config on the fly, which is what was done above. If we received a line that starts with 'reconfig', then we overload the config. And of course, we don’t forget about the wrapper which will raise our frenko_sender.pl during the fall.

Here, by the way, you need to think about how to optimize mileage =).

For example, run in a certain order, sorted by frequency. That is, if we have a lot more hosts than other metrics, then we need to start the run with a regular schedule for hosts. Also, instead of some regulars, you can use the index () function.

At the moment it turns out that when processing 1042 lines of logs per second (there are 15 regularizers in the config), the processor load is 30-32%

  4-  Intel(R) Xeon(R) CPU E5405 @ 2.00GHz

When you see these numbers, it becomes sad. We will remake our config by adding the fourth parameter to the expression for index () - it should speed things up.

frenko_config

 errdescr= 9 [E2;Javascript error at web interface - nginx acclog] 1 errdescr= errorcode=\d+ 10 [E3;Fail upload with errorcode - nginx acclog] 1 errorcode=

And also we will slightly change our function of loading the config and parsing:

 sub load_config{ my $re=shift; %{$re}=(); open (F,$CONFIG) || die 'cant open config '.$CONFIG; while (<F>){ next if /^[\s\n]+$/; my ($regexp,$key,undef,$rate,$index)=split("\t",$_,5); $index=~s/[\n\t\r]//g; $$re{$key}{re}=$regexp; $$re{$key}{index}=$index; $$re{$key}{rate}=$rate || 1; } close (F); } sub handle_incoming_connection_udp { lambda { my ( $addr, $msg ) = @_; if (length($msg)>5){ if (substr($msg,0,8) eq 'reconfig'){ load_config(\%Regexps); } else { foreach my $key(keys %Regexps){ if (index($msg,$Regexps{$key}{index})>=0){ if ($Regexps{$key}{index} eq $Regexps{$key}{re}){ sendstat($key,$Regexps{$key}{rate},$memd); last; }elsif($msg=~/$Regexps{$key}{re}/){ sendstat($key,$Regexps{$key}{rate},$memd); last; } } } } } } }

These modifications slightly reduced the load on percents, it turned out 25-28%. Ideally, get at least 10%.

The memcached data is as follows:

 STAT uptime 601576 STAT bytes_read 358221242 STAT bytes_written 92007777 STAT bytes 5018318 STAT curr_items 62269 STAT total_items 557908 STAT reclaimed 372969

Go ahead

frenko_listen.pl line (130-157)

 sub signal_handler{ $time_to_die=1; } getopts('d:', my $opts = {}); if (exists $opts->{d}) { print "start daemon\n"; my $pid=fork; exit if $pid; die 'Couldn\'t fork: '.$! unless defined($pid); for my $handle (*STDIN, *STDOUT, *STDERR){ open ($handle, '+<', '/dev/null' ) || die 'Cant\t reopen '.$handle.' to /dev/null: '.$!; } POSIX::setsid() || die 'Can\'t start a new session: '.$!; $time_to_die=0; $SIG{TERM} = $SIG{INT} = $SIG{HUP}= \&signal_handler; until ($time_to_die){ runserv(); } } else { runserv() }

Here, we taught our server to demonize using the -d checkbox.

So we got a highly scalable server architecture. It remains the case for small. Let's go to the client side.

Client part
I revised a lot of graphic artists for charts, but stopped at highcharts, as there is an opportunity to draw a graph without a flash, dynamically loading the data for new points, well, and a bunch of other buns. We go to highcharts and do everything as described in the documentation. Everything is simple there.

I give the data for the client side to the fastcgi script, it goes to the memkey and gives the points for a given period of time.

The graph itself performs two queries: this is the initial initialization and a portion of the new points for the graph. The answers to these requests look like this:
/frenky.fpl?init=1&accur=2

 {"status":"1","init":[["U1",[[1338188480,19],[1338188482,21],[1338188484,11],[1338188486,19],[1338188488,13],[1338188490,9],[1338188492,13],[1338188494,18],[1338188496,14],[1338188498,21],[1338188500,17],[1338188502,13],[1338188504,21],[1338188506,19],[1338188508,22],[1338188510,23],[1338188512,14],[1338188514,17],[1338188516,24],[1338188518,15],[1338188520,23],[1338188522,19],[1338188524,20]],"User uploaded the file - apache errlog"]]}

/frenky.fpl?refresh=1&accur=2

 {"status":"1","refresh":[[1338188814,20],[1338188814,1],[1338188814,0],[1338188814,0],[1338188814,0],[1338188814,0],[1338188814,0],[1338188814,8],[1338188814,7],[1338188814,0],[1338188814,0],[1338188814,0]]}

One of the problems I encountered is if in the case of two running fastcgi processes we want them to update their config with regulars. How to get through both processes through the web interface - I haven’t yet found a beautiful solution, so I just restart Apache.

Download all the sources, including js and fcgi, here rdash.rar

UPDATE
After all the optimizations described in the comments, the following figures were obtained:
with a load of 50 thousand lines per second 80% of the CPU eats
If you add sorting rules, then 80 thousand lines per second and 90% of the CPU

https://github.com/frenkyoptic/rdash

Source: https://habr.com/ru/post/144639/

All Articles

Realtime Dashboard

More articles: