
Sometimes the problem with the application translates into a small nightmare, rebus. Call it what you want. I want to share about the experience of using monitoring tools by developers on one of the projects. Habr has many faces and there are already dozens of articles about products that facilitate understanding of what is happening: cacti -
habrahabr.ru/post/179391; zabbix -
habrahabr.ru/post/137641; collectd -
habrahabr.ru/post/93205; JVM standard tools -
habrahabr.ru/post/147008 (supplement).
I will try to tell about one more small universal and lightweight product in this category - IxoraRMS.
All interested, welcome under cat.
Prehistory
We have CentOS 6.5, JDK 1.7, JBoss AS 7.1, DBMS Linter and Hibernate / JPA application, Seam, RichFaces and strange behavior. After about a week of work, the application's performance degrades significantly, the web interface is criticized by users and “given” to developers (in one place).
Debugging, profiling tools show a problem in one of the functions -autoFlushIfRequired. Having studied a ton of articles about optimization and Hibernate problems, having tracked its code in the debugger, we decide to monitor the picture more widely.
')
You can, of course, set up a system for collecting statistics and monitoring Cacti / Munin / Zabbix, etc., but if you do it well and correctly, then this is another task and within the organization it is not our responsibility. Therefore, we take a simple one that will require a minimum of gestures on a local machine. There are separate tools for Java profiling and monitoring, but I want to see the whole picture (OS, DBMS, application server, java and the application itself), since the source of the problems is elusive. Since we love Java, we choose IxoraRMS in terms of the tool (it's Java! Hurray!) (I’ll make a reservation right away, I don’t have anything to do with the product or its authors). Product open source, so that the field for improvements is open. You can look at
www.ixorarms.com ,
code.google.com/p/ixora-rms ,
github.com/danielm777/ixora-rms .
Highlights why from Munin, Cacti, collectd chose IxoraRMS:
- Lightweight and usability (this is a simple desktop Java + Swing application, without a database, the need for in-depth knowledge of Linux administration and the build process from source code)
- To collect all the indicators there are agents and providers, many are ready to use. In many cases, receiving data from the system / source can be simply described in the interface without lines of code.
- Examples of monitoring OS, DBMS, web server and JMX out of the box (not without a file in a specific configuration)
- A lot of graphs, ready-made control panels.
- If a disk, service, EJB component, or something else is added / removed, then the information on the graphs can change dynamically, without the need for reconfiguration.
- It runs the same under Linux / Windows (important, since most of the developers in the project work with us in the Windows environment).
Examples of what can be seen on the screen after settingThe remaining screenshots can be viewed
here. IxoraRMS in brief
This is a java application that can be run as a gui client, or as a console application. In this case, an agent can be used to start and collect statistics on a remote machine (we did not want to put anything on the production server) or the collection can be carried out locally with a network connection to the remote machine (our case). This option may also be useful when you come to the customer to diagnose the problem (yours or someone else's).
What can IxoraRMS:
- Group all the information on the control panels and screens.
- Draw graphs of various indicators.
- Display tables and lists with indicators, including non-numeric, there is filtering.
- Log to record and play, including the status of all control panels and graphs, tables, lists in time.
- It has many customized agents for OS, DBMS, web servers and application servers, java (JMX) for collecting indicators and customized control panels for them in one click.
- All other agents and providers are refined based on Process, SQL or Java templates via a graphical interface or by hand in XML. If you wish, you can add the provider in Java
- Baseline indicators using built-in functions and javascript can be processed.
- It has a system for setting alerts when an event occurs (for example, the processor load is more than 80% for 3 minutes).
- It is possible to launch tasks (scripts) on a schedule or on monitoring events.
Concept
For each server, at the start of monitoring, an entity tree with indicators is built (analogy of the JMX model), which contains hosts, agents, entities, and indicators (counter) by nodes. All this is updated, by default, with an interval of 5 seconds. The node scheme is not rigid, i.e. nodes can be formed dynamically (for example, on each disk, process, processor, etc.) depending on the current configuration / situation on the server and monitoring settings.
Agents are added to each host to determine the source of the information. For each agent, providers are defined that provide small pieces of information in the form of indicators in the tree. Providers can be described for each version of their software.
There is an initial set of ready-made agents with providers.
All the rest can be constructed using the interface or by editing XML files in the config folder. Charts, tables, lists can be built by selecting indicators or defining a list of them with regular expressions. Ready views and control panels can be formed for any level of the tree: global for the session / all hosts, for the current host, for the agent, etc. down the entity hierarchy. The agent and providers can optionally provide a different set of entities and indicators depending on the monitoring level of detail set (maximum, high, medium, low). If necessary, you can write the code of the provider in Java.
After adding an agent, each entity (node) in the tree can have two D (ashboard), V (iew) icons next to it, which means that there are ready-made settings for viewing indicators. This can and should be used to minimize effort. Here is an example when you added an agent for Linux.
Dark blue arrows show that the nodes have currently updated indicators for monitoring (it is convenient to use when debugging new providers and presentations).
And immediately ready panel with indicators (picture after a few minutes of work and changing the names of the panel and windows with views):
Now short and in steps
Step 1. Download and run the IxoraRMS GUI
In the folder with the installed distribution run:
console.bat
Note: It is possible to start and stop the local agent (host manager, hmStart and hmStop, respectively) and start in batch mode (batchStart, batchStop).
We create a session and add our host (on it CentOS 6.5) to the list of servers for monitoring.
For our host, select Add Agents from the context menu. We can see which agents are already present in the repository. For our tasks, we need agents for CentOS (Linux), Linter, Java (HotSpot or JMX), JBoss 7.1 (JBoss or JMX). Due to the lack of development since 2011, agents for software products are either outdated or initially lacked support. Therefore, the next steps are devoted to customizing the application to our requirements.
In order for the interface to make corrections to the built-in agents and providers, we restart the console with the
-Dapplication.dev = true parameter
specified when running for java. We rule and run the console.launch.bat file for this (it is regenerated when console.bat is launched, be careful!).
Step 2. Monitor the OS
For CentOS 6.x, a suitable agent exists (Linux), but the supported versions are only RedHat 9 and RedHat AS 3. If you select RedHat 9 and then start monitoring, not all graphs come to life, errors appear in the IxoraRMS log. Providers for the Linux agent are working using the iostat, mpstat, and vmstat utilities, so we install the sysstat and procps packages from the repository to the host.
yum install sysstat procps
For example, one provider try to figure out how they are configured. For this:
• open the menu Tools / Agents Installer
• find Linux and click Edit
• add “CentOS 6.x” to the System versions list
When displaying the control panel for RedHat 9 errors appear in the log related to the File system data and Disk data providers. To correct, we proceed as follows:
• open the Tools / Provider manager menu;
• we select agent Linux, provider Disks data (RedaHat 9, RedHat AS 3)
• create on its basis a copy using Create Like, specifying the same name and a different version;
• in our case, Process is used as the provider, which redirects the standard output stream from the specified command to the parser. The stream is obtained from running the command locally or via telnet / ssh with the transfer of the command to them;
• you can run this iostat –d –x 5 command on the host console with CentOS to see the output format ({tick} is a standard parameter with the current monitoring interval)
• Now let's compare the output to the screen and parser settings;
• switch to the Parser tab;
Here are the rules by which nodes (entities) and indicators are formed in the tree. Each source data column can form an entity by specifying the path of the entity relative to the agent and the indicator value based on the column number. In short, a data model for monitoring is formed here.
• the columns in the iostat output referring to the 7th and 8th columns of the “Kilobytes read per second” and “Kilobytes written per second” output are not available on CentOS with the same command line, so we delete them;
• The Descriptors tab contains types of indicators and their description for the legend. In our case, the rows for the deleted indicators will simply disappear;
• save the provider.
Thus, comparing the output of the utilities used with the specified parameters, we present the providers for CentOS in order. The same work can be done by hand, correcting the files:
• config / agents / agents.linux / agent - general description of the agent;
• config / repository / agents.linux / agent.db — agent-level control panels;
• config / repository / agents.linux / agent.pi — customized provider instances;
• config / repository / agents.linux / entity.db - control panels for entities;
• config / repository / agents.linux / entity.dv - customized views for entities
These providers use a telnet / SSH connection to the host, through which the utility is remotely launched, and all the output is converted into parities and indicators after parsing. Therefore, when adding an agent, you have to specify a login and password to connect. The password is stored in the session description on disk in encrypted form.
The agent as a whole has a lot of collected indicators from different utilities and when running on the host, in our case, there are 5-6 open ssh sessions (if we try using HostManager, I will write later, the developers do not recommend this option so that the agent does not introduce large distortions by preprocessing into the observed system). Total resource consumption <1%.
After adjusting the agent and providers, remove the Linux agent from the host from the entity tree and add it again with a choice of CentOS 6.x version. We start monitoring and immediately turn on the ready control panel System overview (available in the agent tree). Now we see everything that was interesting to us in the OS.
The final Linux agent file with the version for CentOS 6.x for import can be taken
here (Ixora RMS has import / export).
So the OS is monitored, then the DBMS.
Step 3. Analyzing the work of the DBMS
There is no ready agent for
DBMS Linter . Therefore, in this case, the path is slightly longer.
We will monitor the work of the DBMS on the basis of the following available information:
• SYSTEM. $$$ CHAN system view with channel list;
• SYSTEM.SYSINFO system view with aggregated I / O information from the moment it was started.
To connect to the database, we use the JDBC driver from the LINTER distribution kit - jdbc / linjdbc-1.4.jar. It is better to copy it to the <Ixora> / lib folder.
Based on these tables, the following information is available to us:
• the number of open channels (Connection + Statements);
• the maximum possible number of open channels;
• the number of blocked channels;
• the number of active (open) transactions by channels (kernel 6.0.16.x and higher);
• lifetime of the “oldest” active transaction (s) (kernel 6.0.16.x and higher);
• total number of completed transactions;
• the total number of logical entries (in the cache);
• the total number of pages of logical read (from the cache);
• total number of physical write operations (per disk);
• total number of physical reads (from disk);
• total number of SELECT operations.
It is proposed to split these indicators into 3 providers with the following data model
Connection data: Connection / [Opened, Locked, MaxAvailable]
Transaction data: Transaction / [Total, Active, MaxAge]
Input / Output: Statistics / [Block reads, Block reads, Block reads, Block writes, Selects]
Now create a new agent as follows:
• open the Tools / Agents Installer menu and press the Install button in the dialog;
• Next, select the Custom agent installation;
• as the Agent Template, select SQL;
• Fill in the properties of the Linter agent as follows:
• now open the Tools / Provider Manager menu;
• select the Linter agent and click Add to create the Connection Data provider;
• select the type of SQL provider and fill in the appropriate fields
• as the field values, you can specify special variables {host}, {agent.Username}, etc., which are defined in the session or the properties of the agent when it is added to the host;
• to obtain the required indicators we use a SQL query of the form:
select * from (select count(*) as Opened from $$$CHAN where STATUS<>'') a, (select count(*) as Locked from $$$CHAN where STATUS<>'' and LOCKED_BY<>0) b, (select count(*)-1 as MaxAvail from $$$CHAN) c
• on the Parser tab, we describe the formed entities and indicators based on the values ​​of the columns in the sample;
• on the Descriptors tab, we describe the types of indicators and their description
Similarly, we add 2 more providers based on the following SQL queries:
one.
select * from (select TRANSACTIONS_COUNT as Total from $$$sysinfo) a, (select count(*) as Active, NVL(DIVTIME(2, min(TRANSACTION_START), SYSDATE),0) as MaxAge from $$$chan where STATUS<>'' and PARENT_CHANNEL!=0 and TRANSACTION_START > '01.01.1900') b
2
select READ_BLOCKS, READ_LOGICAL_BLOCKS, WRITE_BLOCKS, WRITE_LOGICAL_BLOCKS,SELECT_COUNT from $$$sysinfo
Now you can add an agent to the host by filling in the database connection parameters under the user SYSTEM.
To display indicators, we build a control panel from these providers using the context menu Add / Use Wizards in the Views section.
For large indicators, we use a small trick: instead of absolute values, we show the indicator increment over the monitoring interval, thereby observing the intensity of the load. To do this, we add a description of the function in XML with a description of the view (view) by clicking the Edit button on the finished view.
IxoraRMS has built-in functions for aggregation and JavaScript integration for arbitrary manipulations (see
www.ixorarms.com/documentation/user-guide/functions for details ).
You can also set alarm events for the view. For example, for a channel pool in a DBMS, if you use it more than 90%, you can notify the administrator within 10 seconds.
Settings for the distribution server are configured in the General Configuration / Setting menu.
After simple manipulations with constructions of representations through the master we receive the following control panel for a DBMS.
To simplify the configuration of the control panel, its description is given in the description of the agent by editing the XML files in the repository folder. Now, to view the control panel for any host, it is enough to connect an agent and activate a ready-made panel.
The final agent for DBMS Linter can be found
here .
Step 4. JVM
According to JVM, I wanted to see: distribution of memory pools of Survivor / Eden / Old / PermGen, costs for garbage collection, threads. To monitor the JVM from Oracle / Sun, there is a ready-made HotSpotJVM agent that uses the JMX connection. To connect to a virtual machine running JBoss AS, you can use the remoting-jmx protocol, which is open by default on port 9999 and supports the authorization of users included in the ManagementRealm.
To enable the HotSpot JVM agent to use this protocol, put the jboss-client.jar file from the JBoss As 7.1 distribution kit into the / jars folder and replace the version of the log4j.jar file with version 1.2.12 and higher (for example, the JBoss AS 7.1 distribution package includes log4j -1.2.16.jar). The agent description in the config \ agents \ agents.hotspotjvm \ agent file is adjusted, we add jboss-client.jar to the description.
<?xml version="1.0" encoding="UTF-8"?> <agent> …. <jars> <jar>/jars/AgentHotspotJVM.jar</jar> <jar>/jars/RMSJMX.jar</jar> <jar>/lib/javax77.jar</jar> <jar>/jars/jboss-client.jar</jar> </jars> …. </agent>
After changing the description of the agent, we are overloading IxoraRMS.
To add a user to the ManagementRealm, use the script <JBOSS_HOME> / bin / add-user. Run it and follow the prompts.
Now you can add an agent to the host by specifying the appropriate parameters:
After adding an agent for him immediately available to the control panel. If we use it, we get the following picture:
Only the JVM Throughput view showing the performance as a percentage of total time consumed minus time for garbage collection is not functional. Since In the calculations, the parameters of specific garbage collectors are used, then the presentation needs to be adjusted for each JVM launch option. We did not correct this view.
An agent with our configuration fixes can be downloaded
here .
Step 5. Now the application server queue
What I want to know:
• Number of open web sessions, errors and information transfer load (received / transmitted bytes).
• Number of transactions (total and completed with rollback).
• JMS message queues (current queue size).
• Hibernate (Infinispan) second level cache statistics (by region: cache size, hit percentage, read / write ratio).
• EJB / MDB pool of components (by component: how many are created / free / used)
• Statistics on web services (by entry points: the number of requests is total and with errors).
Ready Agent for JBoss AS only supports server versions 4.0. and 4.2 through JNDI with access via jnp protocol. Therefore, to monitor JBoss AS 7.1, you need to use the standard JMX JSR160 agent (see the example and setting above for the HotSpot JVM agent, including adjusting the agent description to work through remoting-jmx).
Note: The latest Ixora RMS source code already contains JBoss AS 5.x support based on JMX JSR 160, but the compiled version is not yet available.
Specifying values ​​in the Root folder and classpath fields is not enough, because the SPI mechanism is used to search for the protocol, and the jar files from the Classpath field for the agent are loaded by a separate classloader.
After successful connection of the agent to the host, a hierarchy of JMX entities (nodes) is immediately displayed in the tree.
For an example of building graphs and lists we need, consider how to organize monitoring of information on the second-level cache.
Information about the regions of Infinispan is presented in JMX with the following entities and indicators:
To display the number and percentage of hits in the cache in the form of a table with columns, you can use the XML editor from the context menu on the Views tab. Before this, it is better to activate the jboss.infinispan node in the tree in order to save the constructed view to this node. After editing, we get the following XML:
<rms> <view class="com.ixora.rms.ui.dataviewboard.tables.definitions.TableDef"> <name>Cache regions statistics</name> <description>Infinispan cache regions hits statistics</description> <query> <resource id="region" iname="$1" name="$1" rid="-/-/root/jboss.infinispan/(jboss.infinispan:type=Cache,name="(.*)",.*,component=Statistics)"/> <resource id="hits" iname="$1/$counter" name="$1/$counter" rid="-/-/root/jboss.infinispan/(jboss.infinispan:type=Cache,name=(.*),.*,component=Statistics)/[hits]"/> <resource id="hitRatio" iname="$1/$counter" name="$1/$counter" rid="-/-/root/jboss.infinispan/(jboss.infinispan:type=Cache,name=(.*),.*,component=Statistics)/[hitRatio]"/> </query> <agentVersions/> <author>customer</author> <category id="region"/> <column id="hits"/> <column id="hitRatio"/> </view> </rms>
A few comments on it: resource - this is the source of data (indicators from the tree) for display. They are specified using the absolute path in the tree or relative. The path may include regular expressions that define multiple sources. For each source, you can define the attributes iname, name, which define the unique identifier of the resource and its displayed name. For a detailed description, see the website at
http://www.ixorarms.com/documentation/user-guide/concepts.For display, various display options can be used: graphs, tables with multiple columns, lists of properties, and viewing logs. In our case, the simplest table is used.
We define the elements we need in the tree of indicators for the JBoss AS server and, to simplify access to them, we then form all representations at the entity level jboss.as, jboss.infinispan, jboss.ws, respectively. To monitor various services, we use the following entities and their indicators in the IxoraRMS notation:
Web server | - / - / root / jboss.as / jboss.as: subsystem = web, connector = http. * , metrics: bytesSent, bytesReceived, requestCount, errorCount |
Cache | - / - / root / jboss.infinispan / (jboss.infinispan: type = Cache, name = (. *) ,. *, component = Statistics) , indicators: hits, hitRatio, readWriteRatio, numberOfEntries, evictions |
Web services | - / - / root / jboss.ws / jboss.ws: context =. *, endpoint = \ w + $ , indicators: RequestCount, FaultCount |
EJB components | - / - / root / jboss.as / jboss.as: deployment =. *, subsystem = ejb3, stateless-session-bean =. * , - / - / root / jboss.as / jboss.as: deployment =. *, subsystem = ejb3, message-drive-bean-bean =. * , performance: poolAvailableCount, poolCurrentSize, poolCreateCount |
Message Queues | - / - / root / jboss.as / jboss.as: subsystem = messaging, hournetq-server = default, jms-queue =. * , - / - / root / jboss.as / jboss.as: subsystem = messaging, hournetq-server = default, jms-topic =. * , indicators: messageCount |
Transactions | - / - / root / jboss.as / jboss.as: subsystem = transactions , performance: numberOfTransactions, numberOfInflightTransactions, , numberOfCommitedTransactions, numberOfAbortedTransactions, numberOfApplicationRollbacks |
Regular expressions allow you to specify whole samples of entities instead of specific ones here.
As a result, the settings for JBoss AS will get a control panel with graphs and tables below (after pressing the Start button).
All views are combined into a single control panel and saved at the level of the JMX JSR 160 agent under the name “JBoss AS 7.x Overview”. This is done by the wizard on the Dashboards tab for the entity selected in the tree. All settings for views and control panels are automatically saved in agent settings. A JMX JSR 160 agent with settings for JBoss AS can be downloaded for import
here .
All control panels can be placed on different screens (screen) in IxoraRMS and switch between them to monitor various system indicators.
Total for our experiments
Pros:
- got the settings of the providers for our configuration;
- if necessary, we quickly complete graphs and tables in the monitoring process without restarting;
- we see in a complex everything that happens with our server;
- to monitor the test server now work for a minute connecting the right agent.
And a few drawbacks:
- The project is not supported since 2011 and it is necessary to update the agents and providers, if necessary;
- In case of several unsuccessful interrupts of the monitoring application, the JMX interface on the JBoss side does not respond at all (this is most likely a JBoss problem, but the production server will have to be restarted if further monitoring is needed);
- The log player on the monitoring interval of 2-3 days and more inconvenient, responds slowly in the interface, slows down. True and the size of logs in XML format is about 2-4 GB, the initial viewing phase contains settings for the level of aggregation of indicators by minutes, hours, but games with settings have not increased usability. Note: the ability to write a log to the database was later discovered in the last source code, but did not have time to try.
- For continuous monitoring of the system is not suitable, you can connect occasionally and monitor performance on servers for up to 1-2 days in continuous mode.
At this, the monitoring of our application did not end and we attempted to monitor the logic of the application itself according to the following list without major intervention in the application:
- a list of the most costly in terms of total time for the execution of service methods;
- a list of the most costly in terms of average time for the execution of service methods;
- a list of the most frequently called service methods;
- a list of the most expensive in terms of total SQL query execution;
- a list of the most expensive in terms of average SQL query execution;
- list of the most frequently called SQL queries;
- errors in the log
But about it in the following part. Successes in debugging!
PS A complete understanding of the original problem never came. Indicator Context switches per second for the OS clearly goes into the range of 8000-10000 in a week. At the same time, the number of threads grows insignificantly. Synchronization? Hibernate's AutoFlushIfRequired takes a very long time: from fractions of a second to 30-70 seconds in total after degradation on some service calls. The first level cache is the same in size for this service request (about 8000 entities), but the execution time increases. Synchronization inside the HIbernate and Infinispan bundles? By hook or by crook, the time of degradation of the system has increased and is already close to the 2nd week, but I want more.