MongoDB as a monitoring tool for LOG files

In this article I will talk about using MongoDB non-relational database for monitoring log files. There are many tools for monitoring log files, from monitoring with shell scripts tied to cron to an apache hadoop cluster.

The approach with monitoring scripts of text files is convenient only in the simplest cases, when, for example, problems are detected by the presence of ERROR, FAILURE, SEVERE lines in the log file, etc. To monitor large files, it is convenient to use the Zabbix system, where the Zabbix Agent (active) will only read new data and send it to the server at regular intervals.

Up to a certain point, Zabbix has had enough, but for monitoring business processes, analyzing metrics within an SLA, more complex tasks often appear, for example:
- determining the number of operations that have passed through the system during the time interval;
- determination of the processing time of various operations;
- identifying the percentage of errors, exceptions, broken down by operations and business data;
- collection of statistics on non-monetary transactions, service level analysis.
')
Typically, such tasks are a headache for system administrators who have to search for sophisticated methods of collecting, aggregating and presenting such data, if the system was not originally designed to calculate their collection.

To solve these problems, we had to find a more flexible and universal solution.

Initially, I had to teach an application to write logs using structured documents in MongoDB using a java driver. The data model was specifically designed to store unified documents in JSON format.

The advantages of this method are the lack of delay in obtaining new data (as in the case of apache hadoop), the presence of replication and, if necessary, segmentation, convenient means of sampling and analyzing data using the MongoDB API, mapReduce and JavaScript functions, but first things first.

Example

Suppose an application writes documents to the database:

{ _id: ObjectId id: string operation: string time: ISODate data: { info: string result: string message: string } array: [a1, a2, ... ,aN] }

Having a collection of such documents, using a simple query, we can understand how many test operations the application has performed in the last 5 minutes:

 db.test.count({ "operation": "test", "time": {$gte: new Date(new Date() - 1000*60*5)} })

Such a request is easy to “wrap” in a shell script that will connect to mongo, and output its result to the terminal. This script can later be connected to the monitoring system, which we did by getting a graphic chart and “hanging” the triggers on inappropriate values.

For the query to be executed quickly, it suffices to build a single index db.test.ensureIndex( {time: 1}, {background: true} ) . Then MongoDB will only look at the data for the last 5 minutes, and not all documents in the collection. You can add several more indices at once, but if there are too many of them, for each insert operation, the binary trees corresponding to them will be appended, which creates additional load. And at some point, the indexes may not fit in the RAM, then reading will occur from the disk, which will significantly slow down data access.

Work with dates

The field of time (in this case, time) I used most often. At the beginning of working with the MongoDB shell, it would be easy to find information on drawing up simplest queries, but there were problems with queries by date. I will describe several ways how to select documents in the specified time interval.

If the application inserts a new Date() in the time field of an insert operation, the date will be written in ISODate format. Entering the new Date() command in the mongo shell console will display the current date in the format ISODate("YYYY-MM-DDThh:mm:ss[.sss]Z") - you can take this line and substitute it in the query like:

db.test.find({ time: {$gte: ISODate("YYYY-MM-DDThh:mm:ss[.sss]"), $lte: ISODate("YYYY-MM-DDThh:mm:ss[.sss]")} }) , where $gte is greater than or equal to, $lte is less than or equal to, [.sss] is the number of milliseconds.

You can also directly set the date via new Date() :

db.test.find({ time: {$gte: new Date(YYYY, MM, DD, hh, mm, ss, sss) } }) , where the months count (MM) starts at 0.

or

db.test.find({ time: {$gte: new Date( new Date() -1000*300 ) } }) - display documents in the last 5 minutes. Instead of the expression "-1000*300" you can substitute any time in milliseconds. Also, before the request, you can pre-define variables with dates:

 var today = new Date(); var yesterday = new Date(); yesterday.setDate(today.getDate() - 1); db.test.find( {"time": {$gte : yesterday, $lt : today} } );

In some cases it is convenient to use POSIX, for example:

 for (var i = today.getTime(); i < yesterday.getTime(); i=i+300*1000) {var b=0; b = db.test.find({ "time": { $gte: new Date(i), $lte: new Date(i+300*1000) } //   getTime   POSIX    }).count(); // .count()     time=new Date(i); //     i print(b+"; "+time.toTimeString().split(' ')[0]) } //    (b) +  time   timeString,   split(' ')[0]      .

After the execution of this request, we will receive in the console the total number of documents that have passed through the system, broken down into 5-minute intervals per day.

ForEach method

Since JSON-like aggregation output is often inconvenient, the .forEach() method helps, which is applied to the cursor and is able to modify the document in an arbitrary way using javascript.

For example, you need to display a list of id for a specified period of time (take the last 5 minutes):

 db.test.aggregate({$match:{time:{$gte:new Date(new Date()-1000*300)}}}).forEach( function(doc) { print( doc.id ) } )

Instead of aggregate in this case, you can use find or distinct - the main thing is that the input forEach() an array. Since the format of the aggregation output somewhat changed from version to version, in version 2.6, for example, you should use aggregate({...},...,{...}).result.forEach , since the output has the format "result":[{...},...,{...}] .

Another example: you need to find out what the final status of each id and unload into the table.

 db.cbh.aggregate( {$match: {time: {$gte: new Date(new Date()-1000*300)}}}, //     {$group: { _id: "$id", //   id "count": {$sum:1}, //      id "result": {$push:"$data.result"} //     } }, {$match: { "count": 4 }}, //  id   4  {$match: { "result": { $ne:[] }}} //   result ).result.forEach(function(doc) {print( doc._id+", "+doc.result )} )

Such a query will output to the console the result of the form “ id, data.result ”, which can be imported into excel or any relational DBMS.

Functions

With the help of functions it is convenient to count metrics, complex monitoring requests, generate reports. I will give an example of a simple function for calculating the average time of an operation.

 function avgDur(operation, period) { var i=0; var sum=0; var avg=0; db.test.aggregate( { $match : { "time" : { $gte : new Date(new Date - period*1000) } } }, { $group : {_id: "$id", "operation":{$addToSet : "$operation"}, "time":{$push : "$time"}, "count":{$sum: 1} } }, { $match : {"operation": operation, "count":4 }, { $project : {_id:0, "time":1}} ).result.forEach(function(op) { dur=op.time[3]-op.time[0]; sum=sum+dur; i=i+1; }); avg=sum/i; print(avg/1000); //      }

Before saving the function, it is better to run it in the console, avgDur(test,300) and check its operation. Next, save it:

 db.system.js.save( { _id: "avgDur", value : function(operation, period) { ... } } )

After that, run db.loadServerScripts(); and call the avgDur(test,300) .

If you save an invalid function, with db.loadServerScripts() we can get an error and cannot access other functions, so we check carefully before saving.

Underwater rocks

The first thing I encountered on the server without replication was the difficulty of freeing up disk space. MongoDB writes documents to the disk in a row, and if you manually delete part of the collection with the db.test.remove({...}) command, the disk space will not be freed, because this would cause severe fragmentation. In order to avoid this, MongoDB leaves all documents in place, and adds links in the gaps. Then, in order to “compress the collection”, you need to execute db.repairDatabase() , but the command will require as much disk space as the database takes, since first, the base is copied to the new location, and only then the files of the old base are deleted.

To avoid such problems, I found several solutions that can be combined:

1. Replication. If there is a backup server, you can always stop the slave replica, perform repairDatabase() and start the server back. It is even better to set up a TTL (time-to-live) index, where the documents themselves will be deleted after a certain lifetime, for example, after 30 days. But in this case, you still have to periodically repairDatabase() .

2. Create a collection of a limited amount of Capped collection immediately. When the collection reaches its limit, the oldest documents will begin to be overwritten with the newest ones.

3. See how much disk space the collection will take after 30 days of logging, then convertToCapped this collection and set the limit with a margin.

4. Write temporary collections to a separate database, since when db.dropDatabase() executed, disk space will be guaranteed free. But db.collectioName.drop () , unfortunately, does not free disk space. Data is simply marked as inaccessible.

Starting from version 2.6, the preallocation strategy has changed a little, the default for the collection is the usePowerOf2Sizes option. The disk space freed up as a result of deleting documents has become more efficient, but it is safer to initially determine the size of the collection.

Aggregation of large amounts of data

Aggregation is not the best tool for processing huge amounts of data: it will be difficult to aggregate 100 million documents, such as in the example above.

The first constraint that you have to face is that the size of the resulting document should not exceed 16 MB. However, starting from version 2.6, the result of the aggregation is passed as a cursor, and using the cursor.next() method applied to the cursor, you can select the necessary data sequentially.

There is also a limit of 64 MB to use when processing the buffer, which can overflow with so many documents. Starting with version 2.6, the {allowDisckUse:true} aggregation parameter helps to avoid this.

Large data volumes are best handled by mapReduse, where multi-thread processing is used, as well as load distribution between servers in a segmented cluster.

But all this stuff. Real flours will begin if the data scheme is not suitable for mongodb, when there are tasks requiring data binding from several collections or a recursive approach in document search and aggregation.

I will give an example:

Suppose you need to calculate the average time to complete all successful operations. But if the transaction logging approach involves recording documents of the type “request” and “answer”, and there will be several such documents in one transaction, then in our scheme the “data.result” field with the result of the transaction will appear only in the last document of the series.

So how do you calculate time in this case? If in the aggregation in the $match block we search for all documents with "data.result":"SUCCESS" , then only the latest documents inside each id will be "data.result":"SUCCESS" in the selection. The mapReduce paradigm will not help here either, since at the stage of the map MongoDB passes through the collection only 1 time.

You can go through the collection and build an array with all the necessary id and perform aggregation, substituting this array in $match :

 ids=[]; var monthBegin = new Date(new Date().getFullYear(),new Date().getMonth()-1,1,0,0,0,0) var monthEnd = new Date(new Date().getFullYear(),new Date().getMonth(),1,0,0,0,0) //       new Date(),   YYYY   ,   new Date().getFullYear(),   MM  new Date().getMonth()  new Date().getMonth()-1     . db.test.find({time:{$gte:monthBegin, $lt:monthEnd}, "operation":"test", "data.result":"SUCCESS"}, {"_id":0, "id":1} ).forEach(function(op) { ids.push(op.id) //   id     }); db.test.aggregate( {$match:{"id":{$in:ids}}}, //    id   {...} )

However, it should be understood that this array will be in memory all this time, and if there are several million elements in it, there is a high probability of getting out of memory. To avoid problems with memory allocation, you can write “prepared data” into a new collection:

 db.getSiblingDB("testDb").test.find({ "time" : { $gte : new Date(monthBegin.getTime()), $lte : new Date(monthEnd.getTime()) }, "operation" : "test", "data.result" : "SUCCESS" }).addOption(DBQuery.Option.noTimeout).forEach( //   noTimeout   function(doc) { db.test.find({"id" : doc.id}).forEach( //    id,      function(row) { db.getSiblingDB("anotherDb").newTest.insert({ //      id     . "id" : doc.id, "time" : row.time, }); } ); } );

Thus, we recorded a new collection with documents like {id:sting, time:ISODate(...)} . Next, we perform a simple aggregation of the new collection and get the desired result:

 i=0; sum=0; avg=0; db.getSiblingDB("anotherDb").newTest.aggregate( {$group:{_id:"$id", "time":{$push:"$time"}, }} ).result.forEach(function(op) { dur=op.time[op.time.length-1]-op.time[0] sum=sum+dur; i=i+1; }); print(sum/i) //  .

Thus, the approach described above has made it possible to simplify the search for the necessary records, set up very accurate monitoring at all levels, and also generate reports and statistics based only on log data from MongoDB. Management of data from the application and ease of deployment will save development time.

However, the MongoDB data scheme is not suitable for every project. If it is known that a recursive approach to data retrieval or join is not required, most likely, there will be no serious problems!

Source: https://habr.com/ru/post/272651/

All Articles