OpenCV 2.4 Flat Object Recognition

Hello to all. I would like to talk about the principles underlying object recognition using OpenCV. Fortunately, for some time I had a chance to work in the computer vision laboratory of the VMK MSU, and I got a little insight into the wisdom of this branch of computer science. The task, which I will consider here, was proposed at the Microsoft Computer Vision School Moscow 2011 at the seminars of Victor Erukhimov, one of the developers of the OpenCV software. Almost in the same form, the code in question can be found in the OpenCV 2.4 demos.

Task formalization

We set the following task: an image of a scene from an ordinary USB camera and an image of a target flat object (for example, a book) are input. The task is to find the target object in the scene image.

Let's start with the camera

First, let's connect the core.hpp and opencv.hpp files h, which are responsible for the opencv and features2d.hpp base classes, which defines the classes of various detectors and descriptors (we will be interested in SURF).

#include <iostream> #include "opencv2/opencv.hpp" #include "opencv2/core/core.hpp" #include "opencv2/nonfree/features2d.hpp" #include <vector> using namespace std; using namespace cv; void readme(string &message) { cout << message << endl; }

')
Next, the main body begins, we will assume that we are passing 1 parameter to the executable file, the path to the picture to the sample (flat object). The constructor of the VideoCapture class accepts a device number (camera) as input, 0 is the default device (probably an integrated camera). Next, read the target image in img_object.

 int main( int argc, char** argv ) { if(argc != 2) { string message = ": ./cv_test <img_object>"; readme(message); return -1; } VideoCapture cap(1); //   ( 1).      0 . if(!cap.isOpened()) //    { string message = "        "; readme(message); return -1; } Mat img_object = imread( argv[1], CV_LOAD_IMAGE_GRAYSCALE );

In an infinite loop, we will receive a regular frame from the device, this is the input image of the scene in which to find the target object.

 for(;;) { Mat frame; cap >> frame; //      Mat img_scene = frame; if( !img_object.data || !img_scene.data ) //       { string message = "   "; readme(message); }

Finding key points

Now we proceed to the recognition of the object. The first thing to do is to detect the key points on the image. Simplistically, we can assume that these are points in the places of a sharp gradient gradient in the image along x and y (corner points). The principle of their determination is based on the use of the autocorrelation matrix and the pyramid of images (for scale invariance). The autocorrelation matrix consists of derivatives with respect to x and y in image I.

the meaning is that the metric used (lambdas are eigenvalues, det is the determinant, trace is the trace of the matrix, alpha is a constant)

allows you to identify corner points, because at these points there will be significant gradients in x and y, and R will be at the local maximum. By setting the minHessian parameter, we determine the threshold by which the key point will be determined or not.

 //--  1.   . int minHessian = 400; SurfFeatureDetector detector( minHessian ); std::vector<KeyPoint> keypoints_object, keypoints_scene; detector.detect( img_object, keypoints_object ); detector.detect( img_scene, keypoints_scene );

Finding handles

Further, it is necessary to calculate the descriptor, - the vector encoding the geometry of the local neighborhood around the point. The basis of this, as a rule, is SIFT (SURF is a fast SIFT). The principle is as follows:

The patch around this point is divided into deterministic blocks, in each block the dominant gradient direction is calculated and the magnitude + is rotated in the direction of the dominant direction (rotation invariance). This “gradient pattern” describes a local patch.

 //--  2.  . SurfDescriptorExtractor extractor; Mat descriptors_object, descriptors_scene; extractor.compute( img_object, keypoints_object, descriptors_object ); extractor.compute( img_scene, keypoints_scene, descriptors_scene );

Handle comparison

At the next stage, we must match the vector of descriptors, i.e. find the corresponding points on the target object and in the scene. For this purpose, you can use FlannBasedMatcher (it should be used for large sets of key points) or BruteForceMatcher (vice versa). Next, we select from all the matted points only those whose distance between the descriptors is no more than 3 * min_dist, where min_dist is the minimum distance between the descriptors.

 //--  3:    . FlannBasedMatcher matcher; vector< DMatch > matches; matcher.match( descriptors_object, descriptors_scene, matches ); double max_dist = 0; double min_dist = 100; //--         //    for( int i = 0; i < descriptors_object.rows; i++ ) { double dist = matches[i].distance; if( dist < min_dist ) min_dist = dist; if( dist > max_dist ) max_dist = dist; } printf("-- Max dist : %f \n", max_dist ); printf("-- Min dist : %f \n", min_dist ); //--    ,    3 * min_dist vector< DMatch > good_matches; for( int i = 0; i < descriptors_object.rows; i++ ) { if( matches[i].distance < 3 * min_dist ) { good_matches.push_back( matches[i]); } } Mat img_matches; //--    drawMatches( img_object, keypoints_object, img_scene, keypoints_scene, good_matches, img_matches, Scalar::all(-1), Scalar::all(-1), vector<char>(), DrawMatchesFlags::NOT_DRAW_SINGLE_POINTS );

Using homography

In computer vision, any two images of the same flat object in space are connected by homography (if we use a pin-hole camera model). In other words, this transformation is a plane - a plane. Those. having a set of points on the target object and a set of points associated with it in the scene, we can find a correspondence between them in the form of a homography matrix H (and vice versa, respectively). The basis for finding this transformation lies in the RANSAC algorithm, which is based on an iterative assessment of homography for randomly selected points (4 in the image and 4 in the scene).

  //--   vector<Point2f> obj; vector<Point2f> scene; for( int i = 0; i < good_matches.size(); i++ ) { obj.push_back( keypoints_object[ good_matches[i].queryIdx ].pt ); scene.push_back( keypoints_scene[ good_matches[i].trainIdx ].pt ); } Mat H = findHomography( obj, scene, CV_RANSAC );

Next, you need to take 4 points along the edges of the target object and display them using the found transformation on the scene image. Thus, we will find the bounding box of the object in the scene. Note that when drawing lines, we add Point2f (img_object.cols, 0) to each point, since the img_matches image assumes the adjacent placement of the target image (left) and the scene (right).

 //--  ""     std::vector<Point2f> obj_corners(4); obj_corners[0] = cvPoint(0,0); obj_corners[1] = cvPoint( img_object.cols, 0 ); obj_corners[2] = cvPoint( img_object.cols, img_object.rows ); obj_corners[3] = cvPoint( 0, img_object.rows ); std::vector<Point2f> scene_corners(4); //--    ,   ,   perspectiveTransform( obj_corners, scene_corners, H); //--    line( img_matches, scene_corners[0] + Point2f( img_object.cols, 0), scene_corners[1] + Point2f( img_object.cols, 0), Scalar(0, 255, 0), 4 ); line( img_matches, scene_corners[1] + Point2f( img_object.cols, 0), scene_corners[2] + Point2f( img_object.cols, 0), Scalar( 0, 255, 0), 4 ); line( img_matches, scene_corners[2] + Point2f( img_object.cols, 0), scene_corners[3] + Point2f( img_object.cols, 0), Scalar( 0, 255, 0), 4 ); line( img_matches, scene_corners[3] + Point2f( img_object.cols, 0), scene_corners[0] + Point2f( img_object.cols, 0), Scalar( 0, 255, 0), 4 ); //-- Show detected matches imshow( "Good Matches & Object detection", img_matches ); if(waitKey(30) >= 0) break; } //--

Summary

Unfortunately, the anticipations of many people are somewhat superior to state-of-the-art in the field of computer vision. I used this code on the example of chocolate recognition. I had to turn a little chocolate in my hands before I realized what the limits of the provisions of its stable recognition were. In view of the variability of the situations that arise, the stability of recognition is the No. 1 headache. However, this example is basic and may be modified.