Automatic text recognition in video

This article is a translation of the article "Automatic text recognition in digital videos" by Rainer Lienhart and Frank Stuber, University of Mannheim, Germany.

Short review

We develop algorithms for the automatic segmentation of characters in films that extract text from the preface, captions and conclusion. Our algorithms use standard text symbols in video to improve the quality of segmentation and, as a result, recognition efficiency. As a result, we have separate characters from frames. They can be analyzed using any OCR software. The results of recognition of several copies of the same symbol in all subsequent frames are combined to improve the quality of recognition and to calculate the final result. We tested our algorithms in a series of experiments with video clips recorded from a TV and achieved good segmentation results.

Introduction

In the multimedia era, video is becoming an increasingly important and common way of transmitting information. However, most of the current video data is unstructured, that is, stored and displayed only as pixels. There is no additional information about the video content: year of release, cast, directors, costumers, filming locations, positions and types of scene breaks, etc., so the ease of use of unprocessed video is limited, which eliminates efficient and productive search. There are thousands of MPEG videos online. It is rare to find any information about the content and structure of these films in addition to the title and brief description, therefore searching, for example, specific scenes can be seen and scenes - a serious task. We would all like to receive more detailed information about video content than we have now.

Usually this information needs to be written manually, but manual annotation of a video is very expensive and takes a lot of effort. Thus, searching and viewing content-based information gives rise to the need for automatic video analysis tools for indexing. [2] [15] [16] [17] One of the important sources of information about a video is the text contained in it. We have developed algorithms for automatic segmentation of symbols and their recognition in the video. These algorithms automatically and reliably extract text from the preface, titles and conclusion. Algorithms explicitly use typical text characteristics in video created by video header generators or similar devices and / or methods to improve the quality of segmentation and, as a result, recognition efficiency.
')
The rest of the article is organized as follows. Section 2 discusses similar work on text segmentation and text recognition in video. We then describe the features of the characters and text displayed in the films, and present in section 4 our characteristics-based approach to the segmentation of candidate areas, which are based on the characteristics of the characters listed in section 3. Section 5 discusses our recognition algorithms. This is followed by some information on the implementation of the algorithms in section 6. In section 7, we present empirical results as evidence that our algorithms lead to good segmentation results. Finally, we complete our work with a summary and prospect of future work.

Related work

At the moment, the existing work on the recognition of text is mainly focused on optical character recognition in printed and handwritten documents, since there is a great demand for devices for reading documents for office automation systems and in the market. These systems have reached a high degree of maturity. [6] Also, text recognition works can be found in industrial applications. Most of this work focuses on a very narrow scope. An example is the automatic recognition of license plates. [13] The proposed system works only for characters / numbers whose background is mostly monochrome and whose position is limited. On top of that, a little work was published on character recognition in text appearing in video programs.

Michael A. Smith and Takeo Canada briefly describe in [12] a method that concentrates on extracting areas from video frames containing textual information. However, they do not prepare the detected text for standard optical character recognition software. In particular, they do not attempt to define the outlines of characters or segment individual characters. They save bitmap images containing text as is. People need to disassemble them on their own. They characterize the text as a “horizontal rectangular structure of grouped sharp edges” [12] and use this function to identify text segments. We also use this feature in our approach during the filling phase. Unlike their approach, this function plays only a small role in the process of segmentation of candidate areas. We also use several instances in different conditions to increase the efficiency of segmentation and recognition.

Another interesting approach to text recognition in scene images is Yun Ohya, Akio Shio and Shigeru Akamatsu. Characters in scene images may suffer from a variety of noise components. Text in scene images exists in three-dimensional space, so it can turn, bend, partially hide and / or darken and be under uncontrolled lighting. [7] In view of the many possible degrees of freedom of text characters, Ohya and others have limited them to almost vertical, monochromatic and unrelated, in order to facilitate detection. This makes the approach of Ohya et al. Feasible for our purpose, despite the fact that they focus on still images rather than video streams, and therefore do not use the characteristics typical of text appearing in video. Moreover, we focus on the text generated by the video header generators, and not on the text of the scene.

Features of characters in titles and introductory and final text

The text in the videos serves for different purposes: at the beginning and / or end of the broadcast, he informs the audience about its title, director, actors, producers, etc. The broadcast text also provides important information about the subject that is currently being covered. For example, text in sports broadcasts often reports results, while news releases and documentaries show the name and place of the speaker and / or important information about the topic. Text in advertising informs slogan, product or company name. These textual representations have a common point that they are clearly directed. They do not appear just like that - they are superimposed on the frame and created to be read.

In addition, the text can also be displayed in the scenes as part of them: for example, in the video of the shopping center, many store names can be seen in the video. Such text in the video is difficult to detect or recognize, it can be under any tilt, have distortion, in any light and on straight or wavy surfaces (for example, text on a T-shirt).

We do not work with the text of the scene in this article; rather, we concentrate solely on the text added to the video artificially, especially with the help of the video header generator. The reason lies in the fact that the text that is superimposed on top of the scene is fundamentally different from the text contained in the scenes, and we did not want to deal with two different problems at the same time. Thus, in the following, the words “text” and “symbol” will refer exclusively to those video headers that are made by machines or similar devices / methods.

Before it becomes possible to recognize words and text, it is necessary to analyze the features of their appearance.

Our list includes:

Monochrome symbols. Only a very small percentage have chromaticity, so this parameter is not of great interest;
The characters are hard. They do not change their shape, size or orientation from frame to frame. Again, the percentage of characters that change size is not so high, so they are not interested in us either;
Characters have a size limit. Letters are not as big as the entire frame, and less than a certain number of pixels, because otherwise they would be unreadable for people;
Symbols are fixed or moving linearly. Stationary symbols do not move, their position remains fixed on several subsequent frames. Moving characters are constantly moving [3] and have a constant direction of movement: as a rule, they move either horizontally or vertically. Moreover, most of the moving text moves from right to left or from bottom to top;
The characters contrast with the background. Artificial text is designed to be read and, therefore, must be distinguishable from the background. But, as we will see later, due to the narrow bandwidth of the television signal, this item is not reserved for all character outlines;
The same characters appear in several consecutive frames (temporal relation);
Characters often appear in clusters (lines) at a limited distance, aligned horizontally (spatial relation), since this is a natural method of recording words and sentences. But this is not a prerequisite, just an obvious indicator. From time to time only one character may appear on a single line;
The quality of symbol / border contours is falling due to modern television technologies and digital converter boards. Characters are often mixed with the background, especially on the left side. Monochrome symbols no longer look as such. The color is very turbid and sometimes changes slightly in space and time, for example, by mixing with the color of the surrounding background. Even still text can move several pixels. These are typical old analogue television and video recordings.

Any (artificial) method of segmentation and text recognition should be based on these observable features. Next we describe their use.

Separation of candidate symbol areas

Theoretically, the segmentation pass extracts all pixels belonging to the text appearing in the video. However, it is impossible to do without knowing where and what characters. Thus, the ultimate goal of the segmentation stage is to divide the pixels of each video frame into two classes:

Areas containing text;
Areas that do not contain text.

Areas that do not contain text are discarded because they cannot contribute to the recognition process, and areas that may contain text are retained. We call them candidate areas, because they are (not quite) a superset of character areas. They will be passed to the recognition stage for evaluation.

Here we describe the segmentation process. It can be divided into three parts, each part increasing the set of symbolic areas of the previous part by other areas that do not contain text, thereby reducing candidate areas, bringing them more and more closer to areas with symbols. First, we process each frame independently of the others. Then we try to take advantage of several copies of the same text in successive frames. Finally, we analyze the contrast of the remaining areas in each frame in order to further reduce the number of candidate areas and construct final candidate areas. In each part, we use the functions of the symbols, as described in section 3.

Character segmentation in candidate regions in single frames

Monochromatic

Let's start with the original frame (Fig. 1). Due to the supposed monochromatic character of the characters, we divide the frame to homogeneous gray scale segments at the first stage of processing. We use the “Separation and Merge” algorithm proposed by Horowitz and Pavlidis [4] to perform segmentation. It is based on a hierarchical decomposition of the frame. According to Horowitz and Pavlidis, the separation process begins with the entire image as an initial segment, which is then divided into quarters. Each quarter is tested for certain uniformity criteria to determine if a segment is “homogeneous enough”. If it is not homogeneous enough, the segment is again divided into quarters. This process is applied recursively until only homogeneous segments remain. We use the standard uniformity criterion: the difference between the highest and lowest intensities of a gray tone should be below a certain threshold. We call the threshold max_split_distance. A homogeneous segment is assigned its average gray level. Then, in the merge process, adjacent segments are combined together if their intensity of the average gray tone is less than the max_merge_distance parameter. As a result, all monochrome characters appearing in the image must be contained in some monochrome segments. For our example frame, the Split and Merge algorithm displays the image shown in Figure 2.

Size restrictions.

Now the segmented image consists of areas of uniform gray tone intensity. Some areas are too large and others too small to be instantiated characters. Consequently, the width and height of monochrome segments exceed max_size, since the connected monochrome segments, the aggregate size of which is less than min_size. An example of the image can be seen in Figure 3 (the deleted segments became black).

Figure 1. Original frame

Figure 2. Figure 1 using the Split and Merge method

Improved segmentation based on sequential frames

As we analyze text in videos created by video header generators, the same text usually appears in the number of consecutive frames. Obviously, the result of the segmentation can be improved by using these multiple instances of the same text, because each character of the text often looks slightly changed from frame to frame due to noise, background changes and / or position changes. Thus, we must detect the corresponding character of the candidate area in successive frames.

Motion analysis

As mentioned in section 3, the text considered here is either stationary or linearly moving, and even stationary text can move several pixels around its original position from frame to frame. Therefore, we must perform a motion analysis to detect the corresponding candidate areas in consecutive frames. Movement is estimated using block mapping, since block mapping is suitable for hard objects, and characters are considered hard if their shape, orientation and color do not change. Moreover, block mapping is very popular and is used to compensate for movement in international standards for video compression, such as H.261 and MPEG3. Our criterion of conformity is the minimum average criterion of absolute difference [14]. Mean absolute difference (MAD) is defined as

R indicates the block for which the translation vector is to be calculated. Offset estimate

for block R, is set as an offset, where the MAD value is minimal. Search scope is limited

and follows from the speed of scrollable titles.

The upcoming question is how to determine the location and size of the blocks that will be used for motion estimation. Obviously, the quality of the estimation of the displacement depends on the position and size of the block, which we are trying to compare with its instance in a sequential frame. For example, if the selected block is too large, it may not be possible for the algorithm to find an equivalent block, since parts of the block may leave the frame (this may happen with scroll lists) or parts of the block in the next frame may be correctly recognized as background, while in the first frame, they were not filtered and remained part of the candidate area.

To avoid these problems, we use the fact that characters appear as words and therefore fit into strings. We select our R block using the following algorithm: the input image is converted to two-color (background = black, the rest = white). And each white pixel expands around the specified radius. As can be seen from Figure 4, symbols and words now constitute a compact area. We structure each connected cluster in a rectangle and define it as a block R. If the fill factor is above a certain threshold, the block is used for motion analysis. If the fill factor is below a certain threshold, the block is recursively divided into smaller blocks until the fill factor for the resulting block exceeds the threshold value. For each result block that corresponds to the required fill factor, motion analysis is performed according to block matching.

Figure 3. Applying size restrictions for Figure 2

Figure 4: Figure 3 after turning into a two-color frame and stretching. Blocks are marked with rectangles.

Blocks with no equivalent in the next frame are discarded. In addition, blocks that have an equivalent in a subsequent frame, but demonstrate a significant difference in their average gray tone intensity, are discarded. The resulting image is transferred to the next stage of segmentation (Figure 5).

Figure 5: Image result after applying motion analysis to two consecutive frames in Figure 3

Improved segmentation of candidate areas using contrast analysis

Contrast analysis

Characters created by video header generators are usually contrasted with their background. Therefore, it is also a prerequisite for candidate areas. Therefore, each area remaining from the previous segmentation step is checked to see if its contour is partially contrasted strongly with the background and / or other remaining areas. Especially dark shadows, often lying below symbols, in order to improve readability, should result in very strong contrast between the areas of the characters and parts of their environment. If no such contrast is found for the region, we conclude that it cannot belong to the symbol and discard the region.

Contrast analysis is performed by the following processing queue: we compute a map of Canny's boundaries (1) and apply a significantly high threshold (called canny_threshold) to limit the response to sharp edges. The resulting edge image is expanded with dilation_radius. Then the areas from the segmentation stage of the motion analysis are discarded if they do not intersect with any extended edge. In our example, this leads to the result shown in Figure 6.

Fill factor and width to height ratio

The blocks and their corresponding fill factors are calculated again for each remaining candidate area, as described in the motion analysis section above. If the fill factor is too low, the corresponding areas are discarded. Then the ratio of the width to the height of the blocks is calculated. If it exceeds certain limits, i.e. does not lie between min_ratio and max_ratio, corresponding areas are also discarded. This process leads to the final image segmentation. Figure 7 shows this for our exemplary video frame.

Segmentation result

So far, the candidate symbol areas for each frame have been extracted. Regions are saved in new frames, thereby creating a new video. In these frames, pixels belonging to candidate areas retain their original gray level. All other pixels are marked as background. The segmentation is complete, and the new video can be analyzed frame by frame with any standard OCR software.

Figure 6: Result after contrasting analysis of Figure 5

Figure 7: Final Segmentation

Character recognition

Segmentation provides a video showing the areas of candidates. When recognizing characters, each frame must be analyzed by OCR software. We implemented our own OCR software using the classification of objects with vector characteristics, as described in [11]. However, this software is far from perfect, and the use of a commercial software package should lead to higher recognition rates.

As we analyze the video, each character appears in several consecutive frames. Thus, all instances of recognition of the same character must be combined into one recognition result. The corresponding symbols and groups of symbols are identified by motion analysis, as described in section 4.2. Thus, we can associate multiple independent recognition results with the same symbol and word. The most frequent result is the final result of recognition.

Implementation

Segmentation algorithms were implemented on SUN SPARCstation 5 under Solaris 2.4 and on DEC ALPHA 3000 under Digital Unix 3.2 with 2300 lines of code C. They are part of the MoCA Workbench5 and require the Vista 1.3 library as the basis. [9,10] OCR software is implemented on C in 1200 lines of C code and trained in 14 different postscript fonts. However, the second part of the character recognition process, that is, the combination of all text recognition results in one final text output, is still underway and will soon be completed.

Experimental results

We tested our approach to segmentation on 8 samples of digital video. Video data was digitized from several German and international television broadcasts in the form of 24-bit JPEG images with a figure of merit of 508, 384 by 288 pixels and 14 frames per second. All JPEG images were decoded as gray scale images only. We have two samples for each of the following classes:

Stationary text, fixed scene;
Fixed text, moving scene;
Moving text, fixed scene;
Moving text, moving the scene.

Moving text means the text is moving around the scene, for example. Bottom up or right to left. Equivalently, moving scenes denote scenes with significantly greater movement or, more generally, dramatically significant changes. A stationary scene is either a still image or a very static scene, for example, a column scene in a news release. Stationary text remains in a fixed position. Moreover, the characters in the video samples vary in size, color and shape.

In our experiments, we used the following values for the following parameters

max_split_distance = 30;
max_merge_distance = 30;
max_size = 70 pixels;
min_size = 5 pixels;
search_area = 20 pixels;
dilation radius = 3 pixels;
fill_faktor_threshold = 0.7 (section 4.2) and 0.3 (section 4.3), respectively;
canny_threshold = 80;
min_ratio = 0 and max_ratio = 6.

Experimental results can be found in Table 1. The first column identifies the type of video, followed by its length, measured in frames. The third column contains the actual number of characters in the corresponding video sample. It is measured by recording the entire text of the video header appearing in the video sample and character counting. Thus, the symbol number refers to the text in the video is not the sum of the number of characters displayed in all frames. The fourth column gives the number and percentage of characters, segmented as candidate areas, according to our segmentation algorithms.The effectiveness of segmentation in our experiments is always very high - from 86% to 100% and, thus, gives experimental data on the quality of our algorithms. For video samples with moving text and / or moving scene, the segmentation performance ranges even from 97% to 100%. These performance measurements are consistent with our approach. It cannot benefit from multiple instances in a fixed scene with stationary text, since all instances of the same symbol have the same background. Therefore, the performance of the segmentation is lower.

Readers interested in seeing eight video clips can extract them from here.. The quality of the characters in the candidate areas for the recognition process cannot be evaluated here, since we are only dealing with the segmentation of characters. Such an assessment can only be performed in combination with OCR software and should be investigated in future experiments.

Another important quality factor in the segmentation process is the average reduction rate of the corresponding pixels. It defines a decrease in the number of relevant pixels in our segmentation process, reducing the workload for the recognition process. Moreover, the higher the reduction rate, the smaller the non-character regions that are still part of the character candidate area, thereby reducing the misrecognition of the OCR software. The average reduction ratio is determined by

The last column of Table 1 shows the number of characters per frame for video samples. It correlates with the average reduction factor.

Table 1. Segmentation results
Video type	Frames	Characters	Of these, found in the candidate area		Reduction	Characters in the frame
Stationary text, stationary scene	400	137	131	96%	0.058	0.34
Stationary text, stationary scene	400	92	79	86%	0.028	0.23
Stationary text, moving scene	116	21	21	100%	0.035	0.18
Stationary text, moving scene	400	148	144	97%	0.037	0.36
Moving text, stationary scene	139	264	264	100%	0.065	1.90
Moving text, stationary scene	190	273	273	100%	0.112	1.44
Moving text, moving scene	202	373	372	99.7%	0.130	1.85
Moving text, moving scene	400	512	512	100%	0.090	1.28

To give empirical evidence of the stability of our algorithm, we tested it with the ninth video sample without any text. The sample video consisted of 500 frames, and the average reduction factor was 0.038. This value is very low compared to those contained in video samples containing text. Thus, our algorithm is also able to detect scenes that probably do not contain or little text. But the final decision rests with the OCR tool. Some readers may ask, what about the text as part of the scene? Does it distort the results of the experiments of the algorithm? In general, the text of the scene is not extracted. But if it has the same functions as an artificial text, it is extracted. This usually happens for scene text that is used in video for similar tasks,such as artificial text, for example, a snapshot of a city’s name at close range.

Conclusions and perspectives

We presented algorithms for automatic segmentation of symbols in moving images, which automatically and reliably extract text in the preface, captions and final words. Experimental results of 8 samples of digital video, consisting of a total of 2,247 frames, are very promising. Our algorithms extracted from 86% to 100% of all added text images in our digital video samples. For video samples with moving text and / or moving scene, the segmentation performance ranges even from 97% to 100%. The resulting candidate areas can easily be analyzed with standard OCR software. Our recognition algorithms combine all instances of recognition of the same character into a single recognition result.

Currently, our algorithms process the image with a gray scale. This makes it difficult to detect, for example, yellow text on a gray-blue background, since these colors do not contrast with images with a gray scale. Consequently, our approach could not reliably segment such text. Our future plan is to expand the algorithm for working with color images in the corresponding color space and calculate the contrast in these color images.

In the future, we also plan to include text segmentation and a text recognition module in our automatic video abstraction system in order to be able to extract movie titles and the most important film actors, since they are an integral part of abstracting. Algorithms will also be built into our system of automatic recognition of video genres [2] in view of improved performance, since certain text may be characteristic of certain genres.

Links

[1] John Canny, “A Computational Approach to Edge Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, pp. 679-697, Nov. 1986.
[2] Stefan Fischer, Rainer Lienhart, and Wolfgang Effelsberg, “Automatic Recognition of Film Genres”, Proc. ACM Multimedia 95, San Francisco, CA, Nov. 1995, pp. 295-304.
[3] DL Gall, “MPEG: A Video Compression Standard for Multimedia Applications”, Communications of the ACM, 34, 4, April 1991.
[4] SL Horowitz and T. Pavlidis, “Picture Segmentation by a Traversal Algorithm”, Comput. Graphics Image Process. 1, pp. 360-372, 1972.
[5] Rainer Lienhart, Silvia Pfeiffer, and Wolfgang Effelsberg, “The MoCA Workbench”, University of Mannheim, Computer Science Department, Technical Report TR-34-95, November 1996.
[6] Shunji Mori, Ching Y. Suen, Kazuhiko Yamamoto, “Historical Review of OCR Research and Development”, Proceedings of the IEEE, Vol. 80, No. 7, pp. 1029-1058, July 1992.
[7] Jun Ohya, Akio Shio, and Shigeru Akamatsu, “Recognizing Characters in Scene Images”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, No. 2, pp. 214-220, 1994.
[8] William B. Pennebaker and Joan L. Mitchel, “JPEG Still Image Data Compression Standard”, Van Nostrand Rheinhold, New York, 1993.
[9] Arthur R. Pope, Daniel Ko, David G. Lowe, “Introduction to Vista Programming Tools”, Department of Computer Science, University of British Columbia, Vancouver.
[10] Arthur R. Pope and David G. Lowe, “Vista: A Software Environment for Computer Vision Research”, Department of Computer Science, University of British Columbia, Vancouver.
[11] Alois Regl, “Methods of Automatic Character Recognition”, Ph. D. thesis, Johannes Kepler University Linz, Wien 1986 (in German).
[12] Michael A. Smith and Takeo Kanade, “Video Skimming for Quick Browsing Based on Audio and Image Characterization”, Carnegie Mellon University, Technical Report CMU-CS-95-186, July 1995.
[13] M. Takatoo et al., “Gray Scale Image Processing Technology Applied to Vehicle License Number Recognition System”, in Proc. Int. Workshop Industrial Applications of Machine Vision and Machine Intelligence, pp. 76-79, 1987.
[14] A. Murat Tekalp, “Digital Video Processing”, Prentice Hall Signal Processing Series, ISBN 0-13-190075-7, 1995.
[15] Ramin Zabih, Justin Miller, and Kevin Mai, “A Feature-Based Algorithm for Detecting and Classifying Scene Breaks”, Proc. ACM Multimedia 95, San Francisco, CA, pp. 189-200, Nov. 1995.
[16] HJ Zhang, CY Low, SW Smoliar, and JH Wu, “Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution”, Proc. ACM Multimedia 95, San Francisco, CA, pp. 15-24, Nov. 1995.
[17] Hong Jiang Zhang and Stephen W. Smoliar, “Developing Power Tools for Video Indexing and Retrieval”, Proc. SPIE Conf. on Storage and Retrieval for Image and Video Databases, San Jose, pp. 140-149, CA, 1994.

Source: https://habr.com/ru/post/332840/

All Articles