IJECET - Iaeme.com

International
Journal of ElectronicsJOURNAL
and Communication
Engineering & Technology
(IJECET),
INTERNATIONAL
OF ELECTRONICS
AND
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)
ISSN 0976 – 6464(Print)
ISSN 0976 – 6472(Online)
Volume 5, Issue 1, January (2014), pp. 74-81
© IAEME: www.iaeme.com/ijecet.asp
Journal Impact Factor (2013): 5.8896 (Calculated by GISI)
www.jifactor.com
IJECET
©IAEME
REVIEW OF METHODS OF SCENE TEXT DETECTION AND ITS
CHALLENGES
Ms. Saumya sucharita Sahoo
M.E. (E&TC), Genba Sopanrao Moze College of Engineering, Pune.
Prof. Smita Tikar
ABSTRACT
Since from last decade, there are many methods presented over the research area called scene
text detecting using the image processing terminologies. The automated systems presented to address
the challenges of detecting and localizing the text information from the natural scene images. The
application areas of such automated systems are keyword based image search, tourist guide, image
indexing using texts, image text translation systems etc. The text data which available in images or
videos is important information required for automatic annotation, searching, structuring etc.
Automated system to extract the text information from images is basically consisting for phases like
detection, preprocessing, localization, tracking, enhancement, extraction, and recognition.
Challenging part of automated scene text detection systems is that, sometimes scene images
containing varying in text due to the style, size, alignment, orientation, complex background etc.
Many research methods presented over the automatic detection of text from the natural images
previously and still many researches are going in this same area. The main goal of this paper is to
present the survey of different methods presented for text detection on images. The detailed
description of works done for automatically detection of text from the scene images is presented.
Keywords: Text detection, Scene text detection.
I. INTRODUCTION
The current work over the research field called retrieval of contents from the videos and
images recognized different range of application areas where the need of automated text extraction is
required from the natural scene images. Recently the new application is developed from the mobile
banking which is provided by particular bank to their customers with aim of facilitating their banking
users to execute their transactions by sending just image of their cheque of passbook to the main
74
International Journal of Electronics and Communication Engineering & Technology (IJECET),
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
server of bank. Another application of it is tourist guide, which helps the tourist to understand the
different language written display boards as image text translation systems to help the visually
impaired people and also tourists. This all main applications are based on concept of automatic text
extraction from the scene images. This automated system needs to efficiently detect, localize as well
as extract the text related information available in natural scene images. Below figure 1 showing the
overall processing of such automated systems. From the figure the first phase is image acquisition
which is possibility through the input videos or cameras. The quality is based on use of cameras.
Once you image is acquire, the next step is the preprocessing. In the preprocessing, the contrast of
image is enhanced or noise is removed from it so that accuracy of detection improved. The next
phase is to detect as well as localize the text present in the preprocessed image. The next phase of
this system is recognition phase where the texts are extracted and recognized by using different
methods. The detected text regions are given to OCR which recognizes the characters and gives the
textual output. The preprocessing is mainly required due to the fact that input image is may be of
different size, angle, orientation, alignment etc and hence it is required to smoothen the image. For
each phase of this system, there are different methods suggest by various researchers with their own
advantages and disadvantages.
Figure 1: Text Extraction Process from Scene Images
Since from the 1990s, the search over detection of text and its localization is carried out as
well as many text detection algorithms has been presented so far. In this paper we are aiming to
present the literature survey over different techniques presented over the automatic text extraction
from the scene images, research challenges, performance metrics used etc. In below section II we are
presenting the literature survey of different methods presented by various authors for extraction of
text from scene images. In section III we are discussing the different challenges of these systems. In
section IV we are presenting the performance metrics used for evaluation of scene text extraction
methods.
II. REVIEW OF TEXT DETECTION METHODS
Several approaches for text detection in images and videos have been proposed in the past.
Techniques for automatic detection and translation of text in images and videos have been proposed.
75
International Journal of Electronics and Communication Engineering & Technology (IJECET),
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
The distribution of edges, for example, is used in many text detection methods [7, 8, and 9]. In these
methods the edges are grouped together based on features such as size, color and aspect ratio [10,
11]. Many researchers working on text detection and thres holding algorithm with various
approaches have achieved good performance based on some constraints.
An early histogram based global thres holding Otsu’s method is widely used in many
applications [1]. Text detection and binarization method is proposed for Korean sign board images
using k means clustering [2]. But finding a best value for ‘k’ to achieve a good binary image is
difficult in images with complex background and uneven lighting. The linear Nib lack method was
proposed to extract connected components and texts were localized using a classifier algorithm [3].
Four different methods were suggested to extract text, depending on character size [4]. In the work of
Wu et al. a method was proposed to clean up and extract text using a histogram based binarization
algorithm [5].
The local threshold was picked at first valley on the smoothed intensity histogram and used to
achieve good binarization result. A thres holding method was developed using intensity and
saturation feature to separate text from background in color document images [6]. System using the
gray-level values at high gradient regions as known data to interpolate the threshold surface of image
document was proposed [7]. Layer based approach using morphological operation was proposed to
detect text from complex natural scene images [8].
However, these method put few constrain and showed lots of missing and false positive
detection on many natural scene images. This may confirm that the detection of text from natural
scene is still a challenging issue. In our previous work we proposed a region based method using the
color contrast of the text and their surrounding pixels. Due to limited number of color variation
between text and its immediate background, finding a right threshold and detecting text pattern are
key issues. Based on the methods being used to localize text regions, these approaches can be
categorized into two main classes: connected component based methods and texture based methods.
CAI ET a1 [2] presented a text detection approach which is based on character features like
edge strength, Edge density and horizontal distribution. first, they have a color edge detection
algorithm in the YUV color space and a range of non-text edges out using filter then keep holding, a
local technology thres-to simplify the text and background contrast. Finally, in order to localize the
text areas projection profile analysis. an approach which RGB color space on using color images
directly Operate the proposed lien Hart and Effelsberg [1].
The character features like mono chromacity and contrast within the local environment are used to
qualify a pixel as a part of a connected component or not, segmenting each frame into suitable
objects in this way. Then, regions are merged using the criteria of having similar color. At the end,
specific ranges of width, height, width to height ratio and compactness of characters are used to
discard all non-character regions.
Kim [6] has proposed an approach in which LCQ (Local Color Quantization) is performed
for each color separately. Each color is assumed as a text color without knowing whether it is real
text color or not. Color quantization takes place before processing to reduce blight, an input image is
converted to a 256-color image when they show features the text field text lines to find candidates,
connected components that are extracted for each color merged. LCQ for each color is executed
since this drawback of the method is processing time.
Agnihotri and Dimitrova [11] have presented an algorithm which uses only the red part of the
RGB color space, with the aim to obtain high contrast edges for the frequent text colors. By means of
a convolution process with specific masks they first enhance the image and then detect edges. Non text areas are removed using a preset fixed threshold. Finally, a connected component analysis
(eight-pixel neighborhood) is performed on the edge image in order to group neighboring edge pixels
to single connected components structures. Then, the detected text candidates undergo another
treatment in order to be ready for an OCR.
76
International Journal of Electronics and Communication Engineering & Technology (IJECET),
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
Garcia and Apostolicism [4] perform an eight -connected component analysis on a binary
image, which is obtained as the union of local edged maps that are produced by applying the band
Deriche filter on each color. Jain and Yu [5] first perform a color reduction by bit dropping and color
clustering quantization, and afterwards, a multi-value image decomposition algorithm is applied to
decompose the input image into multiple foreground and background images. Then, connected
component analysis combined with images performed on each of them to localize text candidates.
This method can extract only horizontal texts of large sizes.
The second class of approaches [7, 9] regards texts as regions with distinct textural
properties, such as character components that contrast the background and at the same time exhibit a
periodic horizontal intensity variation, Characters are horizontal alignment. Gabor filtering method
of spatial variance of texture analysis and automatically detect text fields is used with such approach
different character font size do not function well, and besides, they are computationally intense. for
example, Lee and doorman [7] usually 16 x 16 pixels, and a small window of the image scan each of
them A text or non-text window is a three-layer neural network to classify as using different text
sizes to use. a successful detection, they use a three-tier pyramid approach.
Text regions are extracted at each level and then extrapolated at the original scale. The
bounding box of the text area is generated by a connected component analysis of the text windows.
Wu et al. [9] have proposed an automatic text extraction system.
Then, features are computed to form a feature vector for each pixel from the filtered images
in order to classify them into text or non text pixels. In a second step bottom up methods are applied
to extract connected components. A simple histogram-based algorithm is proposed to automatically
find the threshold value for each text region, making the text cleaning process more efficient.
III. CHALLENGES OF SCENE TEXT DETECTION
First of all, in order to understand challenges of this field, new imaging conditions and newly
considered scenes need to be detailed:
•
•
•
•
•
•
•
•
Raw sensor image and sensor noise: in low-priced HIDs, pixels of a raw sensor are
interpolated to produce real colors, which can induce degradations. Demos icing techniques,
viewed more as complex interpolation techniques, are sometimes required. Moreover, sensor
noise of an HID is usually higher than that of a scanner.
Angle: scene text and HIDs are not necessarily parallel creating perspective to correct. Blur:
during acquisition, some motion blur can appear or be created by a moving object. All other
kinds of blur, such as wrong focus, may also degrade even more image quality.
Lighting: in real images, real (uneven) lighting, shadowing, reflections onto objects, interreflections between objects may make colors vary drastically and decrease analysis
performance.
Resolution and Aliasing: from webcam to professional cameras, resolution range is large
and images with low resolution must also be taken into account. Resolution may be below 50
dpi which causes commercial OCR to fail. It may lead to aliasing creating fringed artifacts in
the image. The newly considered scenes represent targets such as:
Outdoor/non-paper objects: different materials cause different surface reflections leading to
various degradations and creating inter-reflections between objects.
Scene text: backgrounds are not necessarily clean and white, and more complex ones make
text extraction from background difficult. Moreover scene text such as that seen in
advertisements may include artistic fonts.
Non-planar objects: - text embedded in the bottles or cans suffer from deformation.
Unknown layout: priori information not available on structure of text to detect it efficiently.
77
International Journal of Electronics and Communication Engineering & Technology (IJECET),
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
•
Objects in distance: space between text & HIDs can vary, & character sizes may vary in a
wide range, leading to a wide range of character sizes in a same scene.
Fig. 1. Samples of natural scene images
The main challenge is to design a system as versatile as possible to handle all variability in
daily life, meaning variable targets with unknown layout, scene text, several character fonts and sizes
and variability in imaging conditions with uneven lighting, shadowing and aliasing. Our proposed
solutions for each text understanding step must be context independent, meaning independent of
scenes, colors, lighting and all various conditions.
We discus on this methods which work reliably across the broadest possible range of NS
images, such as displayed in Figure 1.
IV. PERFORMANCE METRICS USED
The ICDAR 2011 Robust Reading Competition (Challenge 2: Reading Text in Scene Images)
dataset [?] is a widely used dataset for benchmarking scene text detection algorithms. The dataset
contains 229 training images and 255 testing images. The proposed system is trained on the training
set and evaluated on the testing set.
It is worth noting that the evaluation scheme of ICDAR 2011 competition is not the same as
of ICDAR 2003 and ICDAR 2005. The new scheme, the object count/area scheme proposed by Wolf
et al. [?], is more complicated but offers several enhancements over the old scheme. Basically, these
two scheme use the notation of precision, recall and f measure that is defined as
The above matching functions only consider one-tone matches between ground truth and
detected rectangles, leaving room for ambiguity between detection quantity and quality [?]. In the
new evaluation scheme, the matching functions are redesigned considering detection quality and
different matching situations (one-to-one matching’s, one-to-many matching’s and many to- one
matching’s) between ground truth rectangles and detected rectangles, such that the detection quantity
78
International Journal of Electronics and Communication Engineering & Technology (IJECET),
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
and quality can both be observed using the new evaluation scheme. The evaluation software DetEval
4 used by ICDAR 2011 competition is available online and free to use.
The performance of our system, together with Neumann and Matas’ method [?], a very recent
MSER based method by Shi et al. [?] and some of the top scoring methods (Kim’s method, Yi’s
method, TH-Text Loc system and Neumann’s method) from ICDAR 2011 Competition are presented
in Table 1. As can be seen from Table 1, our method produced much better recall, precision and f
measure over other methods on this dataset.
Four methods in Table 1 are all MSER based methods and Kim’s method is the winning
method of ICDAR 2011 Robust Reading Competition.
Apart from the detection quality, the proposed system offers speed advantage over some of
the listed methods. The average processing speed of the proposed system on a Linux laptop with
Intel (R) Core (TM)2 Duo 2.00GHZ CPU is 0.43s per image. The processing speed of Shi et al.’s
method [?] on a PC with Intel (R) Core (TM)2 Duo 2.33GHZ CPU is 1.5s per image. The average
processing speed of Neumann and Matas’ method [?] is 1.8s per image on a “standard PC”. Figure 9
shows some text detection examples by our system on ICDAR 2011 dataset.
Fig. 9: Text detection examples on the ICDAR 2011 dataset. Detected text by our system is labeled
using red rectangles. Notice the robustness against low contrast, complex background and font
variations
TABLE 1: Performance (%) comparison of text detection algorithms is on ICDAR 2011 Robust
Reading Competition dataset
To fully appreciate the benefits of text candidate’s elimination and the MSERs pruning
algorithm, we further profiled the proposed system on this dataset using the following schemes (see
Table 2)
1) Scheme-I, no text candidates elimination performed. As can be seen from Table 2 the absence of
text candidate’s elimination results is in a major decrease in precision value. The degradation can be
fixed by the fact that large number of non-text is passed to the text candidate’s classification stage
without being eliminated.
79
International Journal of Electronics and Communication Engineering & Technology (IJECET),
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
2) Scheme-II, using default parameter setting [?] for the MSER extraction algorithm. The MSER
extraction algorithm is controlled by several parameters [?]: _ controls how the variation is
calculated; maximal variation v+ excludes too unstable MSERs; minimal diversity d+ removes
duplicate MSERs by measuring the size difference between a MSER and its parent. As can be seen
from Table 2, compared with our parameter setting (_ = 1; v+ = 0:5; d+ = 0:1), the default parameter
setting (_ = 5; v+ = 0:25; d+ = 0:2) results in a major decrease in recall value. (1) the MSER
algorithm is not able to detect some low contrast characters (due to v+), and (2) the MSER algorithm
tends to miss some regions that are more likely to be characters (due to _ and d+). Note that the
speed loss (from 0.36 seconds to 0.43 seconds) is mostly due to the MSER detection algorithm itself.
TABLE 2: Performance (%) of the proposed method due to different components
V. CONCLUSION AND FUTURE WORK
In this paper we have take the review of concepts of automatic extraction text from scene
images using different methods. We have discussed the different challenges of research in automatic
extraction text images as well as performance metrics. There are many papers presented over the
automatic extraction of texts from the images are discussed in literature in paper. Algorithms is
proposed by considering the various properties of text which helps to distinguish the text regions
from the natural scene image or videos. The future work for this research area is further to present
improved method with aim of improving the overall accuracy of detecting as compared to previous
methods.
VI. REFERENCES
1. R. Lien hart and W. Effelsberg. Automatic Text Segmentation and Text Recognition for
Video Indexing Multimedia System, Vol. 8, pp. 69-81, 2000.
2. N. Ezaki, M. Bulacu, L. Schomaker, “Text Detection from Natural Scene Images: Towards a
System for Visually Impaired Persons”, Int. Conf. on Pattern Recognition (ICPR 2004), vol.
II, pp. 683-686.
3. J. Park, G. Lee, E. Kim, J. Lim, S. Kim, H. Yang, M. Lee, S. Hwang, “Automatic detection
and recognition of Korean text in outdoor signboard images”, Pattern Recognition Letters,
2010.
4. T. N. Dinh, J. Park, G. Lee, “Korean Text Detection and Binarization in Color Signboards”,
Proc. of The Seventh Int. Conf. on Advanced Language Processing and Web Information
Technology (ALPIT 2008), pp. 235-240.
5. P. Shivakumara, W. Huang, C. L. Tan, “Efficient Video Text Detection using Edge
Features”, Int. Conf. on Pattern Recognition (ICPR 2008), pp. 1-4.
6. P. Shivakumara, T. Q. Phan, C. L. Tan, “Video text detection based on filters and edge
features”, Int. Conf. on Multimedia & Expo (ICME 2009), pp. 514-517.
7. Q. Yuan and C. Tan, “Text extraction from gray scale document images using edge
information,” Sixth International Conference on Document Analysis and Recognition, 2001,
pp. 302–306.
8. X. Chen, J. Yang, J. Zhang, and A. Waibel, “Automatic detection and recognition of signs
from natural scenes,” IEEE Transactions on Image Processing, vol. 13, pp. 87–99, 2004.
80
International Journal of Electronics and Communication Engineering & Technology (IJECET),
ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online), Volume 5, Issue 1, January (2014), © IAEME
9. N. Ezaki, M. Bulacu, and L. Schomaker, “Text detection from natural scene images: towards a
system for visually impaired persons,” Proceeding of the 17th International Conference on
Pattern Recognition, vol. 2, 2004, pp. 683–686.
10. Hertzmann, A.; Jacobs, C.E.; Oliver, N.; Curless, B. & Sales in, D.H. (2001). Image analogies,
Proceedings of ACM SIGGRAPH, Int. Conf. On Computer Graphics and Interactive Techniques.
11. Hild, M. (2004). Color similarity measures for efficient color classification, Jour. of Imaging
Science and Technology, Vol. 15, No. 6, pp. 529-547.
12. ICDAR Competition (2003). http://algoval.essex.ac.uk/icdar Jung, K.; Kim, K.I. & Jain, A.K.
(2004). Text information extraction in images and video: a survey, Pattern Recognition, Vol. 37,
No. 5, pp. 977-997.
13. Karatzas, D. & Antonacopoulos, A. (2004). Text extraction from web images based on a splitand-merge segmentation method using color perception, Proceedings of Int. Conf. Pattern
Recognition, Vol. 2, pp. 634-637.
14. Kim, I.J. (2005). Keynote presentation of camera-based document analysis and recognition,
http://www.m.cs.osakafu-u.ac.jp/cbdar.
15. Kim, J.; Park, S. & Kim, S. (2005). Text locating from natural scene images using image
intensities, Proceedings of Int. Conf Document Analysis and Recognition, pp. 655-659.
16. Kovesi, P.D. (2006). MATLAB and Octave functions for computer vision and image processing,
School of Computer Science & Software Engineering, The University of Western Australia,
http://www.csse.uwa.edu.au/~pk/research/matlabfns/.
17. Li, H. & Doermann D. (1999). Text enhancement in digital video using multiple frame
integration, Proceedings of ACM Int. Conf. on Multimedia, pp. 19-22.
18. Liang, J.; Doermann, D. & Li, H. (2003). Camera-based analysis of text and documents: a survey,
Int. Journal on Document Analysis and Recognition, Vol. 7, No. 2-3, pp. 84-104 Lien hart, R. &
Wernicke, A. (2002). Localizing and segmenting text in images, videos and web Pages, IEEE
Trans. Circuits and Systems for Video Technology, Vol. 12, No. 4, pp.256-268.
19. Lopresti, D. & Zhou, J. (2000). Locating and recognizing text in WWW images, Information
Retrieval, Vol. 2, pp. 177-206.
20. Lukac, R.; Smolka, B.; Martin, K.; Plataniotis, K.N. & Venetsanopoulos, A.N. (2005). Vector
filtering for color imaging, IEEE Signal Processing, Special Issue on Color Image Processing,
Vol. 22, No. 1, pp. 74-86.
21. Luo, X.-P.; Li, J. & Zhen, L.-X. (2004). Design and implementation of a card reader based on
build-in camera, Proceedings of Int. Conf. Pattern Recognition, pp. 417-420.
22. Mancas-Thillou, C. (2006). Natural scene text understanding, PhD thesis, Faculty Polytechnique
de Mons, Belgium Mancas-Thillou, C. & Gosselin, B. (2006). Spatial and color spaces
combination for natural scene text extraction, Proceedings of Int. Conf. Image Processing
Mancas-Thillou, C.; Mancas, M. & Gosselin, B. (2005). Camera-based degraded character
segmentation into individual components, Proceedings of Int. Conf Document Analysis and
Recognition, pp. 755-759.
23. Mata, M.; Armingol, J.M.; Escalera, A. & Salichs, M.A. (2001). A visual landmark recognition
system for topologic navigation of mobile robots, Proceedings of Int. Conf. on Robotics and
Automation, pp. 1124-1129.
24. Messelodi, S. & Modena, C.M. (1992). Automatic identification and skew estimation of text lines
in real scene images, Pattern Recognition, Vol. 32, No. 5, pp. 791-810.
25. Vilas Naik and Sagar Savalagi, “Textual Query Based Sports Video Retrieval By Embedded Text
Recognition”, International Journal of Computer Engineering & Technology (IJCET), Volume 4,
Issue 4, 2013, pp. 556 - 565, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
26. Pushpa D. and Dr. H.S. Sheshadri, “Schematic Model for Analyzing Mobility and Detection of
Multiple Object on Traffic Scene”, International Journal of Computer Engineering & Technology
(IJCET), Volume 4, Issue 3, 2013, pp. 32 - 49, ISSN Print: 0976 – 6367, ISSN Online:
0976 – 6375.
81