Off-line Arabic handwritten words recognition based on hierarchical Bayesian networks Khlifia Jayech 1, 1, Mohamed Ali Mahjoub 1, Najoua Essoukri Ben A mara1 1 Research Unit SAGE, Team Signals and Image Documents, National Engineering School of Sousse [email protected] [email protected] [email protected] Abstract. This paper present new probabilistic graphical model used to model and recognize Arabic Handwriting words representing names of Tunisian cities. In fact, this work is based on Hierarchical Bayesian networks. The aim is to find the best model of Arabic Handwriting to reduce the complexity of recognition process by permitting the partial recognition. Indeed, we propose a hierarchical division of the word and then we extract the characteristics of each block using Zernike and Hu moments, which are invariant to rotation, translation and scaling. Next, we model each block by a latent model and the word by latent hierarchical Bayesian network. In such models, the task of determining the cardinality of the latent variables is difficult, that we solve using K-means. Our approach is tested using IFN / ENIT database, experiments results are very promising. Keywords: Bayesian networks, handwriting recognition, hierarchical latent class model. 1 Introduction Segmentation and recognition of Arabic handwritten words are t wo important areas of research. The recognition of Arabic handwritten words has many applications in bank check reading, mail sorting, forms processing in admin istration and insurance. Off-line Arabic Handwriting Recognition is a difficult task due to the high variab ility and uncertainty of human writ ing (infinite variat ions of shapes resulting from the writing habit, scanning methods, fusion of d iacritical points, overlappin g …). Many works proposed different approaches such as statistical, structural, neural , and Morkov, but all these approaches have many limits. Thus we propose, in this paper, an original method of recognition based on hierarchical Bayesian networks. The originality of our approach also relies on the use of k-means to estimate the cardinality of hidden nodes. In the follo wing section, we expose a state of art of the related work and then the theory of Bayesian networks and hierarchical Bayesian networks, respectively. The developed approach is addressed in section 4. Experiments are the subject of section 5. Conclusion and some perspectives are addressed in the last section. 2 Related works In recent years, many researchers have addressed the recognition o f Arabic text. In [1], Benouareth et al. proposed a MMC model in wh ich the state duration is modeled exp licit ly. They begin by division the handwriting word into vertical band (division may be uniform or not). Then, for each band, they extract a vector (33 features) to be used for training the model. They have shown experimentally that the use of these models DHMM (MMC duration of explicit statements) is more efficient than the use of standard semi-continuous HMMs. They also showed that the division into non uniform bands gives better results than division in uniform bands. Experiments on this model were performed based on IFN / ENIT. For the best system (Gamma to model duration of states and non-uniform division into bands), the authors obtained a recognition rate of 89.79%. In [2], S. Masmoudi Touj proposed a recognition system based on pseudo 2D or MMC planar. This approach is based on a decomposition of writ ing characteristics into five zones which are the upper and lo wer diacrit ics, extensions top and bottom and the middle zone. This decomposition allo ws obtaining homogeneous groups of entities with morphological characteristics specific to each area which can be characterized by appropriate techniques. Each horizontal zone is associated with a super model status MMCP and each super state was modeled by an HMM that adapts well to semi cursive nature of Arabic script. The correlat ion between different horizontal zones is provided by the vertical model o f MMCP. In [16], the author proposed an alphabet of letters body that can take into account the redundancy of forms of letters of the Arab ic alphabet. He has shown that although the points are essential to differentiate the Arabic letters, it is possible to propose a recognition system on an application to average vocabulary size (937 Tunisian town names) without it being necessary to take account the information carried by diacrit ics. The recognition rate of this system evaluated on the basis of IFN / ENIT is 89.98%. He then proposed a mechanis m for dealing with d iacritical marks, and a strategy to combine this information with the recognition results without diacritics. This strategy helped to increase the recognition rate to 92%. In [15], M. Pechwitz and V. Maergner use a sliding window of width three pi xels on a normalized image. On each window, they form a single vector co mprising values grayscale fro m the three bands. They use Hidden Markov Models to model semicontinuous letters. The model word is obtained by concatenation of model letters. This system was tested on IFN / ENIT and gave a recognition rate of 89.5%. 3 Bayesian Networks Bayesian networks (BNs), also known as belief networks, belong to the family of probabilistic graphical models (GMs). A Bayesian network is a directed acyclic graph (DA G) co mposed of nodes and directed links. The nodes correspond to random variables and directed lin ks between nodes represent probabilistic dependencies. The conditioning variables and the dependent variables are called parent nodes and child nodes, respectively. For a link between two variab les, X Y, the overall joint distribution is specified by the product of the prior probability P(X) and the conditional probability P (Y|X). In general, given a set of N variables H1:N =H1 ,…,HN , the joint probability distribution P(H1:N )=P(H1 ,…,HN ) can be factored into a spare set of conditional probabilit ies as follows according to the conditional independency: Where is the set of parent nodes of node in the DA G. 3.1 Hierarchical B ayesian Networks Hierarchical Bayesian networks are a generalization of standard Bayesian networks, where a node in the network may be an aggregate data type. This allows the random variables of the network to represent arbitrarily structured types. Within a single node, there may also be links between co mponents, representing probabilistic dependencies among parts of the structure. Hierarchical Bayesian networks encode conditional probability dependencies the same way as standard Bayesian networks. Hierarchical Bayesian networks are used in various domains such as event recognition of human actions and interactions [17], genetic association studies [18], Discovering Handwriting Strategies of Primary School Children [19]. 3.2 Inference i n Hierarchical B ayesian Networks Inference algorith ms used for standard Bayesian networks can also be applied to the hierarchical Bayesian network model. Given a part ially observed set e of variables referred to as “evidence”, the set of beliefs is established and updated and established to reflect both prior and observed information depending on the evidence: B (h) = P (h|e) Where B(h) is the belief in the value h of variab le H given the evidence E. The most probable explanation h* of the hidden variable H give the evidence e is determined by: The belief update is achieved by a message-passing process distributed along the network through the explo itation of the local dependencies and global independencies of various variables. 4 The developed approach Database Feature extraction Pretreatment Discretization Determination cardinality hidden nodes Classification Inference C1 … Recognition Structure learning Parameter learning Ci Class Cn Fig. 1. Developed system architecture In the next section, we detail the different steps of the developed system: 4.1 Pretreatment of of After the pre-processing: normalization and thinning (fig. 2.), we d ivide the word in a set of blocks using a threshold to avoid having empty blocks . (a) (b) (c) Fig. 2. (a) Original Word (b) word normalization and thinning (c) divided word. 4.2 Feature extraction After the pretreatment of the word, we use the descriptors such as mo ment invariants of Zern ike and Hu descriptors, which are invariant to rotation, t ranslation and scaling, to extract the characteristics of each block. Zernike descriptor: They are constructed using a set of complex polynomials which form a comp lete orthogonal set on the unit disk with : Where m and n define the order of the mo ment and I (x, y) the gray level of a pixel in the image. The Zern ike polynomials are exp ressed in polar coordinates: Where is the radial o rthogonal polynomial: Hu descri ptor: Hu made the first step in the use of mo ment invariants for pattern recognition by offering the seven Hu mo ments which are calculated fro m the normalized mo ments. Hu mo ments are invariant to translation, rotation and scaling. We introduce the normalized mo ments as follows: The seven mo ments of Hu are: For examp le the mo ment invariants of the word “sidi dhaher” are: block 1 0,0003 0,0032 0,6098 .. .. .. .. 0,1760 0,0453 block 2 0,0011 0,0002 0,6861 .. .. .. .. 0,1732 0,0399 block 3 0,0053 0,0004 0,6828 .. .. .. .. 0,1767 0,0555 block 4 0,0001 0,0001 0,7132 .. .. .. .. 0,1718 0,0356 4.3 Discretization The descriptors of mo ment invariants, described in the previous section, give us continuous signatures. However, Bayesian networks require discrete variables. So, we process to the next stage of pre-treat ment, which consist to discretize the obtained variables. We opt for the method proposed in [8], because it allows a significant reduction of the data without loss of informat ion. Th is method consists to represent the laws o f probability in the fo rm of histograms. These histograms must optimally approximate, in the sense of maximu m likelihood and the mean square error, the unknown probability laws of a random process with a single n -sample. To obtain the bin number of histogram and distribution for each bin, the Akaike information criterion (AIC) [9] has been generalized to our p roblem. Specifically, we first construct a histogram with m bins fro m the samp le, to reduce the size of the histogram, we have merged the two ad jacent bins, that maximize the d ifference between AIC before and after merg ing. This process is repeated until the d ifferen ce would be negative. 4.4 Determinati on of cardi nality of hi dden variable Although this task is difficult, some methods have been proposed for estimating the cardinality of the latent variables such as cross-validation method used in [11], which is to maximize a criterion score on the structure obtained for examp le the BIC criterion, the algorith m known as Structural EM [12] which was adapted for the latent hierarchical models to optimize the size of the latent variab les in the learning phase. In our work, we used the clustering method (k-means) for a first attempt to approximate the size of our h idden nodes. 4.5 Learning The construction of a Bayesian network requires the determination of its structure and its parameters.  Structure Learning Structure learn ing is a difficult task, we consider the recognition process as a classification problem and we propose to use latent hierarchical models ( fig.3). These models generalize the latent structure models (fig. 4). Word Model: hierarchical Bayesian Networks 0 Ci=1…8 Fig. 3. Hierarchical Bayesian Networks We model our problem with a hierarchical model where the local structures of each block are latent discrete variables that are independent of each other, but all depend on a global structure that is latent. Block Model: latent model Considering now that each block is represented by its own local model, the task is to determine the local structure between observed variables and latent variables. Several solutions are available to us. One solution is to use conventional algorith ms for structure learning in Bayesian networks. Another solution is to use structures "classic" for classification as naive Bayesian network. A third approach is to determine the structure of knowledge fro m experts. To simp lify and facilitate the recognition process , the first approach is to ignore. We chose to use a Bayesian latent network structure often used in pattern recognition Fig. 4. The latent model, the latent variable is colored in gray  Para meter learning Once the structure and the cardinality of the latent variables fixed, the learning of network parameters (i.e the conditional probabilit ies) fro m inco mplete data and / or latent variables is achieved by the EM algorithm. The EM algorith m introduced by Dempster is an iterat ive approach of maximu m likelihood estimators. Each iteration of the EM algorithm is composed of two main steps: a step of estimat ing (E) and a maximization step (M). The purpose of this algorith m is to maximize the log likelihood function l(λ,Y) = log (L(λ, Y)) where λ is the set of model parameters and Y is the set of observations on the problem. 5 Experime nts and discussion The tests have been done on a corpus of Tunisian city names that is extracted fro m the IFN/ENIT data base. We have selected a subset of 8 models and generated a database of 1600 words, composed of 200 different images per model. Each word of city name corresponds to a model. On this database we have defined various training and test sets of different sizes. We have used the samples of the sub basis “a” and ”b” for the training and those of the sub basis “c” for the validation and those of the sub basis “d” for the test. 100 50 0 1 3 5 7 9 11 13 15 Cardinality of latent variable (×10) Recognition Rate (%) Recognition Rate (%) After ext racted and discretized the continuous variables. The first experiments are to find, using the validation dataset, the cardinality of hidden nodes for the optimal model fo r each class. The criteria used to determine the cardinality of hidden nodes is the recognition rates and the cost of inference in terms of t imes. The fo llo wing figure shows the different results for some classes: 100 50 0 1 50 0 1 3 5 7 9 11 13 15 Cardinality of latent variable (×10) C7 5 7 9 11 13 15 C4 Recognition Rate (%) Recognition Rate (%) C2 100 3 Cardinality of latent variable (×10) 100 50 0 1 3 5 7 9 11 13 15 Cardinality of latent variable (×10) C8 Fig. 5. Recognition rates according to the cardinality of latent variab les for class 2, 4, 7 and 8. For class C2, the value o f the card inality of h idden nodes varies fro m 10 to 150. The results in fig. 5 show that the recognition rates increase with the cardinality of hidden nodes. This rate reaches a maximu m at k=110. There isn’t improvement fo r values of k>110. The optimal value of k is set at 110. For class C7, the recognition rates reaches a maximu m value for k= 40 and 100. The cost of recognition rates, when k=100, is high in terms of memo ry space and time inference. So, we choose k=40 for this class, this value p rovides the best compro mise between computational comp lexity and the recognition rates. All results concerning the optimal cardinality of hidden nodes are illustrated in the following table: Class Cardinality of latent variable Table 1. Optimum cardinality of the latent variable per class C1 C2 C3 C4 C5 C6 C7 80 110 50 90 90 150 40 C8 90 For the following results, learn ing and inference in each model are made taking into account the corresponding optimal cardinality of the latent variable determined in previous experiments. Table 2 shows the recognition rates obtained with the training set: Class Recognition rates (%) Table 2. Recognition rates obtained by training set C1 C2 C3 C4 C5 C6 C7 C8 97 99 95 98 94 98 99 98 Table 3 shows the recognition rates obtained with the test set: Class Recognition rates (%) C1 Table 3. Recognition rates obtained by test set C2 C3 C4 C5 C6 88 88 86 90 90 92 C7 C8 90 86 Comparison with other systems The IFN/ENIT database is available to the scientific co mmunity and this makes system co mparison possible. As mentioned in the abstract, we co mpared the developed system to three other systems tested also on IFN/ ENIT database; the ones of El-hajj [21], Burrow [20], and Masmoudi [2], it may be noted fro m table 4 that the highest accuracy was obtained by El-hajj fo llo wing our system, this is due to the use of a segmentation phase, on the other side our system achieves a good accuracy compared with Burro w’s and Masmoudi’s systems which prove the performance of Hierarchical Bayesian network as classifier. Table 4. Co mparison results Systems Classifiers Co mbin ing rule Accuracy El-Hajj’s system [21] Burro w’s system [20] 3 HMMs Several K-NN Neural Net work Majority vote 94.44 % 74% Masmoudi’s system [2] Our system 6 PHMM Hierarchical Bayesian network Max ru le 70% Max ru le 90% Conclusion and pers pectives In this paper, we proposed a new approach for the offline Arabic handwrit ing words recognition based on hierarchical Bayesian networks. Moreover, the Zernike and Hu mo ments, invariant to rotation, translation and scaling, allowed us to effectively address the major problems of mo rphological Arab ic handwriting. We proposed the k-means method to solve the cardinality problem of the h idden nodes. The experimental results are pro mising and show that our method enables to improve the recognition rates when we find the optimal card inality of hidden nodes. References 1. Benouareth, A., Ennaji, A., Sellami, M .: Arabic handwritten word recognition using HMMs with explicit state duration. EURASIP Journal on Advances in Signal Processing, 2008. 2. M asmoudi Touj, S., Ben Amara, N., Amiri, H.: Arabic handwritten words recognition based on a planar hidden markov model. the International Arab Journal of Information Technology, vol.2, no.4, 2005. 3. M asmoudi Touj S., Ben Amara N., Amiri H. : M odélisation markovienne planaire pour la reconnaissance de l’écriture arabe. Colloque International Francophone sur l’Ecrit et le Document, CIFED, 2004. 4. Cho S.J., Kim J.H.: Bayesian Network M odeling of Hangul Characters for On-line Handwriting Recognition. Proc. of ICDAR, 2003. 5. Tlemsani, R.,Benyettou, A. : Application des réseaux bayésiens dynamiques à la reconnaissance en-ligne des caractères isolés. 4th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications, 2007, tunisia. 6. Hallouli K. : Reconnaissance de caractères par méthodes markoviennes et réseaux bayésiens. PhD thèses. Ecole Nationale supérieure des Télécommunication, mai ,2004. 7. Hallouli, K., Likforman, L., Sigelle, M .: A comparative study between decision fusion and fusion data in M arkovian printed character recognition. Proceedings of 16th International Conference on Pattern Recognition. pp.147-150, 2002. 8. Colot, O., Olivier, C., Courtellemont, P., El-M atouat, A., Brucq, D.: Information criteria and abrupt changes in probability laws. Signal Processing VII : Theories and Applications, vol. 1, pp.387–391, 2004. 9. Barrat, S. : M odèles graphiques probabilistes pour la reconnaissance de formes. thése de doctorat, 2009. 10. Leray, P. : Réseaux bayésiens : apprentissage et modélisation de systèmes complexes. Rapport de recherche, 2006. 11. Zhang, N. L., Nielsen, T.D., Jensen, F.V.: Latent variable discovery in classification models. to appear in Artificial Intelligence in M edicine, 2003. 12. Zhang, N. L.: Structural EM for hierarchical latent class model. Technical report. HKUSTCS03-06, 2003. 13. Peña, J.M ., Lozano, J.A., Larrañaga, P.: An improved Bayesian structural EM algorithm for learning Bayesian networks for clustering. Pattern Recognition Letters. pp.779-786, 2001. 14. Tibshirani, R.: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (M ethodological), vol. 58, no. 1, pp: 267–288, 1996. 15. Pechwitz, M ., and M argner, V.: HMM Based Approach for Handwritten Arabic Word Recognition Using the IFN/ENIT – Database. In Proc. of the 7th Inter. Conf. on Document Analysis and Recognition (ICDAR), Vol. 2, pp 890– 894, 2003. 16. M enasri, F. : Contributions à la reconnaissance de l’écriture arabe manuscrite, Ph. D. thesis, l’Université Paris Descartes, France 2008 17. Sangho, P., Aggarwal, J.K.: A hierarchical Bayesian network for event recognition of human actions and interactions, Springer-Verlag 2004. 18. M ourad, R., Sinoquet, C., Leray, P. : Apprentissage de réseaux bayésiens hiérarchiques latents pour les études d’association pangénomiques. Proc. JFRB, 2010. 19.Leray, Ph., Zaarour, I., Heutte, L., El Eter, B., Labiche, J., M ellier, D.: A Bayesian Network M odel for Discovering Handwriting Strategies of Primary School Children. Proceedings of 14ème Congrès Francophone Reconnaissance des Formes et Intelligence Artificielle, 2004. 20. Burrow, P.: Arabic Handwriting Recognition. Report of M aster of Science School of Informatics, University of Edinburgh, 2004. 21. El-Hajj, R., M okbel, C., and Likforman-Sulem, L.: Combination of HMM -Based classifiers for the recognition of Arabic handwritten words. Document Analysis and Recognition, 2007.