Text Mining for Open Source Archive systems Srikrishna Raamadhurai Information Systems Specialist Prof. Timo Honkela National Library of Finland, Mikkeli and University of Helsinki Nov 19, 2014 1 Department of Modern Languages Language Technology HELSINKI Center for Preservation and Digitisation MIKKELI 2 Outline ● ● Types of Archived media Usability of such archives and challenges posed ● Serving them to consumers ● Text Mining for open source archives ● Brief insight into a few Methodologies ● Results and Discussions 3 Types of Archive media DIGITAL RESOURCES Images Texts Speeches/ convers. Videos Interactive systems Numerical data Multimedia documents Computational models Computer software 4 Consumers and Institutions Museums Citizens Archives Artists Libraries Teachers Researchers Journalists Universities DIGITAL RESOURCES Societies Media Companies Information specialists Municipalities Decision makers State 5 Serving Digital resources and Challenges posed... 6 Challenges posed ● Imposed Copyrights ● Quality of materials ● Overwhelming size of the repository ● Different types of media ● Audience and their motivation Motivation ● Opportunity to unleash an unknown data market ● Availability of efficient methodologies 7 A subfield called Text Mining t Tex Text Mining Audio Information Retrieval Speech Recognition Ima ge Other media Image Processing ... Pattern Recognition Methodologies (Technical) Applications (Practical) 8 An example of automatic multimedia content analysis Acknowledgements: Finnish Broadcasting Company (YLE) Jorma Laaksonen users.ics.aalto.fi/jorma/ scholar.google.com/citations?user=suHzeyIAAAAJ&hl=en Mikko Kurimo users.ics.aalto.fi/mikkok/ elec.aalto.fi/en/about/careers/professors/mikko_kurimo/ 9 Video analysis / scene classification Speaker recognition Speech recognition (speech to text) 10 Video analysis / scene classification Speaker recognition OCR Speech recognition (speech to text) 11 Serving Newspapers online http://digi.kansalliskirjasto.fi 12 Benefits of digitizing Newspapers ● Support research on History ● Power of knowledge to public (more using Translator) ● Compare equality of growth between cities ● Predict trends in market using classifieds ● Use market fluctuations to analyse goodness of a government ● Information extraction for related incidents in the past ● Research on linguistic evolution of Finnish / Swedish ● Perhaps even find out more about Kalevala... 13 Technical challenges ● Recognizing tokens via OCR – ● Varying page layouts – ● Efficient post processing techniques Efficient modeling (Machine learning) Imbalance in label distribution – Efficient learning algorithms (Statistical balancing) 14 zzhdysvautki v, u, p ? Yhdyspankki u, n, ll ? 15 Similarity of Fraktur letter shapes (a self-organizing map) Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 16 How hard is the task really? But why? 1. Bad estimate of the Complexity involved ● ● Layout irregularity Linguistic variability 2. Cost overhead 3. Size of target Users 17 Link to the article Machine Learning to the rescue Huge data set Machine learning Machine learning simply 'combs the data for hidden patterns and learns them' A Linear-chain Conditional Random Fields (CRF) classifier, Incrementally trained Meaning- A small number of marked-up samples could automatically markup several thousands/millions of new ones. 18 A quick intro to Machine Learning Data Training Validation training data Features Learner patterns Test validation test Predictions Model Error feedback 19 Results and Discussion To the HTML outputs... 20 Thank You! Any ??? 21
© Copyright 2024 ExpyDoc