Text Mining for Open Source Archive systems

Text Mining for Open Source
Archive systems
Srikrishna Raamadhurai
Information Systems Specialist
Prof. Timo Honkela
National Library of Finland, Mikkeli
and University of Helsinki
Nov 19, 2014
1
Department of
Modern Languages
Language Technology
HELSINKI
Center for Preservation
and Digitisation
MIKKELI
2
Outline
●
●
Types of Archived media
Usability of such archives and challenges
posed
●
Serving them to consumers
●
Text Mining for open source archives
●
Brief insight into a few Methodologies
●
Results and Discussions
3
Types of Archive media
DIGITAL RESOURCES
Images
Texts
Speeches/
convers.
Videos
Interactive
systems
Numerical
data
Multimedia
documents
Computational
models
Computer
software
4
Consumers and Institutions
Museums
Citizens
Archives
Artists
Libraries
Teachers
Researchers
Journalists
Universities
DIGITAL
RESOURCES
Societies
Media
Companies
Information
specialists
Municipalities
Decision
makers
State
5
Serving Digital resources
and Challenges posed...
6
Challenges posed
●
Imposed Copyrights
●
Quality of materials
●
Overwhelming size of the repository
●
Different types of media
●
Audience and their motivation
Motivation
●
Opportunity to unleash an unknown data market
●
Availability of efficient methodologies
7
A subfield called Text Mining
t
Tex
Text Mining
Audio
Information Retrieval
Speech Recognition
Ima
ge
Other
media
Image Processing
...
Pattern Recognition
Methodologies
(Technical)
Applications
(Practical)
8
An example of automatic multimedia content analysis
Acknowledgements:
Finnish Broadcasting Company (YLE)
Jorma
Laaksonen
users.ics.aalto.fi/jorma/
scholar.google.com/citations?user=suHzeyIAAAAJ&hl=en
Mikko
Kurimo
users.ics.aalto.fi/mikkok/
elec.aalto.fi/en/about/careers/professors/mikko_kurimo/
9
Video analysis / scene classification
Speaker
recognition
Speech recognition
(speech to text)
10
Video analysis / scene classification
Speaker
recognition
OCR
Speech recognition
(speech to text)
11
Serving Newspapers online
http://digi.kansalliskirjasto.fi
12
Benefits of digitizing Newspapers
●
Support research on History
●
Power of knowledge to public (more using Translator)
●
Compare equality of growth between cities
●
Predict trends in market using classifieds
●
Use market fluctuations to analyse goodness of a government
●
Information extraction for related incidents in the past
●
Research on linguistic evolution of Finnish / Swedish
●
Perhaps even find out more about Kalevala...
13
Technical challenges
●
Recognizing tokens via OCR
–
●
Varying page layouts
–
●
Efficient post processing techniques
Efficient modeling (Machine learning)
Imbalance in label distribution
–
Efficient learning algorithms (Statistical balancing)
14
zzhdysvautki
v, u, p ?
Yhdyspankki
u, n, ll ?
15
Similarity of Fraktur letter shapes
(a self-organizing map)
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
16
How hard is the task really?
But why?
1. Bad estimate of the Complexity
involved
●
●
Layout irregularity
Linguistic variability
2. Cost overhead
3. Size of target Users
17
Link to the article
Machine Learning to the rescue
Huge data set
Machine learning
Machine learning simply 'combs the data for hidden patterns and
learns them'
A Linear-chain Conditional Random Fields (CRF) classifier,
Incrementally trained
Meaning- A small number of marked-up samples could automatically
markup several thousands/millions of new ones.
18
A quick intro to Machine Learning
Data
Training
Validation
training
data
Features
Learner
patterns
Test
validation
test
Predictions
Model
Error
feedback
19
Results and Discussion
To the HTML outputs...
20
Thank You!
Any ???
21