Data Mining

Dariusz Brzeziński
Poznan University of Technology
•
•
•
•
•
•
•
KDD & ECML: How-To
Large-scale processing
Conformal prediction
The science of annotation
Other topics
Applications
Summary
Trends in Data Mining and Machine Learning 2014
Trends in Data Mining and Machine Learning 2014
Trends in Data Mining and Machine Learning 2014
Trends in Data Mining and Machine Learning 2014
Data Mining
Big Data Mining
Big Data Science
Trends in Data Mining and Machine Learning 2014
”Big data mining is not about data mining per se”
Jimmy Lin (Univ. Maryland, Twitter)
•
•
•
•
•
It’s a lot of mundane tasks
Large-p and large-n
Understanding and cleaning data
Good integration over fancy algorithms
Scaling up over new methods
Trends in Data Mining and Machine Learning 2014
• Exact data reduction algorithm
• Eliminate attributes without changing the final model
Trends in Data Mining and Machine Learning 2014
• Key idea: fast Lasso estimation
•
•
•
•
Some attributes will zero-out
DPC: Dual Projection onto Convex Set
Calculating the Dual Formulation of Lasso is difficult
Providing a good estimate is easier
Close underestimate of rejected attributes
𝜃∗ 𝜆 ∈ Θ
Trends in Data Mining and Machine Learning 2014
• A similar technique can be used speed up SVM
Compilation of papers available at: http://www.public.asu.edu/~jwang237/screening.html
Lasso Screening Rules via Dual Polytope Projection, J. Wang, J. Zhou, P. Wonka, J. Ye. NIPS 2013.
Safe Screening with Variational Inequalities and Its Application to Lasso, J. Liu, Z. Zhao, J. Wang, J. Ye, ICML 2014.
Scaling SVM and Least Absolute Deviations via Exact Data Reduction, J. Wang, P. Wonka, J. Ye, ICML 2014.
An Efficient Algorithm for Weak Hierarchical Lasso, Y. Liu, J. Wang, J. Ye, SIGKDD 2014.
Trends in Data Mining and Machine Learning 2014
• An evaluation framework
suitable for high risk applications
• Confidence intervals for all
possible outcomes
• Based on randomness and hypothesis testing
• Can be used with classifiers and regressors
• Developed by Vovk, Shafer, and Gammerman
Trends in Data Mining and Machine Learning 2014
• Online setting
• Requires a non-conformity (strageness) measure
• Prediction steps:
– Given a sequence of labeled data S and a test object x
– For all possible labels for y
• Compute the non-conformal scores for each point in the sequence
S {(x,y)}
• Find Py
– Include y in prediction region Γ 𝜀 (𝑆, 𝑥) iff Py > 𝜀
Trends in Data Mining and Machine Learning 2014
Strangeness for k-NN
Strangeness for SVM
• In an offline setting requires a calibration set
– Inductive Conformal Prediction
– Cross-conformal prediction
– Bootstrap conformal prediction
Trends in Data Mining and Machine Learning 2014
• Applications in sensing, medicine, biology, security,
computer vision, civil engineering, and other fields
• Extensions:
–
–
–
–
–
–
Active Learning
Model Selection
Feature Selection
Anomaly Detection
Change Detection
Quality Assessment
Conformal Predictions for Reliable Machine Learning, V. N. Balasubramanian, S. Ho, V. Vovk, ECML 2014.
Tutorial website with references: http://www.iith.ac.in/~vineethnb/cptutorial/index.html
Tutorial slides: https://dl.dropboxusercontent.com/u/16632828/Conformal%20Prediciton%20Tutorial%20SlidesECML2014.pdf
Trends in Data Mining and Machine Learning 2014
”Algorithms last shorter than that what they work on”
Eduard Hovy (Carnegie Mellon University)
• Is annotation the most boring thing in the world?
• Annotation must be:
– Fast… to produce enough material
– Consistent… enough to support learning
– Deep… enough to be interesting
• What is required:
– Simple procedure
– Several people
– Attention to the source theory
Trends in Data Mining and Machine Learning 2014
• Human annotation services
–
–
–
–
Amazon Mechanical Turk
Crowdflower
ATLAS.TI
QDAP
• What UI? How complex task?
• How many annotators? What price?
Trends in Data Mining and Machine Learning 2014
• Seven questions of annotation
–
–
–
–
–
–
–
Selecting a corpus
Instantiating the theory
Designing the interface
Selecting and training annotators
Designing and managing the annotation procedure
Validating results
Delivering and mainataining the product
Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics, E. Hivy, J.
Lavid, Internation Journal of Translation, 22 (1), 2010.
Toward a Science of Annotation, E. Hovy, MLSMA 2014 (Tutorial slides, I’ve got a copy)
Trends in Data Mining and Machine Learning 2014
•
•
•
•
•
Comparing machine learning and social science
Predicting vs theory testing
Correlation vs causation
How to combine these worlds
Great talk by Sendhil Mullainathan
http://videolectures.net/kdd2014_mullainathan_machine_learning/
Trends in Data Mining and Machine Learning 2014
Box drawings
• Regularization on number of boxes
• Slow exact algorithm
• Fast approximation via clustering
Box Drawings for Learning with Imbalanced Data, S. T. Goh, C. Rudin,
KDD 2014.
Bayesian Decision Lists
• Posterior distribution over
possible decision lists
• Focus on sparsity
Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model, B. Letham, C.
Rudin, Tech Report.
Trends in Data Mining and Machine Learning 2014
Reject classification
• Classifiers with reject option
• Best reject classifier ≠ Best classifier
with reject option
Combination of One-Class Support Vector Machines for Classification
with Reject Option, B. Hanczar, M. Sebag, ECML 2014.
Pattern number estimation
• Fast estimation of the expected number
patterns based on min-sup
• Can be used to create a chart based on
a range of min-sups
Fast estimation of the pattern frequency spectrum, M. Leeuwen, A.
Ukkonen, ECML 2014.
Trends in Data Mining and Machine Learning 2014
Etsy
• Unique goods marketplace
• Capturing aesthetic preferences
• Topic modeling
– Items as words
– User favorites as documents
– Styles as topics
• Coherent styles without any image processing
• Locality sensitive hashing for nearest neighbor search
• Used for trend detection and recommendation
Style in the Long Tail: Discovering Unique Interests with Latent Variable Models in Large Scale Social E-Commerce,
D. Hu, R. Hall, J.Attenberg, KDD 2014.
Trends in Data Mining and Machine Learning 2014
Call of duty
• Game analytics
–
–
–
–
–
4.6 billion hours played
6.5 trillion shots
227 billion grenades
386 billion kills
30 million players
• Map and weapon balancing
• ”Boosting” detection and learning
• Feature engineering and process scaling (GBM)
Machine Learning and Data Mining in Call of Duty, Arthur von Eschen, ECML 2014. (Great talk, no slides… the
only summary I found is here: http://inside-bigdata.com/2014/05/30/data-science-activision/)
Trends in Data Mining and Machine Learning 2014
Learning about meetings
• Questions:
–
–
–
–
Can we detect when key decision are made?
Is their a pattern of interactions?
How long will the meeting last?
Do persuasive words exist?
• Dataset: http://groups.inf.ed.ac.uk/ami/download/
Learning about meetings, B. Kim, C. Rudin, ECML 2014.
Mobile application usage
• Which app will the user start next?
• Experiment with Amazon Mechanical Turk
• Dataset: http://www.idiap.ch/dataset/mdc
Conditional Log-linear Models for Mobile Application Usage Prediction, J. Kim, T. Mielikäinen, ECML 2014.
Trends in Data Mining and Machine Learning 2014
Windflow
• Aircrafts Aloft
• Planes as sensors
• Much better wind
predictions
+
http://windflow.azurewebsites.net/
PROOF
• Biomarker construction
• Heart, lung and kidney failures
• Quick prevention
• Large-p-small-n problem
https://www.cs.ubc.ca/~rng/researchprojects.html
Trends in Data Mining and Machine Learning 2014
•
•
•
•
•
•
•
•
GiveDirectly
DataKind.org
SumAll.org
UN Global Pulse
NYC Analytics
UNICEF
Crisis Text Line
DonorsChoose.org
Trends in Data Mining and Machine Learning 2014
•
•
•
•
•
•
•
•
•
Social Networks
Graphical Models (Weisfeler-Lehman)
MOOC Mining
ADMM (Gradient decent optimization)
Evaluation
Active Learning
Multilabel classification
Deep learning
Privacy
Trends in Data Mining and Machine Learning 2014
•
•
•
•
•
•
Everything should be large and social
Data mining is leaning towards data science
Lots of preprocessing methods
Sparse screening
Conformal prediction
Social applications
http://videolectures.net/kdd2014_newyork/
Trends in Data Mining and Machine Learning 2014
Thank you!
Trends in Data Mining and Machine Learning 2014