Download Slides (PDF) - Singapore Management University

Online Learning Methods
for Big Data Analytics
Steven C.H. Hoi*, Peilin Zhao+
*Singapore
Management University
+Institute for Infocomm Research, A*STAR
17 Dec 2014
http://LIBOL.stevenhoi.org/
Agenda
• PART I: Introduction
–
–
–
–
Big Data: Opportunities & Challenges
Online Learning: What and Why
Online Learning Applications
Overview of OL Methods
• PART II: Online Learning Methods
– Traditional Linear OL Algorithms
– Non-traditional OL Algorithms
– Kernel-based OL Algorithms
• Discussions and Open Issues
• Summary and Take-Home Messages
17/12/2014
Online Learning (Hoi & Zhao)
2
Data Science
Last few hundred years
Thousand years
Theory
Experiment
Last few decades
Computation
Datadriven
Big Data Era
Jim Gray @ 2007
17/09/2014
Online Learning (Hoi & Zhao)
3
Big Data: Popularity
• Google Trends
• Big Hype or Big Hope
17/12/2014
Online Learning (Hoi & Zhao)
4
What is Big Data
• "Big data is a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing
applications."
--- Wikipedia
• “Big data refers to datasets whose size is beyond the ability
of typical database software tools to capture, store,
manage, and analyze.”
---McKinsey Global Institute’11
• “Big data refers to data that is too large, complex and
dynamic for any conventional data tools to capture, store,
manage and analyze.”
--- WIPRO
17/12/2014
Online Learning (Hoi & Zhao)
5
Characteristics of Big Data
• Gartner Analyst, Doug Laney, published a research paper (2001) titled
3D Data Management: Controlling Data Volume, Velocity, and Variety.
Even today, the “3Vs” are generally-accepted dimensions of big data.
Volume
Velocity
Variety
17/12/2014
Online Learning (Hoi & Zhao)
6
Big Data: Volume
1.28 billion
Users (Mar 2014)
Source: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
17/12/2014
Online Learning (Hoi & Zhao)
7
Big Data: Velocity
3.5M images
per day
9B
photos/
month
150M
images
/month
250 years
of video/day
Source: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
17/12/2014
Online Learning (Hoi & Zhao)
8
Source: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html
17/12/2014
Online Learning (Hoi & Zhao)
9
Big Data: Variety
Data Types
-Structured
-Semi-structured
-Unstructured
Data Sources
-Machine-Machine
-Human-machine
-Human-Human
Source: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
17/12/2014
Online Learning (Hoi & Zhao)
10
Big Data: Big Value
Source from McKinsey
17/12/2014
Online Learning (Hoi & Zhao)
11
Big Data Analytics: Opportunities
17/12/2014
Online Learning (Hoi & Zhao)
12
Infrastructure
Analytics
Applications
DATA
17/12/2014
Online Learning (Hoi & Zhao)
13
Big Data Analytics: Challenges
Adaptability
Scalability
• Be able to adapt complex and fast-changing
environment to deal with diverse data and
evolving concepts
• Be able to scale up to handle explosively
increasing data (e.g., real-time stream data)
Online
Learning
Efficiency
• Handle vast volume of data (million or even billion)
with limited computing capacity (CPU/RAM/DISK)
17/12/2014
Online Learning (Hoi & Zhao)
14
What is Online Learning?
Batch/Offline Learning
Online Learning
Feedback
Learner
Update
Predictor
17/12/2014
Online Learning (Hoi & Zhao)
15
Online Prediction Task
For t=1, 2, …, T
• Receive an instance
• Predict its class label
• Receive the true class label
• Suffer loss
• Update the prediction model
Goal: To minimize the total loss suffered:
17/12/2014
Online Learning (Hoi & Zhao)
16
Regret Analysis
• Denote by
the optimal hypothesis from H --the class of linear classifiers
• The regret of an online learning algorithm
• We want the regret to be small and bounded
– Guarantee to perform nearly as well as the one who
observes the entire sequence and choose the best
prediction strategy in hindsight
17/12/2014
Online Learning (Hoi & Zhao)
17
Online-to-Batch Conversions
• Online learning algorithms for batch learning
– An online algorithm that attains low regret can be converted
into a batch learning algorithm that attains low risk
• Theoretical guarantee
– From Regret Bounds to Risk Bounds (i.i.d. from unknown Q)
• Various conversion techniques
– Last hypothesis, averaging, validation, etc
17/12/2014
Online Learning (Hoi & Zhao)
18
Why Online Learning?
Avoid re-training when adding new data
High efficiency
Excellent scalability
Strong adaptability to changing environments
Simple to understand
Trivial to implement
Easy to be parallelized
Theoretical guarantee
Online-to-Batch Conversions
17/12/2014
Online Learning (Hoi & Zhao)
19
Online Learning: Applications
Finance
Multimedia
Search
Social
Media
Online
Learning
Cyber
Security
Computer
Vision
Recommender
Systems
17/12/2014
Online Learning (Hoi & Zhao)
20
Online Learning: Applications
Multimedia
Search
Online
Learning
17/12/2014
Online Learning (Hoi & Zhao)
21
Online Learning for Multimedia Search
• Web-scale Content-based Multimedia Retrieval
I want to buy
a pink Gucci
purse…
pattern
I want to watch
Obama’s
Gangnam Style
visual
audio
color
Image Retrieval
17/12/2014
Video Search
Online Learning (Hoi & Zhao)
22
Online Learning for Multimedia Search
• Web-scale Content-based Multimedia Retrieval
– Interactive Search with online relevance feedback
– Challenges:
• Learn effective hypothesis for user search need, and
• Identify informative examples for soliciting user feedback
– Solution: online active learning algorithms
17/12/2014
Online Learning (Hoi & Zhao)
23
Online Learning for Multimedia Search
• Collaborative Multimedia Retrieval from Big Data
– Mining massive-scale side information
• Relevance feedback logs
• User click through data in search engines
• Multimodal data
– Solutions:
• Online distance metric learning
• Online kernel similarity learning
• Online multimodal similarity learning
– Applications
• For improving similarity searching quality in CBIR
• For improving indexing efficacy in CBIR
17/12/2014
Online Learning (Hoi & Zhao)
24
Online Learning: Applications
Online
Learning
Cyber
Security
17/12/2014
Online Learning (Hoi & Zhao)
25
Online Learning for Cyber Security
• Online Anomaly Detection (outlier/intrusion/fraud)
• Examples
Fraud credit card
transactions
17/12/2014
Malicious web/
spam email filtering
Online Learning (Hoi & Zhao)
Network intrusion
detection systems
26
Online Learning for Cyber Security
• Challenges
– Handle real-time data and has to response instantly
– Highly class imbalance (#anomalies << # normal)
– Different misclassification costs
– Labeling cost could be expensive
– Anomaly concepts/patterns often evolve over time
• Solutions
– Cost-Sensitive Online Learning
– Online Active Learning
– Cost-Sensitive Online Active Learning
17/12/2014
Online Learning (Hoi & Zhao)
27
Online Learning: Applications
Online
Learning
Recommender
Systems
17/12/2014
Online Learning (Hoi & Zhao)
28
Online Learning for Recommendation
• Online Recommender Systems
17/12/2014
Online Learning (Hoi & Zhao)
29
Online Learning for Recommendation
• Challenges
– Data (user rating) arriving sequentially and rapidly
– Data (user rating matrix) is extremely sparse
– User preferences could evolve over time
• Traditional Approaches
–
–
–
–
Collaborative filtering or content-based techniques
Batch learning approaches
Suffer from poor re-training with new data
Fail to adapt for fast-changing environment
• Solutions
– Online Collaborative Filtering/Matrix Factorization
– Sparse Online Learning for High-dimensional data streams
17/12/2014
Online Learning (Hoi & Zhao)
30
Online Learning: Applications
Online
Learning
Computer
Vision
17/12/2014
Online Learning (Hoi & Zhao)
31
Online Learning for Computer Vision
• Video Surveillance using Online Learning
– Visual object tracking from video streams
– Detect anomalous objects/events from video streams
– Challenges:
• Velocity: real-time
processing of video
streams
• Lack of feedback:
have to assume weak labels
• Concept drifting
17/12/2014
Online Learning (Hoi & Zhao)
(Basharat et al. 2008)
32
Online Learning for Computer Vision
• Large-scale Image Classification / Search
– The bag-of-visual words (BoW)
representation is
often not optimal
– Online learning can
be used to optimize
the BoW
representation
– Challenges:
• High dimensionality
• Massive training data
17/12/2014
Online Learning (Hoi & Zhao)
33
Online Learning: Applications
Social
Media
Online
Learning
17/12/2014
Online Learning (Hoi & Zhao)
34
Online Learning for Social Media
• Online learning for mining social media
streams for business intelligence applications
– Sentiment classification
– Public emotion analytics
– Product sentiment detection
– Track brand sentiments
17/12/2014
Online Learning (Hoi & Zhao)
35
Online Learning for Social Media
• Microblogging Emotion Prediction
– Limited training data for each person
– Combining all data may not fit each individual
• Collaborative Online Learning (Li et al. 2010)
17/12/2014
Online Learning (Hoi & Zhao)
36
Online Learning for Social Media
• Mining Social Images for Auto Photo Tagging
– Online learning is used to optimize distance metric for
search-based annotation by mining vast social images
Hawk
Bird
Sky
Eagle
…
Sun
Bird
Sky
Blue
…
17/12/2014
Online Learning (Hoi & Zhao)
Bird
Fly
White
Cloud
…
Sun
Cloud
Hawk
Fly
…
37
Online Learning: Applications
Finance
Online
Learning
17/12/2014
Online Learning (Hoi & Zhao)
38
Online Learning for Finance
• On-line Portfolio Selection
– Goal : To make sequential trading decisions of investing wealth
over a collection of assets
Figure from http://streambase.typepad.com/streambase_stream_process/2007/04/events_in_algo_.html
– Challenge
• Real-time data arrive sequentially while the decision has to be
made immediately (e.g. high-frequent trading)
17/12/2014
Online Learning (Hoi & Zhao)
39
Online Learning for Finance
• On-line Portfolio Selection
– Solution
• Online learning algorithms to optimize strategies
• Exploit the mean reversion principle
– Empirical Results
•
•
•
•
•
NYSE dataset
36 stocks
22 years, daily data
Invest $1 on 1st day
Baseline:
Market(Buy-And-Hold)
~15 times return
– Recent OLPS studies (Li et al, ICML’12, ML’13, CSUR’14, etc)
17/12/2014
Online Learning (Hoi & Zhao)
40
Agenda
• PART I: Introduction
–
–
–
–
Big Data: Opportunities & Challenges
Online Learning: What and Why
Online Learning Applications
Overview of Online Learning Methods
• PART II: Online Learning Methods
– Traditional Linear OL Algorithms
– Non-traditional OL Algorithms
– Kernel-based OL Algorithms
• Discussions and Open Issues
• Summary and Take-Home Messages
17/12/2014
Online Learning (Hoi & Zhao)
41
Online Learning: Overview
Online Learning
Online Learning
with Full Feedback
Online Learning
with Partial Feedback
•
•
•
Bandit problems
Reinforcement learning
Online Learning
without Feedback
•
Unsupervised learning
from stream data (e.g.,
online clustering)
Not covered in this tutorial
–
–
Bandits
•
•
•
ACML12 Tutorial: http://www.princeton.edu/~sbubeck/tutorial.html
ICML Tutorial: https://sites.google.com/site/banditstutorial/
Prediction, Learning, and Games (Nicolo Cesa-Bianchi & Gabor Lugosi)
Reinforcement learning
•
•
http://chercheurs.lille.inria.fr/~ghavamza/ICML2012-Tutorial.html
http://hunch.net/~jl/projects/RL/RLTheoryTutorial.pdf
17/12/2014
Online Learning (Hoi & Zhao)
42
Online Learning: Overview
First order OL
Second order OL
Sparse OL
OL w/ Expert Advice
Single
Kernel
Classification
Traditional
Linear
Methods
Regression
RankingMultiple
NonTraditional
…
Kernels
Online AUC Max.
Cost-Sensitive OL
Online Transfer Learning
Online Distance Metric Learning
Online Collaborative Filtering
17/12/2014
Online Learning (Hoi & Zhao)
Kernel OL
DUOL
Budget OL
Non-Linear
Methods
Online MKL
Online MKC
Online MKS
43
Agenda
• PART I: Introduction
–
–
–
–
Big Data: Opportunities & Challenges
Online Learning: What and Why
Online Learning Applications
Overview of Online Learning Methods
• PART II: Online Learning Methods
– Traditional Linear OL Algorithms
– Non-traditional OL Algorithms
– Kernel-based OL Algorithms
• Discussions and Further Topics
• Summary and Take-Home Messages
17/12/2014
Online Learning (Hoi & Zhao)
44
Notation
17/12/2014
Online Learning (Hoi & Zhao)
45
Online Learning: Overview
Traditional
Linear
Methods
17/12/2014
NonTraditional
Single
Kernel
Multiple
Kernels
Online Learning (Hoi & Zhao)
Non-Linear
Methods
46
Online Learning: Classification Setting
For t=1, 2, …, T
• Receive an instance
• Predict its class label
• Receive the true class label
• Suffer loss
• Update the classification model
17/12/2014
Online Learning (Hoi & Zhao)
47
Objective
• Minimize the total loss
• Loss function
• Zero-One loss:
• Hinge loss:
17/12/2014
Online Learning (Hoi & Zhao)
48
Loss Functions
Hinge Loss
Zero-One Loss
1
1
17/12/2014
Online Learning (Hoi & Zhao)
49
Linear Classifiers
• Restrict our discussion to linear classifier
• Prediction:
• Confidence:
17/12/2014
Online Learning (Hoi & Zhao)
50
Update Rules
• Online algorithms are based on an update rule
which defines
from
(and possibly
other information)
• Linear Classifiers : find
on the input
17/12/2014
Online Learning (Hoi & Zhao)
from
based
51
Algorithms for Update Rules
• First-Order Algorithms
– Perceptron (Rosenblatt, Frank, 1958)
– Online Gradient Descent (Zinkevich et al., 2003)
– Passive Aggressive learning (Crammer et al., 2006)
–
–
–
–
MIRA: Margin Infused Relaxed Algorithm (Crammer and Singer, 2003)
NORMA: Naive Online R-reg Minimization Algorithm (Kivinen et al., 2002)
ROMMA: Relaxed Online Maximum Margin Algorithm (Li and Long, 2002)
ALMA: A New Approximate Maximal Margin Classification Algorithm (Gentile, 2001)
• Second-Order Algorithms
–
–
–
–
SOP: Second order Perceptron (Cesa-Bianchi et al, 2005)
CW: Confidence Weighted learning (Dredze et al, 2008)
AROW: Adaptive Regularization of Weights (Crammer, 2009)
SCW: Soft Confidence Weighted learning (Wang et al, 2012)
• Sparse Online Learning Algorithms
17/12/2014
Online Learning (Hoi & Zhao)
52
Perceptron Algorithm (Rosenblatt Frank, 1958)
w1
w3
w2
17/12/2014
+
-
Online Learning (Hoi & Zhao)
53
Aggressive Perceptron
•
•
•
•
•
•
Initialize
For t=1, 2, … T
Receive an instance
Predict its class label
Receive the true class label
If
then
Aggressive: updates the classifier
whenever loss is non-zero
(even if it classifies correctly)
17/12/2014
Online Learning (Hoi & Zhao)
is the learning rate
54
Online Gradient Descent
• Online Convex Optimization (Zinkevich et al., 2003)
• Consider a convex objective function
where
is a bounded convex set
• The update by Online Gradient Descent (OGD) or Stochastic
Gradient Descent (SGD):
where
17/12/2014
is called the learning rate
Online Learning (Hoi & Zhao)
55
Online Gradient Descent (OGD) algorithm
• Repeat from t=1,2,…
– An unlabeled example
arrives
– Make a prediction based on existing weights
– Observe the true class label
– Update the weights by the OGD rule:
where
17/12/2014
is a learning rate
Online Learning (Hoi & Zhao)
56
Passive Aggressive Online Learning
• Passive Aggressive learning (Crammer et al., 2006)
– PA
– PA-I
– PA-II
17/12/2014
Online Learning (Hoi & Zhao)
57
Passive Aggressive Online Learning
• Closed-form solutions can be derived:
17/12/2014
Online Learning (Hoi & Zhao)
58
Traditional Linear Online Learning (cont’)
• First-Order methods
– Learn a linear weight vector (first order) of model
• Pros and Cons
☺ Simple and easy to implement
☺ Efficient and scalable for high-dimensional
☹ Relatively slow convergence rate
17/12/2014
Online Learning (Hoi & Zhao)
data
59
Second Order Online Learning methods
• Key idea
– Update the weight vector w by maintaining and exploring second
order information in addition to the first order information
• Some representative methods
–
–
–
–
–
SOP: Second order Perceptron (Cesa-Bianchi et al, 2005)
CW: Confidence Weighted learning (Dredze et al, 2008)
AROW: Adaptive Regularization of Weights (Crammer, 2009)
SCW: Soft Confidence Weighted (SCW) (Wang et al, 2012)
Others (but not limited)
• IELLIP:Online Learning by Ellipsoid Method (Yang et al., 2009)
• NHERD: Gaussian Herding (Crammer & Lee 2010)
• NAROW: New variant of AROW algorithm (Orabona & Crammer 2010)
17/12/2014
Online Learning (Hoi & Zhao)
60
SOP: Second Order Perceptron
• SOP: Second order Perceptron (Cesa-Bianch et al. 2005)
• Whiten Perceptron (Not incremental!!)
• Correlation matrix
• Simply run a standard Perceptron for the following
• Online algorithm (an incremental variant of Whiten Perceptron)
• Augmented matrix:
• Correlation matrix:
17/12/2014
Online Learning (Hoi & Zhao)
61
SOP: Second Order Perceptron
• SOP: Second order Perceptron (Cesa-Bianch et al. 2005)
17/12/2014
Online Learning (Hoi & Zhao)
62
CW: Confidence Weighted learning
• CW: Confidence Weighted learning (Dredze et al. 2008)
– Draw a parameter vector
– The margin is viewed as a random variable:
– The probability of a correct prediction is
– Optimization of CW
17/12/2014
Online Learning (Hoi & Zhao)
63
CW: Confidence Weighted learning
can be written as
is the cumulative function of
the normal distribution.
Lemma 1: The optimal value
of the Lagrange multiplier is
given by
17/12/2014
Online Learning (Hoi & Zhao)
64
AROW: Adaptive Regularization of Weights
• AROW (Crammer et al. 2009)
– Extension of CW learning
– Key properties: large margin training, confidence weighting, and the
capacity to handle non-separable data
• Formulations
17/12/2014
Online Learning (Hoi & Zhao)
65
AROW: Adaptive Regularization of Weights
• AROW algorithm (Crammer et al. 2009)
17/12/2014
Online Learning (Hoi & Zhao)
66
SCW: Soft Confidence Weighted learning
• SCW (Wang et al. 2012)
– Four salient properties
☺
Large margin, Non-separable, Confidence
weighted (2nd order), Adaptive margin
– Formulation
• SCW-I
• SCW-II
17/12/2014
Online Learning (Hoi & Zhao)
67
SCW: Soft Confidence Weighted learning
• SCW Algorithms
17/12/2014
Online Learning (Hoi & Zhao)
68
Traditional Linear Online Learning (cont’)
• Second-Order Methods
– Learn both first order and second order info
• Pros and Cons
☺ Faster convergence rate
☹ Expensive for high-dimensional data
☹ Relatively sensitive to noise
17/12/2014
Online Learning (Hoi & Zhao)
69
Traditional Linear Online Learning (cont’)
• Empirical Results (Wang et al., ICML’12)
Online Time Cost
Online Mistake Rate
17/12/2014
Online Learning (Hoi & Zhao)
70
Sparse Online Learning
• Motivation
– How to induce Sparsity in the weights of online learning
algorithms for high-dimensional data
– Space constraints (memory overflow)
– Test-time constraints (test computational cost)
• Some existing work
–
–
–
–
Truncated gradient (Langford et al., 2009)
FOBOS: Forward Looking Subgradients (Duchi and Singer 2009)
Dual averaging (Xiao, 2010)
etc.
17/12/2014
Online Learning (Hoi & Zhao)
71
Truncated gradient (Langford et al., 2009)
• Main Idea
– Truncated gradient: impose sparsity by modifying
the stochastic gradient descent
• Stochastic Gradient Descent
• Simple Coefficient Rounding
17/12/2014
Online Learning (Hoi & Zhao)
72
Truncated gradient (Langford et al., 2009)
Simple Coefficient Rounding vs. Less aggregative truncation
Illustration of the two truncation functions, T0 and T1
17/12/2014
Online Learning (Hoi & Zhao)
73
Truncated gradient (Langford et al., 2009)
• The amount of shrinkage is
measured by a gravity
parameter
• The truncation can be
performed every K online steps
• When
the update rule is identical to
the standard SGD
• Loss Functions:
– Logistic
– SVM (hinge)
– Least Square
17/12/2014
Online Learning (Hoi & Zhao)
74
FOBOS (Duchi and Singer 2009)
• Forward-Backward Splitting
17/12/2014
Online Learning (Hoi & Zhao)
75
The Fobos Algorithm
• Repeat
– I. Unconstrained (stochastic sub) gradient of loss
– II. Incorporate regularization
• Similar to
– Forward-backward splitting (Lions and Mercier 79),
– Composite gradient methods (Wright et al. 09,
Nesterov 07),
– Dual averaging with regularization (Xiao 09).
17/12/2014
Online Learning (Hoi & Zhao)
76
Fobos: Step I
• Unconstrained (stochastic sub) gradient of loss
17/12/2014
Online Learning (Hoi & Zhao)
77
Fobos: Step II
• Incorporate regularization
17/12/2014
Online Learning (Hoi & Zhao)
78
Fobos with
• Step II:
• Separable:
• Coordinate-wise update yields sparsity:
• Similar to
– Truncated gradient (Langford et al. 08), Iterative shrinkage
and Thresholding (Donoho 95, Daubechies et al. 04)
17/12/2014
Online Learning (Hoi & Zhao)
79
Forward Looking Property
17/12/2014
Online Learning (Hoi & Zhao)
80
Dual Averaging (Xiao, 2010)
• Goal: The regularized stochastic learning
• SGD lacks capability in exploiting problem structure
and often suffers from large variations
• Extensions of Nesterov’s dual averaging Method
• The Regularized Dual Averaging (RDA)
Strongly Convex
function
17/12/2014
Online Learning (Hoi & Zhao)
81
The RDA Algorithm
• Step 1: compute a subgradient
• Step 2: Update average subgradient:
• Step 3: Compute the next weight vector:
Closed-form solutions
17/12/2014
Online Learning (Hoi & Zhao)
82
Comparison of Sparse OL Algorithms
• RDA
• FOBOS
•
– Subgradient
– Local Bregman divergence
– Average Subgradient
– Global proximal function
– Coefficient
– Equivalent to TG when
– Coefficient
– Uses a much more
aggressive truncation
threshold
17/12/2014
Online Learning (Hoi & Zhao)
83
Comparisons
• Comparison of TG with other baselines
17/12/2014
Online Learning (Hoi & Zhao)
84
Comparisons
17/12/2014
Online Learning (Hoi & Zhao)
85
Variants of Sparse Online Learning
• Online Feature Selection (OFS)
– A variant of Sparse Online Learning
– The key difference is that OFS focuses on selecting a
fixed subset of features in online learning process
– Could be used as an alternative tool for batch
feature selection when dealing with big data
• Existing Work
– Online Feature Selection (Hoi et al, 2012) proposed an OFS
scheme by exploring the Sparse Projection to choose a
fixed set of active features in online learning
17/12/2014
Online Learning (Hoi & Zhao)
86
Online Learning with Expert Advice
• Learning to combine the predictions from
multiple experts (classifiers)
• An ensemble of d experts:
• Combination weights:
• Combined classifier
17/12/2014
Online Learning (Hoi & Zhao)
87
Hedge Algorithm (Freund & Schapire '97)
• Assume there exists some best expert
• What is the learning strategy that can perform
as well as the best expert?
• Consider it as an on-line allocation task
+1
17/12/2014
-1
Online Learning (Hoi & Zhao)
+1
+1
88
Hedge Algorithm
Initialize
For t=1, 2, … T
• Receive a training example
• Prediction
• If
then
– For i=1, 2, …, d
• If
then
17/12/2014
Online Learning (Hoi & Zhao)
89
Hedge Algorithm
• Regret Bounds
– Denote
as the loss of the i-th expert
– By choosing appropriate
• No-regret algorithm! Perform as well as the best expert!
17/12/2014
Online Learning (Hoi & Zhao)
90
Summary of Traditional Linear OL
• Pros
☺ Efficient for computation & memory
☺ Extremely scalable
☺ Theoretical bounds
on the mistake rate
• Cons
☹ Learn Linear prediction models
☹ Optimize the mistake rate only
17/12/2014
Online Learning (Hoi & Zhao)
91
Online Learning: Overview
Traditional
Linear
Methods
17/12/2014
NonTraditional
Single
Kernel
Multiple
Kernels
Online Learning (Hoi & Zhao)
Non-Linear
Methods
92
Non-Traditional Linear OL
•
•
•
•
•
Online AUC Maximization
Cost-Sensitive Online Learning
Online Transfer Learning
Online Distance Metric Learning
Online Collaborative Filtering
17/12/2014
Online Learning (Hoi & Zhao)
93
Online AUC Maximization
• Motivation
– The mistake rate (or classification accuracy) measure could
be misleading for many real-world applications
• Example:
Consider a set of 10,000 instances with only 10 “positive” and
9,990 “negative”. A naïve classifier that simply declares every
instance as “negative” has 99.9% accuracy.
• Many applications (e.g., anomaly detection) often adopt
other metrics, e.g., AUC (area under the ROC curve).
Can online learning directly optimize AUC?
17/12/2014
Online Learning (Hoi & Zhao)
94
Online AUC Maximization
• What is AUC?
– AUC (Area Under the ROC Curve)
– ROC (Receiver Operating Characteristic) curve details the rate of
True Positives (TP) against
False Positives (FP) over
the range of possible thresholds.
– AUC measures the probability
for a randomly drawn positive
instance to have a higher
decision value than a randomly
sampled negative instance
– ROC was first used in World War II
for the analysis of radar signals.
17/12/2014
Online Learning (Hoi & Zhao)
95
Online AUC Maximization
• Motivation
– To develop an online learning algorithm for training an
online classifier to maximize the AUC metric instead of
mistake rate/accuracy
– “Online AUC Maximization” (Zhao et al., ICML’11)
• Key Challenge
– In math, AUC is expressed as a sum of pairwise losses
between instances from different classes, which is
quadratic in the number of received training examples
– Hard to directly solve the AUC optimization efficiently
17/12/2014
Online Learning (Hoi & Zhao)
96
Formulation
•
•
•
•
A data set
Positive instances
Negative instances
Given a classifier w, its AUC on the dataset D:
17/12/2014
Online Learning (Hoi & Zhao)
97
Formulation (cont’)
• Replace the indicator function I with its
convex surrogate, i.e., the hinge loss function
• Find the optimal classifier w by minimizing
(1)
• It is not difficult to show that
17/12/2014
Online Learning (Hoi & Zhao)
98
Formulation (cont’)
• Re-writing objective function (1) into:
• In online learning task, given (xt , yt ), we may
do online update:
The loss function is related to all received examples.
Have to store all the received training examples!!
17/12/2014
Online Learning (Hoi & Zhao)
99
Main Idea of OAM
• Cache a small number of received examples;
• Two buffers of fixed size, Bt  and Bt  , to cache the
positive and negative instances;
xt
yt
Reservoir
sampling
or
fSequential
( xt )
Predictor
Update buffer
Buffer +
Bt
yt  1

yt  1
update
classifier
Gradient
Update buffer
Buffer –
Bt 
Flow of the proposed online AUC maximization process
17/12/2014
Online Learning (Hoi & Zhao)
100
OAM Framework
17/12/2014
Online Learning (Hoi & Zhao)
101
Update Buffer
• Reservoir Sampling (J. S. Vitter, 1985)
– A family of classical sampling algorithms for randomly
choosing k samples from a data set with n items, where n is
either a very large or unknown number.
– In general, it takes a random sample set of the desired size in
only one pass over the underlying dataset.
– The UpdateBuffer algorithm is simple and very efficient:
17/12/2014
Online Learning (Hoi & Zhao)
102
Update Classifier
• Algorithm 1: Sequential update by PA:
– Follow the idea of Passive aggressive learning (Crammer et al.’06)
– For each x in buffer B, update the classifier:
• Algorithm 2: Gradient-based update
– Follow the idea of online gradient descent
– For each x in buffer B, update the classifier:
17/12/2014
Online Learning (Hoi & Zhao)
103
Empirical Results of OAM
• Comparisons
– Traditional algorithms:
• Perceptron, PA, Cost-sensitive PA (CPA), CW
– The proposed OAM algorithms:
(i) OAM-seq, OAM-gra, (ii) OAM-inf (infinite buffer size)
• Evaluation of AUC for Classification tasks
17/12/2014
Online Learning (Hoi & Zhao)
104
Other Related Work
• Online AUC Maximization problem is a special
case for “Online Learning with Pairwise Loss
Functions”
• “On the Generalization Ability of Online
Learning Algorithms for Pairwise Loss
Functions” (ICML’13)
• Wang, Yuyang, et al. "Online Learning with
Pairwise Loss Functions." arXiv preprint
arXiv:1301.5332 (2013).
17/12/2014
Online Learning (Hoi & Zhao)
105
Cost-Sensitive Online Learning
• Motivation
– Beyond optimizing the mistake rate or accuracy
– Attempt to optimize the cost-sensitive measures
• Sum
• Cost
• Existing Work
– Cost-sensitive Online Gradient Descent (Wang et al. 2012)
– Cost-Sensitive Double Updating Online Learning (Zhao et al. 2013)
17/12/2014
Online Learning (Hoi & Zhao)
106
Cost-Sensitive Online Gradient Descent
• CSOGD: Cost-Sensitive Online Gradient Descent
– Formulate the cost-sensitive objective functions
• where
for optimizing sum or
for cost
– Update by Online Gradient Descent
17/12/2014
Online Learning (Hoi & Zhao)
107
Cost-Sensitive Online Gradient Descent
17/12/2014
Online Learning (Hoi & Zhao)
108
Cost-Sensitive Online Active Learning
• Motivation
• Feedback is not always available
• Labeling cost could be expensive
• Learning to Detect Malicious URLs (Zhao et al. KDD’13)
17/12/2014
Online Learning (Hoi & Zhao)
109
Cost-Sensitive Online Active Learning
• Main idea
– Combining both Cost-Sensitive OL and Active Learning
– Acquire label only when necessary
– Query Probability
• Empirical Results
– CSOAL saves
almost 99%
amount of
labeling cost by
achieving similar
performance
17/12/2014
Online Learning (Hoi & Zhao)
110
Online Transfer Learning
• Transfer learning (TL)
– Extract knowledge from one or more source tasks
and then apply them to solve target tasks
– Three ways which transfer might improve learning
– Two Types of TL tasks
• Homogeneous vs Heterogeneous TL
17/12/2014
Online Learning (Hoi & Zhao)
111
Online Transfer Learning (Zhao and Hoi 2011)
• Online Transfer learning (OTL)
– Assume training data for target domain arrives sequentially
– Assume a classifier was learnt from a source domain
– online algorithms for transferring knowledge from source
domain to target domain
• Settings
– Old/source data space:
– New/target domain:
– A sequence of examples from new/target domain
– OTL on Homogeneous domains
– OTL on heterogeneous domains
17/12/2014
Online Learning (Hoi & Zhao)
112
Online Transfer Learning (Zhao and Hoi 2011)
• OTL on Homogeneous domains
– Key Ideas: explore ensemble learning by combining
both source and target classifiers
– Update rules using any existing OL algorithms (e.g., PA)
17/12/2014
Online Learning (Hoi & Zhao)
113
Online Transfer Learning (Zhao and Hoi 2011)
• OTL on Heterogeneous domains
– Assumption: not completely different
– Each instance
in target domain can be
split into two views:
– The key idea is to use a co-regularization principle
for online optimizing two classifiers
– Prediction can be made by
17/12/2014
Online Learning (Hoi & Zhao)
114
Online Transfer Learning (Zhao and Hoi 2011)
• Heterogeneous OTL algorithm
• Applications: online learning with concept drifting
17/12/2014
Online Learning (Hoi & Zhao)
115
Online Distance Metric Learning
• Distance Metric Learning (DML) has many applications
in multimedia, especially for content-based image
retrieval and indexing, data clustering, etc.
• Objective
– Instead of learning the classification model,
– The goal of DML is to learn a Mahalanobis distance function
where A is a d*d positive definite matrix for the distance
metric, i.e.,
17/12/2014
Online Learning (Hoi & Zhao)
116
Online Distance Metric Learning
• Two Types of Data (a.k.a. “Side Information”)
– Pairwise Instances (a.k.a. “Pairwise constraints”)
– Triple Instances (a.k.a. “Triplet constraints”)
• Data sources: relevance feedback in CBIR, query
logs of search engines, social media, etc.
17/12/2014
Online Learning (Hoi & Zhao)
117
Online DML: Problem Setting
For t=1, 2, …, T
• Receive a pairwise instance
• Predict similarity label
• Receive the true label
• Suffer loss
• Update the distance metric
17/12/2014
Online Learning (Hoi & Zhao)
118
Online DML: Update Rules
• Regularized Online Learning Framework
• Loss Functions
– Hinge Loss
– Square Loss
17/12/2014
Online Learning (Hoi & Zhao)
119
Online DML: Regularizers
• Frobenius divergence
Projecting matrix A onto positive semidefinite (PSD) cone
– Examples:
• Pseudo-metric Online Learning Algorithm (POLA)
(Shalev-Shwartz et al, 2004)
• (Online) Regularized Metric Learning (Jin et al, 2010)
17/12/2014
Online Learning (Hoi & Zhao)
120
Online DML: Regularizers
• LogDet divergence
– Information-theoretic approach
– Examples:
• Info Theo. Metric Learning (ITML) (Davis et al. 2007)
• LogDet Exact Gradient Online (LEGO) (Jain et al., 2009)
17/12/2014
Online Learning (Hoi & Zhao)
121
Online Similarity Learning
• Consider a parametric similarity function in a
bi-linear form
• The goal is to find S such that for all triplets
• For each triplet, define loss function as
17/12/2014
Online Learning (Hoi & Zhao)
122
Online Similarity Learning
• Online Algorithm for Scalable Image Similarity Learning
(OASIS) (Chechik et al., 2010)
– Follow the principle of Passive-Aggressive Online Learning
17/12/2014
Online Learning (Hoi & Zhao)
123
Online Collaborative Filtering
• Collaborative Filtering
– Learn User-Item matrix to predict rating/ranking
– Simple in data collection
– Model-based vs Memory-based methods
R=
I1
I2
I3
I4
I5
I6
I7
u1
5
?
2
?
1
?
4
u2
?
4
?
1
?
3
?
u3
5
?
1
?
?
3
?
• Model-based Collaborative Filtering
– Matrix factorization is the most widely used
17/12/2014
Online Learning (Hoi & Zhao)
124
Online Collaborative Filtering
• Problem Setting
– A total of m users, and n items
– A sequence of observed ratings
• Matrix Factorization
– The goal is to find U and V by minimizing
17/12/2014
Online Learning (Hoi & Zhao)
125
Online Collaborative Filtering
• Online Gradient Descent (OGD) for Online CF
– For t=1,…,T
• Receive a rating observation
the a-th user and the b-th item
• Update U and V as follows:
with respect to
– OGD may converge slowly
17/12/2014
Online Learning (Hoi & Zhao)
126
Online Collaborative Filtering
• Online Multi-Task Collaborative Filtering (Wang et al, 2013)
– Equivalence between MF-CF and Multi-task learning
– Instead of only updating one user (row) vector and
one item (column) vector, OMTCF attempts to
update multiple users (tasks) for each observation
– Using a task-interaction matrix
to model
the relationship between the tasks, and using this
matrix to simultaneously update multiple models
17/12/2014
Online Learning (Hoi & Zhao)
127
Online Collaborative Filtering
• Online Multi-Task Collaborative Filtering (Wang et al, 2013)
– For t=1,…,T
• Receive a rating observation
the a-th user and the b-th item
• Update U and V as follows:
17/12/2014
Online Learning (Hoi & Zhao)
with respect to
128
Online Collaborative Filtering
• Second-Order Online Collaborative Filtering (Lu et al, 2013)
– Attempt to exploit second order information
– Assume
– Following the idea of Confidence-weighted (CW) learning
– CWOCF: Online Update Rules (w.r.t. RMSE loss)
17/12/2014
Online Learning (Hoi & Zhao)
129
Online Collaborative Filtering
• Open Challenges
– Handle novel sample extension (e.g., new user
added or new item added during learning process)
– Parallelization (OMTCF is easier than CWOCF)
– High dimensionality for second-order methods
– Handle concept drifting or preference evolving
– Handle cold start (e.g., combining content-based)
17/12/2014
Online Learning (Hoi & Zhao)
130
Online Learning: Overview
Traditional
Linear
Methods
17/12/2014
NonTraditional
Single
Kernel
Multiple
Kernels
Online Learning (Hoi & Zhao)
Non-Linear
Methods
131
Kernel-based Online Learning
• Motivation
– Linear classifier is limited in certain situations
• Objective
– Learn a non-linear model for online classification
tasks using the kernel trick
17/12/2014
Online Learning (Hoi & Zhao)
132
Kernel-based Online Learning
• Kernel Perceptron
• Related Work
– Double Updating Online Learning (Zhao et al, 2011)
– Others
17/12/2014
Online Learning (Hoi & Zhao)
133
Double Updating Online Learning (DUOL)
• Motivation
– When a new support vector (SV) is added, the weights of
existing SVs remain unchanged (i.e., only the update is
applied for a single SV )
– How to update the weights of existing SVs in an efficient
and effective approach
• Main idea
– Update the weight for one more existing SV in addition to
the update of the new SV
• Challenge
– which existing SV should be updated and how to update?
17/12/2014
Online Learning (Hoi & Zhao)
134
Double Updating Online Learning (DUOL)
• Denote a new Support Vector as:
• Choose an auxiliary example
– Misclassified:
– Conflict most with new SV:
• Update the current hypothesis by
from existing SVs:
• How to optimize the weights of the two SVs
• DUOL formulates the problem as a simple QP task of large
margin optimization, and gives closed-form solutions.
17/12/2014
Online Learning (Hoi & Zhao)
135
Double Updating Online Learning (DUOL)
17/12/2014
Online Learning (Hoi & Zhao)
136
Kernel-based Online Learning
• Challenge
– The number of support vectors with the kernelbased classification model is often unbounded!
– Non-scalable and inefficient in practice!!
• Question
– Can we bound the number of support vectors?
• Solution
– “Budget Online Learning”
17/12/2014
Online Learning (Hoi & Zhao)
137
Budget Online Learning
• Problem
– Kernel-based Online Learning by bounding the
number of support vectors for a given budget B
• Related Work in literature
– Randomized Budget Perceptron (Cavallanti et al.,2007)
– Forgetron (Dekel et al.,2005)
– Projectron (Orabona et al.,2008)
– Bounded Online Gradient Descent (Zhao et al 2012)
– Others
17/12/2014
Online Learning (Hoi & Zhao)
138
RBP: Randomized Budget Perceptron
(Cavallanti et al.,2007)
• Idea: maintaining budget by means of randomization
• Repeat whenever there is a mistake at round t:
• If the number of SVs <= B, then apply Kernel Perceptron
• Otherwise randomly discard one existing support vector
17/12/2014
Online Learning (Hoi & Zhao)
139
Forgetron (Dekel et al.,2005)
1
2
3
...
...
t-1 t
1
2
3
...
...
t-1 t
1
2
3
...
...
t-1 t
1
2
3
...
...
t-1 t
Step (1) - Perceptron
Step (2) – Shrinking
Step (3) – Remove Oldest
17/12/2014
Online Learning (Hoi & Zhao)
140
Projectron (Orabona et al., 2008)
• The new hypothesis
is projected onto the space
spanned by
• How to solve the projection?
17/12/2014
Online Learning (Hoi & Zhao)
141
Bounded Online Gradient Descent
(Zhao et al 2012)
• Limitations of previous work
– Perceptron-based, heuristic or expensive update
• Motivation of BOGD
– Learn the kernel-based model using online gradient
descent by constraining the SV size less than a
predefined budget B
• Challenges
– How to efficiently maintain the budget?
– How to minimize the impact due to the budget
maintenance?
17/12/2014
Online Learning (Hoi & Zhao)
142
Bounded Online Gradient Descent
(Zhao et al 2012)
• Main idea of the BOGD algorithms
– A stochastic budget maintenance strategy to guarantee
• One existing SV will be discarded by multinomial sampling
• Unbiased estimation with only B SVs;
• Formulation
– Current hypothesis
– Construct an unbiased estimator (to ensure
)
indicates the i-th SV is selected for removal
17/12/2014
Online Learning (Hoi & Zhao)
143
Empirical Results of BOGD
• Comparison
– Baseline: Forgetron, RBP, Projectron, Projectron++
– Our algorithms: BOGD (uniform), BOGD++ (non-uniform)
• Evaluation of budget online learning algorithms
Experimental result of varied budget sizes on the codrna data set (n=271617)
17/12/2014
Online Learning (Hoi & Zhao)
144
Budget Online Kernel Learning:
Kernel Approximation approaches
• Motivations
– Most existing budget online kernel learning often
adopts different budget maintenance strategies
• Idea of Kernel Approximation
17/12/2014
Online Learning (Hoi & Zhao)
145
Kernel Approximation
• Two Approaches (Wang et al, 2013)
– Kernel Function Approximation
– Kernel Matrix Approximation
• Kernel Function Approximation
– Fourier Online Gradient Descent Algorithm
– Approximate kernel functions by Random fourier
features, and then apply online gradient descent
17/12/2014
Online Learning (Hoi & Zhao)
146
Kernel Approximation
• Kernel Matrix Approximation: applicable to
any types of kernels
• Nystrom Online Gradient Descent (NOGD)
– Approximate a Kernel matrix using the Nystrom
method, and the apply OGD
– Construct a small B*B kernel matrix
17/12/2014
Online Learning (Hoi & Zhao)
147
Empirical Evaluation
• Comparison to Batch Binary Classification
17/12/2014
Online Learning (Hoi & Zhao)
148
Summary
• A family of Budget Online Kernel Learning
• Pros
☺ Very efficient due to stochastic
strategy
☺ Rather scalable
☺ State-of-the-art performance, theoretical guarantee
• Cons
☹ Predefined budget size (optimal budget size)?
☹ Only learn with a single kernel
17/12/2014
Online Learning (Hoi & Zhao)
149
Online Learning: Overview
Traditional
Linear
Methods
17/12/2014
NonTraditional
Single
Kernel
Multiple
Kernels
Online Learning (Hoi & Zhao)
Non-Linear
Methods
150
Online Multiple Kernel Learning
• Motivation
–
–
–
–
Variety is a key challenge for multimedia data analytics
Traditional methods assume data in vector space
Real objects often have diverse representations
Multiple Kernel Representation
• Each kernel represents one similarity function
Pyramid matching kernels
Graph kernels
(vision, multimedia)
(bio, web/social, etc)
17/12/2014
Sequence kernels
(speech, video, bio, etc)
Online Learning (Hoi & Zhao)
Tree kernels
(NLP, etc)
151
Multiple Kernel Learning (MKL)
• What is Multiple Kernel Learning (MKL) (Lanckriet et al JMLRl04)
– Kernel method by an optimal combination of multiple kernels
• Batch MKL Formulation
• Hard to solve the convex-concave optimization for big data!
Can we avoid solving the batch optimization directly?
17/12/2014
Online Learning (Hoi & Zhao)
152
Online MKL (Hoi et al., ML’13)
• Objective
– Aims to learn a kernel-based predictor with multiple
kernels from a sequence of (multi-modal) data examples
– Avoid the need of solving complicated optimizations
• Main idea: a two-step online learning
At each iteration, if there is a mistake:
– Step 1: Online learning with each single kernel
• Kernel Perceptron (Rosenblatt Frank, 1958, Freund 1999)
– Step 2: Online update the combination weights
• Hedge algorithm (Freund and Schapire COLT95)
17/12/2014
Online Learning (Hoi & Zhao)
153
Online Multiple Kernel Classification
• Deterministic Algorithm for OMKC
17/12/2014
Online Learning (Hoi & Zhao)
154
OMKC by Stochastic Combination
• To improve the efficiency of Algorithm 1 by selecting
a subset of kernels for prediction.
17/12/2014
Online Learning (Hoi & Zhao)
155
OMKC by Stochastic Updating
• To improve the learning efficiency of Algorithm 1 by
sampling a subset of kernel classifiers for updating,
based on the weights assigned to kernel classifiers.
17/12/2014
Online Learning (Hoi & Zhao)
156
OMKC by Stochastic Updating &
Stochastic Combination
17/12/2014
Online Learning (Hoi & Zhao)
157
Summary of OMKC Variants
• OMKC(D,D) is the most computationally intensive algorithm that
updates and combines all the kernel classifiers at each iteration
• OMKC(S,S) is the most efficient algorithm that selectively updates
and combines a subset of kernel classifiers at each iteration.
• Finally, OMKC(D,S) and OMKC(S,D) are the other two variants of
OMKC algorithms in between these two extremes
17/12/2014
Online Learning (Hoi & Zhao)
158
Empirical Evaluation of OMKC
We compare the four variants of OMKC algorithms for
classification with the following baselines:
• Perceptron: the well-known Perceptron with a linear
kernel (Rosenblatt 1958; Freund and Schapire 1999);
• Perceptron(u): another Perceptron baseline with an
unbiased/uniform combination of all the kernels;
• Perceptron(*): online validation to search for the best
kernel among the pool (using first 10% training data) and
then apply the Perceptron with the best kernel;
• OM-2: a state-of-the-art online learning algorithm for
MKL (Jie et al. 2010; Orabona et al. 2010)
17/12/2014
Online Learning (Hoi & Zhao)
159
Evaluation Result of OMKC
17/12/2014
Online Learning (Hoi & Zhao)
160
Evaluation Result of OMKC
17/12/2014
Online Learning (Hoi & Zhao)
161
Online MKL for Multimedia Retrieval
• Online Multi-Kernel Similarity Learning (Xia et al TPAMI’14)
– Aim to learn multi-kernel similarity for multimedia retrieval
Color
Side Info
Stream
OMKS
Texture
Contentbased
Multimedia
Retrieval
Local pattern (BoW)
17/12/2014
Online Learning (Hoi & Zhao)
162
Kernel Similarity Learning
• Define Similarity function S as follows:
where
• Formulating the optimization framework of
kernel similarity learning as
17/12/2014
Online Learning (Hoi & Zhao)
163
Online Kernel Similarity Learning
• Consider an online learning setting, at each
trial t, given triplet
, we solve the
following optimization:
• The optimal solution:
17/12/2014
Online Learning (Hoi & Zhao)
164
Online Multiple Kernel Similarity
• The multiple kernel similarity function:
• Optimization problem:
17/12/2014
Online Learning (Hoi & Zhao)
165
OMKS Algorithm
• Time complexity
– OKS: O(T |SV|)
– OMKS: O(T |SV|m)
17/12/2014
Online Learning (Hoi & Zhao)
166
Multi-modal Image Retrieval
Query
OASIS(*)
OKS(*)
OMKS-U
OMKS
OASIS(*)
OKS(*)
OMKS-U
OMKS
17/12/2014
Online Learning (Hoi & Zhao)
167
Summary of OMKL
• Pros
☺ Nonlinear models for tough applications
☺ Avoid solving complicated optimization directly
☺ Handle multi-modal
data
☺ Theoretical guarantee
• Cons
☹ Scalability has to be further improved
17/12/2014
Online Learning (Hoi & Zhao)
168
Agenda
• PART I: Introduction
–
–
–
–
Big Data: Opportunities & Challenges
Online Learning: What and Why
Online Learning Applications
Overview of Online Learning Methods
• PART II: Online Learning Methods
– Traditional Linear OL Algorithms
– Non-traditional OL Algorithms
– Kernel-based OL Algorithms
• Discussions and Open Issues
• Summary and Take-Home Messages
17/12/2014
Online Learning (Hoi & Zhao)
169
Discussions: Notion Comparison
• Online Learning (OL) vs Incremental Learning (IL)
• Similarity
– Both learns in a sequential fashion
• Difference
– OL does not assume input knowledge (could be
adversarial), while IL has complete input knowledge
– OL: single pass, IL: can do multiple passes
– OL: may not solve the identical batch learning
problem, IL: solve the same problem, and often
associated with decremental solutions
17/12/2014
Online Learning (Hoi & Zhao)
170
Discussions: Notion Comparison
• Online Learning (OL) vs Reinforcement Learning (RL)
• Similarity
– Both (bandit OL and RL) works in a sequential fashion
with only partial feedback given to the learner
– Both (bandit OL and RL) attempts to trade off between
exploitation and exploration
• Difference
– In machine learning, RF is often focused more Markov
decision process (MDP) for learning policies, while OL can
address general supervised learning tasks
17/12/2014
Online Learning (Hoi & Zhao)
171
Discussions: Notion Comparison
• Online Learning (OL) vs Active Learning (AL)
• Similarity
– Both learn repeatedly in a sequential manner
• Difference
– OL assumes feedback (either full or partial) can always
be received passively, while AL has to actively solicit the
feedback from the environment
– OL typically process a single example, while AL may
request to label multiple example (a.k.a. batch mode
active learning)
17/12/2014
Online Learning (Hoi & Zhao)
172
Open Issues
• Challenges of Big Data
– Volume
• Explosive growing data: from million to billion scales
• From a single machine to multiple machines in parallel
– Velocity
• Data arrives extremely fast
• From a normal scheme to a real-time solution
– Variety
• Heterogeneous data and diverse sources
• From centralized approach to distributed solutions
17/12/2014
Online Learning (Hoi & Zhao)
173
Open Issues
• Parallel & Distributed Online Learning
– Motivation
• Not making significant gains in serial computation speed
• Data no longer fit on a single machine.
– Major Issues
• Synchronization: we can't wait for the slowest machine.
• Communication: we can't transfer all information.
– Parallelism in Online Learning
• Split task across M machines, solve independently, combine
• These allow optimal linear speedups
– Asynchronous Computation
• Updating asynchronously saves a lot of time.
– Reduced Communication
• Parallel with single coordinator versus distributed decentralized
17/12/2014
Online Learning (Hoi & Zhao)
174
Open Issues
• Other Issues
–
–
–
–
–
–
–
–
High-dimensionality
Data sparsity
Structural/semi-structural data
Noise and incomplete data
Concept drifting
Domain adaption
Incorporation of background knowledge
Parallel & distributed computing
• User interaction
– Interactive OL vs Passive OL
– Human computation, crowdsourcing
17/12/2014
Online Learning (Hoi & Zhao)
175
Open Issues
• Applications of Big Data Analytics
– Web Rich Media Search and Mining
– Social Network and Social Media
– Speech Recognition & Mining (e.g., SIRI)
– Multimedia Information Retrieval
– Computer Vision and Multimedia Understanding
– Medical and Healthcare Informatics
– Financial Engineering (with multimodal data)
– etc
17/12/2014
Online Learning (Hoi & Zhao)
176
Conclusion
• Introduction of emerging opportunities and
challenges for big data mining
• Introduction of online learning, widely applied
for various real-word applications
with big data mining
Single
• Survey of classical and
Traditional
Kernel
state-of-the-art online
NonMultiple
learning techniques
Traditional Kernels
17/12/2014
Online Learning (Hoi & Zhao)
177
Take-Home Message
• Online learning is promising for big data mining
• More challenges and opportunities ahead:
– More smart online learning algorithms
– Handle more real-world challenges, e.g., concept drifting,
noise, sparse data, high-dimensional issues, etc.
– Scale up for mining billions of instances using distributed
computing facilities & parallel programming (e.g., Hadoop)
LIBOL: An open-source Library of Online Learning Algorithms
http://libol.stevenhoi.org
17/12/2014
Online Learning (Hoi & Zhao)
178
References
•
•
•
•
•
•
•
•
•
•
Steven C.H. Hoi, Rong Jin, Tianbao Yang, Peilin Zhao, "Online Multiple Kernel Classification",
Machine Learning (ML), 2013.
Hao Xia, Pengcheng Wu, Steven C.H. Hoi, "Online Multi-modal Distance Learning for Scalable
Multimedia Retrieval“, ACM Intl. Conf. on Web Search and Data Mining (WSDM),2013.
Peilin Zhao, Jialei Wang, Pengcheng Wu, Rong Jin, Steven C.H. Hoi, "Fast Bounded Online
Gradient Descent Algorithms for Scalable Kernel-Based Online Learning", The 29th
International Conference on Machine Learning (ICML), June 26 - July 1, 2012.
Jialei Wang, Steven C.H. Hoi, "Exact Soft Confidence-Weighted Learning ", The 29th
International Conference on Machine Learning (ICML), June 26 - July 1, 2012.
Bin Li, Steven C.H. Hoi, "On-line Portfolio Selection with Moving Average Reversion", The 29th
International Conference on Machine Learning (ICML), June 26-July 1, 2012.
Jialei Wang, Peilin Zhao, Steven C.H. Hoi, "Cost-Sensitive Online Classification“, IEEE
International Conference on Data Mining (ICDM), 2012.
Bin Li, Peilin Zhao, Steven C.H. Hoi, V. Gopalkrishnan, "PAMR: Passive-Aggressive Mean
Reversion Strategy for Portfolio Selection", Machine Learning, vol.87, no.2, pp.221-258, 2012.
Steven C.H. Hoi, Jialei Wang, Peilin Zhao, Rong Jin, Online Feature Selection for Big Data
Mining, ACM SIGKDD Workshop on Big Data Mining (BigMine), Beijing, China, 2012
Peilin Zhao, Steven C.H. Hoi, Rong Jin, "Double Updating Online Learning", Journal of Machine
Learning Research (JMLR), 2011.
Peilin Zhao, Steven C.H. Hoi, Rong Jin, Tianbo Yang, "Online AUC Maximization" The 28th
International Conference on Machine Learning (ICML), 2011.
17/12/2014
Online Learning (Hoi & Zhao)
179
References
•
•
•
•
•
•
•
•
•
•
•
Duchi, John C., Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning
and stochastic optimization. JMLR, 12:2121–2159, 2011.
Jin, R., Hoi, S. C. H. and Yang, T. Online multiple kernel learning: Algorithms and mistake bounds,
ALT, pp. 390–404., 2010.
Peilin Zhao and Steven C.H. Hoi, "OTL: A Framework of Online Transfer Learning" The 27th
International Conference on Machine Learning, Haifa, Israel, 21-24 June, 2010
Crammer, Koby and D.Lee, Daniel. Learning via gaussian herding. In NIPS, pp. 345–352, 2010.
Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors. In
Advances in Neural Information Processing Systems (NIPS), 2009.
M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. In ICML, pages
264–271, 2008.
Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer. The forgetron: A kernel-based perceptron on
a budget. SIAM J. Comput., 37(5):1342–1372, 2008. ISSN 0097-5397.
Francesco Orabona, Joseph Keshet, and Barbara Caputo. The projectron: a bounded kernelbased perceptron. In ICML, pages 720–727, 2008.
Crammer, Koby, Dredze, Mark, and Pereira, Fernando. Exact convex confidence-weighted
learning. In NIPS, pp. 345–352, 2008.
Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Tracking the best hyperplane with
a simple budget perceptron. Machine Learning, 69(2-3):143–167, 2007.
N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, 2006.
17/12/2014
Online Learning (Hoi & Zhao)
180
References
•
•
•
•
•
•
•
•
•
Crammer, Koby, Dekel, Ofer, Keshet, Joseph, Shalev-Shwartz, Shai, and Singer, Yoram. Online
passive aggressive algorithms. JMLR, 7:551–585, 2006.
Cesa-Bianchi, Nicol`o, Conconi, Alex, and Gentile, Claudio. A second-order perceptron
algorithm. SIAM J. Comput., 34(3):640–668, 2005.
Koby Crammer, Jaz S. Kandola, and Yoram Singer. Online classification on a budget. In
Advances in Neural Information Processing Systems (NIPS), 2003.
Claudio Gentile. A new approximate maximal margin classification algorithm. Journal of
Machine Learning Research, 2:213–242, 2001.
Jyrki Kivinen, Alex J. Smola, and Robert C. Williamson. Online learning with kernels. In
Advances in Neural Information Processing Systems (NIPS), pages 785–792, 2001.
Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron
algorithm. Mach. Learn., 37(3):277–296, 1999.
Yi Li and Philip M. Long. The relaxed online maximum margin algorithm. In Advances in
Neural Information Processing Systems (NIPS), pages 498–504, 1999.
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139,
(1997).
F. Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65:386–407, 1958.
17/12/2014
Online Learning (Hoi & Zhao)
181