download - YesBut - University of Technology, Sydney

University of Technology Sydney
Doctoral Assessment
Active Learning for Crowdsourcing
Author:
Meng Fang
Supervisor:
March 18, 2015
Contents
1 Abstract
2
2 Introduction
2
3 Background
4
4 Research question and contribution
5
5 Research objective and scope
5
6 Literature review
5
7 Proposed research methodology and justification
7.1 Problem Definition and Overall Framework . . . .
7.2 Modeling Multiple Labelers with Reliability . . . .
7.2.1 Label Inference . . . . . . . . . . . . . . . .
7.2.2 Labeler Modeling . . . . . . . . . . . . . . .
7.2.3 Maximum Likelihood Estimation . . . . . .
7.3 Self-Taught Active Learning . . . . . . . . . . . . .
7.3.1 Instance Selection . . . . . . . . . . . . . .
7.3.2 Labeler Selection . . . . . . . . . . . . . . .
7.3.3 Self-Taught Learning between Labelers . . .
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Experimental settings . . . . . . . . . . . .
7.4.2 A Real-World Data Set . . . . . . . . . . .
7.4.3 Benchmark Data Sets . . . . . . . . . . . .
7.4.4 Label Error Reduction . . . . . . . . . . . .
6
7
7
7
7
7
7
7
7
7
7
7
8
8
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Ethics and risk consideration
8
9 Research plan
8
10 Progress to date
8
1
Abstract
The emerging of social tagging and crowdsourcing systems provide a labeling platform where a number of weak labelers can form a group to fulfill a
labeling task. Yet crowd labelers are often noisy, inaccurate, and containing inadequate labeling knowledge, and worst of all, they act independently
without seeking complementary knowledge from each other to improving labeling performance. Our research focuses on the multiple annotator scenario
where multiple labelers, with varying expertise, are available for querying.
In our work, we propose a Self-Taught Active Learning (STAL) paradigm,
where a crowd of imperfect labelers are able to learn complementary knowledge from one another to expand their knowledge sets and benefit the underlying active learner. Our main technical challenge is threefold (1) when
making a query from multiple weak labelers, we should properly select the
most informative instance to query; (2) the limited labeling costs do not
allow all labelers to answer each query. We need to properly model each
labeler’s weakness and strength, and choose a good labeler for each query;
and (3) for each query, we should allow crowd labelers to form a self-taught
system to gradually improve the labeling performance. We employ a probabilistic model to characterize the knowledge of each labeler. We use four
random variables X , A, Y, and Z to model unlabeled instances, the knowledge of the labelers, the observed labels from the labelers, and the genuine
labels of the instances. The generative model explicitly capturing their relationship through which we can answer (a) which unlabeled instance has the
mostly needed knowledge and needs to be queried; and (b) which labeler has
the best knowledge to label the instance; and (c) which labeler has the least
knowledge and needs to be taught. Experimental results on both real-world
and benchmark data sets demonstrate a clear performance gain of STAL,
compared to a number of baselines.
2
Introduction
Active learning represents a family of methods which aims to reduce the
labeling cost with maximum performance gain. It has been extensively
studied for nearly two decades since [11, 18] and has become a powerful tool
for addressing the labeling cost problem in many real-world data mining
problems. In most data mining problems, to obtain labeling information
for instances is usually a time consuming process with expensive costs [1].
Instead of labeling all instances or randomly selecting instances to label,
2
active learning [9, 2, 18] selectively choose the most informative instances
for an oracle to label.
In a traditional active learning setting, an omniscient oracle is required
to provide correct answers to each queries. This is, unfortunately, hardly the
case for many applications, such as social tagging and crowdsourcing systems, where plenty of users can form abundant weak labeling resources [8].
These emerging web-based applications have raised up a new active learning problem which involves multiple nonexpert labelers who may provide
imperfect labels to a same set of queried instances. Existing omniscient
oracle based active learning cannot take the risk of incorrect information
provided by weak labelers into account. Researchers have observed this interesting problem and some works have been reported recently [13, 23, 12]
for extracting useful labeling information from multiple imperfect labelers.
Nevertheless, for all existing multiple weak labeler based methods, they assume that imperfect labelers’ knowledge sets are fixed and they are unable
to learn complementary knowledge from one another. This has motivated us
to study a new active learning problem, that is, enabling imperfect labelers
to learn labeling knowledge from one another to refine their knowledge sets
during the active learning process.
Our research focuses on active learning strategy for crowdsourcing system. We propose a Self-Taught Active Learning (STAL) paradigm, where
a crowd of imperfect labelers are able to form a self-taught learning system and learn complementary knowledge from one another to expand their
knowledge sets such that the underlying active learner can be benefitted.
To implement such a self-taught active learning process, we have three challenges to address:
• Instance Selection: Identifying the most informative instance for
labeling is difficult, mainly because each weak labeler may provide
incorrect/noisy labels for the query. We need to identify the mostly
needed instance by taking all labelers as a whole instead of treating
them separately;
• Labeler Selection: For each selected query instance, identifying the
most reliable labeler is difficult. We need to properly characterize the
strength/weakness of each labeler so we can identify the labeler with
the most reliable knowledge for the query instance;
• Self-Taught Learning: While existing methods assume weak labelers are independent individuals, we need to promote self-taught learning among labelers. For specified knowledge or concept, we should
3
know which labeler is good at it and which labeler needs to learn the
that knowledge.
In our work, we propose a self-taught active learning framework to address the above challenges. Our framework, STAL, employs a probabilistic
knowledge-concept model to explicitly characterize the knowledge of different labelers. We consider that making a query still has a cost in multiple
labelers setting, so each query only involves answers from one selected labelers (instead of asking all labelers to provide the labels for the selected
instance). To properly select the instance-labeler pair in each active learning
iteration, we use four random variable X , A, Y, and Z to represent unlabeled
instances, the knowledge of the labelers, the observed labels from the labelers, and the genuine labels of the instances. So the probability value P (Z|x)
can capture the global uncertainty of an unlabeled instance x with respect
to all labelers, and P (A|x) represent a labeler knowledge in labeling the
instance x. As a result, we can properly select the most informative unlabel instance for labeling, and also use a query instance x’s label gained
from the most reliable labeler to teach the most unreliable labeler (i.e. selftaught learning). Experiments from both real-world and benchmark data
sets demonstrate a clear performance gain of STAL, compared to a number
of baselines.
3
Background
Active learning technology belongs to artificial intelligence or further it is
sub area of machine learning. The intuited assumption is that we can select
ports of data from the data set and we can train a better classifier just
based on those labeled data. You may think about why it makes sense.
Here we will give simple intuition to explain why the active learning does
work. For a supervised learning problem, we should collect lots of data as
training data set and the size of training data set is often very huge. For
some learning problems, the labeled data is easy to get or labeling data
is very cheap, such as the web-based rating data set or the data collected
through social network by thousands of users. We use those data to train a
mechanism to give music or image suggestion for user or recommend goods
to user. In these problems, the labeled data is easy to obtain and labeling
the instance is very cheap. However, for other problems, obtaining labeled
data is expensive and very difficult, such as medical image annotation.
The goal of active learning technology is to handle this labeling problem
that how to ask the queries from the unlabeled data set to get labels from an
4
oracle or labeler. Another goal is to get better accuracy learner by using few
labeled data set as few as possible, in other words, it is to minimize the total
cost of labeling data. The motivations of modern machine learning problems
are that labeling the instances is very expensive and time consuming but
getting unlabeled data is very easy and unlabeled data set is also abundant.
Pay attention on that this norm of active learning is different from the
techniques in the education literature which have the same name.
4
Research question and contribution
Todo..
5
Research objective and scope
Todo..
6
Literature review
Classical active learning strategy is to select the most uncertain instance to
query the labeler (oracle) for instance label [9, 19]. Alternatively, one can select an instance on which a committee of classifiers mostly disagree [?, 7, 6].
Another general uncertainty sampling strategy is based on an informationtheoretic measure to select the instance minimizing the posterior entropy [11,
21]. For those learners considering data distributions in their decision models, such as SVM, one can choose to label instances which are close to the
decision boundaries [21]. Most of these strategies share the same view as the
version space reduction [17, 21]. Another family of methods aim to query
instances that can reduce future classification error as much as possible (i.e.,
to directly minimize the empirical risk) [15]. All methods in this category
assume that the labeler is omniscient and always accurate to answer each
query.
Recently, several works argue that labelers cannot always behave perfectly in reality [12, 3] and a number of studies on active learning with noisy
labelers have been reported [10, 20, 4]. In [4], the authors address the budget constrained labeling problem by balancing two types of labelers with
different costs (high cost gives high labeling quality). They consider the
different cost of labeler and select labeler-instance pair by maximizing the
information gain under the pre-defined budget. The study in [20] proposes
5
to use the impact of repeated-labeling to model the quality of the resultant labels to achieve better active learning accuracy. While works in the
above problem setting have already taken into account the fact that labelers
may have different levels of expertise [4, 22] or varying opinions [20]. They
inherently disregard whether a labeler has knowledge to label the queried
instance and require labelers to label instances which may be out of their
knowledge. If a labeler is not given a chance to decline labeling an instance
beyond his/her knowledge, he/she may simply provide a false label. This
scenario, unfortunately, has never been considered in the previous research.
Recent advancement in Web 2.0 applications, such as social tagging, has
demonstrated clear values of crowdsourcing systems. Extracting wisdom
of crowds has drawn a lot of research attention recently [14]. The major
concern of the supervised learning problem in crowdsourcing scenarios is
to estimate the reliability of crowdsourced labels. Some works try to learn
the quality of a labeler in tandem with learning values of model parameters [3, 14, 23], while others introduce confidence scores to approximate the
reliability. Among these works, [23] is the only one that studies active learning in crowdsourcing scenarios. This work introduces a probabilistic model
for modeling the confidences of different labelers by forming a bi-convex
optimization problem. Then it tries to find the instance from the training
set that is closest to the optimal value and find the most reliable labeler
to query this instance. Unfortunately, this work cannot model the labelers’
knowledge sets such that the learner cannot know the ability of each labeler
to benefit active learning.
Although the active learning research based on multiple weak labelers
has emerged, to the best of our knowledge, there is no existing work addressing the active learning problem where imperfect labeler crowds can
learn complementary knowledge from one another. An important difference
between our problem setting and that of the previous active learning problems is that (1) we can explicitly model the knowledge of each labeler, and
(2) we promote a self-taught active learning paradigm by letting labellers
benefit from one another during the active learning process.
7
Proposed research methodology and justification
Todo..
6
7.1
Problem Definition and Overall Framework
7.2
Modeling Multiple Labelers with Reliability
Todo..
7.2.1
Label Inference
Todo..
7.2.2
Labeler Modeling
Todo..
7.2.3
Maximum Likelihood Estimation
Todo..
7.3
Self-Taught Active Learning
Todo..
7.3.1
Instance Selection
Todo..
7.3.2
Labeler Selection
Todo..
7.3.3
Self-Taught Learning between Labelers
Todo..
7.4
Experiments
Todo..
7.4.1
Experimental settings
Todo..
7
7.4.2
A Real-World Data Set
Todo..
7.4.3
Benchmark Data Sets
Todo..
7.4.4
Label Error Reduction
Todo..
8
Ethics and risk consideration
Todo..
9
Research plan
Todo..
10
Progress to date
Todo..
References
[1] J. Baldridge and M. Osborne. Active learning and the total cost of
annotation. In Proceedings of EMNLP, pages 9–16, 2004.
[2] David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Active
learning with statistical models. J. Artif. Int. Res., 4:129–145, 1996.
[3] Ofer Dekel and Ohad Shamir. Good learners for evil teachers. In Proc.
of ICML, pages 233–240, New York, NY, USA, 2009.
[4] Pinar Donmez and Jaime G. Carbonell. Proactive learning: costsensitive active learning with multiple imperfect oracles. In Proc. of
CIKM, pages 619–628, New York, NY, USA, 2008.
[5] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
8
[6] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of
Computer and System Sciences, 55(1):119–139, 1997.
[7] Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Mach. Learn.,
28:133–168, 1997.
[8] Jeff Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the
Future of Business. Crown Publishing Group, New York, NY, USA, 1
edition, 2008.
[9] David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In Proc. of SIGIR, pages 3–12, New York, NY, USA,
1994.
[10] G. Lugosi. Learning with an unreliable teacher. Pattern Recognition,
25(1):79–87, 1992.
[11] David J. C. MacKay. Information-based objective functions for active
data selection. Neural Comput., 4:590–604, 1992.
[12] P. Rashidi and D. Cook. Ask me better questions: Active learning
queries based on rule induction. In Proc. of KDD, 2011.
[13] V.C. Raykar, S. Yu, L.H. Zhao, A. Jerebko, C. Florin, G.H. Valadez,
L. Bogoni, and L. Moy. Supervised learning from multiple experts:
Whom to trust when everyone lies a bit. In Proc. of ICML, pages
889–896. ACM, 2009.
[14] Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo
Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from
crowds. J. Mach. Learn. Res., 11:1297–1322, 2010.
[15] Nicholas Roy and Andrew McCallum. Toward optimal active learning
through sampling estimation of error reduction. In Proc. of ICML,
pages 441–448, San Francisco, CA, USA, 2001.
[16] Andrey Rzhetsky, Hagit Shatkay, and W. John Wilbur. How to get the
most out of your curation effort. PLoS Comput Biol, 5(5):e1000391,
2009.
[17] Greg Schohn and David Cohn. Less is more: Active learning with support vector machines. In Proc. of ICML, pages 839–846, San Francisco,
CA, USA, 2000.
9
[18] B. Settles. Active learning literature survey. University of Wisconsin,
Madison, 2010.
[19] B. Settles and M. Craven. An analysis of active learning strategies
for sequence labeling tasks. In Proc. of EMNLP, pages 1070–1079.
Association for Computational Linguistics, 2008.
[20] Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. Get another label? improving data quality and data mining using multiple,
noisy labelers. In Proc. of KDD, pages 614–622, New York, NY, USA,
2008.
[21] Simon Tong and Daphne Koller. Support vector machine active learning
with applications to text classification. J. Mach. Learn. Res., 2:45–66,
2002.
[22] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni,
L. Moy, J. Dy, and PA Malvern. Modeling annotator expertise: Learning when everybody knows a bit of something. In Proc. of AISTATS,
2010.
[23] Yan Yan, Romer Rosales, Glenn Fung, and Jennifer Dy. Active learning
from crowds. In Proc. of ICML, pages 1161–1168, New York, NY, USA,
2011.
[24] Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: A
quantitative study of their impacts. Artificial Intelligence Review,
22:177–210, 2004.
10