University of Technology Sydney Doctoral Assessment Active Learning for Crowdsourcing Author: Meng Fang Supervisor: March 18, 2015 Contents 1 Abstract 2 2 Introduction 2 3 Background 4 4 Research question and contribution 5 5 Research objective and scope 5 6 Literature review 5 7 Proposed research methodology and justification 7.1 Problem Definition and Overall Framework . . . . 7.2 Modeling Multiple Labelers with Reliability . . . . 7.2.1 Label Inference . . . . . . . . . . . . . . . . 7.2.2 Labeler Modeling . . . . . . . . . . . . . . . 7.2.3 Maximum Likelihood Estimation . . . . . . 7.3 Self-Taught Active Learning . . . . . . . . . . . . . 7.3.1 Instance Selection . . . . . . . . . . . . . . 7.3.2 Labeler Selection . . . . . . . . . . . . . . . 7.3.3 Self-Taught Learning between Labelers . . . 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Experimental settings . . . . . . . . . . . . 7.4.2 A Real-World Data Set . . . . . . . . . . . 7.4.3 Benchmark Data Sets . . . . . . . . . . . . 7.4.4 Label Error Reduction . . . . . . . . . . . . 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Ethics and risk consideration 8 9 Research plan 8 10 Progress to date 8 1 Abstract The emerging of social tagging and crowdsourcing systems provide a labeling platform where a number of weak labelers can form a group to fulfill a labeling task. Yet crowd labelers are often noisy, inaccurate, and containing inadequate labeling knowledge, and worst of all, they act independently without seeking complementary knowledge from each other to improving labeling performance. Our research focuses on the multiple annotator scenario where multiple labelers, with varying expertise, are available for querying. In our work, we propose a Self-Taught Active Learning (STAL) paradigm, where a crowd of imperfect labelers are able to learn complementary knowledge from one another to expand their knowledge sets and benefit the underlying active learner. Our main technical challenge is threefold (1) when making a query from multiple weak labelers, we should properly select the most informative instance to query; (2) the limited labeling costs do not allow all labelers to answer each query. We need to properly model each labeler’s weakness and strength, and choose a good labeler for each query; and (3) for each query, we should allow crowd labelers to form a self-taught system to gradually improve the labeling performance. We employ a probabilistic model to characterize the knowledge of each labeler. We use four random variables X , A, Y, and Z to model unlabeled instances, the knowledge of the labelers, the observed labels from the labelers, and the genuine labels of the instances. The generative model explicitly capturing their relationship through which we can answer (a) which unlabeled instance has the mostly needed knowledge and needs to be queried; and (b) which labeler has the best knowledge to label the instance; and (c) which labeler has the least knowledge and needs to be taught. Experimental results on both real-world and benchmark data sets demonstrate a clear performance gain of STAL, compared to a number of baselines. 2 Introduction Active learning represents a family of methods which aims to reduce the labeling cost with maximum performance gain. It has been extensively studied for nearly two decades since [11, 18] and has become a powerful tool for addressing the labeling cost problem in many real-world data mining problems. In most data mining problems, to obtain labeling information for instances is usually a time consuming process with expensive costs [1]. Instead of labeling all instances or randomly selecting instances to label, 2 active learning [9, 2, 18] selectively choose the most informative instances for an oracle to label. In a traditional active learning setting, an omniscient oracle is required to provide correct answers to each queries. This is, unfortunately, hardly the case for many applications, such as social tagging and crowdsourcing systems, where plenty of users can form abundant weak labeling resources [8]. These emerging web-based applications have raised up a new active learning problem which involves multiple nonexpert labelers who may provide imperfect labels to a same set of queried instances. Existing omniscient oracle based active learning cannot take the risk of incorrect information provided by weak labelers into account. Researchers have observed this interesting problem and some works have been reported recently [13, 23, 12] for extracting useful labeling information from multiple imperfect labelers. Nevertheless, for all existing multiple weak labeler based methods, they assume that imperfect labelers’ knowledge sets are fixed and they are unable to learn complementary knowledge from one another. This has motivated us to study a new active learning problem, that is, enabling imperfect labelers to learn labeling knowledge from one another to refine their knowledge sets during the active learning process. Our research focuses on active learning strategy for crowdsourcing system. We propose a Self-Taught Active Learning (STAL) paradigm, where a crowd of imperfect labelers are able to form a self-taught learning system and learn complementary knowledge from one another to expand their knowledge sets such that the underlying active learner can be benefitted. To implement such a self-taught active learning process, we have three challenges to address: • Instance Selection: Identifying the most informative instance for labeling is difficult, mainly because each weak labeler may provide incorrect/noisy labels for the query. We need to identify the mostly needed instance by taking all labelers as a whole instead of treating them separately; • Labeler Selection: For each selected query instance, identifying the most reliable labeler is difficult. We need to properly characterize the strength/weakness of each labeler so we can identify the labeler with the most reliable knowledge for the query instance; • Self-Taught Learning: While existing methods assume weak labelers are independent individuals, we need to promote self-taught learning among labelers. For specified knowledge or concept, we should 3 know which labeler is good at it and which labeler needs to learn the that knowledge. In our work, we propose a self-taught active learning framework to address the above challenges. Our framework, STAL, employs a probabilistic knowledge-concept model to explicitly characterize the knowledge of different labelers. We consider that making a query still has a cost in multiple labelers setting, so each query only involves answers from one selected labelers (instead of asking all labelers to provide the labels for the selected instance). To properly select the instance-labeler pair in each active learning iteration, we use four random variable X , A, Y, and Z to represent unlabeled instances, the knowledge of the labelers, the observed labels from the labelers, and the genuine labels of the instances. So the probability value P (Z|x) can capture the global uncertainty of an unlabeled instance x with respect to all labelers, and P (A|x) represent a labeler knowledge in labeling the instance x. As a result, we can properly select the most informative unlabel instance for labeling, and also use a query instance x’s label gained from the most reliable labeler to teach the most unreliable labeler (i.e. selftaught learning). Experiments from both real-world and benchmark data sets demonstrate a clear performance gain of STAL, compared to a number of baselines. 3 Background Active learning technology belongs to artificial intelligence or further it is sub area of machine learning. The intuited assumption is that we can select ports of data from the data set and we can train a better classifier just based on those labeled data. You may think about why it makes sense. Here we will give simple intuition to explain why the active learning does work. For a supervised learning problem, we should collect lots of data as training data set and the size of training data set is often very huge. For some learning problems, the labeled data is easy to get or labeling data is very cheap, such as the web-based rating data set or the data collected through social network by thousands of users. We use those data to train a mechanism to give music or image suggestion for user or recommend goods to user. In these problems, the labeled data is easy to obtain and labeling the instance is very cheap. However, for other problems, obtaining labeled data is expensive and very difficult, such as medical image annotation. The goal of active learning technology is to handle this labeling problem that how to ask the queries from the unlabeled data set to get labels from an 4 oracle or labeler. Another goal is to get better accuracy learner by using few labeled data set as few as possible, in other words, it is to minimize the total cost of labeling data. The motivations of modern machine learning problems are that labeling the instances is very expensive and time consuming but getting unlabeled data is very easy and unlabeled data set is also abundant. Pay attention on that this norm of active learning is different from the techniques in the education literature which have the same name. 4 Research question and contribution Todo.. 5 Research objective and scope Todo.. 6 Literature review Classical active learning strategy is to select the most uncertain instance to query the labeler (oracle) for instance label [9, 19]. Alternatively, one can select an instance on which a committee of classifiers mostly disagree [?, 7, 6]. Another general uncertainty sampling strategy is based on an informationtheoretic measure to select the instance minimizing the posterior entropy [11, 21]. For those learners considering data distributions in their decision models, such as SVM, one can choose to label instances which are close to the decision boundaries [21]. Most of these strategies share the same view as the version space reduction [17, 21]. Another family of methods aim to query instances that can reduce future classification error as much as possible (i.e., to directly minimize the empirical risk) [15]. All methods in this category assume that the labeler is omniscient and always accurate to answer each query. Recently, several works argue that labelers cannot always behave perfectly in reality [12, 3] and a number of studies on active learning with noisy labelers have been reported [10, 20, 4]. In [4], the authors address the budget constrained labeling problem by balancing two types of labelers with different costs (high cost gives high labeling quality). They consider the different cost of labeler and select labeler-instance pair by maximizing the information gain under the pre-defined budget. The study in [20] proposes 5 to use the impact of repeated-labeling to model the quality of the resultant labels to achieve better active learning accuracy. While works in the above problem setting have already taken into account the fact that labelers may have different levels of expertise [4, 22] or varying opinions [20]. They inherently disregard whether a labeler has knowledge to label the queried instance and require labelers to label instances which may be out of their knowledge. If a labeler is not given a chance to decline labeling an instance beyond his/her knowledge, he/she may simply provide a false label. This scenario, unfortunately, has never been considered in the previous research. Recent advancement in Web 2.0 applications, such as social tagging, has demonstrated clear values of crowdsourcing systems. Extracting wisdom of crowds has drawn a lot of research attention recently [14]. The major concern of the supervised learning problem in crowdsourcing scenarios is to estimate the reliability of crowdsourced labels. Some works try to learn the quality of a labeler in tandem with learning values of model parameters [3, 14, 23], while others introduce confidence scores to approximate the reliability. Among these works, [23] is the only one that studies active learning in crowdsourcing scenarios. This work introduces a probabilistic model for modeling the confidences of different labelers by forming a bi-convex optimization problem. Then it tries to find the instance from the training set that is closest to the optimal value and find the most reliable labeler to query this instance. Unfortunately, this work cannot model the labelers’ knowledge sets such that the learner cannot know the ability of each labeler to benefit active learning. Although the active learning research based on multiple weak labelers has emerged, to the best of our knowledge, there is no existing work addressing the active learning problem where imperfect labeler crowds can learn complementary knowledge from one another. An important difference between our problem setting and that of the previous active learning problems is that (1) we can explicitly model the knowledge of each labeler, and (2) we promote a self-taught active learning paradigm by letting labellers benefit from one another during the active learning process. 7 Proposed research methodology and justification Todo.. 6 7.1 Problem Definition and Overall Framework 7.2 Modeling Multiple Labelers with Reliability Todo.. 7.2.1 Label Inference Todo.. 7.2.2 Labeler Modeling Todo.. 7.2.3 Maximum Likelihood Estimation Todo.. 7.3 Self-Taught Active Learning Todo.. 7.3.1 Instance Selection Todo.. 7.3.2 Labeler Selection Todo.. 7.3.3 Self-Taught Learning between Labelers Todo.. 7.4 Experiments Todo.. 7.4.1 Experimental settings Todo.. 7 7.4.2 A Real-World Data Set Todo.. 7.4.3 Benchmark Data Sets Todo.. 7.4.4 Label Error Reduction Todo.. 8 Ethics and risk consideration Todo.. 9 Research plan Todo.. 10 Progress to date Todo.. References [1] J. Baldridge and M. Osborne. Active learning and the total cost of annotation. In Proceedings of EMNLP, pages 9–16, 2004. [2] David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Active learning with statistical models. J. Artif. Int. Res., 4:129–145, 1996. [3] Ofer Dekel and Ohad Shamir. Good learners for evil teachers. In Proc. of ICML, pages 233–240, New York, NY, USA, 2009. [4] Pinar Donmez and Jaime G. Carbonell. Proactive learning: costsensitive active learning with multiple imperfect oracles. In Proc. of CIKM, pages 619–628, New York, NY, USA, 2008. [5] A. Frank and A. Asuncion. UCI machine learning repository, 2010. 8 [6] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. [7] Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Mach. Learn., 28:133–168, 1997. [8] Jeff Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. [9] David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In Proc. of SIGIR, pages 3–12, New York, NY, USA, 1994. [10] G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79–87, 1992. [11] David J. C. MacKay. Information-based objective functions for active data selection. Neural Comput., 4:590–604, 1992. [12] P. Rashidi and D. Cook. Ask me better questions: Active learning queries based on rule induction. In Proc. of KDD, 2011. [13] V.C. Raykar, S. Yu, L.H. Zhao, A. Jerebko, C. Florin, G.H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proc. of ICML, pages 889–896. ACM, 2009. [14] Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. J. Mach. Learn. Res., 11:1297–1322, 2010. [15] Nicholas Roy and Andrew McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proc. of ICML, pages 441–448, San Francisco, CA, USA, 2001. [16] Andrey Rzhetsky, Hagit Shatkay, and W. John Wilbur. How to get the most out of your curation effort. PLoS Comput Biol, 5(5):e1000391, 2009. [17] Greg Schohn and David Cohn. Less is more: Active learning with support vector machines. In Proc. of ICML, pages 839–846, San Francisco, CA, USA, 2000. 9 [18] B. Settles. Active learning literature survey. University of Wisconsin, Madison, 2010. [19] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks. In Proc. of EMNLP, pages 1070–1079. Association for Computational Linguistics, 2008. [20] Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of KDD, pages 614–622, New York, NY, USA, 2008. [21] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45–66, 2002. [22] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, J. Dy, and PA Malvern. Modeling annotator expertise: Learning when everybody knows a bit of something. In Proc. of AISTATS, 2010. [23] Yan Yan, Romer Rosales, Glenn Fung, and Jennifer Dy. Active learning from crowds. In Proc. of ICML, pages 1161–1168, New York, NY, USA, 2011. [24] Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22:177–210, 2004. 10
© Copyright 2024 ExpyDoc