Parameter tuning: Exposing the gap between data curation and effective data analytics Catherine Blake [email protected] Henry A. Gabb [email protected] School of Library and Information Science University of Illinois at Urbana-Champaign ABSTRACT The “big data” movement promises to deliver better decisions in all aspects of our lives from business to science health, and government by using computational techniques to identify patterns from large historical collections of data. Although a unified view from curation to analysis has been proposed, current research appears to have polarized into two separate groups: those curating large datasets and those developing computational methods to identify patterns in large datasets. The case study presented here demonstrates the enormous impact that parameter tuning can have on the resulting accuracy, precision, and recall of a computational model that is generated from data. It also illustrates the vastness of the parameter space that must be searched in order to produce optimal models and curated in order to avoid redundant experiments. This highlights the need for research that focuses on the gap between collection and analytics if we are to realize the potential of big data. Keywords Big data, data curation, data analytics, parameter space. INTRODUCTION The “big data” movement promises to deliver better decisions in all aspects of our lives from business to science, health, and government by using computational techniques to identify patterns from large historical collections of data. Jim Gray, the first advocate of dataintensive science, advocated that we “support the whole research cycle – from data capture and data curation to data analysis and data visualization” (Hey, Tansley, & Tolle, 2009). Similarly, work on knowledge discovery in databases is framed as a process that starts with data selection, preprocessing, transformation, data mining and [This space is reserved for copyright notices.] ASIST 2014, November 1-4, 2014, Seattle, WA, USA. [Author Retains Copyright. Insert personal or institutional copyright notice here.] interpretation (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Despite Gray’s unified view, current data research appears to have polarized into two separate groups. The first group focuses on how to curate large datasets that are poised to take advantage of computational methods, while the second group focuses on how to develop computational methods to identify patterns. This notion that the analytical potential of data will not be realized without understanding its collection and preprocessing has been observed by other researchers who report that raw data typically requires several transformations before reuse is possible and that limited amounts of data and complex transformations leave “a gap between the potential value of analytics and the actual value achieved” (Kohavi, Mason, Parekh, & Zheng, 2004). Others have also found that “careful consideration of the basic factors that govern how data is generated (and collected) can lead to significantly more accurate predictions” (Smyth & Elkan, 2010). In addition to model quality, there are important resource allocation issues associated with data preparation that account for up to 60% of the effort involved in knowledge discovery (Cabena, 1998). Interestingly, the McKinsey report, one of the most cited papers on the workforce need for big data, states that “the United States alone faces a shortage of 140,000 to 190,000 people with deep analytics skills,” but many more people are required (1.5 million) to manage and make decisions based on the models created (Manyika et al., 2011). This case study of models produced while conducting research in text mining clearly demonstrates that parameter tuning can have an enormous impact on model quality (as measured by accuracy, precision, and recall). Despite this impact, complete parameter sets are rarely reported in the scientific literature, in part because the search space is so large. This is also an issue for industry as computational models transition from academe into mainstream tools that target non-experts. Instructions on how to set parameters will become increasingly important, but this is a challenging problem because there is typically no overall optimal parameter set (i.e., the optimum settings depend on the incoming data). Our hope is that this example highlights the need to understand how to explore the parameter space when building models and the need to include sufficient details about these settings for reproducibility. METHODS Our initial project goal was to automatically identify results (also called findings or claims) from an empirical study. The problem was framed as a binary classification task, where the classifier was trained to discern a result from a non-result sentence. With respect to pre-preprocessing, full-text articles were manually annotated to identify sentences that contained results and those that did not. With respect to transformations, we built an automated feature selection strategy that drew features from three places: individual words in each sentence; the sentence location in the article (abstract, introduction, method, result, or discussion); and whether the sentence is the first or last of a paragraph (as results are frequently reported at the beginning or end of a paragraph). All of the models in the present study take sentence location into account. Features were then ranked using three domain-independent selection strategies: information gain, χ2 statistic (CHI), and mutual information (Yang & Pedersen, 1997). All of the models in the present study use CHI terms. Data was stored in an Oracle Database 11g Enterprise Edition (release 11.2.0.1, 64-bit production build). The parameter tuning focus of this paper occurs during the data mining step of the knowledge discovery process. We used four different classification algorithms: decision trees (DT), general linear model (GLM), support vector machine (SVM), and naïve Bayes (NB) as implemented in the Oracle Data Miner 11g (ODM release 1), which was part of Oracle SQL Developer (version 3.2). RESULTS AND DISCUSSION The number of parameters that could be tuned for a given classifier in the ODM ranged between 2 and 7 (see Table 1). In addition to the algorithm-specific parameters, the user can also decide if the final model should consider the distribution of classes in the training dataset which can be set by the ODM automatically or set manually by the user. Three weight settings are examined in this study: natural, balanced, and equal. Natural weight refers to the actual proportions of the class variable; in our case, whether or not a sentence contains a result. Balanced weight means that the inverse proportions are used so that model predictions are not unduly skewed toward the more commonly occurring class. All of the ODM classifiers use balanced weighting by default. Equal weight simply means that the model treats both classes the same, regardless of their proportions in the training data. Although the number of features for a given classifier may appear manageable, many of these features are continuous, which leads to an enormous search space. The size of the search space can perhaps explain why only some of the parameter space is typically reported in the literature. Figure 1. The effect of SVM kernel selection and class weights on model accuracy. The linear kernel with balanced class weights is the default setting, indicated in red. Parameter Generate Row Diagnostics Confidence Level Reference Class Name Missing Value Treatment Specify Row Weights Enable Ridge Regression Ridge Value Singleton Threshold Pairwise Threshold Kernel Function Tolerance Value Complexity Factor Active Learning Homogeneity Metric Maximum Depth Minimum Records in a Node Minimum Percent of Records in a Node Minimum Records for a Split Minimum Percent of Records for a Split GLM GLM GLM GLM GLM GLM GLM NB NB SVM SVM SVM SVM DT DT DT DT DT DT Default Off 0.95 System Mean Off On System 1 0 1 0 Linear 0.001 System On Gini 7 10 0.05 20 0.1 Table 1. Tuning parameters available to ODM users. The important result is not how many parameters can be tuned, but rather how tuning these parameters can drastically effect the quality of the computational model. To explore this relationship, we measured accuracy (the number of sentences correctly classified divided by the total number of sentences), precision (the number of claimcontaining sentences correctly classified divided by the total number of sentences identified), recall (the number of claim-containing sentences correctly classified divided by the total number of claim-containing sentences), and F1 score (the weighted average of precision and recall). There is only enough space here examine a couple of parameters for the four classifiers but it demonstrates the effects that just two parameters in the vast parameter space can have on model quality. over precision. The opposite is true for models built with natural class weights, where precision is favored over recall. With the exception of DT, weighting classes equally yields models somewhere in between. The effect of kernel and class weight parameters is more pronounced on precision and recall (Figure 3). When 40 features are used the differences between the highest and lowest precision, recall, and F1 score are 0.305, 0.374, and 0.134, respectively. These results clearly show that simply using the ODM default settings may not necessarily produce optimal models. Models should ideally have high precision and recall but there is a tradeoff between these metrics. One metric may be preferred over the other, depending on the user requirements. For example, high recall might be preferred at the expense of precision if it is more important not to miss potentially critical pieces of information. Therefore, parameter tuning is an essential part of model building, but as Table 1 indicates, an exhaustive search of the parameter space is difficult. However, it is important to record what regions of the parameter space have been examined so that redundant work is not performed. CONCLUSIONS The optimal settings for a given modeling problem are datadependent so optimal default parameter settings cannot be guaranteed. For example, results show that the accuracy of a SVM classifier improves by 6% from the default values by using the linear kernel and natural class weights. Results also show that complex interaction effects exist between different stages in the knowledge discovery process data transformations (i.e., number of terms) and accuracy, but simply adding more data does not always result in better models. For example, the accuracy of the SVM model is 6% worse than the default setting when using 40 terms, but the default parameters perform 4% better than manually set parameters when using 250 terms. The focus here is not on model quality per se, but rather the impact that parameter settings have on the quality of computational models. Figure 1 shows the effect that kernel selection and class weights have on the accuracy of the SVM classifier. The default settings provide the best and the worst accuracy depending on the number of term features. The difference between the highest and lowest accuracy is 0.066. The class weight parameter also affects the accuracy of the DT classifier but the effect on the GLM or NB classifiers is less pronounced than SVM and DT (Figure 2). Accuracy does not give a complete picture of model performance so precision, recall, and F1 scores are also examined (Figure 3). In general, models achieve higher F1 scores with balanced class weights, which also favor recall 1 The GUI version of the ODM has a default of 0 but the documentation states that the default threshold is 0.01. Figure 2. The effect of class weights on model accuracy. The SVM and DT classifiers are affected more strongly than GLM and NB. To date, much of the data curation focus has been on physical objects with only some papers mentioning “computational data” (Borgman, 2012). Other work on defining a dataset has found terms ‘grouping, content, relatedness, and purpose’ important (Renear, Sacchi, & Wickett, 2010). This seems like an ideal potential reuse as the search space is so large that no single researcher could explore all possible settings in all possible datasets. While it might be possible in an academic setting to explore all preprocessing, transformation, and data mining parameters, the size of that search space makes it unlikely that these could be explored in a business setting. Ideally, the academic community should be moving towards an understanding of how decisions made in earlier parts of the knowledge discovery process (including parameter tuning) should be made with respect to critical characteristics of a dataset. Such an understanding will remain elusive unless academic papers report complete details about the dataset and settings used. Figure 3. Precision vs. recall for models built with the default classifier settings with 40 CHI terms and varying class weights. The F1 score for each model is shown after the class weight setting. ACKNOWLEDGEMENTS This research is made possible in part by a grant from the U.S. Institute of Museum and Library Services, Laura Bush 21st Century Librarian Program Grant Number RE05-12-0054-12 Socio-technical Data Analytics (SODA). Kohavi, R., Mason, L., Parekh, R., & Zheng, Z. (2004). Lessons and Challenges from Mining Retail ECommerce Data. Machine Learning, 57(1), 83-113. REFERENCES Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. In M. G. Insititute (Ed.). Borgman, C. L. (2012). The Conundrum of Sharing Research Data. Journal of the American Society for Information Science and Technology, 63(6), 10591078. Renear, A. H., Sacchi, S., & Wickett, K. M. (2010). Definitions of dataset in the scientific and technical literature. Paper presented at the Paper presented at the ASIST, Pittsburgh, PA, USA. Cabena, P. (1998). Discovering data mining: from concept to implementation: Prentice Hall. Smyth, P., & Elkan, C. (2010). Creativity Helps Influence Prediction Precision. Communications of the ACM, 53(4), 88. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34. Hey, T., Tansley, S., & Tolle, K. (Eds.). (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, Washington: Microsoft Research. Yang, Y., & Pedersen, J. P. (1997). A Comparative Study on Feature Selection in Text Categorization. Paper presented at the Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97).
© Copyright 2024 ExpyDoc