SIGCHI Conference Paper Format - American Society for Information

Parameter tuning: Exposing the gap between data curation
and effective data analytics
Catherine Blake
[email protected]
Henry A. Gabb
[email protected]
School of Library and Information Science
University of Illinois at Urbana-Champaign
ABSTRACT
The “big data” movement promises to deliver better
decisions in all aspects of our lives from business to science
health, and government by using computational techniques
to identify patterns from large historical collections of data.
Although a unified view from curation to analysis has been
proposed, current research appears to have polarized into
two separate groups: those curating large datasets and those
developing computational methods to identify patterns in
large datasets. The case study presented here demonstrates
the enormous impact that parameter tuning can have on the
resulting accuracy, precision, and recall of a computational
model that is generated from data. It also illustrates the
vastness of the parameter space that must be searched in
order to produce optimal models and curated in order to
avoid redundant experiments. This highlights the need for
research that focuses on the gap between collection and
analytics if we are to realize the potential of big data.
Keywords
Big data, data curation, data analytics, parameter space.
INTRODUCTION
The “big data” movement promises to deliver better
decisions in all aspects of our lives from business to
science, health, and government by using computational
techniques to identify patterns from large historical
collections of data. Jim Gray, the first advocate of dataintensive science, advocated that we “support the whole
research cycle – from data capture and data curation to data
analysis and data visualization” (Hey, Tansley, & Tolle,
2009). Similarly, work on knowledge discovery in
databases is framed as a process that starts with data
selection, preprocessing, transformation, data mining and
[This space is reserved for copyright notices.]
ASIST 2014, November 1-4, 2014, Seattle, WA, USA.
[Author Retains Copyright. Insert personal or institutional copyright
notice here.]
interpretation (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).
Despite Gray’s unified view, current data research appears
to have polarized into two separate groups. The first group
focuses on how to curate large datasets that are poised to
take advantage of computational methods, while the second
group focuses on how to develop computational methods to
identify patterns. This notion that the analytical potential of
data will not be realized without understanding its
collection and preprocessing has been observed by other
researchers who report that raw data typically requires
several transformations before reuse is possible and that
limited amounts of data and complex transformations leave
“a gap between the potential value of analytics and the
actual value achieved” (Kohavi, Mason, Parekh, & Zheng,
2004). Others have also found that “careful consideration of
the basic factors that govern how data is generated (and
collected) can lead to significantly more accurate
predictions” (Smyth & Elkan, 2010).
In addition to model quality, there are important resource
allocation issues associated with data preparation that
account for up to 60% of the effort involved in knowledge
discovery (Cabena, 1998). Interestingly, the McKinsey
report, one of the most cited papers on the workforce need
for big data, states that “the United States alone faces a
shortage of 140,000 to 190,000 people with deep analytics
skills,” but many more people are required (1.5 million) to
manage and make decisions based on the models created
(Manyika et al., 2011).
This case study of models produced while conducting
research in text mining clearly demonstrates that parameter
tuning can have an enormous impact on model quality (as
measured by accuracy, precision, and recall). Despite this
impact, complete parameter sets are rarely reported in the
scientific literature, in part because the search space is so
large. This is also an issue for industry as computational
models transition from academe into mainstream tools that
target non-experts. Instructions on how to set parameters
will become increasingly important, but this is a
challenging problem because there is typically no overall
optimal parameter set (i.e., the optimum settings depend on
the incoming data). Our hope is that this example highlights
the need to understand how to explore the parameter space
when building models and the need to include sufficient
details about these settings for reproducibility.
METHODS
Our initial project goal was to automatically identify results
(also called findings or claims) from an empirical study.
The problem was framed as a binary classification task,
where the classifier was trained to discern a result from a
non-result sentence.
With respect to pre-preprocessing, full-text articles were
manually annotated to identify sentences that contained
results and those that did not. With respect to
transformations, we built an automated feature selection
strategy that drew features from three places: individual
words in each sentence; the sentence location in the article
(abstract, introduction, method, result, or discussion); and
whether the sentence is the first or last of a paragraph (as
results are frequently reported at the beginning or end of a
paragraph). All of the models in the present study take
sentence location into account. Features were then ranked
using three domain-independent selection strategies:
information gain, χ2 statistic (CHI), and mutual information
(Yang & Pedersen, 1997). All of the models in the present
study use CHI terms. Data was stored in an Oracle
Database 11g Enterprise Edition (release 11.2.0.1, 64-bit
production build).
The parameter tuning focus of this paper occurs during the
data mining step of the knowledge discovery process. We
used four different classification algorithms: decision trees
(DT), general linear model (GLM), support vector machine
(SVM), and naïve Bayes (NB) as implemented in the
Oracle Data Miner 11g (ODM release 1), which was part of
Oracle SQL Developer (version 3.2).
RESULTS AND DISCUSSION
The number of parameters that could be tuned for a given
classifier in the ODM ranged between 2 and 7 (see Table
1). In addition to the algorithm-specific parameters, the user
can also decide if the final model should consider the
distribution of classes in the training dataset which can be
set by the ODM automatically or set manually by the user.
Three weight settings are examined in this study: natural,
balanced, and equal. Natural weight refers to the actual
proportions of the class variable; in our case, whether or not
a sentence contains a result. Balanced weight means that the
inverse proportions are used so that model predictions are
not unduly skewed toward the more commonly occurring
class. All of the ODM classifiers use balanced weighting by
default. Equal weight simply means that the model treats
both classes the same, regardless of their proportions in the
training data.
Although the number of features for a given classifier may
appear manageable, many of these features are continuous,
which leads to an enormous search space. The size of the
search space can perhaps explain why only some of the
parameter space is typically reported in the literature.
Figure 1. The effect of SVM kernel selection and class weights on model accuracy. The linear kernel with balanced
class weights is the default setting, indicated in red.
Parameter
Generate Row Diagnostics
Confidence Level
Reference Class Name
Missing Value Treatment
Specify Row Weights
Enable Ridge Regression
Ridge Value
Singleton Threshold
Pairwise Threshold
Kernel Function
Tolerance Value
Complexity Factor
Active Learning
Homogeneity Metric
Maximum Depth
Minimum Records in a Node
Minimum Percent of Records in a Node
Minimum Records for a Split
Minimum Percent of Records for a Split
GLM
GLM
GLM
GLM
GLM
GLM
GLM
NB
NB
SVM
SVM
SVM
SVM
DT
DT
DT
DT
DT
DT
Default
Off
0.95
System
Mean
Off
On
System
1
0
1
0
Linear
0.001
System
On
Gini
7
10
0.05
20
0.1
Table 1. Tuning parameters available to ODM users.
The important result is not how many parameters can be
tuned, but rather how tuning these parameters can
drastically effect the quality of the computational model. To
explore this relationship, we measured accuracy (the
number of sentences correctly classified divided by the total
number of sentences), precision (the number of claimcontaining sentences correctly classified divided by the
total number of sentences identified), recall (the number of
claim-containing sentences correctly classified divided by
the total number of claim-containing sentences), and F1
score (the weighted average of precision and recall). There
is only enough space here examine a couple of parameters
for the four classifiers but it demonstrates the effects that
just two parameters in the vast parameter space can have on
model quality.
over precision. The opposite is true for models built with
natural class weights, where precision is favored over
recall. With the exception of DT, weighting classes equally
yields models somewhere in between.
The effect of kernel and class weight parameters is more
pronounced on precision and recall (Figure 3). When 40
features are used the differences between the highest and
lowest precision, recall, and F1 score are 0.305, 0.374, and
0.134, respectively.
These results clearly show that simply using the ODM
default settings may not necessarily produce optimal
models. Models should ideally have high precision and
recall but there is a tradeoff between these metrics. One
metric may be preferred over the other, depending on the
user requirements. For example, high recall might be
preferred at the expense of precision if it is more important
not to miss potentially critical pieces of information.
Therefore, parameter tuning is an essential part of model
building, but as Table 1 indicates, an exhaustive search of
the parameter space is difficult. However, it is important to
record what regions of the parameter space have been
examined so that redundant work is not performed.
CONCLUSIONS
The optimal settings for a given modeling problem are datadependent so optimal default parameter settings cannot be
guaranteed. For example, results show that the accuracy of
a SVM classifier improves by 6% from the default values
by using the linear kernel and natural class weights. Results
also show that complex interaction effects exist between
different stages in the knowledge discovery process data
transformations (i.e., number of terms) and accuracy, but
simply adding more data does not always result in better
models. For example, the accuracy of the SVM model is
6% worse than the default setting when using 40 terms, but
the default parameters perform 4% better than manually set
parameters when using 250 terms.
The focus here is not on model quality per se, but rather the
impact that parameter settings have on the quality of
computational models. Figure 1 shows the effect that kernel
selection and class weights have on the accuracy of the
SVM classifier. The default settings provide the best and
the worst accuracy depending on the number of term
features. The difference between the highest and lowest
accuracy is 0.066. The class weight parameter also affects
the accuracy of the DT classifier but the effect on the GLM
or NB classifiers is less pronounced than SVM and DT
(Figure 2).
Accuracy does not give a complete picture of model
performance so precision, recall, and F1 scores are also
examined (Figure 3). In general, models achieve higher F1
scores with balanced class weights, which also favor recall
1
The GUI version of the ODM has a default of 0 but the
documentation states that the default threshold is 0.01.
Figure 2. The effect of class weights on model accuracy.
The SVM and DT classifiers are affected more strongly
than GLM and NB.
To date, much of the data curation focus has been on
physical objects with only some papers mentioning
“computational data” (Borgman, 2012). Other work on
defining a dataset has found terms ‘grouping, content,
relatedness, and purpose’ important (Renear, Sacchi, &
Wickett, 2010). This seems like an ideal potential reuse as
the search space is so large that no single researcher could
explore all possible settings in all possible datasets.
While it might be possible in an academic setting to explore
all preprocessing, transformation, and data mining
parameters, the size of that search space makes it unlikely
that these could be explored in a business setting. Ideally,
the academic community should be moving towards an
understanding of how decisions made in earlier parts of the
knowledge discovery process (including parameter tuning)
should be made with respect to critical characteristics of a
dataset. Such an understanding will remain elusive unless
academic papers report complete details about the dataset
and settings used.
Figure 3. Precision vs. recall for models built with the default classifier settings with 40 CHI terms and varying class
weights. The F1 score for each model is shown after the class weight setting.
ACKNOWLEDGEMENTS
This research is made possible in part by a grant from the
U.S. Institute of Museum and Library Services, Laura
Bush 21st Century Librarian Program Grant Number RE05-12-0054-12 Socio-technical Data Analytics (SODA).
Kohavi, R., Mason, L., Parekh, R., & Zheng, Z. (2004).
Lessons and Challenges from Mining Retail ECommerce Data. Machine Learning, 57(1), 83-113.
REFERENCES
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
Roxburgh, C., & Byers, A. H. (2011). Big data: The
next frontier for innovation, competition, and
productivity. In M. G. Insititute (Ed.).
Borgman, C. L. (2012). The Conundrum of Sharing
Research Data. Journal of the American Society for
Information Science and Technology, 63(6), 10591078.
Renear, A. H., Sacchi, S., & Wickett, K. M. (2010).
Definitions of dataset in the scientific and technical
literature. Paper presented at the Paper presented at the
ASIST, Pittsburgh, PA, USA.
Cabena, P. (1998). Discovering data mining: from
concept to implementation: Prentice Hall.
Smyth, P., & Elkan, C. (2010). Creativity Helps Influence
Prediction Precision. Communications of the ACM,
53(4), 88.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996).
The KDD Process for extracting useful knowledge from
volumes of data. Communications of the ACM, 39(11),
27-34.
Hey, T., Tansley, S., & Tolle, K. (Eds.). (2009). The
Fourth Paradigm: Data-Intensive Scientific Discovery.
Redmond, Washington: Microsoft Research.
Yang, Y., & Pedersen, J. P. (1997). A Comparative Study
on Feature Selection in Text Categorization. Paper
presented at the Proceedings of the Fourteenth
International Conference on Machine Learning
(ICML'97).