A New Implementation of Model-Based Recursive Partitioning in R

Parties, Models, Mobsters: A New Implementation
of Model-Based Recursive Partitioning in R
Achim Zeileis
Torsten Hothorn
Universit¨
at Innsbruck
Universit¨at Z¨
urich
Abstract
MOB is a generic algorithm for model-based recursive partitioning (Zeileis, Hothorn,
and Hornik 2008). Rather than fitting one global model to a dataset, it estimates local
models on subsets of data that are “learned” by recursively partitioning. It proceeds in the
following way: (1) fit a parametric model to a data set, (2) test for parameter instability
over a set of partitioning variables, (3) if there is some overall parameter instability, split
the model with respect to the variable associated with the highest instability, (4) repeat
the procedure in each of the resulting subsamples. It is discussed how these steps of the
conceptual algorithm are translated into computational tools in an object-oriented manner, allowing the user to plug in various types of parametric models. For representing the
resulting trees, the R package partykit is employed and extended with generic infrastructure for recursive partitions where nodes are associated with statistical models. Compared
to the previously available implementation in the party package, the new implementation
supports more inference options, is easier to extend to new models, and provides more
convenience features.
Keywords: parametric models, object-orientation, recursive partitioning.
1. Overview
To implement the model-based recursive partitioning (MOB) algorithm of Zeileis et al. (2008)
in software, infrastructure for three aspects is required: (1) statistical “models”, (2) recursive
“party”tions, and (3) “mobsters” carrying out the MOB algorithm.
Along with Zeileis et al. (2008), an implementation of all three steps was provided in the
party package (Hothorn, Hornik, Strobl, and Zeileis 2014) for the R system for statistical
computing (R Core Team 2013). This provided one very flexible mob() function combining
party’s S4 classes for representing trees with binary splits and the S4 model wrapper functions
from modeltools (Hothorn, Leisch, and Zeileis 2013). However, while this supported many
applications of interest, it was somewhat limited in several directions: (1) The S4 wrappers for
the models were somewhat cumbersome to set up. (2) The tree infrastructure was originally
designed for ctree() and somewhat too narrowly focused on it. (3) Writing new “mobster”
interfaces was not easy because of using unexported S4 classes.
Hence, a leaner and more flexible interface (based on S3 classes) is now provided in partykit
(Hothorn and Zeileis 2014): (1) New models are much easier to provide in a basic version and
customization does not require setting up an additional S4 class-and-methods layer anymore.
(2) The trees are built on top of partykit’s flexible ‘party’ objects, inheriting many useful
2
Model-Based Recursive Partitioning in R
methods and providing new ones dealing with the fitted models associated with the tree’s
nodes. (3) New “mobsters” dedicated to specific models, e.g., lmtree() and glmtree() for
MOBs of (generalized) linear models, are readily provided.
The remainder of this vignette is organized as follows: Section 2 very briefly reviews the
original MOB algorithm of Zeileis et al. (2008) and also highlights relevant subsequent work.
Section 3 introduces the new mob() function in partykit in detail, discussing how all steps
of the MOB algorithm are implemented and which options for customization are available.
For illustration logistic-regression-based recursive partitioning is applied to the Pima Indians
diabetes data set from the UCI machine learning repository (Bache and Lichman 2013).
Section 4 and 5 present further illustrative examples (including replications from Zeileis et al.
2008) before Section 6 provides some concluding remarks.
2. MOB: Model-based recursive partitioning
First, the theory underling the MOB (model-based recursive partitioning) is briefly reviewed;
a more detailed discussion is provided by Zeileis et al. (2008). To fix notation, consider a
parametric model M(Y, θ) with (possibly vector-valued) observations Y and a k-dimensional
vector of parameters θ. This model could be a (possibly multivariate) normal distribution for
Y , a psychometric model for a matrix of responses Y , or some kind of regression model when
Y = (y, x) can be split up into a dependent variable y and regressors x. An example for the
latter could be a linear regression model y = x⊤ θ or a generalized linear model (GLM) or a
survival regression.
Given n observations
Yi (i = 1, . . . , n) the model can be fitted by minimizing some objective
P
function ni=1 Ψ(Yi , θ), e.g., a residual sum of squares or a negative log-likelihood leading to
ordinary least squares (OLS) or maximum likelihood (ML) estimation, respectively.
If a global model for all n observations does not fit well and further covariates Z1 , . . . , Zℓ are
available, it might be possible to partition the n observations with respect to these variables
and find a fitting local model in each cell of the partition. The MOB algorithm tries to
find such a partition adaptively using a greedy forward search. The reasons for looking
at such local models might be different for different types of models: First, the detection
of interactions and nonlinearities in regression relationships might be of interest just like in
standard classification and regression trees (see e.g., Zeileis et al. 2008). Additionally, however,
this approach allows to add explanatory variables to models that otherwise do not have
regressors or where the link between the regressors and the parameters of the model is inclear
(this idea is pursued by Strobl, Wickelmaier, and Zeileis 2011). Finally, the model-based tree
can be employed as a thorough diagnostic test of the parameter stability assumption (also
termed measurement invariance in psychometrics). If the tree does not split at all, parameter
stability (or measurement invariance) cannot be rejected while a tree with splits would indicate
in which way the assumption is violated (Strobl, Kopf, and Zeileis 2013, employ this idea in
psychometric item response theory models).
The basic idea is to grow a tee in which every node is associated with a model of type M. To
assess whether splitting of the node is necessary a fluctuation test for parameter instability is
performed. If there is significant instability with respect to any of the partitioning variables
Zj , the node is splitted into B locally optimal segments (the currenct version of the software
has B = 2 as the default and as the only option for numeric/ordered variables) and then the
Achim Zeileis, Torsten Hothorn
3
procedure is repeated in each of the B children. If no more significant instabilities can be
found, the recursion stops. More precisely, the steps of the algorithm are
1. Fit the model once to all observations in the current node.
2. Assess whether the parameter estimates are stable with respect to every partitioning
variable Z1 , . . . , Zℓ . If there is some overall instability, select the variable Zj associated
with the highest parameter instability, otherwise stop.
3. Compute the split point(s) that locally optimize the objective function Ψ.
4. Split the node into child nodes and repeat the procedure until some stopping criterion
is met.
This conceptual framework is extremely flexible and allows to adapt it to various tasks by
choosing particular models, tests, and methods in each of the steps:
1. Model estimation: The original MOB introduction (Zeileis et al. 2008) discussed regression models: OLS regression, GLMs, and survival regression. Subsequently, Gr¨
un,
Kosmidis, and Zeileis (2012) have also adapted MOB to beta regression for limited response variables. Furthermore, MOB provides a generic way of adding covariates to
models that otherwise have no regressors: this can either serve as a check whether the
model is indeed independent from regressors or leads to local models for subsets. Both
views are of interest when employing MOB to detect parameter instabilities in psychometric models for item responses such as the Bradley-Terry or the Rasch model (see
Strobl et al. 2011, 2013, respectively).
2. Parameter instability tests: To assess the stability of all model parameters across a
certain partitioning variable, the general class of score-based fluctuation tests proposed
by Zeileis and Hornik (2007) is employed. Based on the empirical score function observations (i.e., empirical estimating functions or contributions to the gradient), ordered
with respect to the partitioning variable, the fluctuation or instability in the model’s parameters can be tested. From this general framework the Andrews’ supLM test is used
for assessing numerical partitioning variables and a χ2 test for categorical partitioning
variables (see Zeileis 2005 and Merkle and Zeileis 2013 for unifying views emphasizing
regression and psychometric models, respectively). Furthermore, the test statistics for
ordinal partitioning variables suggested by Merkle, Fan, and Zeileis (2013) have been
added as further options (but are not used by default as the simulation of p-values is
computationally demanding).
3. Partitioning: As the objective function Ψ is additive, it is easy to compute a single
optimal split point (or cut point or break point). For each conceivable split, the model is
estimated on the two resulting subsets and the resulting objective functions are summed.
The split that optimizes this segmented objective function is then selected as the optimal
split. For optimally splitting the data into B > 2 segments, the same idea can be used
and a full grid search can be avoided by employing a dynamic programming algorithms
(Hawkins 2001; Bai and Perron 2003) but at the moment the latter is not implemented
in the software. Optionally, however, categorical partitioning variables can be split into
all of their categories (i.e., in that case B is set to the number of levels without searching
for optimal groupings).
4
Model-Based Recursive Partitioning in R
4. Pruning: For determining the optimal size of the tree, one can either use a pre-pruning
or post-pruning strategy. For the former, the algorithm stops when no significant parameter instabilities are detected in the current node (or when the node becomes too
small). For the latter, one would first grow a large tree (subject only to a minimal
node size requirement) and then prune back splits that did not improve the model, e.g.,
judging by information criteria such as AIC or BIC (see e.g., Su, Wang, and Fan 2004).
Currently, pre-pruning is used by default (via Bonferroni-corrected p-values from the
score-based fluctuation tests) but AIC/BIC-based post-pruning is optionally available
(especially for large data sets where traditional significance levels are not useful).
In the following it is discussed how most of the options above are embedded in a common
computational framework using R’s facilities for model estimation and object orientation.
3. A new implementation in R
This section introduces a new implementation of the general model-based recursive partitioning (MOB) algorithm in R. Along with Zeileis et al. (2008), a function mob() had been introduced to the party package (Hothorn et al. 2014) which continues to work but it turned out
to be somewhat unflexible for certain applications/extensions. Hence, the partykit package
(Hothorn and Zeileis 2014) provides a completely rewritten (and not backward compatible)
implementation of a new mob() function along with convenience interfaces lmtree() and
glmtree() for fitting linear model and generalized linear model trees, respectively. The function mob() itself is intended to be the workhorse function that can also be employed to quickly
explore new models – whereas lmtree() and glmtree() will be the typical user interfaces
facilitating applications.
The new mob() function has the following arguments:
mob(formula, data, subset, na.action, weights, offset,
fit, control = mob_control(), ...)
All arguments in the first line are standard for modeling functions in R with a formula
interface. They are employed by mob() to do some data preprocessing (described in detail
in Section 3.1) before the fit function is used for parameter estimation on the preprocessed
data. The fit function has to be set up in a certain way (described in detail in Section 3.2)
so that mob() can extract all information that is needed in the MOB algorithm for parameter
instability tests (see Section 3.3) and partitioning/splitting (see Section 3.4), i.e., the estimated
parameters, the associated objective function, and the score function contributions. A list of
control options can be set up by the mob_control() function, including options for pruning
(see Section 3.5). Additional arguments ... are passed on to the fit function.
The result is an object of class ‘modelparty’ inheriting from ‘party’. The info element of
the overall ‘party’ and the individual ‘node’s contain various informations about the models.
Details are provided in Section 3.6.
Finally, a wide range of standard (and some extra) methods are available for working with
‘modelparty’ objects, e.g., for extracting information about the models, for visualization,
computing predictions, etc. The standard set of methods is introduced in Section 3.7. However, as will be discussed there, it may take some effort by the user to efficiently compute
Achim Zeileis, Torsten Hothorn
5
certain pieces of information. Hence, convenience interfaces such as lmtree() or glmtree()
can alleviate these obstacles considerably, as illustrated in Section 3.8.
3.1. Formula processing and data preparation
The formula processing within mob() is essentially done in “the usual way”, i.e., there is a
formula and data and optionally further arguments such as subset, na.action, weights,
and offset. These are processed into a model.frame from which the response and the
covariates can be extracted and passed on to the actual fit function.
As there are possibly three groups of variables (response, regressors, and partitioning variables), the Formula package (Zeileis and Croissant 2010) is employed for processing these
three parts. Thus, the formula can be of type y ~ x1 + ... + xk | z1 + ... + zl where
the variables on the left of the | specify the data Y and the variables on the right specify
the partitioning variables Zj . As pointed out above, the Y can often be split up into response (y in the example above) and regressors (x1, . . . , xk in the example above). If there
are no regressors and just constant fits are employed, then the formula can be specified as
y ~ 1 | z1 + ... + zl.
So far, this formula representation is really just a specification of groups of variables and does
not imply anything about the type of model that is to be fitted to the data in the nodes of the
tree. The mob() function does not know anything about the type of model and just passes
(subsets of) the y and x variables on to the fit function. Only the partitioning variables z
are used internally for the parameter instability tests (see Section 3.3).
As different fit functions will require the data in different formats, mob_control() allows to
set the ytype and xtype. The default is to assume that y is just a single column of the model
frame that is extracted as a ytype = "vector". Alternatively, it can be a "data.frame" of
all response variables or a "matrix" set up via model.matrix(). The options "data.frame"
and "matrix" are also available for xtype with the latter being the default. Note that if
"matrix" is used, then transformations (e.g., logs, square roots etc.) and dummy/interaction
codings are computed and turned into columns of a numeric matrix while for "data.frame"
the original variables are preserved.
By specifying the ytype and xtype, mob() is also provided with the information on how to
correctly subset y and x when recursively partitioning data. In each step, only the subset of
y and x pertaining to the current node of the tree is passed on to the fit function. Similarly,
subsets of weights and offset are passed on (if specified).
Illustration
For illustration, we employ the popular benchmark data set on diabetes among Pima Indian
women that is provided by the UCI machine learning repository (Bache and Lichman 2013)
and available in the mlbench package (Leisch and Dimitriadou 2012):
R> data("PimaIndiansDiabetes", package = "mlbench")
Following Zeileis et al. (2008) we want to fit a model for diabetes employing the plasma
glucose concentration glucose as a regressor. As the influence of the remaining variables on
diabetes is less clear, their relationship should be learned by recursive partitioning. Thus,
we employ the following formula:
6
Model-Based Recursive Partitioning in R
R> pid_formula <- diabetes ~ glucose | pregnant + pressure + triceps +
+
insulin + mass + pedigree + age
Before passing this to mob(), a fit function is needed and a logistic regression function is set
up in the following section.
3.2. Model fitting and parameter estimation
The mob() function itself does not actually carry out any parameter estimation by itself,
but assumes that one of the many R functions available for model estimation is supplied.
However, to be able to carry out the steps of the MOB algorithm, mob() needs to able to
extract certain pieces of information: especially the estimated parameters, the corresponding
objective function, and associated score function contributions. Also, the interface of the
fitting function clearly needs to be standardized so that mob() knows how to invoke the
model estimation.
Currently, two possible interfaces for the fit function can be employed:
1. The fit function can take the following inputs
fit(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = FALSE)
where y, x, weights, offset are (the subset of) the preprocessed data. In start starting
values for the parameter estimates may be supplied and ... is passed on from the mob()
function. The fit function then has to return an output list with the following elements:
ˆ
❼ coefficients: Estimated parameters θ.
❼ objfun: Value of the minimized objective function
P
ˆ
i Ψ(yi , x, θ).
ˆ
❼ estfun: Empirical estimating functions (or score function contributions) Ψ′ (yi , xi , θ).
Only needed if estfun = TRUE, otherwise optionally NULL.
❼ object: A model object for which further methods could be available (e.g., predict(),
or fitted(), etc.). Only needed if object = TRUE, otherwise optionally NULL.
By making estfun and object optional, the fitting function might be able to save
computation time by only optimizing the objective function but avoiding further computations (such as setting up covariance matrix, residuals, diagnostics, etc.).
2. The fit function can also of a simpler interface with only the following inputs:
fit(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ...)
The meaning of all arguments is the same as above. However, in this case fit needs to
return a classed model object for which methods to coef(), logLik(), and estfun()
(see Zeileis 2006) are available for extracting the parameter estimates, the maximized
log-likelihood, and associated empirical estimating functions (or score contributions),
respectively. Internally, a function of type (1) is set up by mob() in case a function of
type (2) is supplied. However, as pointed out above, a function of type (1) might be
useful to save computation time.
Achim Zeileis, Torsten Hothorn
7
In either case the fit function can, of course, choose to ignore any arguments that are not
applicable, e.g., if the are no regressors x in the model or if starting values or not supported.
The fit function of type (2) is typically convenient to quickly try out a new type of model
in recursive partitioning. However, when writing a new mob() interface such as lmtree()
or glmtree(), it will typically be better to use type (1). Similarly, employing supporting
starting values in fit is then recommended to save computation time.
Illustration
For recursively partitioning the diabetes ~ glucose relationship (as already set up in the
previous section), we employ a logistic regression model. An interface of type (2) can be easily
set up:
R> logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
+
glm(y ~ 0 + x, family = binomial, start = start, ...)
+ }
Thus, y, x, and the starting values are passed on to glm() which returns an object of class
‘glm’ for which all required methods (coef(), logLik(), and estfun()) are available.
Using this fit function and the formula already set up above the MOB algorithm can be
easily applied to the PimaIndiansDiabetes data:
R> pid_tree <- mob(pid_formula, data = PimaIndiansDiabetes, fit = logit)
The result is a logistic regression tree with three terminal nodes that can be easily visualized
via plot(pid_tree) (see Figure 1) and printed:
R> pid_tree
Model-based recursive partitioning (logit)
Model formula:
diabetes ~ glucose | pregnant + pressure + triceps + insulin +
mass + pedigree + age
Fitted party:
[1] root
|
[2] mass <= 26.3: n = 167
|
x(Intercept)
xglucose
|
-9.95151
0.05871
|
[3] mass > 26.3
|
|
[4] age <= 30: n = 304
|
|
x(Intercept)
xglucose
|
|
-6.70559
0.04684
|
|
[5] age > 30: n = 297
|
|
x(Intercept)
xglucose
|
|
-2.77095
0.02354
8
Model-Based Recursive Partitioning in R
Number of
Number of
Number of
Objective
inner nodes:
2
terminal nodes: 3
parameters per node: 2
function: 355.5
The tree finds three groups of Pima Indian women:
#2 Women with low body mass index that have on average a low risk of diabetes, however
this increases clearly with glucose level.
#4 Women with average and high body mass index, younger than 30 years, that have a
higher avarage risk that also increases with glucose level.
#5 Women with average and high body mass index, older than 30 years, that have a high
avarage risk that increases only slowly with glucose level.
Note that the example above is used for illustration here and glmtree() is recommended over
using mob() plus manually setting up a logit() function. The same tree structure can be
found via:
R> pid_tree2 <- glmtree(diabetes ~ glucose | pregnant +
+
pressure + triceps + insulin + mass + pedigree + age,
+
data = PimaIndiansDiabetes, family = binomial)
However, glmtree() is slightly faster as it avoids many unnecessary computations, see Section 3.8 for further details.
Here, we only point out that plot(pid_tree2) produces a nicer visualization (see Figure 2).
As y is diabetes, a binary variable, and x is glucose, a numeric variable, a spinogram is
chosen automatically for visualization (using the vcd package, Meyer, Zeileis, and Hornik
2006). The x-axis breaks in the spinogram are the five-point summary of glucose on the full
data set. The fitted lines are the mean predicted probabilities in each group.
3.3. Testing for parameter instability
In each node of the tree, first the parametric model is fitted to all observations in that node
(see Section 3.2). Subsequently it is of interest to find out whether the model parameters
are stable over each particular ordering implied by the partitioning variables Zj or whether
splitting the sample with respect to one of the Zj might capture instabilities in the parameters
and thus improve the fit. The tests used in this step belong to the class of generalized Mfluctuation tests (Zeileis 2005; Zeileis and Hornik 2007). Depending on the class of each of
the partitioning variables in z different types of tests are chosen by mob():
1. For numeric partitioning variables (of class ‘numeric’ or ‘integer’) the supLM statistic
is used which is the maximum over all single-split LM statistics. Associated p-values
can be obtained from a table of critical values (based on Hansen 1997) stored within
the package.
9
Achim Zeileis, Torsten Hothorn
1
mass
p < 0.001
≤ 26.3
> 26.3
2
n = 167
Estimated parameters:
x(Intercept) −9.95151
xglucose 0.05871
3
age
p < 0.001
≤ 30
> 30
4
n = 304
Estimated parameters:
x(Intercept) −6.70559
xglucose 0.04684
5
n = 297
Estimated parameters:
x(Intercept) −2.77095
xglucose 0.02354
Figure 1: Logistic-regression-based tree for the Pima Indians diabetes data. The plots in the
leaves report the estimated regression coefficients.
1
mass
p < 0.001
≤ 26.3
> 26.3
3
age
p < 0.001
≤ 30
Node 4 (n = 304)
1
0.8
●
neg
●
0.6
pos
0.4
0.2
0
99
0
117 140.5
0.2
●
0
0
99
117
0.6
●
0.4
●
●
pos
●
●
●
0.8
●
0.4
●
1
●
neg
neg
0.8
0.6
Node 5 (n = 297)
1
140.5 199
0.2
pos
Node 2 (n = 167)
> 30
0
0
99
117
140.5
199
Figure 2: Logistic-regression-based tree for the Pima Indians diabetes data. The plots in the
leaves give spinograms for diabetes versus glucose.
10
Model-Based Recursive Partitioning in R
If there are ties in the partitioning variable, the sequence of LM statistics (and hence
their maximum) is not unique and hence the results by default may depend to some
degree on the ordering of the observations. To explore this, the option breakties =
TRUE can be set in mob_control() which breaks ties randomly. If there are are only
few ties, the influence is often negligible. If there are many ties (say only a dozen
unique values of the partitioning variable), then the variable may be better treated as
an ordered factor (see below).
2. For categorical partitioning variables (of class ‘factor’), a χ2 statistic is employed which
captures the fluctuation within each of the categories of the partitioning variable. This
is also an LM -type test and is asymptotically equivalent to the corresponding likelihood
ratio test. However, it is somewhat cheaper to compute the LM statistic because the
model just has to be fitted once in the current node and not separately for each category
of each possible partitioning variable. See also Merkle et al. (2013) for a more detailed
discussion.
3. For ordinal partitioning variables (of class ‘ordered’ inheriting from ‘factor’) the same
χ2 as for the unordered categorical variables is used by default (as suggested by Zeileis
et al. 2008). Although this test is consistent for any parameter instabilities across
ordered variables, it does not exploit the ordering information.
Recently, Merkle et al. (2013) proposed an adapted maxLM test for ordered variables
and, alternatively, a weighted double maximum test. Both are optionally availble in
the new mob() implementation by setting ordinal = "L2" or ordinal = "max" in
mob_control(), respectively. Unfortunately, computing p-values from both tests is
more costly than for the default ordinal = "chisq". For "L2" suitable tables of critical
values have to be simulated on the fly using ordL2BB() from the strucchange package
(Zeileis, Leisch, Hornik, and Kleiber 2002). For "max" a multivariate normal probability
has to be computed using the mvtnorm package (Genz et al. 2014).
All of the parameter instability tests above can be computed in an object-oriented manner as
the “estfun” has to be available for the fitted model object. (Either by computing it in the
fit function directly or by providing a estfun() extractor, see Section 3.2.)
All tests also require an estimate of the corresponding variance-covariance matrix of the
estimating functions. The default is to compute this using an outer-product-of-gradients
(OPG) estimator. Alternatively, the corrsponding information matrix or sandwich matrix
can be used if: (a) the estimating functions are actually maximum likelihood scores, and
(b) a vcov() method (based on an estimate of the information) is provided for the fitted
model objects. The corresponding option in mob_control() is to set vcov = "info" or vcov
= "sandwich" rather than vcov = "opg" (the default).
For each of the j = 1, . . . , ℓ partitioning variables in z the test selected in the control options
is employed and the corresponding p-value pj is computed. To adjust for multiple testing,
the p values can be Bonferroni adjusted (which is the default). To determine whether there
is some overall instability, it is checked whether the minial p-value pj ∗ = minj=1,...,ℓ pj falls
below a pre-specified significance level α (by default α = 0.05) or not. If there is significant
instability, the variable Zj ∗ associated with the minimal p-value is used for splitting the node.
Achim Zeileis, Torsten Hothorn
11
Illustration
In the logistic-regression-based MOB pid_tree computed above, the parameter instability tests can be extracted using the sctest() function from the strucchange package (for
structural change test). In the first node, the test statistics and Bonferroni-corrected p-values
are:
R> library("strucchange")
R> sctest(pid_tree, node = 1)
pregnant pressure triceps insulin
mass pedigree
statistic 2.989e+01
7.5024 15.94095 6.5969 4.881e+01 18.33476
p.value
9.779e-05
0.9104 0.06474 0.9701 8.317e-09 0.02253
age
statistic 4.351e+01
p.value
1.183e-07
Thus, the body mass index has the lowest p-value and is highly significant and hence used for
splitting the data. In the second node, no further significant parameter instabilities can be
detected and hence partitioning stops in that branch.
R> sctest(pid_tree, node = 2)
statistic
p.value
pregnant pressure triceps insulin
mass pedigree
age
10.3924
4.3537 5.9112
3.786 10.4749
3.626 6.0979
0.4903
0.9998 0.9869
1.000 0.4785
1.000 0.9818
In the third node, however, there is still significant instability associated with the age variable
and hence partitioning continues.
R> sctest(pid_tree, node = 3)
pregnant pressure triceps insulin
mass pedigree
statistic 2.674e+01
6.1758 7.3468
7.896 9.1546 17.96439
p.value
4.434e-04
0.9845 0.9226
0.870 0.7033 0.02647
age
statistic 3.498e+01
p.value
8.099e-06
Because no further instabilities can be found in the fourth and fifth node, the recursive
partitioning stops there.
3.4. Splitting
In this step, the observations in the current node are split with respect to the chosen partitioning variable Zj ∗ into B child nodes. As pointed out above, the mob() function currently
only supports binary splits, i.e., B = 2. For deterimining the split point, an exhaustive search
12
Model-Based Recursive Partitioning in R
procedure is adopted: For each conceivable split point in Zj ∗ , the two subset models are fit
and the split associated with the minimal value of the objective function Ψ is chosen.
Computationally, this means that the fit function is applied to the subsets of y and x for
each possibly binary split. The “objfun” values are simply summed up (because the objective
function is assumed to be additive) and its minimum across splits is determined. In a search
over a numeric partitioning variable, this means that typically a lot of subset models have to
be fitted and often these will not vary a lot from one split to the next. Hence, the parameter
estimates “coefficients” from the previous split are employed as start values for estimating
the coefficients associated with the next split. Thus, if the fit function makes use of these
starting values, the model fitting can often be speeded up.
Illustration
For the Pima Indians diabetes data, the split points found for pid_tree are displayed both
in the print output and the visualization (see Figure 1 and 2).
3.5. Pruning
By default, the size of mob() trees is determined only by the significance tests, i.e., when
there is no more significant parameter instability (by default at level α = 0.05) the tree
stops growing. Additional stopping criteria are only the minimal node size (minsize) and the
maximum tree depth (maxdepth, by default infinity).
However, for very large sample size traditional significance levels are typically not useful
because even tiny parameter instabilities can be detected. To avoid overfitting in such a
situation, one would either have to use much smaller significance levels or add some form
of post-pruning to reduce the size of the tree afterwards. Similarly, one could wish to first
grow a very large tree (using a large α level) and then prune it afterwards, e.g., using some
information criterion like AIC or BIC.
To accomodate such post-pruning strategies, mob_control() has an argument prune that
can be a function(objfun, df, nobs) that either returns TRUE if a node should be pruned
or FALSE if not. The arguments supplied are a vector of objective function values in the
current node and the sum of all child nodes, a vector of corresponding degrees of freedom,
and the number of observations in the current node and in total. For example if the objective
function used is the negative log-likelihood, then for BIC-based pruning the prune function
is: (2 * objfun[1] + log(nobs[2]) * df[1]) < (2 * objfun[2] + nobs[2] * df[2]).
As the negative log-likelihood is the default objective function when using the “simple” fit
interface, prune can also be set to "AIC" or "BIC" and then suitable functions will be set up
internally. Note, however, that for other objective functions this strategy is not appropriate
and the functions would have to be defined differently (which lmtree() does for example).
In the literature, there is no clear consensus as to how many degrees of freedom should be
assigned to the selection of a split point. Hence, mob_control() allows to set dfsplit which
by default is dfsplit = TRUE and then as.integer(dfsplit) (i.e., 1 by default) degrees
of freedom per split are used. This can be modified to dfsplit = FALSE (or equivalently
dfsplit = 0) or dfsplit = 3 etc. for lower or higher penalization of additional splits.
13
Achim Zeileis, Torsten Hothorn
Illustration
With n = 768 observations, the sample size is quite reasonable for using the classical significance level of α = 0.05 which is also reflected by the moderate tree size with three terminal
nodes. However, if we wished to explore further splits, a conceivable strategy could be the
following:
R> pid_tree3 <- mob(pid_formula, data = PimaIndiansDiabetes,
+
fit = logit, control = mob_control(verbose = TRUE,
+
minsize = 50, maxdepth = 4, alpha = 0.9, prune = "BIC"))
This first grows a large tree until the nodes become too small (minimum node size: 50 observations) or the tree becomes too deep (maximum depth 4 levels) or the significance levels
come very close to one (larger than α = 0.9). Here, this large tree has eleven nodes which
are subsequently pruned based on whether or not they improve the BIC of the model. For
this data set, the resulting BIC-pruned tree is in fact identical to the pre-pruned pid_tree
considered above.
3.6. Fitted ‘party’ objects
The result of mob() is an object of class ‘modelparty’ inheriting from ‘party’. See the other
vignettes in the partykit package (Hothorn and Zeileis 2014) for more details of the general
‘party’ class. Here, we just point out that the main difference between a ‘modelparty’ and
a plain ‘party’ is that additional information about the data and the associated models is
stored in the info elements: both of the overall ‘party’ and the individual ‘node’s. The
details are exemplified below.
Illustration
In the info of the overall ‘party’ the following information is stored for pid_tree:
R> names(pid_tree$info)
[1] "call"
[7] "dots"
"formula" "Formula" "terms"
"nreg"
"fit"
"control"
The call contains the mob() call. The formula and Formula are virtually the same but are
simply stored as plain ‘formula’ and extended ‘Formula’ (Zeileis and Croissant 2010) objects,
respectively. The terms contain separate lists of terms for the response (and regressor)
and the partitioning variables. The fit, control and dots are the arguments that were
provided to mob() and nreg is the number of regressor variables.
Furthermore, each node of the tree contains the following information:
R> names(pid_tree$node$info)
[1] "coefficients" "objfun"
[5] "p.value"
"test"
"object"
"nobs"
14
Model-Based Recursive Partitioning in R
The coefficients, objfun, and object are the results of the fit function for that node.
In nobs and p.value the number of observations and the minimal p-value are provided,
respectively. Finally, test contains the parameter instability test results. Note that the
object element can also be suppressed through mob_control() to save memory space.
3.7. Methods
There is a wide range of standard methods available for objects of class ‘modelparty’. The
standard print(), plot(), and predict() build on the corresponding methods for ‘party’
objects but provide some more special options. Furthermore, methods are provided that
are typically available for models with formula interfaces: formula() (optionally one can
set extended = TRUE to get the ‘Formula’), getCall(), model.frame(), weights(). Finally, there is a standard set of methods for statistical model objects: coef(), residuals(),
logLik() (optionally setting dfsplit = FALSE suppresses counting the splits in the degrees
of freedom, see Section 3.5), deviance(), fitted(), and summary().
Some of these methods rely on reusing the corresponding methods for the individual model
objects in the terminal nodes. Functions such as coef(), print(), summary() also take a
node argument that can specify the node IDs to be queried.
Two methods have non-standard arguments to allow for additional flexibility when dealing
with model trees. Typically, “normal” users do not have to use these arguments directly but
they are very flexible and facilitate writing convenience interfaces such as glmtree() (see
Section 3.8).
❼ The predict() method has the following arguments: predict(object, newdata =
NULL, type = "node", ...). The argument type can either be a function or a character string. More precisely, if type is a function it should be a function (object,
newdata = NULL, ...) that returns a vector or matrix of predictions from a fitted
model object either with or without newdata. If type is a character string, such a
function is set up internally as predict(object, newdata = newdata, type = type,
...), i.e., it relies on a suitable predict() method being available for the fitted models
associated with the ‘party’ object.
❼ The plot() method has the following arguments: plot(x, terminal_panel = NULL,
FUN = NULL). This simply calls the plot() method for ‘party’ objects with a custom
panel function for the terminal panels. By default, node_terminal is used to include
some short text in each terminal node. This text can be set up by FUN with the default
being the number of observations and estimated parameters. However, more elaborate
terminal panel functions can be written, as done for the convenience interfaces.
Finally, two S3-style functions are provided without the corresponding generics (as these
reside in packages that partykit does not depend on). The sctest.modelparty can be used
in combination with the sctest() generic from strucchange as illustrated in Section 3.3. The
refit.modelparty function extracts (or refits if necessary) the fitted model objects in the
specified nodes (by default from all nodes).
Illustration
The plot() and print() methods have already been illustrated for the pid_tree above.
Achim Zeileis, Torsten Hothorn
15
However, here we add that the print() method can also be used to show more detailed
information about particular nodes instead of showing the full tree:
R> print(pid_tree, node = 3)
Model-based recursive partitioning (logit)
-- Node 3 -Estimated parameters:
x(Intercept)
xglucose
-4.61015
0.03426
Objective function:
344.2
Parameter instability tests:
pregnant pressure triceps insulin
mass pedigree
statistic 2.674e+01
6.1758 7.3468
7.896 9.1546 17.96439
p.value
4.434e-04
0.9845 0.9226
0.870 0.7033 0.02647
age
statistic 3.498e+01
p.value
8.099e-06
Information about the model and coefficients can for example be extracted by:
R> coef(pid_tree)
x(Intercept) xglucose
2
-9.952 0.05871
4
-6.706 0.04684
5
-2.771 0.02354
R> coef(pid_tree, node = 1)
x(Intercept)
-5.35008
xglucose
0.03787
R> summary(pid_tree, node = 1)
Call:
glm(formula = y ~ 0 + x, family = binomial, start = start)
Deviance Residuals:
Min
1Q Median
-2.110 -0.784 -0.536
Coefficients:
3Q
0.857
Max
3.273
16
Model-Based Recursive Partitioning in R
Estimate Std. Error z value Pr(>|z|)
x(Intercept) -5.35008
0.42083
-12.7
<2e-16 ***
xglucose
0.03787
0.00325
11.7
<2e-16 ***
--Signif. codes: 0 ✬***✬ 0.001 ✬**✬ 0.01 ✬*✬ 0.05 ✬.✬ 0.1 ✬ ✬ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1064.67
Residual deviance: 808.72
AIC: 812.7
on 768
on 766
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
As the coefficients pertain to a logistic regression, they can be easily interpreted as odds ratios
by taking the exp():
R> exp(coef(pid_tree)[,2])
2
4
5
1.060 1.048 1.024
i.e., the odds increase by 6%, 4.8% and 2.4% with respect to glucose in the three terminal
nodes.
Log-likelihoods and information criteria are available (which by default also penalize the
estimation of splits):
R> logLik(pid_tree)
✬log Lik.✬ -355.5 (df=8)
R> AIC(pid_tree)
[1] 726.9
R> BIC(pid_tree)
[1] 764.1
Mean squared residuals (or deviances) can be extracted in different ways:
R> mean(residuals(pid_tree)^2)
[1] 0.9257
R> deviance(pid_tree)/sum(weights(pid_tree))
Achim Zeileis, Torsten Hothorn
17
[1] 0.9257
R> deviance(pid_tree)/nobs(pid_tree)
[1] 0.9257
Predicted nodes can also be easily obtained:
R> pid <- head(PimaIndiansDiabetes)
R> predict(pid_tree, newdata = pid, type = "node")
1 2 3 4 5 6
5 5 2 4 5 2
More predictions, e.g., predicted probabilities within the nodes, can also be obtained but
require some extra coding if only mob() is used. However, with the glmtree() interface this
is very easy as shown below.
Finally, several methods for ‘party’ objects are, of course, also available for ‘modelparty’
objects, e.g., querying width and depth of the tree:
R> width(pid_tree)
[1] 3
R> depth(pid_tree)
[1] 2
Also subtrees can be easily extracted:
R> pid_tree[3]
Model-based recursive partitioning (logit)
Model formula:
diabetes ~ glucose | pregnant + pressure + triceps + insulin +
mass + pedigree + age
Fitted party:
[3] root
|
[4] age <= 30: n = 304
|
x(Intercept)
xglucose
|
-6.70559
0.04684
|
[5] age > 30: n = 297
|
x(Intercept)
xglucose
|
-2.77095
0.02354
18
Number of
Number of
Number of
Objective
Model-Based Recursive Partitioning in R
inner nodes:
1
terminal nodes: 2
parameters per node: 2
function: 325.2
The subtree is again a completely valid ‘modelparty’ and hence it could also be visualized
via plot(pid_tree[3]) etc.
3.8. Extensions and convenience interfaces
As illustrated above, dealing with MOBs can be carried out in a very generic and objectoriented way. Almost all information required for dealing with recursively partitioned models
can be encapsulated in the fit() function and mob() does not require more information on
what type of model is actually used.
However, for certain tasks more detailed information about the type of model used or the
type of data it can be fitted to can (and should) be exploited. Notable examples for this are
visualizations of the data along with the fitted model or model-based predictions in the leaves
of the tree. To conveniently accomodate such specialized functionality, the partykit provides
two convenience interfaces lmtree() and glmtree() and encourages other packages to do
the same (e.g., raschtree() and bttree() in psychotree). Furthermore, such a convenience
interface can avoid duplicated formula processing in both mob() plus its fit function.
Specifically, lmtree() and glmtree() interface lm.fit(), lm.wfit(), and glm.fit(), respectively, and then provide some extra computations to return valid fitted ‘lm’ and ‘glm’
objects in the nodes of the tree. The resulting ‘modelparty’ object gains an additional class
‘lmtree’/‘glmtree’ to dispatch to its enhanced plot() and predict() methods.
Illustration
The pid_tree2 object was already created above with glmtree() (instead of mob() as for
pid_tree) to illustrate the enhanced plotting capabilities in Figure 2. Here, the enhanced
predict() method is used to obtain predicted means (i.e., probabilities) and associated linear
predictors (on the logit scale) in addition to the previously available predicted nods IDs.
R> predict(pid_tree2, newdata = pid, type = "node")
1 2 3 4 5 6
5 5 2 4 5 2
R> predict(pid_tree2, newdata = pid, type = "response")
1
2
3
4
5
6
0.67092 0.31639 0.68827 0.07330 0.61146 0.04143
R> predict(pid_tree2, newdata = pid, type = "link")
1
2
0.7123 -0.7704
3
4
0.7920 -2.5371
5
6
0.4535 -3.1414
Achim Zeileis, Torsten Hothorn
19
4. Illustrations
In the remainder of the vignette, further empirical illustrations of the MOB method are
provided, including replications of the results from Zeileis et al. (2008):
1. An investigation of the price elasticity of the demand for economics journals across covariates describing the type of journal (e.g., its price, its age, and whether it is published
by a society, etc.)
2. Prediction of house prices in the well-known Boston Housing data set, also taken from
the UCI machine learning repository.
3. Explore how teaching ratings and beauty scores of professors are associated and how
this association changes across further explanatory variables such as age, gender, and
native speaker status of the professors.
4. Assessment of differences in the preferential treatment of women/children (“women and
children first”) across subgroups of passengers on board of the ill-fated maiden voyage
of the RMS Titanic.
5. Modeling of breast cancer survival by capturing heterogeneity in certain (treatment)
effects across patients.
6. Modeling of paired comparisons of topmodel candidates by capturing heterogeneity in
their attractiveness scores across participants in a survey.
More details about several of the underlying data sets, previous studies exploring the data,
and the results based on MOB can be found in Zeileis et al. (2008).
Here, we focus on using the partykit package to replicate the analysis and explore the resulting
trees. The first three illustrations employ the lmtree() convenience function, the fourth is
based on logistic regression using glmtree(), and the fifth uses survreg() from survival
(Therneau 2014) in combination with mob() directly. The sixth and last illustration is deferred
to a separate section and shows in detail how to set up new “mobster” functionality from
scratch.
4.1. Demand for economic journals
The price elasticity of the demand for 180 economic journals is assessed by an OLS regression in log-log form: The dependent variable is the logarithm of the number of US library
subscriptions and the regressor is the logarithm of price per citation. The data are provided
by the AER package (Kleiber and Zeileis 2008) and can be loaded and transformed via:
R> data("Journals", package = "AER")
R> Journals <- transform(Journals,
+
age = 2000 - foundingyear,
+
chars = charpp * pages)
Subsequently, the stability of the price elasticity across the remaining variables can be assessed
using MOB. Below, lmtree() is used with the following partitioning variables: raw price and
20
Model-Based Recursive Partitioning in R
citations, age of the journal, number of characters, and whether the journal is published by a
scientific society or not. A minimal segment size of 10 observations is employed and by setting
verbose = TRUE detailed progress information about the recursive partitioning is displayed
while growing the tree:
R> j_tree <- lmtree(log(subs) ~ log(price/citations) | price + citations +
+
age + chars + society, data = Journals, minsize = 10, verbose = TRUE)
-- Node 1 --------------------------------Number of observations: 180
Parameter instability tests:
price citations
age chars society
statistic 6.5617
5.2614 4.220e+01 4.5638 3.2797
p.value
0.9218
0.9881 1.629e-07 0.9977 0.6599
Best splitting variable: age
Perform split? yes
Selected split: <= 18 | > 18
-- Node 2 --------------------------------Number of observations: 53
Parameter instability tests:
price citations
age chars society
statistic 3.3415
3.7259 5.6132 6.0400 0.6495
p.value
0.9996
0.9984 0.9354 0.8979 0.9984
Best splitting variable: chars
Perform split? no
-- Node 3 --------------------------------Number of observations: 127
Parameter instability tests:
price citations
age chars society
statistic 3.37
6.8391 5.9868 3.6769 0.6083
p.value
1.00
0.8944 0.9598 0.9999 0.9988
Best splitting variable: citations
Perform split? no
The resulting tree just has one split and two terminal nodes for young journals (with a somewhat larger price elasticity) and old journals (with an even lower price elasticity), respectively.
Figure 3 can be obtained by plot(j_tree) and the corresponding printed representation is
shown below.
R> j_tree
21
Achim Zeileis, Torsten Hothorn
1
age
p < 0.001
≤ 18
> 18
Node 2 (n = 53)
7
Node 3 (n = 127)
7
●
●
●
●
●
●
●
●
●●
●
●● ● ● ●● ●
●
● ● ● ●●
●
●
● ● ●
●●
● ● ●● ●
●
●
●
●
● ●●
● ●●●● ●
●
●
●
●
● ● ●●
●
●● ●
●●
●●
●
●●
●●●
●
●
●
● ●●
● ●
●
● ●● ●
●
●
●
● ●
●● ● ● ●
●●
●
●
●
● ●●●
●
●●● ●
●
●
●
●
● ●
●●
●●● ●
●●●
●
● ●
●●●●
● ●●
●●
●
●
● ●●
●
●
●●●
● ●● ●
●●●
● ●●
●● ●● ● ●●
●
●
● ●
●●
● ●
●
●
●
●
1
1
●
−6
4
−6
4
Figure 3: Linear-regression-based tree for the journals data. The plots in the leaves show
linear regressions of log(subscriptions) by log(price/citation).
Linear model tree
Model formula:
log(subs) ~ log(price/citations) | price + citations + age +
chars + society
Fitted party:
[1] root
|
[2] age <= 18: n = 53
|
(Intercept) log(price/citations)
|
4.3528
-0.6049
|
[3] age > 18: n = 127
|
(Intercept) log(price/citations)
|
5.011
-0.403
Number of
Number of
Number of
Objective
inner nodes:
1
terminal nodes: 2
parameters per node: 2
function (residual sum of squares): 77.05
Finally, various quantities of interest such as the coefficients, standard errors and test statistics, and the associated parameter instability tests could be extracted by the following code.
The corresponding output is suppressed here.
22
Model-Based Recursive Partitioning in R
R> coef(j_tree, node = 1:3)
R> summary(j_tree, node = 1:3)
R> sctest(j_tree, node = 1:3)
4.2. Boston housing data
The Boston housing data are a popular and well-investigated empirical basis for illustrating
nonlinear regression methods both in machine learning and statistics. We follow previous
analyses by segmenting a bivariate linear regression model for the house values.
The data set is available in package mlbench and can be obtained and transformed via:
R> data("BostonHousing", package = "mlbench")
R> BostonHousing <- transform(BostonHousing,
+
chas = factor(chas, levels = 0:1, labels = c("no", "yes")),
+
rad = factor(rad, ordered = TRUE))
It provides n = 506 observations of the median value of owner-occupied homes in Boston (in
USD 1000) along with 14 covariates including in particular the number of rooms per dwelling
(rm) and the percentage of lower status of the population (lstat). A segment-wise linear
relationship between the value and these two variables is very intuitive, whereas the shape
of the influence of the remaining covariates is rather unclear and hence should be learned
from the data. Therefore, a linear regression model for median value explained by rm^2 and
log(lstat) is employed and partitioned with respect to all remaining variables. Choosing
appropriate transformations of the dependent variable and the regressors that enter the linear
regression model is important to obtain a well-fitting model in each segment and we follow in
our choice the recommendations of Breiman and Friedman (1985). Monotonic transformations
of the partitioning variables do not affect the recursive partitioning algorithm and hence do
not have to be performed. However, it is important to distinguish between numerical and
categorical variables for choosing an appropriate parameter stability test. The variable chas is
a dummy indicator variable (for tract bounds with Charles river) and thus needs to be turned
into a factor. Furthermore, the variable rad is an index of accessibility to radial highways
and takes only 9 distinct values. Hence, it is most appropriately treated as an ordered factor.
Note that both transformations only affect the parameter stability test chosen (step 2), not
the splitting procedure (step 3).
R> bh_tree <- lmtree(medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox +
+
age + dis + rad + tax + crim + b + ptratio, data = BostonHousing)
R> bh_tree
Linear model tree
Model formula:
medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox + age +
dis + rad + tax + crim + b + ptratio
Fitted party:
[1] root
Achim Zeileis, Torsten Hothorn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[2]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[9]
23
tax <= 432
[3] ptratio <= 15.2: n = 72
(Intercept) log(lstat)
I(rm^2)
9.2349
-4.9391
0.6859
[4] ptratio > 15.2
|
[5] ptratio <= 19.6
|
|
[6] tax <= 265: n = 63
|
|
(Intercept) log(lstat)
I(rm^2)
|
|
3.9637
-2.7663
0.6881
|
|
[7] tax > 265: n = 162
|
|
(Intercept) log(lstat)
I(rm^2)
|
|
-1.7984
-0.2677
0.6539
|
[8] ptratio > 19.6: n = 56
|
(Intercept) log(lstat)
I(rm^2)
|
17.5865
-4.6190
0.3387
tax > 432: n = 153
(Intercept) log(lstat)
I(rm^2)
68.2971
-16.3540
-0.1478
Number of
Number of
Number of
Objective
inner nodes:
4
terminal nodes: 5
parameters per node: 3
function (residual sum of squares): 6090
The corresponding visualization is shown in Figure 4. It shows partial scatter plots of the
dependent variable against each of the regressors in the terminal nodes. Each scatter plot
also shows the fitted values, i.e., a projection of the fitted hyperplane.
From this visualization, it can be seen that in the nodes 4, 6, 7 and 8 the increase of value with
the number of rooms dominates the picture (upper panel) whereas in node 9 the decrease with
the lower status population percentage (lower panel) is more pronounced. Splits are performed
in the variables tax (poperty-tax rate) and ptratio (pupil-teacher ratio).
For summarizing the quality of the fit, we could compute the mean squared error, loglikelihood or AIC:
R> mean(residuals(bh_tree)^2)
[1] 12.04
R> logLik(bh_tree)
✬log Lik.✬ -1311 (df=24)
R> AIC(bh_tree)
[1] 2669
1
54
1
54
6.3
0.3
●●
3.9
83.5
● ●
●
●● ●●
● ●
●
●●
●●
● ●
●
●
●●
●●
●
●
●
●
●
●
●●●●
●●
●
● ●●
●
●
●●
●
●●
●●
●● ●
●●●●
●●
●●●
●●
●●
●●●
●
●
●
●
●●
●●
● ●● ●
● ●
● ●
●
●
● ●
●
●
●●●● ●
● ●
●
●
●
●● ● ● ● ● ●
●●
●
●●●●
● ●● ●●
●
●●
●●●
●●
●●
●●
●●
●
●
● ● ●
●
Node 3 (n = 72)
1
54
1
54
≤ 15.2
6.3
0.3
2
ptratio
p < 0.001
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●●
● ●
● ●●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●
●
● ●
● ●
● ●
● ●●●●
●
● ●●
● ●●
●
●
● ●
● ● ●●
●
●●
●●
●●●
●●
● ●
●●
●●●
●● ●
●
●
●
●
●
●● ●
●●
●
●
Node 6 (n = 63)
83.5
3.9
≤ 265
1
54
1
54
6.3
≤ 19.6
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●● ●
● ●
83.5
1
54
1
54
6.3
0.3
●
●●
●
●
●●●
●●
●●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●
● ●●
●
● ●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●●●●●
●
●
●
●
●●●
●
●
●
●●●●●
●●
●●●
●● ●
●
Node 8 (n = 56)
> 19.6
4
ptratio
p < 0.001
3.9
● ●
●
●
● ●
●
●
●
●●
●
●●
● ●
● ●
●●●●
●
●
●●●
●●
●● ●
●●●
●
●●●●●
●
●
●
●
●
●
● ●
●
●
●
●
●●●●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●●
●
●●
●
●
●
●
Node 7 (n = 162)
> 265
0.3
5
tax
p < 0.001
> 15.2
≤ 432
83.5
3.9
1
54
1
54
6.3
●
●●●
83.5
●
3.9
●
●
●
●
●●
● ●
● ●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ● ●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●●●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●● ● ●
●
●●
Node 9 (n = 153)
●
●
●
●
●
●●●
●
●
●●
● ●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●●
●
●
● ●●●●
●
●●
●
●
● ●
●
●
●
●● ●
●●
●●
●
●
●
●●
●
●
●●●● ●●
●
●
0.3
> 432
1
tax
p < 0.001
24
Model-Based Recursive Partitioning in R
Figure 4: Linear-regression-based tree for the Boston housing data. The plots in the leaves
give partial scatter plots for rm (upper panel) and lstat (lower panel).
Achim Zeileis, Torsten Hothorn
25
4.3. Teaching ratings data
Hamermesh and Parker (2005) investigate the correlation of beauty and teaching evaluations
for professors. They provide data on course evaluations, course characteristics, and professor
characteristics for 463 courses for the academic years 2000–2002 at the University of Texas
at Austin. It is of interest how the average teaching evaluation per course (on a scale 1–5)
depends on a standardized measure of beauty (as assessed by a committee of six persons
based on photos). Hamermesh and Parker (2005) employ a linear regression, weighted by the
number of students per course and adjusting for several further main effects: gender, whether
or not the teacher is from a minority, a native speaker, or has tenure, respectively, and whether
the course is taught in the upper or lower division. Additionally, the age of the professors
is available as a regressor but not considered by Hamermesh and Parker (2005) because the
corresponding main effect is not found to be significant in either linear or quadratic form.
Here, we employ a similar model but use the available regressors differently. The basic model
is again a linear regression for teaching rating by beauty, estimated by weighted least squares
(WLS). All remaining explanatory variables are not used as regressors but as partitioning
variables because we argue that it is unclear how they influence the correlation between
teaching rating and beauty. Hence, MOB is used to adaptively incorporate these further
variables and determine potential interactions.
First, the data are loaded from the AER package (Kleiber and Zeileis 2008) and only the
subset of single-credit courses is excluded.
R> data("TeachingRatings", package = "AER")
R> tr <- subset(TeachingRatings, credits == "more")
The single-credit courses include elective modules that are quite different from the remaining
courses (e.g., yoga, aerobics, or dance) and are hence omitted from the main analysis.
WLS estimation of the null model (with only an intercept) and the main effects model is
carried out in a first step:
R> tr_null <- lm(eval ~ 1, data = tr, weights = students)
R> tr_lm <- lm(eval ~ beauty + gender + minority + native + tenure + division,
+
data = tr, weights = students)
Then, the model-based tree can be estimated with lmtree() using only beauty as a regressor
and all remaining variables for partitioning. For WLS estimation, we set weights = students
and caseweights = FALSE (because the weights are only proportionality factors and do not
signal exactly replicated observations).
R> tr_tree <- lmtree(eval ~ beauty | minority + age + gender + division +
+
native + tenure, data = tr, weights = students, caseweights = FALSE)
The resulting tree can be visualized by plot(tr_tree) and is shown in Figure 5. This shows
that despite age not having a significant main effect (as reported by Hamermesh and Parker
2005), it clearly plays an important role: While the correlation of teaching rating and beauty
score is rather moderate for younger professors, there is a clear correlation for older professors
(with the cutoff age somewhat lower for female professors).
26
Model-Based Recursive Partitioning in R
1
gender
p < 0.001
male
female
2
age
p = 0.008
5
age
p = 0.014
≤ 40
≤ 50
> 40
7
division
p = 0.019
> 50
upper
Node 3 (n = 113)
5
●
●
●
●
● ●
● ●
● ●●
●● ●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●●
● ●●
●
●
●
●● ●
●
●
●
●
●
Node 4 (n = 137)
5
●
●
●
●
● ●
●
●
●
●
●●●
●
● ●
●
●●
●
●
●
●
● ●
●
●
●
●
● ●
● ●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
2
2
−1.7
2.3
●
●
●
●●
● ● ●
●
●● ● ● ●
●
● ● ●● ●
● ●●
●● ● ●
●
●● ● ●
●● ●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●● ● ● ●
●
●●
●
●
●●
●
●
●
● ●
● ●
●
●
●
●
●●
●
●
●
Node 6 (n = 69)
5
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
● ●
● ●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
Node 8 (n = 81)
5
●
●
lower
Node 9 (n = 36)
5
●
●
●●
●
●
● ●
●
●
●
●
●●
●●
● ●
● ●
●
●
●
● ●
●●
●
●● ●
●● ●
●
●
●
●
●
●●
●
●●
● ●●
●
●●●
● ●● ●
●
●
●
● ●
● ●
● ●
●
●
●
● ●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
−1.7
2
2.3
2
−1.7
2.3
2
−1.7
2.3
−1.7
2.3
Figure 5: WLS-based tree for the teaching ratings data. The plots in the leaves show
scatterplots for teaching rating by beauty score.
R> coef(tr_lm)[2]
beauty
0.2826
R> coef(tr_tree)[, 2]
3
0.1292
4
0.5028
6
8
0.1222 -0.1976
9
0.4033
Th R2 of the tree is also clearly improved over the main effects model:
R> 1 - c(deviance(tr_lm), deviance(tr_tree))/deviance(tr_null)
[1] 0.2713 0.3820
4.4. Titanic survival data
To illustrate how differences in treatment effects can be captured by MOB, the Titanic survival
data is considered: The question is whether “women and children first” is applied in the same
Achim Zeileis, Torsten Hothorn
27
way for all subgroups of the passengers of the Titanic. Or, in other words, whether the
effectiveness of preferential treatment for women/children differed across subgroups.
The Titanic data is provided in base R as a contingency table and transformed here to a
‘data.frame’ for use with glmtree():
R>
R>
R>
R>
R>
+
+
data("Titanic", package = "datasets")
ttnc <- as.data.frame(Titanic)
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
ttnc <- transform(ttnc, Treatment = factor(
Gender == "Female" | Age == "Child", levels = c(FALSE, TRUE),
labels = c("Male&Adult", "Female|Child")))
The data provides factors Survived (yes/no), Class (1st, 2nd, 3rd, crew), Gender (male,
female), and Age (child, adult). Additionally, a factor Treatment is added that distinguishes
women/children from male adults.
To investigate how the preferential treatment effect (Survived ~ Treatment) differs across
the remaining explanatory variables, the following logistic-regression-based tree is considered.
The significance level of alpha = 0.01 is employed here to avoid overfitting and separation
problems in the logistic regression.
R> ttnc_tree <- glmtree(Survived ~ Treatment | Class + Gender + Age,
+
data = ttnc, family = binomial, alpha = 0.01)
R> ttnc_tree
Generalized linear model tree (family: binomial)
Model formula:
Survived ~ Treatment | Class + Gender + Age
Fitted party:
[1] root
|
[2] Class in 3rd: n = 706
|
(Intercept) TreatmentFemale|Child
|
-1.641
1.327
|
[3] Class in 1st, 2nd, Crew
|
|
[4] Class in 2nd: n = 285
|
|
(Intercept) TreatmentFemale|Child
|
|
-2.398
4.477
|
|
[5] Class in 1st, Crew: n = 1210
|
|
(Intercept) TreatmentFemale|Child
|
|
-1.152
4.318
Number of
Number of
Number of
Objective
inner nodes:
2
terminal nodes: 3
parameters per node: 2
function (negative log-likelihood): 1061
28
Model-Based Recursive Partitioning in R
1
Class
p < 0.001
3rd
1st, 2nd, Crew
3
Class
p < 0.001
2nd
Node 2 (n = 706)
Node 4 (n = 285)
1
1st, Crew
Node 5 (n = 1210)
1
●
1
0.6
0
Male&Adult
0.4
0.2
●
Female|Child
0.6
0.2
●
0.4
●
Yes
0.4
0.8
No
0.8
0.6
Yes
Yes
●
0.8
No
No
●
0
Male&Adult
Female|Child
0.2
0
Male&Adult
Female|Child
Figure 6: Logistic-regression-based tree for the Titanic survival data. The plots in the leaves
give spinograms for survival status versus preferential treatment (women or children).
This shows that the treatment differs strongly across passengers classes, see also the result
of plot(ttnc_tree) in Figure 6. The treatment effect is much lower in the 3rd class where
women/children still have higher survival rates than adult men but the odds ratio is much
lower compared to all remaining classes. The split between the 2nd and the remaining two
classes (1st, crew) is due to a lower overall survival rate (intercepts of -2.4 and -1.15, respectively) while the odds ratios associated with the preferential treatment are rather similar
(4.48 and 4.32, respectively).
Another option for assessing the class effect would be to immediately split into all four
classes rather than using recursive binary splits. This can be obtained by setting catsplit =
"multiway" in the glmtree() call above. This yields a tree with just a single split but four
kid nodes.
4.5. German breast cancer data
To illustrate that the MOB approach can also be used beyond (generalized) linear regression
models, it is employed in the following to analyze censored survival times among German
women with positive node breast cancer. The data is available in the TH.data package and
the survival time is transformed from days to years:
R> data("GBSG2", package = "TH.data")
R> GBSG2$time <- GBSG2$time/365
For regression a parametric Weibull regression based on the survreg() function from the
survival package (Therneau 2014) is used. A fitting function for mob() can be easily set up:
Achim Zeileis, Torsten Hothorn
29
R> library("survival")
R> wbreg <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
+
survreg(y ~ 0 + x, weights = weights, dist = "weibull", ...)
+ }
As the survreg package currently does not provide a logLik() method for ‘survreg’ objects,
this needs to be added here:
R> logLik.survreg <- function(object, ...)
+
structure(object$loglik[2], df = sum(object$df), class = "logLik")
Without the logLik() method, mob() would not know how to extract the optimized objective
function from the fitted model.
With the functions above available, a censored Weibull-regression-tree can be fitted: The
dependent variable is the censored survival time and the two regressor variables are the main
risk factor (number of positive lymph nodes) and the treatment variable (hormonal therapy).
All remaining variables are used for partitioning: age, tumor size and grade, progesterone and
estrogen receptor, and menopausal status. The minimal segment size is set to 80.
R> gbsg2_tree <- mob(Surv(time, cens) ~ horTh + pnodes | age + tsize +
+
tgrade + progrec + estrec + menostat, data = GBSG2,
+
fit = wbreg, control = mob_control(minsize = 80))
Based on progesterone receptor, a tree with two leaves is found:
R> gbsg2_tree
Model-based recursive partitioning (wbreg)
Model formula:
Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec +
estrec + menostat
Fitted party:
[1] root
|
[2] progrec <= 24: n = 299
|
x(Intercept)
xhorThyes
|
1.77331
0.17364
|
[3] progrec > 24: n = 387
|
x(Intercept)
xhorThyes
|
1.9730
0.4451
Number of
Number of
Number of
Objective
inner nodes:
1
terminal nodes: 2
parameters per node: 3
function: 809.9
R> coef(gbsg2_tree)
xpnodes
-0.06535
xpnodes
-0.0302
30
Model-Based Recursive Partitioning in R
1
progrec
p < 0.001
≤ 24
> 24
2
n = 299
Estimated parameters:
x(Intercept) 1.77331
xhorThyes 0.17364
xpnodes −0.06535
3
n = 387
Estimated parameters:
x(Intercept) 1.9730
xhorThyes 0.4451
xpnodes −0.0302
Figure 7: Censored Weibull-regression-based tree for the German breast cancer data. The
plots in the leaves report the estimated regression coefficients.
1
progrec
p < 0.001
≤ 24
> 24
Node 2 (n = 299)
Node 3 (n = 387)
8
8
●
0
●
●●
●●
● ●
●●
●
●
●●●● ●● ●
●
●
●●
●●●
●
●
●
●
● ●●
●
●
●
● ● ●
●
●●
●
●
●
●●●●
● ● ●●
●
●●
●
●●● ● ●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●●●●
●●
●
●
●
● ●●
●
●
●
●
●
● ● ●
●
●
●●
●● ●●
●
●
●
●●●●●
●
●
●● ●
●
●●
●
●●
●●● ●
●
●
●
●
●
●
●●● ●●● ●
●
●
●●●
●
● ●●
● ●
●●
●
● ●● ●
●●
●
●
● ●●
● ●●
●
●● ●
●
●● ●
●
● ● ●
●●
● ●●
●
●
● ●●●
●●●
●●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●●●●
●
● ●
●
● ●●
●
●
●
●
●
0
●
●
0
52
●
●●●
●
●●
●
●
●● ●
● ●
●●
●●●●
●
●
●●●●●
●●
●
●
●
●
●●●●●● ●
●●
● ●
●
●
●●
●●
● ●●●
●
●●●
●
●
●●
● ● ●
●
●●
●●
●
●
●
●●● ●●
●
●●●
●●
●●
●
●
●
●
●● ●
●●
●●●●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●● ●● ●
●
●●
● ●
●
●●●●●●
●● ●●
●
●
●
●● ●● ●
●
●
●●
● ●
●●
●
●
●● ●●●
● ●
●●
●●●
●●●●●
●
●●●
●
●●●●
● ●
●
●● ●
●
●●
●
●●●
●
●● ● ●
●●
●
●
●●●●
●
●●
●●
●●●
●●●
●
● ●
●
●● ●
●●
●●●
●
●
●● ●●●
●
●
●
● ● ●
●
●
●
●●
●
● ●
●
●
0
●
●
●
●
52
Figure 8: Censored Weibull-regression-based tree for the German breast cancer data. The
plots in the leaves depict censored (hollow) and uncensored (solid) survival time by number
of positive lymph nodes along with fitted median survival for patients with (dashed line) and
without (solid line) hormonal therapy.
Achim Zeileis, Torsten Hothorn
31
x(Intercept) xhorThyes xpnodes
2
1.773
0.1736 -0.06535
3
1.973
0.4451 -0.03020
R> logLik(gbsg2_tree)
✬log Lik.✬ -809.9 (df=9)
The visualization produced by plot(gbsg2_tree) is shown in Figure 7. A nicer graphical
display using a scatter plot (with indication of censoring) and fitted regression curves is shown
in Figure 8. This uses a custom panel function whose code is too long and elaborate to be
shown here. Interested readers are referred to the R code underlying the vignette.
The visualization shows that for higher progesterone receptor levels: (1) survival times are
higher overall, (2) the treatment effect of hormonal therapy is higher, and (3) the negative
effect of the main risk factor (number of positive lymph nodes) is less severe.
5. Setting up a new mobster
To conclude this vignette, we present another illustration that shows how to set up new
mobster functionality from scratch. To do so, we implement the Bradley-Terry tree suggested
by Strobl et al. (2011) “by hand”. The psychotree package already provides an easy-to-use
mobster called bttree() but as an implementation exercise we recreate its functionality here.
The only inputs required are a suitable data set with paired comparisons (Topmodel2007 from
psychotree), a parametric model for paired comparison data (btReg.fit() from psychotools,
implementing the Bradley-Terry model), and an extractor method for the corresponding empirical estimating functions (which are computed by default in btReg.fit() already and just
need to be extracted):
R> data("Topmodel2007", package = "psychotree")
R> library("psychotools")
R> estfun.btReg <- function(x, ...) x$estfun
The Bradley-Terry (or Bradley-Terry-Luce) model is a standard model for paired comparisons
in social sciences. It parametrizes the probability πij for preferring some object i over another
object j in terms of corresponding “ability” or “worth” parameters θi :
θi
θi + θj
logit(πij ) = log(θi ) − log(θj )
πij
=
This model can be easily estimated by maximum likelihood as a logistic or log-linear GLM.
This is the approach used internally by btReg.fit().
The Topmodel2007 data provide paired comparisons of attractiveness among the six finalists
of the TV show Germany’s Next Topmodel 2007 : Barbara, Anni, Hana, Fiona, Mandy, Anja.
The data were collected in a survey with 192 respondents at Universit¨at T¨
ubingen and the
available covariates comprise gender, age, and familiarty with the TV show. The latter is
32
Model-Based Recursive Partitioning in R
assess by three by yes/no questions: (1) Do you recognize the women?/Do you know the
show? (2) Did you watch it regularly? (3) Did you watch the final show?/Do you know who
won?
To fit the Bradley-Terry tree to the data, the available building blocks just have to be tied
together. First, we set up the basic/simple model fitting interface described in Section 3.2:
R> btfit1 <- function(y, x = NULL, start = NULL, weights = NULL,
+
offset = NULL, ...) btReg.fit(y, ...)
The function btfit1() simply calls btReg.fit() ignoring all arguments except y as the
others are not needed here. No more processing is required because ‘btReg’ objects come
with a coef() and logLik() method and an estfun() method was already defined above.
Hence, we can call mob() now specifying the response and the partitioning variable (and no
regressors because there are no regressors in this model).
R> system.time(bt1 <- mob(preference ~ 1 | gender + age + q1 + q2 + q3,
+
data = Topmodel2007, fit = btfit1))
user
1.355
system elapsed
0.000
1.357
An alternative way to fit the exact same tree somewhat more quickly would be to employ
the extended interface described in Section 3.2: Still btReg.fit() is employed for fitting the
model but the quantities estfun and vcov are only computed if they are really required. This
saves some computation time (about 20% on the authors’ machine) when computing the tree:
R> system.time(bt2 <- mob(preference ~ 1 | gender + age + q1 + q2 + q3,
+
data = Topmodel2007, fit = btfit2))
user
1.335
system elapsed
0.000
1.337
The speed-up is not huge but comes almost for free: just a few additional lines of code in
btfit2() are required. For other models where it is more costly to set up a full model (with
variance-covariance matrix, predictions, residuals, etc.) larger speed-ups are also possible.
Both trees, bt1 and bt2, are equivalent (except for the details of the fitting function). Hence,
in the following we only explore bt2. However, the same code can be applied to bt1 as well.
Many tools come completely for free and are inherited from the general ‘modelparty’/‘party’.
For example, both printing (print(bt2)) and plotting (plot(bt2)) shows the estimated
parameters in the terminal nodes which can also be extracted by the coef() method:
R> bt2
Model-based recursive partitioning (btfit2)
Model formula:
Achim Zeileis, Torsten Hothorn
33
preference ~ 1 | gender + age + q1 + q2 + q3
Fitted party:
[1] root
|
[2] age <= 52
|
|
[3] q2 in yes: n = 35
|
|
Barbara
Anni
Hana
Fiona
Mandy
|
|
1.3378 1.2318 2.0499 0.8339 0.6217
|
|
[4] q2 in no
|
|
|
[5] gender in male: n = 71
|
|
|
Barbara
Anni
Hana
Fiona
Mandy
|
|
|
0.43866 0.08877 0.84629 0.69424 -0.10003
|
|
|
[6] gender in female: n = 56
|
|
|
Barbara
Anni
Hana
Fiona
Mandy
|
|
|
0.9475 0.7246 0.4452 0.6350 -0.4965
|
[7] age > 52: n = 30
|
Barbara
Anni
Hana
Fiona
Mandy
|
0.2178 -1.3166 -0.3059 -0.2591 -0.2357
Number of
Number of
Number of
Objective
inner nodes:
3
terminal nodes: 4
parameters per node: 5
function: 1829
R> coef(bt2)
3
5
6
7
Barbara
Anni
Hana
Fiona
Mandy
1.3378 1.23183 2.0499 0.8339 0.6217
0.4387 0.08877 0.8463 0.6942 -0.1000
0.9475 0.72459 0.4452 0.6350 -0.4965
0.2178 -1.31663 -0.3059 -0.2591 -0.2357
The corresponding visualization is shown in the upper panel of Figure 9. In all cases, the
estimated coefficients on the logit scale omitting the fixed zero reference level (Anja) are
reported. To show the corresponding worth parameters θi including the reference level, one
can simply provide a small panel function worthf(). This applies the worth() function from
psychotools to the fitted-model object stored in the info slot of each node, yielding the lower
panel in Figure 9.
R> worthf <- function(info) paste(info$object$labels,
+
format(round(worth(info$object), digits = 3)), sep = ": ")
R> plot(bt2, FUN = worthf)
To show a graphical display of these worth parameters rather than printing their numerical
values, one can use a simply glyph-style plot. A simply way to generate these is to use the
plot() method for ‘btReg’ objects from partykit and nodeapply this to all terminal nodes
(see Figure 10):
34
Model-Based Recursive Partitioning in R
1
age
p < 0.001
≤ 52
> 52
7
n = 30
Estimated parameters:
Barbara 0.2178
Anni −1.3166
Hana −0.3059
Fiona −0.2591
Mandy −0.2357
2
q2
p = 0.017
yes
3
n = 35
Estimated parameters:
Barbara 1.3378
Anni 1.2318
Hana 2.0499
Fiona 0.8339
Mandy 0.6217
no
4
gender
p = 0.007
male
5
n = 71
Estimated parameters:
Barbara 0.43866
Anni 0.08877
Hana 0.84629
Fiona 0.69424
Mandy −0.10003
female
6
n = 56
Estimated parameters:
Barbara 0.9475
Anni 0.7246
Hana 0.4452
Fiona 0.6350
Mandy −0.4965
1
age
p < 0.001
≤ 52
> 52
7
Barbara: 0.259
Anni: 0.056
Hana: 0.153
Fiona: 0.160
Mandy: 0.164
Anja: 0.208
2
q2
p = 0.017
yes
no
3
Barbara: 0.189
Anni: 0.170
Hana: 0.385
Fiona: 0.114
Mandy: 0.092
Anja: 0.050
4
gender
p = 0.007
male
5
Barbara: 0.175
Anni: 0.123
Hana: 0.262
Fiona: 0.225
Mandy: 0.102
Anja: 0.113
female
6
Barbara: 0.266
Anni: 0.213
Hana: 0.161
Fiona: 0.195
Mandy: 0.063
Anja: 0.103
Figure 9: Bradley-Terry-based tree for the topmodel attractiveness data. The default plot
(upper panel) reports the estimated coefficients on the log scale and while the adapted plot
(lower panel) shows the corresponding worth parameters.
35
Achim Zeileis, Torsten Hothorn
●
●
●
0.2
●
●
●
●
●
0.1
0.2
0.3
Worth parameters
●
0.3
0.4
5
0.1
Worth parameters
0.4
3
●
●
0.0
0.0
●
Hana
Fiona
Mandy
Anja
Barbara
Hana
Fiona
Objects
Objects
6
7
Mandy
Anja
●
0.3
0.2
●
●
●
●
●
●
●
Hana
Fiona
Mandy
0.1
0.2
●
Worth parameters
0.3
●
0.1
Worth parameters
Anni
0.4
Anni
0.4
Barbara
●
0.0
0.0
●
Barbara
Anni
Hana
Fiona
Mandy
Anja
Barbara
Anni
Objects
Anja
Objects
Figure 10: Worth parameters in the terminal nodes of the Bradley-Terry-based tree for the
topmodel attractiveness data.
R> par(mfrow = c(2, 2))
R> nodeapply(bt2, ids = c(3, 5, 6, 7), FUN = function(n)
+
plot(n$info$object, main = n$id, ylim = c(0, 0.4)))
Alternatively, one could set up a proper panel-generating function in grid that allows to create
the glyphs within the terminal nodes of the tree (see Figure 11). As the code for this panelgenerating function node_btplot() is too complicated to display within the vignette, we refer
interested readers to the underlying code. Given this panel-generating function Figure 11 can
be generated with
R> plot(bt2, drop = TRUE, tnex = 2,
+
terminal_panel = node_btplot(bt2, abbreviate = 1, yscale = c(0, 0.5)))
Finally, to illustrate how different predictions can be easily computed, we set up a small data
set with three new individuals:
1
age gender q1
60
male no
q2 q3
no no
36
Model-Based Recursive Partitioning in R
1
age
p < 0.001
≤ 52
> 52
2
q2
p = 0.017
yes
no
4
gender
p = 0.007
male
Node 3 (n = 35)
0.5
female
Node 5 (n = 71)
0.5
Node 6 (n = 56)
0.5
Node 7 (n = 30)
0.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
B Ann H
F
M Anj
0
B Ann H
F
M Anj
●
●
●
●
●
0
●
●
●
●
0
B Ann H
F
M Anj
B Ann H
F
M Anj
Figure 11: Bradley-Terry-based tree for the topmodel attractiveness data with visualizations
of the worth parameters in the terminal nodes.
2
3
25 female no no no
35 female no yes no
For these we can easily compute (1) the predicted node ID, (2) the corresponding worth
parameters, (3) the associated ranks.
R> tm
age gender q1 q2
1 60
male no no
2 25 female no no
3 35 female no yes
q3
no
no
no
R> predict(bt2, tm, type = "node")
1 2 3
7 3 5
R> predict(bt2, tm, type = function(object) t(worth(object)))
1
2
3
Barbara
Anni
Hana Fiona
Mandy
Anja
0.2585 0.05573 0.1531 0.1605 0.16427 0.20792
0.1889 0.16993 0.3851 0.1142 0.09232 0.04958
0.1746 0.12305 0.2625 0.2254 0.10188 0.11259
Achim Zeileis, Torsten Hothorn
37
R> predict(bt2, tm, type = function(object) t(rank(-worth(object))))
1
2
3
Barbara Anni Hana Fiona Mandy Anja
1
6
5
4
3
2
2
3
1
4
5
6
3
4
1
2
6
5
This completes the tour of fitting, printing, plotting, and predicting the Bradley-Terry tree
model. Convenience interfaces that employ code like shown above can be easily defined and
collected in new packages such as psychotree.
6. Conclusion
The function mob() in the partykit package provides a new flexible and object-oriented implementation of the general algorithm for model-based recursive partitioning using partykit’s
recursive partytioning infrastructure. New model fitting functions can be easily provided,
especially if standard extractor functions (such as coef(), estfun(), logLik(), etc.) are
available. The resulting model trees can then learned, visualized, investigated, and employed
for predictions. To gain some efficiency in the computations and to allow for further customization (in particular specialized visualization and prediction methods), convenience interfaces
lmtree() and glmtree() are provided for recursive partitioning based on (generalized) linear
models.
References
Bache K, Lichman M (2013). “UCI Machine Learning Repository.” URL http://archive.
ics.uci.edu/ml/.
Bai J, Perron P (2003). “Computation and Analysis of Multiple Structural Change Models.”
Journal of Applied Econometrics, 18, 1–22.
Breiman L, Friedman JH (1985). “Estimating Optimal Transformations for Multiple Regression and Correlation.” Journal of the American Statistical Association, 80(391), 580–598.
Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2014). mvtnorm: Multivariate Normal and t Distributions. R package version 1.0-0, URL http://CRAN.R-project.
org/package=mvtnorm.
Gr¨
un B, Kosmidis I, Zeileis A (2012). “Extended Beta Regression in R: Shaken, Stirred,
Mixed, and Partitioned.” Journal of Statistical Software, 48(11), 1–25. URL http://www.
jstatsoft.org/v48/i11/.
Hamermesh DS, Parker A (2005). “Beauty in the Classroom: Instructors’ Pulchritude and
Putative Pedagogical Productivity.” Economics of Education Review, 24, 369–376.
Hansen BE (1997). “Approximate Asymptotic p Values for Structural-Change Tests.” Journal
of Business & Economic Statistics, 15, 60–67.
38
Model-Based Recursive Partitioning in R
Hawkins DM (2001). “Fitting Multiple Change-Point Models to Data.” Computational Statistics & Data Analysis, 37, 323–341.
Hothorn T, Hornik K, Strobl C, Zeileis A (2014). party: A Laboratory for Recursive Partytioning. R package version 1.0-16, URL http://CRAN.R-project.org/package=party.
Hothorn T, Leisch F, Zeileis A (2013). modeltools: Tools and Classes for Statistical Models.
R package version 0.2-21, URL http://CRAN.R-project.org/package=modeltools.
Hothorn T, Zeileis A (2014). partykit: A Toolkit for Recursive Partytioning. R package
version 0.8-1, URL http://CRAN.R-project.org/package=partykit.
Kleiber C, Zeileis A (2008). Applied Econometrics with R. Springer-Verlag, New York. URL
http://CRAN.R-project.org/package=AER.
Leisch F, Dimitriadou E (2012). mlbench: Machine Learning Benchmark Problems. R
package version 2.1-1, URL http://CRAN.R-project.org/package=mlbench.
Merkle EC, Fan J, Zeileis A (2013). “Testing for Measurement Invariance with Respect to an
Ordinal Variable.” Psychometrika. doi:10.1007/s11336-013-9376-7. Forthcoming.
Merkle EC, Zeileis A (2013). “Tests of Measurement Invariance without Subgroups: A Generalization of Classical Methods.” Psychometrika, 78(1), 59–82.
Meyer D, Zeileis A, Hornik K (2006). “The Strucplot Framework: Visualizing Multi-Way
Contingency Tables with vcd.” Journal of Statistical Software, 17(3), 1–48. URL http:
//www.jstatsoft.org/v17/i03/.
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Strobl C, Kopf J, Zeileis A (2013). “A New Method for Detecting Differential Item Functioning
in the Rasch Model.” Psychometrika. doi:10.1007/s11336-013-9388-3. Forthcoming.
Strobl C, Wickelmaier F, Zeileis A (2011). “Accounting for Individual Differences in BradleyTerry Models by Means of Recursive Partitioning.” Journal of Educational and Behavioral
Statistics, 36(2), 135–153.
Su X, Wang M, Fan J (2004). “Maximum Likelihood Regression Trees.” Journal of Computational and Graphical Statistics, 13, 586–598.
Therneau TM (2014). survival: A Package for Survival Analysis in S. R package version
2.37-7, URL http://CRAN.R-project.org/package=survival.
Zeileis A (2005). “A Unified Approach to Structural Change Tests Based on ML Scores,
F Statistics, and OLS Residuals.” Econometric Reviews, 24, 445–466.
Zeileis A (2006). “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software, 16(9), 1–16. URL http://www.jstatsoft.org/v16/i09/.
Zeileis A, Croissant Y (2010). “Extended Model Formulas in R: Multiple Parts and Multiple
Responses.” Journal of Statistical Software, 34(1), 1–13. URL http://www.jstatsoft.
org/v34/i01/.
Achim Zeileis, Torsten Hothorn
39
Zeileis A, Hornik K (2007). “Generalized M-Fluctuation Tests for Parameter Instability.”
Statistica Neerlandica, 61(4), 488–508.
Zeileis A, Hothorn T, Hornik K (2008). “Model-Based Recursive Partitioning.” Journal of
Computational and Graphical Statistics, 17(2), 492–514.
Zeileis A, Leisch F, Hornik K, Kleiber C (2002). “strucchange: An R Package for Testing
for Structural Change in Linear Regression Models.” Journal of Statistical Software, 7(2),
1–38. URL http://www.jstatsoft.org/v07/i02/.
Affiliation:
Achim Zeileis
Department of Statistics
Faculty of Economics and Statistics
Universit¨
at Innsbruck
Universit¨
atsstr. 15
6020 Innsbruck, Austria
E-mail: [email protected]
URL: http://eeecon.uibk.ac.at/~zeileis/
Torsten Hothorn
Institut f¨
ur Sozial- und Pr¨
aventivmedizin, Abteilung Biostatistik
Universit¨
at Z¨
urich
Hirschengraben 84
CH-8001 Z¨
urich, Switzerland
E-mail: [email protected]
URL: http://user.math.uzh.ch/hothorn/