Full technical report

ENsEN:
Machine Learning For Sentence Selection
Mazen Alsarem
INSA-Lyon,
69100 Villeurbanne, France
Abstract. As described in our approach ENsEN, we select a sentence
(excerpt) from the document and show it in our snippet. This excerpt
(we call it main-sentence) gives the user an idea about the content of this
document, and how it’s related to her query. In addition, when we show
the top-k resources (sometimes we use the term concepts) in this snippet,
we associate each resource with a sentence that justify its selection as
top-k resource. This sentence is the most relevant one in the document,
considering the query, the corresponding resource, the text and other
resources, we call it a concept-sentence.
Here we propose a machine learning process to select these two types of
sentences from a web page. We represent a relevant or not sentences by
a binary class (true, false) for main-sentence or concept-sentence.
Keywords: ENsEN, Machine learning, Linked Data, Features selection
1
Objective
The ultimate objective is to build two classifiers (one for each dataset). These
classifiers are able to predicate if a sentence is relevant (or not) to be used in
the ENsEN snippets.
2
Experiment Setup
In order to train the classifiers, we need sample data with the corresponding
relevance values (i.e. pre-labelled data). The datasets used to train our models
consist of sentences extracted from web-pages retrieved by Google using 20 different queries, and annotated using DBpedia spotlight[3].
We built an evaluation system, where evaluators can manually label sentences
relevant or not (see figure 1). The evaluation has two objectives (two datasets):
(1) evaluate if a sentence is relevant for whole document (i.e. can be used as
textual snippet for this document, as main-sentence).
(2) evaluate if a sentence is relevant for a resource (i.e. can be used as a description for a resource, as concept-sentence).
The evaluators must consider the context of each sentence, i.e. the document,
the query, the annotated resources. For each sentence in a document, evaluators
2
ENsEN: Machine Learning For Sentence Selection
can say ”it is appropriate or not”, and for each resource (annotated in it) ”it is
appropriate for this resource or not”. The result of this manual evaluation are
two datasets, one for main-sentence selection, and the second one is for conceptsentence selection.
Fig. 1. Example of evaluating sentence
We extracted 5 web-pages per query (182 HTML documents),and we used
some search filters in Google like: English documents only, HTML documents
only (no PDF, no doc ...). We extracted the text from each HTML document
using the Boilerplate Detection[2], and we also used ICU BreakIterator 1 to
split text into sentences. We pre-process the sentences in order to eliminate
ones without any annotated resources, the figure 2 shows some statistic over the
training datasets.
Fig. 2. Example of evaluated dataset
We came up with 3206 sentences, and 1.54 resource in average per sentence
(see figure 3). These sentences were manually labelled by two of our researcher
(evaluators). So we got the two datasets (see example in figure 2):
1
http://site.icu-project.org/home
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html
ENsEN: Machine Learning For Sentence Selection
Fig. 3. Statistics over the used queries
3
4
ENsEN: Machine Learning For Sentence Selection
(1) main-sentence dataset with 3206 instances,
(2) concept-sentence dataset with 4884 instances.
3
Features
3.1
Features of main-sentence dataset
The selection of a sentence to be the document’s main-sentence is related to
many factors:
1. The annotated resources in this sentence:
– (AR) how many resources (annotated by DBpedia) are there?
– (MR) how many top-k resources 2 are there?
– (AS) the average of the annotation scores 3 returned by the annotation
tool, i.e. DBpedia spotlight?
– (ARL) how many terms shared between this sentence and the labels of
the annotated resources?
– (MRL) how many terms shared between this sentence and the labels of
the top-k resources?
– (CS) the average score of the annotated resources, calculated by our
LDSVDP
algorithm, a resource score is normalized by the maximum score,
CS = ( (Scorei /M axS core))/AR.
2. The query
– (QR) how many resources are shared between this sentence and the query
(we annotate the query using the same tool)?
– (QMR) how many top-k resources are shared between this sentence and
the query?
– (QT) how many terms shared between this sentence and the query?
3. The RDF graph
– (CNS) the number of links (arcs) between sentence resources in the document’s graph.
– (PT) how many top-predicates exist in this sentence? the top-predicate
selected by our second algorithm, where after applying the tensor decomposition (see tensor decomposition section), we generate groups of
triples, in each one we have a set of predicates ranked by importance, we
take the most important predicates (score over 0.1) in each group and
put them together to build the top-predicates set.
4. The sentence’s text
– (LEN) the sentence length.
– (SSS) how many small sub-sentence are there? we consider sub-sentence
(separated by ’,’) a small one when it has less than four words.
2
3
top-k resources are the top 10 resources selected by our algorithm (LDSVD) as the
most important in the document’s RDF graph, using the graph structure and the
text analysis (see LDSVD section)
This score is defined by DBpedia spotlight as: ”score from comparing the context
representation of an entity with the text (e.g. cosine similarity with if-icf weights)”
ENsEN: Machine Learning For Sentence Selection
5
– (TT) how many terms are shared between this sentence and page’s title?
– (SS,SE) the sentence start (character, number,other) and the sentence
end (’.’ or ’,’ or ...).
– (HL) the text contains URL or ”HTTP”?
– (D) the text contains a date or not?
– (PHP) the sentence position in the document, normalized by the number
of sentences in the document: P HP = i ∗ 100/m, i is the sentence order,
and m is the number of sentences in the document.
– (FPH) is it the first sentence in the document?
– (LPH) is it the last sentence in the document?
– (NCH) the ratio non-alphabetic characters to alphabetic characters.
3.2
Features of concept-sentence dataset
The selection of a sentence (or more) to describe an annotated resource in the
document is related to these factors:
1. All the features of main-sentence dataset.
2. The resource itself
– (TCT) the shared terms between the sentence and the selected resource’s
label.
– (CA) the shared terms between the sentence and the selected resource’s
abstract.
– (CN) how many resource’s neighbours (in the RDF graph) are annotated
in this sentence? (We don’t normalize this score, in order to not penalize
the high connected resources).
– (CB) how many resource’s brothers (Resource’s brothers are the top resources selected by our second algorithm) are annotated in this sentence?
– (CNL) the number of shared terms between this sentence and resource’s
neighbours labels.
– (CBL) the number of shared terms between this sentence and resource’s
brothers labels.
– (CP) this resources position in the sentence, normalized by the sentence
length: CP = (LEN − P osition)/LEN .
– (CSC) this resource’s score from LDSVD algorithm, normalized by the
maximum score: CSC = Scorei /M axS core .
– (CAS) this resource’s annotation score (returned by the DBpedia spotlight).
4
Classification
We merged the manual evaluated datasets with the generated features, in order
to build a complete machine learning training datasets, then we stored it in
ARFF files and employed Weka machine learning tool-kit for training different
classifiers. We chose classifiers which give as output confidence values, we will
use this values to order sentences in function of its probability, then we select
the top sentence as a main-sentence (or concept-sentence).
6
4.1
ENsEN: Machine Learning For Sentence Selection
Classification algorithms
The selected classifier algorithms are: Logistic (Regression), Naive Bayes, J48,
RBFClassifier and SVM (LIBSVM). A brief description of each of these classifiers
is provided below:
Logistic (Regression) : Regression classifier models are used to predict the probability of occurrence of an event by trying to fit the data to a logistic (linear,
polynomial, etc.) curve in a multidimensional space. Based on the location of the
data point in the multidimensional space (relative to the logistic curve), a label
is assigned to it. For example, in the case of linear regression, all points lying on
one side of line in the 2-dimensional space are given one label, and those on the
other side are given another label. Logistic regression is mainly used when there
are two classes of data (where the classes can be separated by a line in the point
space), but multinomial versions also exist [7].
The Naive Bayes Classifier technique is based on the so-called Bayesian theorem
and is particularly suited when the dimensionality of the inputs is high. Despite
its simplicity, Naive Bayes can often outperform more sophisticated classification
methods. The Naive Bayes classifier tries to reduce this complexity by making a
conditional independence assumption that dramatically reduces the number of
parameters to be estimated when modeling P (X|Y ) [4].
C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan.
It is descended from an earlier system called ID3 and is followed in turn by
C5.0. C4.5 builds decision trees from a set of training data using the concept
of information entropy. J48 is an open source Java implementation of the C4.5
algorithm in the weka data mining tool [6].
Radial basis function networks were introduced into the neural network literature
by Broomhead and Lowe (1988). The RBF network model is motivated by the
locally tuned response observed in biologic neurons. The theoretical basis of the
RBF approach lies in the field of interpolation of multivariate functions.
Support Vector Machines (SVMs) are a popular machine learning method for
classification, regression, and other learning tasks. They have been shown to be
highly effective at traditional text categorization. They are large-margin rather
than probabilistic, classifiers, the basic idea behind the training procedure is to
find a hyperplane, represented by a vector, that not only separates the document vectors in one class from those in the other, but for which the separation,
or margin, is as large as possible. This search corresponds to a constrained optimization problem. LIBSVM is an open source Java implementation of the SVM
algorithm in the weka data mining tool [5].
ENsEN: Machine Learning For Sentence Selection
5
7
Unbalanced classes
We notice that the ”true” class (i.e. is-relevant class) is far less presence in our
two datasets, comparing to ”false” class, see table 1. So, we decided to apply
pre-processing filter that over-sample the smaller class in order to overcome the
unbalanced classes problem. We used SMOTE [1] oversampling filter to got 50%
for each class.
Class
— Main-sentence dataset — Concept-sentence dataset
”true” class
— 9,84%
— 13,4%
”false” class
— 90,16%
— 86,6%
Table 1. unbalanced classes
We made a classification study using the last classifiers, where we applied
10-fold cross-validations to measure the prediction performance of each one of
them (see next chapter for the results).
6
Experimental Results
The next tables (figure 4) show the performance results of multiple classifiers
over our two datasets. The F-True measure is the F-measure of the ”true” class,
it computes some average of the precision and recall metrics over this class.
In this experiments we use all features, the best performance were obtained using
Naive Bayes for main-sentence dataset, and J48 for concept-sentence dataset.
As we believe that we can obtain better results (or the same results at least)
using less features, we applied a features selection study presented in the next
paragraph.
Fig. 4. Results without features selection
8
7
ENsEN: Machine Learning For Sentence Selection
Features selection
As mentioned in section 3, we generated 23 features for the main-sentence
dataset, and 30 features for concept-sentence dataset, Considering the big number of features, we decided to apply a features selection study, where the main
objective is to improve the prediction performance, providing faster and more
cost-effective prediction by selecting a subset of features, This selection not only
improve the performance, but it also gives the prediction more flexibility against
the overfitting. The features selection methods divide into Filters and Wrappers.
7.1
Filters approach
The methods of this approach select subset of features as a pre-processing step,
independently of the chosen predictor, they score and rank each feature on a
particular metric, then we select the top k features as new machine learning
features. Most knowing method is the information gain method which use the
difference in the information entropy:
Inf oGain(Class, Attribute) = H(Class) − H(Class|Attribute)
Filters can be used to reduce space dimensionality and overcome overfitting,
they are faster than wrappers and provide generic selection.
On the other side, filters methods have two main weakness: (1) they don’t consider features redundancy, i.e. Two redundant features will have the same score,
and will be selected (or not) together. (2) they don’t detect dependencies between features.
7.2
Wrappers approach
This type of methods use the learning machines to score subsets of features, and
the used score is the prediction performance of these subsets. The main idea is to
build a space of all possible combinations of features, search this space using the
prediction performance as a guide. Many search strategies can be used(best-first,
branch-and-bound...). The validation usually done by cross-validation in order
to assess the selected subset.
Wrappers methods have two types: (1) forward selection: where we start by
small subset of features then we progressively add more features; (2) backward
elimination: where we start with all features in the same set, then we eliminate
weak ones. Wrappers are more adapted to features selection using the context
of other features, The only drawback of wrappers is the performance, since the
built features space (all possible subsets) is huge, most methods search the locally
optimal solution, using a greedy algorithm.
There is a spacial type of wrappers, which are embedded ones, i.e. integrate
features selection in the training phase.
ENsEN: Machine Learning For Sentence Selection
7.3
9
How many features?
In the wrapper methods, there are no need to define the number of features,
these methods search in the solutions space the best combination of features,
i.e. best feature subset, with no prefixed size of this subset, On the other side,
the filter methods like InfoGain, rank features, and they don’t propose an ideal
subset of features, leaving this task to users.
7.4
Our strategy
In order to select the most effective features selection method, we did the next
experiments, where we applied the different methods taking multiple parameters
in account:
1. The features selection method (filters, wrappers)
2. For filters, we use InfoGain method with nosedive threshold, i.e. a big drop
in values between two successive ranks.
3. For wrappers, we use ClassifierSubsetEvaluator as features selector, with
multiple machine learning algorithms (correspond to the run method):
– J48
– NaiveBayes
– Logistic regression
– SVM
– RBFClassifier
4. For wrappers, we have two modes of applying methods (forward and backward), in this study we use backward mode, except when we combine wrapper with filters, in this case we use forward with a start subset defined by
the InfoGain filter.
PS: when we say InfoGain(top), it means applying InfoGain filter, then take top
features by cutting when we have a nosedive drop in the ranking score. The
InfoGain(top) for first dataset main-sentence was at 8th feature, and at 11th
feature for the second one concept-sentence dataset. The study results shown in
the next tables:
Fig. 5. Main-sentence features selection study results (values represent the F-True)
10
ENsEN: Machine Learning For Sentence Selection
Fig. 6. Concept-sentence features selection study results (values represent the F-True)
7.5
Discussion
In the figure 5 (i.e. features selection over the main-sentence dataset), the Naive
Bayes gives the best results (0.927 of F-True), applying the InfoGain filter reduces the number of used features from 22 to 8, and keeps the same performance.
We tried to enhance the features selection by combining the two methods (filters
and wrappers). So we use InfoGain filter to select a small (base) subset of features, then we use these features as a start subset in a forward wrapper, we tried
different subset size and results (figure 7) shows that we can get the same prediction performance using less features, but this methodology can’t achieve better
results than applying only InfoGain (same performance with only 8 features).
Fig. 7. Main-sentence features selection: results of combining filters and wrappers
The 8 selected features are:
– All features related to the query: (1) QT, sentence’s terms shared with the
query, (2) QR, sentence’s resources shared with the query, (3) QMR, sentence’s top-k resources shared with the query.
– Features related to the resources in this sentence: (4) MR, the number of
top-k resources, (5) AR, the number of annotated resources.
– Features related to the labels of the resources: (6) ARL, shared terms between this sentence and the labels of the annotated resources, (7) MRL,
shared terms between this sentence and the labels of the top-k resources.
– Features related to the web page: (8) TT, shared terms between this sentence
and the page’s title.
PS1 : the 13 features selected by InfoGain then wrapper (over 6 features start
subset) are the same of the 8 features list, but with replacing the MR, by:
a) features related to LDSVD results: CS, the average resources’ score returned
by LDSVD algorithm, b)features related to sentence’s text: LEN, the length of
ENsEN: Machine Learning For Sentence Selection
11
the sentence; SS, the starting character; SE, the ending character; HL, if the
sentence contain links or not, and finally the average of the annotation score
AS.
PS2 : It is interested that we can obtain good results (0.807 as F-True) using
only 3 features: (1) QT, sentence’s terms shared between this sentence and the
query, (2) HL, if the sentence contains links or not, (3) D, if the sentence contains
a date or not.
In the figure 8 (i.e. features selection over the concept-sentence dataset), the
J48 gives the best results (0.883 of F-True), applying the InfoGain filter reduces
the number of used features from 29 to 11, and gives little bit better performance
(0.886).
Fig. 8. Concept-sentence features selection: results of combining filters and wrappers
As on the first dataset, we tried to enhance the features selection by combining the two methods (filters and wrappers). So we used InfoGain filter to select
a small (base) subset of features, then we used these features as a start subset in
the wrapper, we tried different subset size. Results (figure 8) shows that we can
get a better prediction performance using less features. So we can use Infogain
to select 10 features as start subset in the wrapper, which gives us a subset of
17 features. This subset produce better results.
The results obtained using 6-features as start subset are good with advantage of
using less features.
The selected 17 features are (find features description at section 3):
– Features related to the query: (1) QT, sentence’s terms shared with the
query, (2) QR, sentence’s resources shared with the query.
– Features related to the resources in this sentence: (3) MR, the number of
top-k resources, (4) AR, the number of annotated resources.
– Features related to the labels of the resources: (5) ARL, shared terms between this sentence and the labels of the annotated resources, (6) MRL,
shared terms between this sentence and the labels of the top-k resources.
– Features related to LDSVD’s results: (7) CSC, the resource’s score returned
by LDSVD algorithm.
– Features related to second algorithm : (8) CB, (9) CBL.
– Features related to selected resource : (10) CA, (11) CP.
– Textual Features : (12) LEN, (13) HL, (14) NCH, (15) SSS.
– And others: (16) PT, (17) TT.
12
ENsEN: Machine Learning For Sentence Selection
PS1 : The 13 selected features (using InfoGain start subset of 6 features) are:
TCT, CB, CBL, CP, AR, ARL, MRL, CAS, LEN, TT, CSC, NCH, PHP.
References
1. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. arXiv preprint arXiv:1106.1813 (2011)
2. Kohlsch¨
utter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow
text features. In: Proceedings of the third ACM international conference on Web
search and data mining. pp. 441–450. ACM (2010)
3. Mendes, P.N., Jakob, M., Garc´ıa-Silva, A., Bizer, C.: Dbpedia spotlight: shedding
light on the web of documents. In: Proceedings of the 7th International Conference
on Semantic Systems. pp. 1–8. I-Semantics ’11, ACM (2011)
4. Mitchell, T.M.: Machine learning. 1997. Burr Ridge, IL: McGraw Hill 45 (1997)
5. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using
machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. pp. 79–86. Association for
Computational Linguistics (2002)
6. Quinlan, J.R.: C4. 5: programs for machine learning, vol. 1. Morgan kaufmann (1993)
7. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005)