ArnetMiner– Extraction and Mining of Academic

ArnetMiner
– Extraction and Mining of Academic Social Networks
1Jie
Tang, 1Jing Zhang, 1Limin Yao, 1Juanzi Li,
2Li Zhang, and 2Zhong Su
1Knowledge
Engineering Group,
Dept. of Computer Science and Technology
Tsinghua University
2IBM, China Research Lab
August 25th 2008
1
Motivation
“The information
need is not only
about publication…”
“Academic search is
treated as document
search, but ignore
semantics”
2
Examples – Expertise search
• When starting a
work in a new research topic;
• Or brainstorming for novel
ideas.
Researcher A
• Who are experts in this field?
• What are the top conferences in
the field?
• What are the best papers?
• What are the top research labs?
3
Examples – Citation network analysis
Researcher B
• an in-depth understanding
of the research field?
An Inverted Index
Implementation
Introduction of Modern
Information Retrieval
Topics
Filtered
Document Retrieval with
Frequency-Sorted Indexes
Parameterised Compression for
Sparse Bitmaps
Memory Efficient
Ranking
Topic 31: Ranking and Inverted Index
Topic 1 : Theory
Topic 27: Information retrieval
Topic 23: Index method
Signature les: An access Method
for Documents and
its Analytical Performance
Evaluation
Topic 21: Framework
Topic 34: Parallel computing
Self-Indexing Inverted Files for
Fast Text Retrieval
Topic 22: Compression
A Document-centric Approach
to Static Index Pruning in Text
Retrieval Systems
Other
Vector-space Ranking with
Effective Early Termination
Citation Relationship Type
Efficient Document Retrieval in
Main Memory
4
Static
Index Pruning for Information
Retrieval Systems
Basic theory
Comparable work
Other
Examples – Conference Suggestion
authors
Which conference
should we submit the
paper?
Researcher C
content
5
Examples – Reviewer Suggestion
KDD Committee
conference
Paper content
6
Who are best matching
reviewers for each
paper?
Topic Browser
7
2
1
8
Academic Network Extraction
in ArnetMiner
= Researcher Profiling
+ Name Disambiguation
9
Motivating Example
Ruud Bolle
2
Office: 1S-D58
Letters: IBM T.J.
Watson Information
Research Center
Contact
P.O. Box 704
Ruud Bolle
Office: 1S-D58 Yorktown Heights, NY 10598 USA
IBM T.J.
WatsonCenter
Research Center
Letters: Packages:
IBM T.J. Watson
Research
Skyline Drive
P.O. Box19704
Hawthorne,
NY10598
10532USA
USA
Yorktown
Heights, NY
Email:
Packages:
[email protected]
T.J. Watson Research Center
19 Skyline Drive
Ruud M. Bolle was born in Voorburg,
The Netherlands.
He received the Bachelor's
Hawthorne,
NY 10532 USA
Degree in Analog Electronics
1977 and the Master's Degree in Electrical
Email: [email protected]
Engineering in 1980, both from Delft University of Technology, Delft, The
In 1983
he received
Master'sEducational
Degree
in Applied
Mathematics and in
Ruud M.Netherlands.
Bolle was born
in Voorburg,
Thethe
Netherlands.
He received
thehistory
Bachelor's
the Ph.D.
in Electrical
Engineering
from Brown
University,
Providence, Rhode
Degree 1984
in Analog
Electronics
in 1977
and the Master's
Degree
in Electrical
Island.
In 1984
he from
became
Research of
Staff
Member atDelft,
the IBM
Engineering
in 1980,
both
Delfta University
Technology,
TheThomas J. Watson
Research
Center
in the Artificial
Intelligence
Department
the Computer
Netherlands.
In 1983
he received
the Master's
Degree
in Applied of
Mathematics
andScience
in
In 1988Engineering
he became from
manager
of University,
the newly formed
Exploratory
1984 theDepartment.
Ph.D. in Electrical
Brown
Providence,
RhodeComputer
Vision
whichaisResearch
part of theStaff
Math
Sciences
Department.
Island. In
1984Group
he became
Member
at the
IBM Thomas J. Watson
Research Center in the Artificial Intelligence Department of the Computer Science
Currently,
hishe
research
are
onformed
video database
indexing,
video
Department.
In 1988
becameinterests
manager
offocused
the newly
Exploratory
Computer
processing,
visual
interaction
and biometrics applications.
Vision Group
which is
part human-computer
of the Math Sciences
Department.
video database indexing
video processing
visual human-computer interaction
biometrics applications
1
IBM T.J. Watson
Research Center
Research Staff
Affiliation
2006
Position
Homepage
Photo
Name
Ruud Bolle
1984
Sharat Chikkerur, Sharath Pankanti, Alan Jea, Nalini K. Ratha, Ruud M. Bolle: Fingerprint
49 EE Representation Using Localized Texture Features. ICPR (4) 2006: 521-524
2
Andrew Senior, Arun Hampapur, Ying-li Tian, Lisa Brown, Sharath Pankanti, Ruud M. Bolle:
48 EE Appearance models for occlusion handling. Image Vision Comput. 24(11): 1233-1243 (2006)
Msuniv
Delft University of Technology
47
46
...
10
1
Bsdate
1977
Bsuniv
Delft University of Technology
Bsmajor
Msmajor
Msmajor
Electrical Engineering
Applied Mathematics
Co-author
Co-author
Publication 2#
Publication 1#
Title
Title
Cancelable Biometrics:
A Case Study in
Venue
Fingerprints
Date
End_page
Start_page
ICPR
2005
1Ruud M. Bolle, Jonathan H. Connell, Sharath Pankanti, Nalini K. Ratha, Andrew W. Senior:
EE The Relation between the ROC Curve and the CMC. AutoID 2005: 15-20
Sharat Chikkerur, Venu Govindaraju, Sharath Pankanti, Ruud M. Bolle, Nalini K. Ratha:
EE 2
Novel Approaches for Minutiae Verification in Fingerprint Images. WACV. 2005: 111-116
Ruud Bolle
Analog Electronics
1980
1Nalini K. Ratha, Jonathan Connell, Ruud M. Bolle, Sharat Chikkerur: Cancelable Biometrics:
50 EE A Case Study in Fingerprints. ICPR (4) 2006: 370-373
[email protected]
Email
Phddate
Phduniv
Phdmajor
Msdate
Brown University
Publications
DBLP: Ruud Bolle
IBM T.J. Watson Research
Center
P.O. Box 704
Address Yorktown Heights,
NY 10598 USA
Address
http://researchweb.watson.ibm.com/
ecvg/people/bolle.html
Electrical Engineering
Ruud
M. Bolle interests
is a Fellow
the IEEE
thedatabase
AIPR. Heindexing,
is Area Editor
Currently,
his research
areoffocused
onand
video
video of Computer
Vision
andhuman-computer
Image Understanding
and Associate
Editor applications.
of Pattern Recognition. Ruud
processing,
visual
interaction
and biometrics
Academic services
M. Bolle is a Member of the IBM Academy of Technology.
Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer
Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud
M. Bolle is a Member of the IBM Academy of Technology.
IBM T.J. Watson Research
Center
19 Skyline Drive
Hawthorne, NY 10532 USA
Research_Interest
370
Fingerprint
Representation Using
Localized Texture
Features
Venue
End_page
Start_page
2006
2006
521
ICPR
373
coauthor
Publication #3
affiliation
524
UIUC
Ruud Bolle
2
Publication #5
...
Date
coauthor
position
Professor
Motivating Example
Ruud Bolle
2
Office: 1S-D58
Letters: IBM T.J.
Watson Information
Research Center
Contact
P.O. Box 704
Ruud Bolle
Office: 1S-D58 Yorktown Heights, NY 10598 USA
IBM T.J.
WatsonCenter
Research Center
Letters: Packages:
IBM T.J. Watson
Research
Skyline Drive
P.O. Box19704
Hawthorne,
NY10598
10532USA
USA
Yorktown
Heights, NY
Email:
Packages:
[email protected]
T.J. Watson Research Center
19 Skyline Drive
Ruud M. Bolle was born in Voorburg,
The Netherlands.
He received the Bachelor's
Hawthorne,
NY 10532 USA
Degree in Analog Electronics
1977 and the Master's Degree in Electrical
Email: [email protected]
Engineering in 1980, both from Delft University of Technology, Delft, The
In 1983
he received
Master'sEducational
Degree
in Applied
Mathematics and in
Ruud M.Netherlands.
Bolle was born
in Voorburg,
Thethe
Netherlands.
He received
thehistory
Bachelor's
the Ph.D.
in Electrical
Engineering
from Brown
University,
Providence, Rhode
Degree 1984
in Analog
Electronics
in 1977
and the Master's
Degree
in Electrical
Island.
In 1984
he from
became
Research of
Staff
Member atDelft,
the IBM
Engineering
in 1980,
both
Delfta University
Technology,
TheThomas J. Watson
Research
Center
in the Artificial
Intelligence
Department
the Computer
Netherlands.
In 1983
he received
the Master's
Degree
in Applied of
Mathematics
andScience
in
In 1988Engineering
he became from
manager
of University,
the newly formed
Exploratory
1984 theDepartment.
Ph.D. in Electrical
Brown
Providence,
RhodeComputer
Vision
whichaisResearch
part of theStaff
Math
Sciences
Department.
Island. In
1984Group
he became
Member
at the
IBM Thomas J. Watson
Research Center in the Artificial Intelligence Department of the Computer Science
Currently,
hishe
research
are
onformed
video database
indexing,
video
Department.
In 1988
becameinterests
manager
offocused
the newly
Exploratory
Computer
processing,
visual
interaction
and biometrics applications.
Vision Group
which is
part human-computer
of the Math Sciences
Department.
video database indexing
video processing
visual human-computer interaction
biometrics applications
1
Two key issues:
IBM T.J. Watson
Research Center
Research Staff
Affiliation
IBM T.J. Watson Research
Center
P.O. Box 704
Address Yorktown Heights,
NY 10598 USA
Address
http://researchweb.watson.ibm.com/
ecvg/people/bolle.html
Position
Homepage
Photo
Name
Ruud Bolle
1984
Brown University
Analog Electronics
1980
Msuniv
Delft University of Technology
Msmajor
Msmajor
Electrical Engineering
Applied Mathematics
Co-author
Co-author
Publication 2#
Title
Title
1Nalini K. Ratha, Jonathan Connell, Ruud M. Bolle, Sharat Chikkerur: Cancelable Biometrics:
50 EE A Case Study in Fingerprints. ICPR (4) 2006: 370-373
Sharat Chikkerur, Sharath Pankanti, Alan Jea, Nalini K. Ratha, Ruud M. Bolle: Fingerprint
49 EE Representation Using Localized Texture Features. ICPR (4) 2006: 521-524
2
Andrew Senior, Arun Hampapur, Ying-li Tian, Lisa Brown, Sharath Pankanti, Ruud M. Bolle:
48 EE Appearance models for occlusion handling. Image Vision Comput. 24(11): 1233-1243 (2006)
Cancelable Biometrics:
A Case Study in
Venue
Fingerprints
1Ruud M. Bolle, Jonathan H. Connell, Sharath Pankanti, Nalini K. Ratha, Andrew W. Senior:
EE The Relation between the ROC Curve and the CMC. AutoID 2005: 15-20
Sharat Chikkerur, Venu Govindaraju, Sharath Pankanti, Ruud M. Bolle, Nalini K. Ratha:
EE 2
Novel Approaches for Minutiae Verification in Fingerprint Images. WACV. 2005: 111-116
...
Date
End_page
Start_page
ICPR
2005
11
Bsdate
1977
Bsuniv
Delft University of Technology
Bsmajor
Publication 1#
2006
46
Ruud Bolle
1
• How to accurately extract the researcher
profile information
from the Web?
Academic services
• How to integrate the information from different
sources? Publications
Ruud
M. Bolle interests
is a Fellow
the IEEE
thedatabase
AIPR. Heindexing,
is Area Editor
Currently,
his research
areoffocused
onand
video
video of Computer
Vision
andhuman-computer
Image Understanding
and Associate
Editor applications.
of Pattern Recognition. Ruud
processing,
visual
interaction
and biometrics
M. Bolle is a Member of the IBM Academy of Technology.
Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer
Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud
M. Bolle is a Member of the IBM Academy of Technology.
47
[email protected]
Email
Phddate
Phduniv
Phdmajor
Msdate
Electrical Engineering
DBLP: Ruud Bolle
IBM T.J. Watson Research
Center
19 Skyline Drive
Hawthorne, NY 10532 USA
Research_Interest
370
Fingerprint
Representation Using
Localized Texture
Features
Venue
End_page
Start_page
2006
2006
521
ICPR
373
coauthor
Publication #3
affiliation
524
UIUC
Ruud Bolle
2
Publication #5
...
Date
coauthor
position
Professor
Researcher Network Extraction
70.60% of the researchers have at least one
homepage/introducing page
Research_Interest
Fax
Affiliation
Title
Phone
Postion
Publication_venue
Address
Person Photo
Email
Homepage
Start_page
71.9% are homepages
Publication
Name
85.6% from universities
Authored
Coauthor
Researcher
End_page
Bsdate
Bsuniv
Phddate
Phduniv
Phdmajor
Msdate
Date
40% are in lists and
tables
14.4% from companies
28.1% are introducing
pages
60% are natural
language text
Bsmajor
Msuniv
Msmajor
There are a large number of person names
having the ambiguity problem
300 most common male names are used by 1
billion+ people (78.74%) in USA
Even 3 “Yi Li” graduated from the author’s lab
70% moved at least one time
12
Our Approach Picture
– based on Markov Random Field
Markov Property:
Ya
P(Yi | Y j | Y j  Yi )
Yb
Yc
Special cases:
 P(Yi | Y j | Y j ~ Yi )
- Conditional Random Fields
- Hidden Markov Random
Fields
Ye
Yd
Yf
y4=2
y9=3
t -coauthor y7=2
y1=1
cite
coauthor
y10=3
y5=2
y6=2
co-conference y3=1
y2=1
coauthor
cite
coauthor
co-conference
cite
y11=3
y8=1
coauthor
x4
x9
x7
x1
Researcher Profiling
13
x5
Name Disambiguation
x3
x6
x2
x11
x8
x10
CRFs
- Green nodes are hidden vars, - Purple nodes are observations
…
…
…
ADR
…
ADR
AFF
AFF
AFF
AFF
AFF
AFF
POS
POS
POS
POS
POS
POS
OTH
OTH
OTH
OTH
OTH
OTH
He
is
a
Professor
at


1
p ( y | x) 
exp    j t j (e, y |e , x)   k sk (v, y |v , x) 
Z ( x)
vV ,k
 eE , j

14
UIUC
Token Definitions
Standard word
Special word
Including several general ‘special words’
e.g. email address, IP address, URL, date,
number, money, percentage, unnecessary
tokens (e.g. ‘===’ and ‘###’), etc.
Image token
<IMAGE src="defaul3.jpg" alt=""/>
Term
Punctuation
marks
15
Words in natural language
base NP, like “Computer Science”
Including period, question mark, and
exclamation mark
Feature Definition
• Content features
Word features
Morphological features
Image size
Image height/width ratio
Image format
Image color
Face recognition
The value of height/width. The value of a person photo is
often larger than 1
JPG or BMP
The number of the “unique color” used in the image and the
number of bits used for per pixel, i.e. 32,24,16,8,1
Whether the current image contains a person face
Image filename
Whether the filename contains (partially) the researcher
name
Image “ALT”
Whether the “alt” of the image contains (partially) the
researcher name
Image positive keywords
Image negative keywords
16
Standard Word
Whether the current token is a word
Whether the word is capitalized
Image Token
The size of the image
“myself”, “biology”
“ads”, “banner”, “logo”
Profiling Experiments
• Dataset
– 1,000 researchers from ArnetMiner.org
• Baseline
– Amilcare
– Support Vector Machines
– Unified_NT (CRFs without transition features)
• Evaluation measures
– Precision, Recall, F1
17
Profiling Results—5-fold cross validation
18
Profiling Task
Unified
Unified_NT
SVM
Amilcare
Photo
89.11
88.64
88.86
31.62
Position
69.44
64.70
64.68
56.48
Affiliation
83.52
72.16
73.86
46.65
Phone
91.10
78.72
79.71
83.33
Fax
90.83
64.28
64.17
86.88
Email
80.35
75.47
79.37
78.70
Address
86.34
75.15
77.04
66.24
Bsuniv
67.38
57.56
59.54
47.17
Bsmajor
64.20
59.18
60.75
58.67
Bsdate
53.49
40.59
28.49
52.34
Msuniv
57.55
47.49
49.78
45.00
Msmajor
63.35
61.92
62.10
57.14
Msdate
48.96
41.27
30.07
56.00
Phduniv
63.73
53.11
57.01
59.42
Phdmajor
67.92
59.30
59.67
57.93
Phddate
57.75
42.49
41.44
61.19
Overall
83.37
83.37
72.09
73.57
62.30
19
Name Disambiguation
Name
Affiliation
Shanghai Jiao Tong Univ.
Yunnan Univ.
Tsinghua Univ.
Jing
Zhang (26)
Alabama Univ.
Univ. of California, Davis
Carnegie Mellon University
Henan Institute of
Education
20
Proposal of a semi-supervised framework
Our Method to Name Disambiguation
y4=2
t -coauthor y7=2
y1=1
cite
coauthor
y10=3
y5=2
y6=2
co-conference y3=1
y2=1
coauthor
cite
coauthor
co-conference
cite
• A hidden Markov Random
Field model
y9=3
y11=3
y8=1
coauthor
x4
• Hidden Variables Y represent
the labels of publications
x9
x7
x1
x5
x3
• Observable Variables X
represent publications
x10
x6
x2
x11
x8
21
• Paper relationships define the
dependencies over hidden
variables
Objective Function
maximize P (Y | X )  P (Y ) P ( X | Y )
1
exp( V (Y ))
Z1

1
exp(   VNi (Y ))
Z1
N i N
P( X | Y ) 
1
 exp( V (i, j ))
Z1
i
j

1
exp(  D( xi , yi ))
Z2
xi X
1
exp(  D( xi , x j ) I ( yi  y j )  [ wk ck ( yi , y j )])
Z1
i
j
ck C
2
1
2
minimize fobj  {D( xi , x j ) I ( yi  y j )  [wk ck ( yi , y j )]}   D( xi , yi )  log Z
i
22
j
ck C
xi X
Relationship Definition
C
c1
c2
c3
c4
W
w1
w2
w3
w4
Relationship
Co-Conference
CoAuthor
Citation
Constraints
Description
pi.pubvenue = pj.pubvenue
r, s>0, ai(r)=aj(s)
pi cites pj or pj cites pi
Feedbacks supplied by users
c5
w5
τ-CoAuthor
one common author in τ extension
p1: A, B, C
p2: A, B
p3: A, D
p4: C, D
23
(0)
(3)
(2)
Mp(1)
:
p1
p1 1
p2 1
0
p3 01
p2
01
1
01
p3
0
1
1
0
1
Parameterized Distance Function

We define the distance function as follows (Basu, 04):
D ( xi , x j )  1 
xiT Ax j
|| xi ||A || x j ||A
where || xi ||A  xiT Axi
24

We can see that || xi ||A actually maps each vector xi into
another new space, i.e. A1/2xi

To simplify our question, we define A as a diagonal
matrix
EM Framework
• Initialization
• use constraints to generate initial k clusters
f obj ( xi , yi )  {D( xi , x j ) I (h  l j )  [ wk ck ( pi , p j )]}  D ( xi , yi )
• E-Step
• M-Step
i
ck C
j
x
i
i :li  h
• Update cluster centroid y  ||  x ||
• Update parameter matrix A
i
i :li  h
fobj
am
 {
i
D( xi , x j )
am
25
j
D( xi , x j )
am
I (li  l j )  [wk ck ( pi , p j )]} 
ck C
xim x jm || xi ||A || x j ||A  x Ax j
T
i

i
A
D( xi , yi )
am
xi X

2
xim
|| xi ||2A  x 2jm || x j ||2A
|| xi ||2A || x j ||2A
2 || xi ||A || x j ||A
Disambiguation Experiments
• Data set:
#Public- #Actual
ations Person
Cheng Chang
12
3
Wen Gao
286
4
Yi Li
42
21
Jie Tang
21
2
Bin Yu
66
12
Abbr. Name
26
Abbr. Name
Gang Wu
Jing Zhang
Kuo Zhang
Hui Fang
Lei Wang
#Public #Actual
-ations Person
40
16
54
25
6
2
15
3
109
40
Rakesh Kumar
61
5
Michael Wagner
44
12
Bing Liu
130
11
Jim Smith
33
5
Our Approach vs. Baseline
Data Set
Person Name
Prec.
Cheng Chang 100.0
Wen Gao
96.60
Yi Li
86.64
Jie Tang
100.0
Gang Wu
97.54
Jing Zhang
85.0
Kuo Zhang 100.0
Hui Fang
100.0
Real Name
Bin Yu
67.22
Lei Wang
68.45
Rakesh
63.36
Kumar
Michael
18.35
Wagner
Bing Liu
84.88
Jim Smith
92.43
Avg.
82.89
27
Rec.
100.0
62.64
95.12
100.0
97.54
69.86
100.0
100.0
50.25
41.12
F1
100.0
76.00
90.68
100.0
97.54
76.69
100.0
100.0
57.51
51.38
Prec.
100.0
99.29
70.91
100.0
71.86
83.91
100.0
100.0
86.53
88.64
Rec.
100.0
98.59
97.50
100.0
98.36
100.0
100.0
100.0
53.00
89.06
F1
100.0
98.94
82.11
100.0
83.05
91.25
100.0
100.0
65.74
88.85
Our Approach
(w/o relation)
Prec. Rec.
F1
72.73 64.00 68.09
96.17 33.53 49.72
20.97 31.71 25.25
88.68 54.65 67.63
57.69 36.89 45.00
10.79 20.55 14.15
66.67 40.00 50.00
64.71 70.97 67.69
26.50 23.25 24.77
24.13 25.16 24.63
92.41
75.18
99.14
96.91
98.01
67.11
43.04
52.45
60.26
28.13
85.19
76.16
80.42
36.89
50.33
42.57
43.16
86.80
78.51
57.22
89.53
80.64
88.25
95.81
90.68
86.49
93.56
92.12
87.36
94.67
91.39
66.14
60.94
54.29
19.51
59.39
40.93
30.13
60.16
46.67
Baseline (Tan, 2006)
Our Approach
Contribution of Relationships
100.00
80.00
w/o Relationship
60.00
+CoConference
40.00
+Citation
+CoAuthor
20.00
All
0.00
Pre.
29
Rec.
F1
Distribution Analysis
(1) All methods can achieve
good performance
(2) Our method can achieve good
performance
(3) Our method can obtain not bad results, but still need further improvements
30
Modeling the Academic Network
and Applications
31
The Academic Network
Dr. Tang
Association...
cite
SVM...
publish
IJCAI
write
write
publish WWW
Tree CRF...
publish
write
Limin
cite
publish
Heterogeneous objects:
ISWC
cite
Prof. Wangpublish
cite
publish
write
EOS... Semantic...
write
Annotation...
write
Pc member
write
Prof. Li
coauthor
Write
Paper, Person, Conf./Journal
Relationships:
•Conf./Journal publish paper
coauthor
Challenges:
- How to model the heterogeneous objects
in a unified approach?
- How to apply the modeling approach to
different applications?
32
Academic Network
•Paper cite paper
•Person write paper
•Person is PC member of
Conf./Journal
•Person is coauthor of person
Modeling the Academic Network
α
β
θ
Φ
A
Φ
α
x
z
Nd
μ
T
D
z
c
Φ
A
ad
x
β
θ
ad
c
x
T
z
w
Nd
w
Nd
c
D
D
ψ
η,σ2
T
ACT1
33
θ
AC
T
w
Topic
β
words
authors
ad
α
conference
ACT2
ACT3
Generative Story of ACT1 Model
• Generative process
Paper
Latent Dirichlet Co-clustering
IR
NLP
ML
P(c|z)
1
2
3
4
P(w|z)
DM
ICDM 0.23
KDD 0.19
….
mining
0.23
clustering
0.19
classification 0.17
….
Shafiei
NLP
P(c|z)
IR
DM
ML
Milios
34
1
2
3
4
P(w|z)
ICML 0.23
NIPS 0.19
….
model
0.23
learning 0.19
boost
0.17
….
Shafiei and Milios
ICDM
NIPS
We present a generative model for
clustering
clusteringdocuments and terms.
Our model is a four hierarchical
bayesian model. We present efficient
inference techniques based on
inference
Markow Chain Monte Carlo. We
report results in document modeling,
document and terms clustering …
ACT Model 1
Generative process:
α
β
words
authors
θ
Φ
A
T
w
ad
x
z
c
Nd
Topic
μ
ψ
T
ACT1
35
D
conference
ACT Model 2
α
Generative process:
β
authors
θ
Φ
AC
T
ad
x
z
c
w
Nd
words
conference
ACT2
36
D
ACT Model 3
authors
α
β
θ
Φ
A
ad
x
Generative process:
words
T
z
w
Nd
c
D
conference
ACT3
37
η,σ2
Applications
Association search
Expertise search
α
β
θ
Φ
A
T
w
ad
x
z
c
Nd
μ
D
ψ
T
Researcher interests
Hot topic on a conference
38
Topic browser
Expertise Search
• Calculate the relevance of query q and different
objects (i.e., papers, authors, and conferences)
• E.g.,
P (q | d )   wq P ( w | d )
P(w | d )  PLM (w | d )  PACT (w | d )
Nd
Nd
tf ( w, d )
tf (w, D)
PLM ( w | d ) 

 (1 
)
Nd  
Nd
Nd  
D
T
Ad
PACT (w | d )   P(w | z, z ) P( z | x,  x ) P( x | d )
z 1 x 1
39
Expertise Search Results
Arnetminer data:
14,134 authors
10,716 papers
1,434 confs/journals
Evaluation measures:
pooled relevance
+ human judgement
Baselines:
- Language Model (LM)
- LDA
- Author Topic (AT)
40
ArnetMiner Today
42
ArnetMiner Today
* Arnetminer data:
> 0.5 M researcher profiles
> 2M papers
> 8M citation relationships
> 4K conferences
* Visits come from more than 165
countries
* Continuously +20% increase of
visits per month
* Currently, more than 1,500
unique-ip visits per day.
43
Top 10 countries
1. USA
6. Canada
2. China
7. Japan
3. Germany
8. France
4. India
9. Taiwan
5. UK
10. Italy
Person Search
Basic Info.
Research Interests
Social Network
Publications
44
Expertise
Search
Finding experts,
expertise conferences,
and expertise papers
for “data mining”
45
Association Search
Finding associations
between persons
- high efficiency
- Top-K associations
Usage:
- to find a partner
- to find a person with
same interests
47
Survey Paper Finding
Survey papers
48
Topic Browser
200 topics have been
discovered automatically
from the academic network
49
Acknowledgements
• National Science Foundation of China (NSFC)
• National 985 Funding
• Chinese Young Faculty Research Funding
• Minnesota-China Collaboration Project
• IBM CRL
• Tsinghua-Google Joint Research Project
• National Foundation Science Research (973)
50
Thanks!
Q&A & Demo
HP: http://keg.cs.tsinghua.edu.cn/persons/tj/
Online URL: http://arnetminer.org
If want to know more technique details,
please come to our poster session tomorrow night.
51