ケモインフォマティクス入門(鹿児島大学医歯学総合研究科)

ケモインフォマティクス入門
九州工業大学 西郷浩人
私、西郷について
•  1997-­‐2001 –  学士(電気電子工学)、上智大学 •  2001-­‐2006 –  博士(情報学)、京都大学 •  2006-­‐2008 –  研究員、独Max Planck Biological Cyberne;cs研究所 •  2008-­‐2010 –  研究員、独Max Planck Informa;cs研究所 •  2010-­‐ –  准教授、九州工業大学 九州工業大学情報工学部
九工大からの眺め
ケモインフォマティクス入門
九州工業大学 西郷浩人
化合物の例
Caffeine
Aspirin
Oseltamivir
Sildenafil
Serotonin
増え続ける化合物の数と コンピュータ時代の化学
Flood of Information
ar
30 000 000
# o f str u c tu r e s
ds / year
•  1800万 (2000年)
25 000 000
20 000 000
15 000 000
10 000 000
5 000 000
0
1965
1970
1975
1980
1985
1990
1995
2000
Year
©Alexandre Varek
ad 4.000 publications / day ?
化学におけるコンピューターの役割
•  データベース –  化合物データの蓄積と検索 •  スクリーニング –  数式や人工知能等を用いて、欲しい性質をもつ
化合物の絞り込み コンピューターの中での化合物
の表現法1:MDLフォーマット
4
アスピリンのMDLファイル
An Introduction to Chemoinformatics
number
of atoms
number
of bonds
-ISIS-
13 13 0
-3.4639
-3.4651
-2.7503
-2.0338
-2.0367
-2.7521
-2.7545
-2.0413
-3.4702
-1.3238
-0.6125
-0.6167
0.1000
1 2 2
6 7 1
3 4 2
7 8 1
7 9 2
4 5 1
5 10 1
2 3 1
10 11 1
5 6 2
11 12 2
6 1 1
11 13 1
M END
the first atom is a carbon
09270222202D
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0
-1.5375
-2.3648
-2.7777
-2.3644
-1.5338
-1.1247
-0.2997
0.1149
0.1107
-1.1186
-1.5292
-2.3542
-1.1125
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0999 V2000
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 C
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
the first three numbers are the
x, y and z coordinates of the
atom
the first bond is
between atoms 1
and 2 and has
order 2
9
8
OH
O
7
6
1
2
10
O
5
4
11
13
O 12
3
©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form).
The numbering of the atoms is as shown in the chemical diagram.
4
MDLファイル:基本情報
An Introduction to Chemoinformatics
number
of atoms
number
of bonds
-ISIS-
13 13 0
-3.4639
-3.4651
-2.7503
-2.0338
-2.0367
-2.7521
-2.7545
-2.0413
-3.4702
-1.3238
-0.6125
-0.6167
0.1000
1 2 2
6 7 1
3 4 2
7 8 1
7 9 2
4 5 1
5 10 1
2 3 1
10 11 1
5 6 2
11 12 2
6 1 1
11 13 1
M END
the first atom is a carbon
09270222202D
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0
-1.5375
-2.3648
-2.7777
-2.3644
-1.5338
-1.1247
-0.2997
0.1149
0.1107
-1.1186
-1.5292
-2.3542
-1.1125
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0999 V2000
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 C
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
the first three numbers are the
x, y and z coordinates of the
atom
the first bond is
between atoms 1
and 2 and has
order 2
9
8
OH
O
7
6
1
2
10
O
5
4
11
13
O 12
3
©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form).
The numbering of the atoms is as shown in the chemical diagram.
4
MDLファイル:原子情報
An Introduction to Chemoinformatics
number
of atoms
number
of bonds
-ISIS-
13 13 0
-3.4639
-3.4651
-2.7503
-2.0338
-2.0367
-2.7521
-2.7545
-2.0413
-3.4702
-1.3238
-0.6125
-0.6167
0.1000
1 2 2
6 7 1
3 4 2
7 8 1
7 9 2
4 5 1
5 10 1
2 3 1
10 11 1
5 6 2
11 12 2
6 1 1
11 13 1
M END
the first atom is a carbon
09270222202D
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0
-1.5375
-2.3648
-2.7777
-2.3644
-1.5338
-1.1247
-0.2997
0.1149
0.1107
-1.1186
-1.5292
-2.3542
-1.1125
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0999 V2000
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 C
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
the first three numbers are the
x, y and z coordinates of the
atom
the first bond is
between atoms 1
and 2 and has
order 2
9
8
OH
O
7
6
1
2
10
O
5
4
11
13
O 12
3
©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form).
The numbering of the atoms is as shown in the chemical diagram.
4
MDLファイル:結合情報
An Introduction to Chemoinformatics
number
of atoms
number
of bonds
-ISIS-
13 13 0
-3.4639
-3.4651
-2.7503
-2.0338
-2.0367
-2.7521
-2.7545
-2.0413
-3.4702
-1.3238
-0.6125
-0.6167
0.1000
1 2 2
6 7 1
3 4 2
7 8 1
7 9 2
4 5 1
5 10 1
2 3 1
10 11 1
5 6 2
11 12 2
6 1 1
11 13 1
M END
the first atom is a carbon
09270222202D
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0
-1.5375
-2.3648
-2.7777
-2.3644
-1.5338
-1.1247
-0.2997
0.1149
0.1107
-1.1186
-1.5292
-2.3542
-1.1125
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0999 V2000
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 C
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
the first three numbers are the
x, y and z coordinates of the
atom
the first bond is
between atoms 1
and 2 and has
order 2
9
8
OH
O
7
6
1
2
10
O
5
4
11
13
O 12
3
©Leach and Gillet, Springer 2007 Figure 1-3. The connection table for aspirin in the MDL format (hydrogen-suppressed form).
The numbering of the atoms is as shown in the chemical diagram.
10
コンピューターの中での化合物の 表現法2:Fingerprint
An Introduction to Chemoinformatics
H
N
O
HO
HO
0
1
1
1
0
0
1
1
1
1
1
1
0
1
0
OH
NH2
O
NH
O
B
N=
NH2
Query
A
N
N
H
Figure 1-7. The bitstring representation of a query substructure is illustrated together with the
corresponding bitstrings of two database molecules. Molecule A passes the screening stage since
©Leach nd illet, Springer 2007 all the bits set to “1” in the query are
also setain
itsGbitstring.
Molecule
B, however,
does not pass
the screening stage since the bit representing the presence of O is set to “0” in its bitstring.
Fingerprintの例
Name
Daylight
Pubchem
MACCS
Size
1028
888
960
109
>= 1 Ho
PubChem
Substructure
Fingerprint
V1.3
248
>= 1 any ring
size 10
http://pubchem.ncbi.nlm.nih.gov
110
>= 1 Er
http://pubchem.ncbi.nlm.nih.gov
>=
1
saturated
or
aromatic
carbon-only
ring size 10
111
>= 1 Tm
250
>=
1
saturated
or
aromatic
nitrogen-containing
ring size 10
PubChem
Substructure
Fingerprint
Description
112
>= 1 Yb
251
>=
or aromatic heteroatom-containing ring size 10
113
>= 11 saturated
Tc
PubChem
Substructure
Fingerprint
(cont.)
252
>=
non-aromatic
carbon-only
size
Section
1:
Hierarchic
Element
Counts -Description
These
bits ring
test
for10the
114
>= 11 unsaturated
U
PubChem
Substructure
Fingerprint
253
>= 1 unsaturated
non-aromatic nitrogen-containing ring size 10
presence or count of individual
chemical atoms represented
V1.3
http://pubchem.ncbi.nlm.nih.gov
254
>= 1SMARTS
unsaturated
non-aromatic
heteroatom-containing ring size 10
Section
6: Simple
patterns
(cont.)
V1.3249
Pubchem fingerprint
by their atomic symbol.
PubChem
Substructure
Fingerprint
PubChem
Substructure
Fingerprint
255
>=
1 Substructure
aromatic
ring
Bit
Position
Bitin
V1.3
http://pubchem.ncbi.nlm.nih.gov
Section
2:
Rings
a
canonic
Extended
Smallest Set of
Smallest Rings
Bit
Position
Bit
Substructure
256
>=
1
hetero-aromatic
ring
V1.3
http://pubchem.ncbi.nlm.nih.gov
PubChem
Substructure
Fingerprint
Description
(cont.)
689
O-C-C-C-C-C-C
(ESSSR)
set rings
- These bits test for the presence or
0
>=2ring
4aromatic
H
257
>=
690
O-C-C-C-C-C-O
1
>=
8
H
258
>=
2
hetero-aromatic
rings
count
of
the
described
chemical ring system. An ESSSR ring
Section
3:Substructure
Simple
atom pairs
(cont.)
691
O-C-C-C-C-C-N
PubChem
Fingerprint
Description (cont.)
(cont.)
2
>=ring
16
H which
259
>=
3
aromatic
rings
PubChem
Substructure
Fingerprint
Description
is
any
does
not share
three consecutive atoms
692
O=C-C-C-C-C-C
3
>=
32
H
3Substructure
hetero-aromatic rings
Bit260
Position with >=
Bit
693
O=C-C-C-C-C-O
any
other
ring in
the chemical
structure. For
Section
4: Simple
(cont.)
4
>=4atom
1aromatic
Li nearest
261
>=
rings neighbors
321
Si-Si
694
O=C-C-C-C-C=O
Section
5: example,
Detailed
atom
neighborhoods
(cont.)
naphthalene
has
three
ESSSR
rings (two phenyl
5
>=42hetero-aromatic
Li
262
>=
rings
322
Si-Cl
695
O=C-C-C-C-C-N
and the 10-membered envelope), while biphenyl
Bit 6Positionfragments
Bit
>= 1Substructure
B
P-H
696
C-C-C-C-C-C-C-C
Bit 323
Position
Bit
Substructure
411
P(~O)(~O)
7
>=
2
B
will
yield
a count of only two ESSSR rings.
324
P-P
697
C-C-C-C-C-C(C)-C
449
C(-N)(=C)
412
S(~C)(~C)
8
>=
4
B
325
As-H
698
O-C-C-C-C-C-C-C
450
C(-N)(=N)
413
S(~C)(~H)
9
>= atom
2
C
Bit451
Position
Bit
Substructure
Section
3: Simple
pairs – These bits test for the presence of
326
As-As
699
O-C-C-C-C-C(C)-C
C(-N)(=O)
414
S(~C)(~O)
10
>=
4
C ring size 3
115
>=
1
any
patterns
of
bonded
atom pairs, regardless of bond order or
700
O-C-C-C-C-C-O-C
452
C(-O)(=O)
415
Si(~C)(~C)
11
>= 18 saturated
C
116
>=
or
aromatic carbon-only ring size 3
701
O-C-C-C-C-C(O)-C
453
N(-C)(=C)
count.>= 16 C
12
117
>= 1 saturated or aromatic nitrogen-containing ring size 3
702
O-C-C-C-C-C-N-C
454
N(-C)(=O)
13
>= 132
C
118
>=
saturated
or aromatic
heteroatom-containing
ringfor
sizethe
3
Section
4: Simple
atom
nearest
neighbors
– These bits test
703
O-C-C-C-C-C(N)-C
N(-O)(=O)
Bit455
Position
Bit
14
>= 1Substructure
1 unsaturated
N
119
>=
non-aromatic
carbon-only
ring
size
3
456
P(-O)(=O)
704
O=C-C-C-C-C-C-C
atomneighborhoods
nearest neighbor
patterns,
regardless
263
Li-H
15
>= 12of
Natom
Section
5:presence
Detailed
These
bits test
for size
the of
120
>=
unsaturated
non-aromatic–nitrogen-containing
ring
3
457
S(-C)(=O)
705
O=C-C-C-C-C(O)-C
264
Li-Li
bond
order
byatom
"~")neighborhood
or heteroatom-containing
count, but
where bond
16
>= 14 unsaturated
N (denoted
121
>=
non-aromatic
ring
size
3
presence
of
detailed
patterns,
regardless
458
S(-O)(=O)
706
O=C-C-C-C-C(=O)-C
265
Li-B
17
>= 28 any
N (denoted
aromaticity
by ":")
is significant.
122
>=
ring
size bond
3
of
count,
but
where
orders
are
specific,
bond
459
S(=O)(=O)
707
O=C-C-C-C-C(N)-C
266
Li-C
18
>= 21 saturated
O
123
>=
or aromatic
carbon-only
ring size
3
708
C-C(C)-C-C
aromaticity
both single
and double
bonds,
and where
267
Li-O
19
>= 2Substructure
2 saturated
O matches
Bit
Bit 124
Position
>=
or aromatic nitrogen-containing ring size 3
709
C-C(C)-C-C-C
268
Li-F
"-",
"=",
and
"#"
matches
a
single
bond,
double
bond,
20
>= 24 saturated
O
327
C(~Br)(~C)
125
>=
or aromatic heteroatom-containing ring size and
3
710
C-C-C(C)-C-C
269
Li-P
21
>=
8
O
triple
bond
order,
respectively.
328
C(~Br)(~C)(~C)
126
>=
2
unsaturated
non-aromatic
carbon-only
ring
size
3
711
C-C(C)(C)-C-C
Section
6: Simple
patterns – These bits test for the presence
270
Li-S
22
>= 2SMARTS
16
O
329
C(~Br)(~H)
127
>=
unsaturated
non-aromatic nitrogen-containing ring size 3
712
C-C(C)-C(C)-C
of
simple
SMARTS
patterns,
regardless of count, but where
23
>= 21Substructure
F
C(~Br)(:C)
Bit330
Position
Bit
128
>=
unsaturated
non-aromatic heteroatom-containing ring size 3
bond orders
24
>= 2 F are specific and bond aromaticity matches both
331
C(~Br)(:N)
416
C=C
25
>=and
F double
332
C(~C)(~C)
417
C#C
129
>=
14 any
ring size
4
single
bonds.
Page 7 of 21 C(~C)(~C)(~C)
5/1/2009 7:21:06 AM
333
26
>= 11 saturated
Na
418
C=N
130
>=
or aromatic carbon-only ring size 4
Section
7:
Complex
SMARTS
patterns
–
These
bits
test
C(~C)(~C)(~C)(~C)
27
>= 1Substructure
2 saturated
Na
419
C#N
131
>=
or aromatic nitrogen-containingfor
ringthe
sizepresence
4
Bit 334
Position
Bit
of complex
patterns,
regardless of count,
335
C(~C)(~C)(~C)(~H)
28
>= 11 saturated
SiSMARTS or
420
C=O
132
>=
aromatic heteroatom-containing
ringbut
sizewhere
4
460
C-C-C#C
336
C(~C)(~C)(~C)(~N)
29
>=
2
Si
421
C=S
133
>=
1
unsaturated
non-aromatic
carbon-only
ring
size
4
bond
orders
and
bond
aromaticity
are
specific.
461
O-C-C=N
337
C(~C)(~C)(~C)(~O)
30
>= 1 P
422
N=N
462
O-C-C=O
338
C(~C)(~C)(~H)(~N)
31
>= 2
P
423
N=O
Bit463
Position
Bit
Substructure
N:C-S-[#1]
339
C(~C)(~C)(~H)(~O)
424
N=P
32
>= 4 P
713
Cc1ccc(C)cc1
464
N-C-C=C
Page 4 of 21 C(~C)(~C)(~N)
5/1/2009 7:21:06 AM
340
425
P=O
33
>= 1 S
714
Cc1ccc(O)cc1
465
O=S-C-C
341
C(~C)(~C)(~O)
426
P=P
34
>= 2 S
715
Cc1ccc(S)cc1
466
N#C-C=C
342
C(~C)(~Cl)
427
C(#C)(-C)
35
>= 4 S
716
Cc1ccc(N)cc1
467
C=N-N-C
343
C(~C)(~Cl)(~H)
428
C(#C)(-H)
36
>= 8 S
717
Cc1ccc(Cl)cc1
468
O=S-C-N
344
C(~C)(~H)
429
C(#N)(-C)
37
>= 1 Cl
718
Cc1ccc(Br)cc1
469
S-S-C:C
345
C(~C)(~H)(~N)
430
C(-C)(-C)(=C)
38
>= 2 Cl
470
C:C-C=C
719
Oc1ccc(O)cc1
346
C(~C)(~H)(~O)
431
C(-C)(-C)(=N)
39
>= 4 Cl
471
S:C:C:C
720
Oc1ccc(S)cc1
347
C(~C)(~H)(~O)(~O)
432
C(-C)(-C)(=O)
40
>= 8 Cl
472
C:N:C-C
721
Oc1ccc(N)cc1
348
C(~C)(~H)(~P)
433
C(-C)(-Cl)(=O)
41
>= 1 K
473
S-C:N:C
722
Oc1ccc(Cl)cc1
hTps://pubchem.ncbi.nlm.nih.gov/help.html
MACCS keys
1:('?',0), # ISOTOPE #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0), # ISOTOPE Not complete 2:('[#103,#104]',0), # ISOTOPE Not complete 3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-­‐6 (Ge...) *NOTE* spec wrong 4:('[Ac,Th,Pa,U,Np,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr]',0), # ac;nide 5:('[Sc,Ti,Y,Zr,Hf]',0), # Group IIIB,IVB (Sc...) *NOTE* spec wrong 6:('[La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]',0), # Lanthanide 7:('[V,Cr,Mn,Nb,Mo,Tc,Ta,W,Re]',0), # Group VB,VIB,VIIB (V...) *NOTE* spec wrong 8:('[!#6;!#1]1~*~*~*~1',0), # QAAA@1 9:('[Fe,Co,Ni,Ru,Rh,Pd,Os,Ir,Pt]',0), # Group VIII (Fe...) 10:('[Be,Mg,Ca,Sr,Ba,Ra]',0), # Group IIa (Alkaline earth) 11:('*1~*~*~*~1',0), # 4M Ring 12:('[Cu,Zn,Ag,Cd,Au,Hg]',0), # Group IB,IIB (Cu..) 13:('[#8]~[#7](~[#6])~[#6]',0), # ON(C)C 14:('[#16]-­‐[#16]',0), # S-­‐S 15:('[#8]~[#6](~[#8])~[#8]',0), # OC(O)O The simplest SMILES is probably that for methane: C. Note that all four
attached hydrogens are implied. Ethane is CC, propane is CCC and 2-methyl
propane is CC(C)C (note the branch point). Cyclohexane illustrates the use of ring
closure integers; the SMILES is C1CCCCC1. Benzene is c1ccccc1 (note the use
of lower case to indicate aromatic atoms). Acetic acid is CC(=O)O. The SMILES
for a selection of more complex molecules are provided in Figure 1-4.
コンピューターの中での化合物の 表現法3:SMILESフォーマット
NH2
HO
COOH
N
H
COOH
succinicacid:
OC(=O)CCC(=O)O
cubane:
C1(C2C3C14)C5C2C3C45
serotonin:
NCCc1c[nH]c2ccc(O)cc12
NH2
O
O
N
H2N
O
N
O
trimethoprim: COc1cc(Cc2cnc(N)nc2N)cc(OC)c1OC
O
progesterone:
CC(=O)C1CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C
Figure 1-4. Some examples of the SMILES strings for a variety of molecules.
©Leach and Gillet, Springer 2007 SMILESやMDLフォーマットの問題点
•  アスピリンのSMILES表現 –  OC(=0)c1cccc1OC(=O)C –  C1cccc(OC(=O)C)c1C(=O)O •  1つの化合物に対する記法が1つに定まらな
いため、2つの化合物が同一かどうか調べる
ことが出来ない。 •  MDLフォーマットも、どの原子からテーブルを
始めるかによって異なる。
化合物に対する単一の名前付け(1)
•  Morgan Indexによる方法 –  原子に数字(例えば原子番号)を与える。 –  隣の原子の数字を繰り返したしていく。 Representation
and Manipulation of 2D Molecular Structures
–  最も大きい数字の原子からテーブルを始める。 O1
O1
O3
O3
3
2
2
3
O5
O5
14
5
O2
3
O1
3
2
O1
4
2
n=3
O6
8
5
4
3
7
5
4
17
12
O3
9
9
21
20
n=11
12
11
O4
4
73
27
26
19
O27
O27
O14
45
O11
n=8
n=6
O14
7
O31
12
39
19
28
O12
66
46
92
49
O58
104
59
55
19
O19
n=11 ©Leach and Gillet, Springer 2007 化合物に対する単一の名前付け(2)
•  IUPAC(Interna;onal Union of Pure and Applied Chemistry)による定義 –  アスピリンの例 2-­‐acetoxybenzoic acid
データベースの検索
部分構造探索
Representation and Manipulation of 2D Molecular Structures
OH
N
O
HO
O
OH
O
H
N
O
NH
Olmidine
Adrenaline
Cl
N
HO
NH2
O
O
HO
query
9
Mefeclorazine
HO
NH
HO
N
N
HO
Cl
N
Fenoldopam
HO
OH
O
Apomorphine
OH
Morphine
Figure 1-6. An illustration of the range of hits that can be obtained using a substructure search.
©Leach and Gillet, Springer 007 the World Drug Index
In this case the dopamine-derived
query
at the
top left
was used2to
search
(WDI) [WDI]. Note that in the case of fenoldopam the query can match the molecule in more
than one way.
popular, particularly when dealing with large numbers of molecules. However,
methods that are based on 2D structure will tend to identify molecules with
common substructures, whereas the aim is often to identify structurally different
molecules.
As we have already indicated in Chapter 2, it is well known that molecular
化合物の構造機能類似性(1)
recognition depends on the 3D structure and properties (e.g. electrostatics and
shape) of a molecule rather than the underlying substructure(s). An illustration
of this is provided in Figure 5-5, which shows the three opioid ligands from
•  構造が類似している化合部は、機能も類似し
ていることが多い。 N
Morphine
O
HO
OH
O
O
O
Codeine
0.99 similar
N
N
N
OH
O
O
Heroin
0.95 similar
O
O
O
Methadone
0.20 similar
©Leach calculated
and Gillet, Springer 2007 Figure 5-5. Similarities to morphine
using
Daylight
fingerprints
and the Tanimoto
coefficient.
化合物の構造機能類似性(2)
構造が類似している化合部は、同じターゲッ
トのタンパク質やDNAと結合しやすい。
Testosterone Estrogen
Dioxine
©Leach and Gillet, Springer 2007 Fingerprintでの 部分構造検索
•  化合物Aはqueryの1のビットが全て1なので
10
An Introduction to Chemoinformatics
ヒットする。化合物BはO原子に対応するビット
が0なのでヒットしない。HN
O
HO
HO
0
1
1
1
0
0
1
1
1
1
1
1
0
1
0
OH
NH2
O
NH
O
B
N=
NH2
Query
A
N
N
H
©Leach and Gillet, Springer 2007 Figure 1-7. The bitstring representation of a query substructure is illustrated together with the
there are no bits in common between the two molecules). Figure 5-2 provides
a simple hypothetical example of the calculation of the Tanimoto similarity
coefficient.
Fingerprintの類似度
A
1
0
1
1
1
0
1
1
0
0
1
1
a=8
c=5
B
0
0
1
1
0
0
1
0
1
0
1
1
b=6
化合物Aで1の数をa化合物Bで1の数をb化合物Aと化合物Bで
5
共通の1の数をcとすると SAB =
= 0.56
8+6−5
数式
Figure 5-2. Calculating similarity using binary vector representations and the Tanimoto
coefficient.
ユークリッド距離
a + b − 2c
ハミング距離
コサイン距離
タニモト距離
a + b − 2c
c
ab
c
a+b−c
identical molecules) and a value of zero indicates that there is no similarity (i.e.
there are no bits in common between the two molecules). Figure 5-2 provides
a simple hypothetical example of the calculation of the Tanimoto similarity
coefficient.
類似度計算の例
A
1
0
1
1
1
0
1
1
0
0
1
1
a=8
c=5
B
0
0
1
1
0
0
SAB =
1
0
5
8+6−5
1
0
= 0.56
1
1
b=6
計算例
ユークリッド距離
a + b − 2c = 8 + 6 −10 = 2
ハミング距離
a + b − 2c = 8 + 6 −10 = 4
Figure 5-2. Calculating similarity using binary vector representations and the Tanimoto
coefficient.
コサイン距離
タニモト距離
c
5
=
~ 0.72
ab
48
c
5
5
=
= ~ 0.55
a + b−c 8+ 6 − 5 9
一般の部分構造探索
•  Fingerprint法で全ての可能な部分構造を保
持するには無限の長さのビット列が必要。 •  MDLフォーマットやSMILESフォーマットから部
分構造のあるなしを判断する問題は難しい –  計算機化学ではNP困難問題として知られる 最大共通部分グラフ(MCS)
106
•  複数の化合物が与えられた時、その最大共
通部分グラフ(Maximum Common Subgraph)
を探す問題も、NP困難であることが知られて
いる。 An Introduction to Chemoinformatics
•  MCSは解釈し易いため計算化学ではよく使わ
OH
れる。 OH
H2N
OH
A
B
MCSAB
©Leach and Gillet, Springer 2007 その他の化合物の特徴表現
•  Log P:水・オクタノール分配係数(の対数) –  化合物が水に溶けるかどうかの指標。 •  計算法としてはGasteiger-­‐Marsili法が有名。 •  Polar Surface Area –  分子の表面積 リピンスキーの5の法則
•  経口投与できる薬(飲み薬)は次のルールを1
つ以上違反しない –  水素供与体の数(NHとOHの数)は5を超えない。 –  水素結合受容体(N原子とO原子の数)は10を超
えない。 –  分子量は500ダルトン以下である。 –  Log P が5を超えない。 •  全てのルールは5の倍数
まとめ:化合物の特徴を表す “記述子”の種類
•  バイナリー(0/1) –  Fingerprint –  SMILES •  実数 –  LogP, PSA, etc..
記述子からの活性予測
C C C C Cl C C C C C C C C C C C C Log p
効果
C O 1
1
1
薬
0
1
0
薬
0
1
0
薬
0
1
0
毒
1
1
0
毒
C f ( ) = α1
C C Cl + α2
C C C O C C + α3
C C C C C C C C C + ...
最小二乗回帰
78
An Introduction to Chemoinformatics
•  横軸が化合物の記述子で、縦軸が活性。
データ点の間を通る線を引くことが目的。
y
x
量的構造活性相関解析
•  QSAR(Quan;ta;ve Structure Ac;vity Rela;onship) •  構造情報に基づく方法 (Structure-­‐based) vs 構造情報に基づかない方法 (Ligand-­‐based) 構造に基づく方法:ドッキング
•  Dock, Gold, Disco,…
166
An Introduction to Chemoinformatics
O
S
O
N
H
O
NH
N
O
O
S
N
H
O
NH
N
Figure 8-3. Operation of the DOCK algorithm. A set of overlapping spheres is used to create
a “negative image” of the active site. Ligand atoms are matched to the sphere centres, thereby
enabling the ligand conformation to be oriented within the active site for subsequent scoring.
©Leach and Gillet, Springer 2007 More recent algorithms that take the ligand conformational degrees of freedom
スコア・エネルギー関数の最小化
170
An Introduction to Chemoinformatics
Figure 8-5. Two simple scoring functions used in docking. On the left is the basic scoring scheme
used by the DOCK program [Desjarlais et al. 1988]. On the right is the piecewise linear potential
with the parameters shown being those used to calculate steric interactions [Gelhaar et al. 1995].
(Adapted from Leach 2001.)
クラスタリング
•  似た性質をもつ化合物をグループ分けする方
法
F
F
Cl
F
F
F
Cl
F
N
O
H2NS
O
N
S
O
H3CS
H3O
C
N
O
O
CH3
O
S
O
N
H2NS
N
OH
N
O
O
O
O
N
H2NS
H2NS
O
N
NH2
O
O
N
Cl
H
N
N
O
O
N
H2NS
N
O
N
H2NS
N
O
O
O
O
N
H2NS
N
N
O
O
N
H2NS
O
N
F
O
OH
CH3
H 3C
Cl
H3CN F
S CH
3
CH3
O Cl
H3CNH Cl
Cl Cl
Cl
H 3C
O
O
O
N
H2NS
N
N
H2NS
N
H2NS
N
CH3
O
O
O
O
O
N
F
N
O
N
H2NS
F
N
O
FF
FF
O
N
H2NS
N
FF
O
F
H2NS
O
N
H2NS
F
N
O
FF
Br
CH3
F
N
O
H2NS
F
N
O
FF
O
N
O
F
N
O
F
F
N
O
F
F
O
N
H2NS
F
F
F
N
F
F
H2 N
O
O
O S
F
F
F
N
O
N
F
N
O
N
O
S
N
H2NS
FF
N
O
F
F
FF
N
H2NS
S
O
H2 N
F
O
O S
F
N
O
FF
NH2
F
Cl
F
F
F
O
O
O
O S
CH3
CH3
O
N
H2NS
CH3
Cl
F
N
O
NH2
NH2
NH2
FF
N N
O
O
O
F
N
F
N
O
F
S
S
N
H2NS
O
FF
N
O
O
O S
F
N
F
N
H2NS
FF
N
N
N
O
N
H2NS
O
FF
F
F
N
S
O
N
H2NS
O
FF
F
N
Cl
N
F
N
N
H2NS
F
Cl
S
N
O
H2NS
O
N
O
FF
N
CH3
Cl
Cl
O
N
O
N
H2NS
F
N
O
FF
O
N
H2NS
F
N
O
FF
O
N
H2NS
F
N
O
FF
N
H2NS
F
N
O
FF
O
H2NS
O
N
F
N
O
FF
F F
H2NS
O
N
F
N
F F
H3C
F
CH3
CH3
OON+
F
F
CH3
CH3
OH
O
S
O
NH2
CH3
O
O
O
N
H2NS
O
F
N
FF
O
N
H2NS
O
F
N
FF
O
N
H2NS
O
F
N
FF
O
N
H2NS
O
F
N
O
N
H2NS
F
N
O
FF
O
O
N
H2NS
F
N
O
FF
F
N
CH3
O O
CH
O
N
H2NS
F
N
O
FF
CH3
H 3C O
O
N
H2NS
O
FF
CH3
O CH3
OH
OH
H3CN
NH
F
N
CH3
CH3
O
N
H2NS
O
FF
FF
H2NS
O
N
F
N
FF
Cl
3
CH3
H 3C N
Cl
O
O
N
H2NS
F
N
O
N
H2NS
O
F
F
F
N
O
H2NS
O
N
O
FF
F
N
F
F
O
N
H2NS
O
F
N
O
N
H2NS
O
FF
F
N
O
N
H2NS
O
FF
F
N
F
N
O
FF
O
O
N
H2NS
N
H2NS
O
FF
F
N
FF
Cl
N
H2NS
O
F
F
N
O
N
H2NS
N
O
F
CH3
F
Cl
Cl
N
N
OH
N
Cl
Cl
O
N
H2NS
N
O
O
OH
O
Cl
O
O
H2NS
N
H2NS
N
O
CH3
O
N
O
N
Cl
Cl
Cl
Cl
Cl
O
H2NS
Cl
O
H2NS
Cl
O
N
N
NH2
O
N
H2NS
N
O
F
O
N
H2NS
N
CH3
O
N
H2NS
O
N
O
O
Cl
Cl
CH3
Br
O
N
H2NS
O
O
O
H2NS
N
N
N
O
Cl
F
CH3
O
N+ O
N
O
O
N
H2NS
N
H2NS
N
O
O
OS
O
O
N
N
H2NS
N
NH2
O
O
F
F
N
O
N
O
N
H2NS
N
H2NS
O
O
CH3
F
O
N
H2NS
F
F
N
O
F
CH3
CH3
H2NS
O
N
N
H2NS
F
F
O
F
N
F
O
N
H2NS
F
F
OH
OCH3
O
N
O
H2NS
F
F
N
O
N
F
F
F
F
F
N
O
CH3
F
N
N
F
F
O
N
F
N
F
Cl
F
O
O
O
S
H 2N
O
H3 C
S
H2 N
O
CH3
O
O S
N
F
F
O
N
NS
H
O
F
N
O
N
H2NS
FF
N
H2NS
N
O
F
F
O
N
O
F
F
O
N
H2NS
F
F
N
O
NH2
CH3
O
N
H2NS
F
F
N
O
CH3
OH
O
O
O
N
H2NS
F
F
N
O
O
O
N
H2NS
F
F
N
O
N
H2NS
F
F
N
O
F
CH3
CH3
CH3
O Cl
H3C O Cl
O F
CH3
F
O F
CH3
O
N
F
F
N
O
O
N
H2NS
F
F
N
O
O
N
H2NS
F
F
N
O
O
N
H2NS
F
F
N
O
O S
N
O
F
F
FF
O
O
O
N
H2NS
NH2
H3C
N
F
F
O
O
N
NS
OCH3
O
S N
H
F
N
FF
N
F
N
F
F
CH3
CH3
Cl
F
N
O
O
N
H2NS
N
H2NS
N
O
O
O
H2NS
F
F
N
O
H 3C
F
Cl
CH3
O
O
F
Cl
F
F
ON+
O
O
N
F
N
FF
F
N
F
FF
F
Cl
H2NS
N
O
N
O
Cl
N
O
N
H2NS
N
N
S
F
N
FF
Cl
N
O
F
N
FF
H 3C
N
F
N
FF
Cl
N
O
F
N
FF
H3C
O
N
F
N
FF
H3CS
O
O
N
F
N
FF
H2NS
O
N
O
N
CH3
O
O
H2NS
O
N
N
CH3
F
F
Cl
F
F
F
Cl
F
N
O
H2NS
O
N
S
O
H3CS
H3O
C
N
O
O
CH3
O
S
O
N
H2NS
N
OH
N
O
O
O
O
N
H2NS
H2NS
O
N
NH2
O
O
N
Cl
H
N
N
O
O
N
H2NS
N
O
N
H2NS
N
O
O
O
O
N
H2NS
N
N
O
O
N
H2NS
O
N
F
O
OH
CH3
H 3C
Cl
H3CN F
S CH
3
CH3
O Cl
H3CNH Cl
Cl Cl
Cl
H 3C
O
O
O
N
H2NS
N
N
H2NS
N
H2NS
N
CH3
O
O
O
O
O
N
F
N
O
N
H2NS
F
N
O
FF
FF
O
N
H2NS
N
FF
O
F
H2NS
O
N
H2NS
F
N
O
FF
Br
CH3
F
N
O
H2NS
F
N
O
FF
O
N
O
F
N
O
F
F
N
O
F
F
O
N
H2NS
F
F
F
N
F
F
H2 N
O
O
O S
F
F
F
N
O
N
F
N
O
N
O
S
N
H2NS
FF
N
O
F
F
FF
N
H2NS
S
O
H2 N
F
O
O S
F
N
O
FF
NH2
F
Cl
F
F
F
O
O
O
O S
CH3
CH3
O
N
H2NS
CH3
Cl
F
N
O
NH2
NH2
NH2
FF
N N
O
O
O
F
N
F
N
O
F
S
S
N
H2NS
O
FF
N
O
O
O S
F
N
F
N
H2NS
FF
N
N
N
O
N
H2NS
O
FF
F
F
N
S
O
N
H2NS
O
FF
F
N
Cl
N
F
N
N
H2NS
F
Cl
S
N
O
H2NS
O
N
O
FF
N
CH3
Cl
Cl
O
N
O
N
H2NS
F
N
O
FF
O
N
H2NS
F
N
O
FF
O
N
H2NS
F
N
O
FF
N
H2NS
F
N
O
FF
O
H2NS
O
N
F
N
O
FF
F F
H2NS
O
N
F
N
F F
H3C
F
CH3
CH3
OON+
F
F
CH3
CH3
OH
O
S
O
NH2
CH3
O
O
O
N
H2NS
O
F
N
FF
O
N
H2NS
O
F
N
FF
O
N
H2NS
O
F
N
FF
O
N
H2NS
O
F
N
O
N
H2NS
F
N
O
FF
O
O
N
H2NS
F
N
O
FF
F
N
CH3
O O
CH
O
N
H2NS
F
N
O
FF
CH3
H 3C O
O
N
H2NS
O
FF
CH3
O CH3
OH
OH
H3CN
NH
F
N
CH3
CH3
O
N
H2NS
O
FF
FF
H2NS
O
N
F
N
FF
Cl
3
CH3
H 3C N
Cl
O
O
N
H2NS
F
N
O
N
H2NS
O
F
F
F
N
O
H2NS
O
N
O
FF
F
N
F
F
O
N
H2NS
O
F
N
O
N
H2NS
O
FF
F
N
O
N
H2NS
O
FF
F
N
F
N
O
FF
O
O
N
H2NS
N
H2NS
O
FF
F
N
FF
Cl
N
H2NS
O
F
F
N
O
N
H2NS
N
O
F
CH3
F
Cl
Cl
N
N
OH
N
Cl
Cl
O
N
H2NS
N
O
O
OH
O
Cl
O
O
H2NS
N
H2NS
N
O
CH3
O
N
O
N
Cl
Cl
Cl
Cl
Cl
O
H2NS
Cl
O
H2NS
Cl
O
N
N
NH2
O
N
H2NS
N
O
F
O
N
H2NS
N
CH3
O
N
H2NS
O
N
O
O
Cl
Cl
CH3
Br
O
N
H2NS
O
O
O
H2NS
N
N
N
O
Cl
F
CH3
O
N+ O
N
O
O
N
H2NS
N
H2NS
N
O
O
OS
O
O
N
N
H2NS
N
NH2
O
O
F
F
N
O
N
O
N
H2NS
N
H2NS
O
O
CH3
F
O
N
H2NS
F
F
N
O
F
CH3
CH3
H2NS
O
N
N
H2NS
F
F
O
F
N
F
O
N
H2NS
F
F
OH
OCH3
O
N
O
H2NS
F
F
N
O
N
F
F
F
F
F
N
O
CH3
F
N
N
F
F
O
N
F
N
F
Cl
F
O
O
O
S
H 2N
O
H3 C
S
H2 N
O
CH3
O
O S
N
F
F
O
N
NS
H
O
F
N
O
N
H2NS
FF
N
H2NS
N
O
F
F
O
N
O
F
F
O
N
H2NS
F
F
N
O
NH2
CH3
O
N
H2NS
F
F
N
O
CH3
OH
O
O
O
N
H2NS
F
F
N
O
O
O
N
H2NS
F
F
N
O
N
H2NS
F
F
N
O
F
CH3
CH3
CH3
O Cl
H3C O Cl
O F
CH3
F
O F
CH3
O
N
F
F
N
O
O
N
H2NS
F
F
N
O
O
N
H2NS
F
F
N
O
O
N
H2NS
F
F
N
O
O S
N
O
F
F
FF
O
O
O
N
H2NS
NH2
H3C
N
F
F
O
O
N
NS
OCH3
O
S N
H
F
N
FF
N
F
N
F
F
CH3
CH3
Cl
F
N
O
O
N
H2NS
N
H2NS
N
O
O
O
H2NS
F
F
N
O
H 3C
F
Cl
CH3
O
O
F
Cl
F
F
ON+
O
O
N
F
N
FF
F
N
F
FF
F
Cl
H2NS
N
O
N
O
Cl
N
O
N
H2NS
N
N
S
F
N
FF
Cl
N
O
F
N
FF
H 3C
N
F
N
FF
Cl
N
O
F
N
FF
H3C
O
N
F
N
FF
H3CS
O
O
N
F
N
FF
H2NS
O
N
O
N
CH3
O
O
H2NS
O
N
N
CH3
階層的クラスタリング
•  7つの化合物をグループ化する手順 122
An Introduction to Chemoinformatics
Figure 6-2. A dendrogram representing a hierarchical clustering of seven compounds.
その他の解析手法: 主成分分析 PCA
•  Principal Component Analysis (PCA) –  主にデータの可視化に使われる。
ソフトウェア
•  フリー –  Openbabel (hTp://openbabel.org) •  MOL、SDF、SMILESなどのフォーマット間での変換 •  2D座標計算、png(画像)ファイル生成 –  Chemsketch (hTp://www.acdlabs.com/resources/
freeware/chemsketch/) •  化合物描画 •  商用 –  MOE (hTp://www.rsi.co.jp/kagaku/cs/ccg/) •  統合パッケージ –  Chemdraw (hTps://www.hulinks.co.jp/souware/
chembiodraw/index.html) •  化合物の描画 References
•  Andrew R. Leach and Valerie J. Gillet “An Introduc;on to Chemoinforma;cs” Springer •  J. Gasteiger and T. Engel “Chemoinforma;cs” Wiley
Contents
• 
Representa;on of chemical compounds in computer – 
– 
– 
– 
• 
• 
Database Souware – 
– 
– 
– 
– 
• 
Openbabel Chemsketch Chemdraw Pubchem Dock, Gold, Similarity search – 
– 
– 
• 
2d 3d MOL/SDF format Smile, Fingerprint Compound retrieval Maximum common subgraph Tanimoto index Virtual screening (modeling) – 
– 
– 
Lipinski’s QSAR/QSPR ADMET