計算神経科学における強化学習「神経修飾物質系のメタ学習仮説」

The 16th Annual Conference of Japanese Society for Artificial Intelligence, 2002
2A1-4
ÁèĝÀŒ›,<´†›Ą
ĝÀăėƐù¿-jT›Ą‡İ
Reinforcement Learning and Computational Neuroscience
- Possible Functions of Neuromodulators in Metalearning ŲŒÇô*1,*2
Kenji Doya
*1
*2
ATR Ġ¤ĔƔŒ›Å°ċ
ATR Human Information Science Laboratories
Œ›«Ĉě¶ï¸ŗ CREST
CREST, Japan Science and Technology Corporation
The framework of reinforcement learning captures an essential function of the nervous system: to realize behaviors for
acquisition of reward. Thus the architectures and algorithms of reinforcement learning can provide important clues as to the
organization and functions of the nervous system. Here I report three such examples: 1) a model of the basal ganglia as the
circuit for reinforcement learning; 2) understanding of the specialization and collaboration of the cerebellum, the basal
ganglia, and the cerebral cortex; 3) working hypotheses about the roles neuromodulators in regulating the metaparameters of
reinforcement learning. The concept of reinforcement learning can provide a common ground for interdisciplinary studies.
sensory
input
EM+*-Ɣą?˜ų<"6-ÛŰ?ŕæŤ,›Ą
<)´†›Ą-ƟŏĮš.ŰƐ8Ġ¤-ÛۛĄã7¦ƙŤ+łƝ?Ń'< 7 7´†<
. “reinforcement” )Ìƫ.ĚƯ›8ĪƐ›-ĥ’
:ý;'ƭ"7-(<´†›Ą-֛Ť+Å°-ś
(“ſ="ApLoQi8AGXHVl.Ġ¤8Ű
Ɛ-ÛۛĄ-ŻŵjF]Qi?Ư<đ(Ƥƪ+ÿ
;?Ƨ'=<
1. Cerebral cortex
state st
Striatum
striosome
state value
V(s)
TD error δt
SNc
dopamine neurons
ŎŻƂù)ŻŸ-¤,xŚ<ŎŻ¦Ţ—. -ƈƒ
,9;ª<bGsSsƈ8`sVsZsƍŮƈ+*ďĖ:€ŰĦÏ,ˆ:-¾><)ř:='
" -ĩēò-¨ź.ŞĝÀŒ›-Ŷ($"
-Ɯ-ìÔ?Ƨ'="-śŻ:ŎŻ¦Ţ
—8ĵůƫ,ij{?Ę/[bhs)Ɛù?ƕć
<]mrs-žŰ©ƺ(<[1]LED Ŧū":
qa?ƒ)ÛŰ?Mp,›Ą"ò -Ċ§
,.[bhs]mrs.Ɣą(<OmP,Ŋ'
‚Ŭ<ÛۛĄ™Ʋ"Î,.[bhs
]mrs. LED -Ŧū,Ŋ'‚Ŭ<9,+;
OmP,Ŋ'.‚Ŭ++<=.´†›ĄƯƻ(Ɣą-Ʀń:-ŀÊ?ƇTD Ñá
δt = rt + V(st) - γ V(st-1)
-ě<ƍ)Ƅē,ƴð'<TD Ñá.ƔąƦń)
ÛۛĄ-Ƴü-›ĄęÝ(;-ſÆ?½¨,Ŏ
Ż¦Ţ—)[bhs]mrs-¨ź?´†›Ą-Ƽ
Ĺ5(İƜ<kYpţv='<ġ
[1,2][
bhs¿-ƌžƣƐyŇ+*ÛŰ-´†,%+<)
ïú.wĵ:ř:='"ŎŻ¦Ţ—-N\d
P-‹ķħ[bhs,9$'ĦÏ=<)úÈŤ
ƸƮıŲŒÇô, ATR Ġ¤ĔƔŒ›Å°ċ, 619-0288 ±
ŨƊĽœ»īŜÒō 2-2-2, Fax: 0774-95-1259, E-mail:
[email protected] http://www.atr.co.jp/his/~doya
motor
output
reward rt
matrix
action value
Q(s,a)
SNr, GP
action at
Thalamus
ġ
ŎŻ¦Ţ—-´†›ĄkYpŎŻƂù-ĖŋƇ
Ë st ?7),ijĕʼn(striatum)(Ėŋ‰Ř¥ģ V(s))ÛŰ
‰Ř¥ģ Q(s,a)Áè=ßùƞƩƎ(SNr))Ŗľ¯
(GP)?À'íĎ(Thalamus)8ŻŸ-€ŰśĤ1-‘ƹ(
ÛŰ at ĴŐ=<ßùƽƛƎ(SNc). TD Ñáδt ?ij
ĕʼn,cB[aWH[bhsyŇħ-N\dP
‹ķħ,9;‰Ř¥ģ›Ą=<
,7™6:=Ɣą-Ʀń) =,¦&ÛŰĴŐ
ŎŻ¦Ţ—-¦ƙŤ+¨ź)'Ư=<9,+$"
2. !#
PET 8 fMRI +*,9<Ġ¤-ŻžŰÁń,9;wĵ
.þ,€Ű,¥><)Ú:='ƭ"čŻ8ŎŻ¦Ţ—
ÌÐĉƯ8uèCjOļå8ƇĔŸř+*€Űw”
-¨ź,7¥><)ƜŽ,+$"[3]4"ĝÀƸƮ
?ŝ2<«Ĉ-ğƓ,9;čŻ7ŎŻ¦Ţ—7ÜóŸ
ř¨ź-â(<)=<ŎŻƂù-ĵůĵơ,ćƵ?Ŀ
<)Ɯ:,=" (€ŰĦÏ#(.+
čŻ)ŎŻ¦Ţ—-Ƣ,¥'Ĝ"+ëÚ-ƼĹ5
ƆƬ)='<ŎŻ¦Ţ—?´†›Ą?úË<"6
-1-
The 16th Annual Conference of Japanese Society for Artificial Intelligence, 2002
-‘ƹ<kYp.:,ŻĶʼn-¨źƑ†-ÉƯ,¥
'7ĆƬ+õà?Ƨ'=<
=4(-Å°(čŻ-›Ą.ÑáęÝ?7),"
µê;›Ą-ƼĹ5(İƜ='<4"ŎŻƂ
ù-]mrs-‚ŬŴħ.ŷƵ-ŭÁŤħù?7),
"µê+›Ą-ƼĹ5(İƜ='<=:
)Þ><)čŻŎŻ¦Ţ—ŎŻƂù. =!=
µê;›Ą´†›Ąµê+›Ą)z+<Ā
ƶ-›Ą-ApLoQi,‚'IJƠ†"ĝÀ‘ƹ(
<)ƼĹ5Ƌ0đ<ġ[4]
Unsupervised learning
ġŌƇŤ+ĝÀăėƐù¿ßùƽƛƎ(SNc))Ə
łƃ–ơ(VTA)-[bhs¿żłtij—(DR)-R
rZ]s¿Ĭƀ—(LC)-_pA[q\os¿śš
—(S))gC^pZ—(M)-ARVpKos¿
output
input
Cerebral Cortex
Reinforcement learning
reward
Basal Thalamus
Ganglia
substantia
nigra
inferior
olive
input
Cerebellum
output
target
Supervised learning
+
error
input
output
ġčŻ(cerebellum)ŎŻ¦Ţ—(basal ganglia)ŎŻ
Ƃù(cerebral cortex). =!=µê;›Ą(supervised
learnning)´†›Ą(reinforcement learning)µê+›
Ą(unsupervised learning),Ŵ†"‘ƹÙŁ)N\dP
‹ķħ-jF]Qi?ñ%
Ʒ/ƎƑ¢ń-´†›Ą(.‰Ř¥ģ,9<¦ƙŤ
+´†›Ą-Ƭĸ,Š=Ėŋ-ĢšŘ(belief state)?
ØĜ<Ɩ÷ƪ:=<-9+ĉƯ.čŻ,µ
ê;›Ą(˜ų="¡³-ƦńkYp)ŎŻ¦Ţ—
,´†›Ą(˜ų="‰Ř¥ģ?ŎŻƂù,µê+
›Ą(˜ų="=ĖŋƇË?'%+Þ><
)(úˋź+.(<-9+Ż-Ŏ¹Ť+¨ź
Ƒ† )¨źŭ Þ,¥ <‡İ .fMRI + *(ų: =<
Ż-*žŰ'<)YT? (*
ĉƯÛ>='<)Ư,%+<đ(Ƥƪ
+ÿ;)+;ų<
3. "
´†›Ą?rfWZĦÏ+*,ťƪ'5'[5,6]ş <).rfWZ›Ą<wđ,<"6,úÈ
üöĞ›Ą-Ɩ?›Ą:='<))(
<´†›ĄApLoQi(.Čƭ-ƔąƦń-~
Ʊ(γ)ŕæ-nsUi?Â6<¬…ũ(β)›Ą-Ņũ
¼ģ(α)+*-jTbnjT-ťĭ+Įš.›ĄŽ
ŏ8¡³ĕÃ,yŇ<"6ň-ĒÞúÈü-îÛ
çÑ,9<Vm]sI?ƆƬ)<=Å°øq
ep(.ĨÓ?Ă6'<›ĄrfWZő++ĥ
-ś,ć'Û+Ŏ+Ưƥ(<
}Ɩ>=>=-Ż. -›Ą-jTbnjT?œ,
Vm]sI'7:>+'7ƚř-¡³-7)(
Ʃ+ÛŰ?öưŤ,›Ą<)‹ź(<%4;
Ż,. -›Ą-jTbnjT?öÍŝį<jT›
Ą-¨Ùƅ>$'<)Ú:=<
Ż,<jT›Ą-Ŕÿ)'ŻŸ:ŎŻƂù
8¦Ţ—čŻ,×ŪûñņŤ+åƪ?­3ĝ
ÀăėƐù¿Ú:=<ġºŹƑéĪƐ›
Ť«Ĉ-ſő,9;Ʃ+ĝÀăėƐù) -āƨʼnŻŵ(-ƑƉ8äƗqep(-åƪ4" -ĺ•ƣ8|
ŧé_WHADZ,9<ÛŰ1-·,¥<YT.Ƙ
Ŏ,ų:='< =:-úÈŤřÆ)ƯƻkYp?e
P,
1) [bhs.ƔąƦń:-ŀÊTD Ñáδ
2) RrZ]s.ƔąƦń-ò¤PJp~Ʊγ
3) _pA[q\os.ÛŰ-nsUi¬…ũβ
4) ARVpKos.©„-ØĜ-›ĄŅũα
? =!=ƇËĦÏ'<)‡İÚ:=<[7]
-‡İ-ÄĐ,Õ'CRESTŻ?Ļ<-Å°Ž
ŏ)'›ĄƯƻÅ°nWZ8Mp(-ĪƯúÈĠ
¤-ŻžŰÁńrfWZúÈ?Ĺ5Þ>"²űÅ°
ğ@(<
4. ´†›Ą-ƟŏĮš.9;ƴħź?®6<֛,7
4"ĪƐ-ť‚8Ġ¤-ÛŰ-ÉƯ,ž<Å°,7²ŠƼĹ5(;z+<›ƟƑơ?£Ĝ"+Å°-¦Ɓ?
Ƨ<7-)+;ų<
$
[1] Schultz, W., Dayan, P., and Montague, P.R.: A neural
substrate of prediction and reward. Science, 275, 1593-1599
(1997).
[2] Houk, J.C., Adams, J.L., and Barto, A.G.: A model of how
the basal ganglia generate and use neural signals that predict
reinforcement. In J.C. Houk, et al. Eds: Models of
Information Processing in the Basal Ganglia, pp. 249-270.
MIT Press (1995).
[3] Doya, K.: Complementary roles of basal ganglia and
cerebellum in learning and motor control. Current Opinion in
Neurobiology, 10, 732-739 (2000).
[4] Doya, K.: What are the computations of the cerebellum, the
basal ganglia, and the cerebral cortex. Neural Networks, 12,
961-974 (1999).
[5] Doya, K.: Reinforcement learning in continuous time and
space. Neural Computation, 12, 219-245 (2000).
[6] Doya, K., Kimura, H., and Kawato, M.: Neural mechanisms
of learning and control. IEEE Control Systems Magazine,
21(4), 42-54 (2001).
[7] Doya, K.: Metalearning and neuromodulation: Neural
Networks, 15(4), (2002)
-2-