計算神経科学における強化学習「神経修飾物質系のメタ学習仮説」

The 16th Annual Conference of Japanese Society for Artificial Intelligence, 2002
2A1-4
ÁèĝÀ,<´Ą
ĝÀăėƐù¿-jTĄİ
Reinforcement Learning and Computational Neuroscience
- Possible Functions of Neuromodulators in Metalearning ŲŒÇô*1,*2
Kenji Doya
*1
*2
ATR Ġ¤ĔƔÅ°ċ
ATR Human Information Science Laboratories
«Ĉě¶ï¸ŗ CREST
CREST, Japan Science and Technology Corporation
The framework of reinforcement learning captures an essential function of the nervous system: to realize behaviors for
acquisition of reward. Thus the architectures and algorithms of reinforcement learning can provide important clues as to the
organization and functions of the nervous system. Here I report three such examples: 1) a model of the basal ganglia as the
circuit for reinforcement learning; 2) understanding of the specialization and collaboration of the cerebellum, the basal
ganglia, and the cerebral cortex; 3) working hypotheses about the roles neuromodulators in regulating the metaparameters of
reinforcement learning. The concept of reinforcement learning can provide a common ground for interdisciplinary studies.
sensory
input
EM+*-Ɣą?ų<"6-ÛŰ?ŕæŤ,Ą
<)´Ą-ƟŏĮš.ŰƐ8Ġ¤-ÛŰĄã7¦ƙŤ+łƝ?Ń'< 7 7´<
. “reinforcement” )Ìƫ.ĚƯ8ĪƐ-ĥ
:ý;'ƭ"7-(<´Ą-ÖŤ+Å°-ś
(ſ="ApLoQi8AGXHVl.Ġ¤8Ű
Ɛ-ÛŰĄ-ŻŵjF]Qi?Ư<đ(Ƥƪ+ÿ
;?Ƨ'=<
1. Cerebral cortex
state st
Striatum
striosome
state value
V(s)
TD error δt
SNc
dopamine neurons
ŎŻƂù)Ż-¤,xŚ<ŎŻ¦Ţ. -ƈƒ
,9;ª<bGsSsƈ8`sVsZsƍŮƈ+*ďĖ:ŰĦÏ,:-¾><)ř:='
" -ĩēò-¨ź.ŞĝÀ-Ŷ($"
-Ɯ-ìÔ?Ƨ'="-śŻ:ŎŻ¦Ţ
8ĵůƫ,ĳ{?Ę/[bhs)Ɛù?ƕć
<]mrs-Ű©ƺ(<[1]LED Ŧū":
qa?)ÛŰ?Mp,Ą"ò -Ċ§
,.[bhs]mrs.Ɣą(<OmP,Ŋ'
Ŭ<ÛŰĄƲ"Î,.[bhs
]mrs. LED -Ŧū,Ŋ'Ŭ<9,+;
OmP,Ŋ'.Ŭ++<=.´ĄƯƻ(Ɣą-Ʀń:-ŀÊ?ƇTD Ñá
δt = rt + V(st) - γ V(st-1)
-ě<ƍ)Ƅē,ƴð'<TD Ñá.ƔąƦń)
ÛŰĄ-Ƴü-ĄęÝ(;-ſÆ?½¨,Ŏ
Ż¦Ţ)[bhs]mrs-¨ź?´Ą-Ƽ
Ĺ5(İƜ<kYpţv='<ġ
[1,2][
bhs¿-ƌƣƐyŇ+*ÛŰ-´,%+<)
ïú.wĵ:ř:='"ŎŻ¦Ţ-N\d
P-ķħ[bhs,9$'ĦÏ=<)úÈŤ
ƸƮıŲŒÇô, ATR Ġ¤ĔƔÅ°ċ, 619-0288 ±
ŨƊĽ»īŜÒō 2-2-2, Fax: 0774-95-1259, E-mail:
[email protected] http://www.atr.co.jp/his/~doya
motor
output
reward rt
matrix
action value
Q(s,a)
SNr, GP
action at
Thalamus
ġ
ŎŻ¦Ţ-´ĄkYpŎŻƂù-ĖŋƇ
Ë st ?7),ĳĕŉ(striatum)(ĖŋŘ¥ģ V(s))ÛŰ
Ř¥ģ Q(s,a)Áè=ßùƞƩƎ(SNr))Ŗľ¯
(GP)?À'íĎ(Thalamus)8Ż-ŰśĤ1-ƹ(
ÛŰ at ĴŐ=<ßùƽƛƎ(SNc). TD Ñáδt ?ĳ
ĕŉ,cB[aWH[bhsyŇħ-N\dP
ķħ,9;Ř¥ģĄ=<
,76:=Ɣą-Ʀń) =,¦&ÛŰĴŐ
ŎŻ¦Ţ-¦ƙŤ+¨ź)'Ư=<9,+$"
2. !#
PET 8 fMRI +*,9<Ġ¤-ŻŰÁń,9;wĵ
.þ,Ű,¥><)Ú:='ƭ"čŻ8ŎŻ¦Ţ
ÌÐĉƯ8uèCjOļå8ƇĔŸř+*Űw
-¨ź,7¥><)ƜŽ,+$"[3]4"ĝÀƸƮ
?ŝ2<«Ĉ-ğƓ,9;čŻ7ŎŻ¦Ţ7ÜóŸ
ř¨ź-â(<)=<ŎŻƂù-ĵůĵơ,ćƵ?Ŀ
<)Ɯ:,=" (ŰĦÏ#(.+
čŻ)ŎŻ¦Ţ-Ƣ,¥'Ĝ"+ëÚ-ƼĹ5
ƆƬ)='<ŎŻ¦Ţ?´Ą?úË<"6
-1-
The 16th Annual Conference of Japanese Society for Artificial Intelligence, 2002
-ƹ<kYp.:,ŻĶŉ-¨źƑ-ÉƯ,¥
'7ĆƬ+õà?Ƨ'=<
=4(-Å°(čŻ-Ą.ÑáęÝ?7),"
µê;Ą-ƼĹ5(İƜ='<4"ŎŻƂ
ù-]mrs-ŬŴħ.ŷƵ-ŭÁŤħù?7),
"µê+Ą-ƼĹ5(İƜ='<=:
)Þ><)čŻŎŻ¦ŢŎŻƂù. =!=
µê;Ą´Ąµê+Ą)z+<Ā
ƶ-Ą-ApLoQi,'ĲƠ"ĝÀƹ(
<)ƼĹ5Ƌ0đ<ġ[4]
Unsupervised learning
ġŌƇŤ+ĝÀăėƐù¿ßùƽƛƎ(SNc))Ə
łƃơ(VTA)-[bhs¿żłtĳ(DR)-R
rZ]s¿Ĭƀ(LC)-_pA[q\os¿ś
(S))gC^pZ(M)-ARVpKos¿
output
input
Cerebral Cortex
Reinforcement learning
reward
Basal Thalamus
Ganglia
substantia
nigra
inferior
olive
input
Cerebellum
output
target
Supervised learning
+
error
input
output
ġčŻ(cerebellum)ŎŻ¦Ţ(basal ganglia)ŎŻ
Ƃù(cerebral cortex). =!=µê;Ą(supervised
learnning)´Ą(reinforcement learning)µê+
Ą(unsupervised learning),Ŵ"ƹÙŁ)N\dP
ķħ-jF]Qi?ñ%
Ʒ/ƎƑ¢ń-´Ą(.Ř¥ģ,9<¦ƙŤ
+´Ą-Ƭĸ,=Ėŋ-ĢšŘ(belief state)?
ØĜ<Ɩ÷ƪ:=<-9+ĉƯ.čŻ,µ
ê;Ą(ų="¡³-ƦńkYp)ŎŻ¦Ţ
,´Ą(ų="Ř¥ģ?ŎŻƂù,µê+
Ą(ų="=ĖŋƇË?'%+Þ><
)(úËź+.(<-9+Ż-Ŏ¹Ť+¨ź
Ƒ )¨źŭ Þ,¥ <İ .fMRI + *(ų: =<
Ż-*Ű'<)YT? (*
ĉƯÛ>='<)Ư,%+<đ(Ƥƪ
+ÿ;)+;ų<
3. "
´Ą?rfWZĦÏ+*,ťƪ'5'[5,6]ş <).rfWZĄ<wđ,<"6,úÈ
üöĞĄ-Ɩ?Ą:='<))(
<´ĄApLoQi(.Čƭ-ƔąƦń-~
Ʊ(γ)ŕæ-nsUi?Â6<¬ũ(β)Ą-Ņũ
¼ģ(α)+*-jTbnjT-ťĭ+Įš.Ą
ŏ8¡³ĕÃ,yŇ<"6ň-ĒÞúÈü-îÛ
çÑ,9<Vm]sI?ƆƬ)<=Å°øq
ep(.ĨÓ?Ă6'<ĄrfWZő++ĥ
-ś,ć'Û+Ŏ+Ưƥ(<
}Ɩ>=>=-Ż. -Ą-jTbnjT?œ,
Vm]sI'7:>+'7ƚř-¡³-7)(
Ʃ+ÛŰ?öưŤ,Ą<)ź(<%4;
Ż,. -Ą-jTbnjT?öÍŝį<jT
Ą-¨Ùƅ>$'<)Ú:=<
Ż,<jTĄ-Ŕÿ)'Ż:ŎŻƂù
8¦ŢčŻ,×ŪûñņŤ+åƪ?3ĝ
ÀăėƐù¿Ú:=<ġºŹƑéĪƐ
Ť«Ĉ-ſő,9;Ʃ+ĝÀăėƐù) -āƨŉŻŵ(-ƑƉ8äƗqep(-åƪ4" -ĺƣ8|
ŧé_WHADZ,9<ÛŰ1-·,¥<YT.Ƙ
Ŏ,ų:='< =:-úÈŤřÆ)ƯƻkYp?e
P,
1) [bhs.ƔąƦń:-ŀÊTD Ñáδ
2) RrZ]s.ƔąƦń-ò¤PJp~Ʊγ
3) _pA[q\os.ÛŰ-nsUi¬ũβ
4) ARVpKos.©-ØĜ-ĄŅũα
? =!=ƇËĦÏ'<)İÚ:=<[7]
-İ-ÄĐ,Õ'CRESTŻ?Ļ<-Å°
ŏ)'ĄƯƻÅ°nWZ8Mp(-ĪƯúÈĠ
¤-ŻŰÁńrfWZúÈ?Ĺ5Þ>"²űÅ°
ğ@(<
4. ´Ą-ƟŏĮš.9;ƴħź?®6<Ö,7
4"ĪƐ-ť8Ġ¤-ÛŰ-ÉƯ,ž<Å°,7²ŠƼĹ5(;z+<ƟƑơ?£Ĝ"+Å°-¦Ɓ?
Ƨ<7-)+;ų<
$
[1] Schultz, W., Dayan, P., and Montague, P.R.: A neural
substrate of prediction and reward. Science, 275, 1593-1599
(1997).
[2] Houk, J.C., Adams, J.L., and Barto, A.G.: A model of how
the basal ganglia generate and use neural signals that predict
reinforcement. In J.C. Houk, et al. Eds: Models of
Information Processing in the Basal Ganglia, pp. 249-270.
MIT Press (1995).
[3] Doya, K.: Complementary roles of basal ganglia and
cerebellum in learning and motor control. Current Opinion in
Neurobiology, 10, 732-739 (2000).
[4] Doya, K.: What are the computations of the cerebellum, the
basal ganglia, and the cerebral cortex. Neural Networks, 12,
961-974 (1999).
[5] Doya, K.: Reinforcement learning in continuous time and
space. Neural Computation, 12, 219-245 (2000).
[6] Doya, K., Kimura, H., and Kawato, M.: Neural mechanisms
of learning and control. IEEE Control Systems Magazine,
21(4), 42-54 (2001).
[7] Doya, K.: Metalearning and neuromodulation: Neural
Networks, 15(4), (2002)
-2-

Download Report