Cross-media Translation Based on Mental Image

MIDST and IMAGES-M
Masao Yokota
Fukuoka Institute of Technology
Background & motivation
Intelligent systems should be
more human-friendly considering…
 Floods of multimedia information
 Increase of highly matured societies
 Development of robots for practical use
 The others
Solution
Integrated Multimedia Understanding System
IMAGES-M
IMAGES-M
Knowledge Base
(KB)
Text Processing Unit
(TPU)
Picture Processing Unit
(PPU)
Speech Processing Unit
(SPU)
Inference Engine
(IE)
Sensory Data Processing Unit
(SDPU)
Action Data Processing Unit
(APU)
Demonstration of IMAGES-M
---Collaboration of TPU and PPU--(Phase 1) Text to Picture translation
Input : Text
Output : Pictorial interpretation
(Phase 2) Q-A about Picture by Text
Input : Query Text
Output: Answer Text
Input text (Japanese/ English/ Chinese)
The lamp above the chair is small.
The red pot is 1m to the left of the chair.
The blue big box is 3m to the right of the chair.
Output picture
Input picture
Output text
The octagon is to the upper right
of the triangle.
The octagon is above the
quadrangle.
The triangle is to the lower left of
the octagon.
・
・
・
Input sentence: Taro ga kubi wo furu (=Taro shakes his head).
Output animation:
Cross-reference between picture and text
下
和
白
交
差
点
で
出
会
い
ま
す
。
美
和
台
通
り
は
国
道
4
9
5
号
線
と
ど
こ
で
出
会
い
ま
す
か
Integrated Multimedia
Understanding based on Lmd
Picture
Text
Speech
Animation
Action
Intermediate
Representation
Sensory data
……
Descriptive power and Computability
of Meta Language Lmd
for Intermediate Representation
Mental Image Directed Semantic Theory
(MIDST)
proposed by Yokota,M.
• Information Processing by intelligent entities
= Mental Image Processing
• Mental Images
Sensory Images = Sensations coded
by Sensors
Conceptual Images= Sensory Images processed
by Brains
( e.g. Word Concepts)
Multimedia Description Language
Lmd based on Mental Image
Directed Semantic Theory (MIDST)
• Syntax
Many-sorted predicate logic with a special
predicate constant L called “Atomic Locus”
• Semantics
Interpretation in association with an
omnisensual mental image model so called “Loci
in attribute spaces”
Omnisensual Mental Image Model
Sensation (= Sensory event)
= Spatio-temporal distribution of stimuli.
LOCATION
SHAPE
COLOR
Coded Sensations  Loci in Attribute Spaces
Atomic Locus
a
q
a
x
x
y
P
q
p
ti
L(x,y,p,q,a,g,k)
tj
y
ti
g=
tj
Gt : temporal event
Gs : spatial event
“Matter ‘x’ causes Attribute ‘a’ of Matter ‘y’ to keep or change its
value temporally or spatially over a time interval, where the value
‘p’ and ‘q’ are relative to Standard ‘k’.”
Terms of Atomic Locus
L(1,2,3,4,5,6,7)
Term
1
2
3
4
5
6
7
Type Name
Matter
Attribute
Value
Attribute
Event Type
Standard
Semantic Role
Event Causer (EC)
Attribute Carrier (AC)
Beginning of Locus
Ending of Locus
Domain of Attribute Value
Relation between AC and FAO
Unit, Origin, Scale etc for Values
Event types
AC
Tokyo
Temporal event
Spatial event
Osaka
FAO
(S1) The bus runs from Tokyo to Osaka.
( x, y, k) L( x, y, Tokyo, Osaka, A12, Gt, k)bus(y)
(S2) The road runs from Tokyo to Osaka.
( x, y, k) L( x, y, Tokyo, Osaka, A12, Gs, k) road(y)
A12 : Physical Location
Attributes
Table 1 Attributes
Standards
Table 2 Standards
Categories of
standards
Remarks
Rigid Standard
Objective standards such as denoted by measuring units (meter, gram, etc.).
Species Standard
The attribute value ordinary for a species. A short train is ordinarily longer than a
long pencil.
Proportional
Standard
Individual
Standard
Purposive
Standard
Declarative
Standard
‘Oblong’ means that the width is greater than the height at a physical object.
Much money for one person can be too little for another.
One room large enough for a person’s sleeping must be too small for his jogging.
The origin of an order such as ‘next’ must be declared explicitly just as ‘next to him’.
Tempo-logical connectives
1 i 2  (1  2 )  i (1,2)
i : tempo-logical connective
j : locus
 : binary logical connective (i.e., , , ,  )
: ‘AND’
i : temporal relation between loci such as
‘before’, ‘during’, etc.
Definition of i
The durations of 1 and 2 are [t11, t12] and [t21, t22], respectively.
Conceptualization of sensory events
Conceptualization
Event 1
x
y
A12 : Location
x
y
Event N
x
Formalization
...L(x,x,p,q,A12,Gt,k) L(x,y,p,q,A12,Gt,k) xy  pq...
y
Time
SAND and CAND
(x, y, p1, p2, k) L(x, x, p1, p2, A12, Gt, k)
(L(x, x, p2, p1, A12, Gt, k)  (L(x, y, p2, p1, A12, Gt, k))
 xy  p1p2
A12
: Simultaneous AND
(SAND)
 : Consecutive AND
(CAND)
p2
x
p1
y
t1
t2
t3
t
Image of ‘x fetches y’
A13: Direction
Description of Discrete Spatial
Relations
The square is between the circle and the triangle.
The circle, square and triangle are in a line.
(u,x,y,z)((z,u,x,y,A12,Gs)(z,u,y,z,A12,Gs))
(z,u,,,A13,Gs)isr(u)C(x)S(y)T(z)
u
x
y
isr: imaginary space region
z
Description of spatial events
associated with temporal loci in attribute spaces
sidewalk
street
N
road
A
10km
B
C
(x,y,z,p,q)(L(_,x,A,B,A12,Gs,_) L(_,x,0,10km,A17,Gs,_)
 L(_,x,Point,Line,A15,Gs,_)  L(_,x,East,East,A13,Gs,_))
 s  (L(_,x,p,C,A12,Gs,_)  L(_,y,q,C,A12,Gs,_)
 L(_,z,y,y,A12,Gs,_)) road(x)street(y)sidewalk(z)pq
The road runs 10km straight east from A to B, and
after a while, at C it meets the street with the sidewalk.
Event Patterns about Location(A12)
A12
A12
A12
return
A12
meet
A12
carry
separate
A12
start
stop
Event Patterns about Color(A32)
Word meaning description
Mw [Cp:Up] ( C : Concept Part, U : Unification Part)
p
p
X
Mw(red)=[
: ARG(Gov,X)]
Color of X is red. The ‘governor’ is X.
Y
Mw(box)=[
:
___
]
Shape of Y is like this. Up is ‘empty’,
red box
Y
Mutual projection between surface and conceptual structures
using word meaning descriptions and surface dependency structures.
The robot carries the book.
Surface Structure
carries
Dep1
robot
the
Dep2
book
Surface Dependency Structure
the
Conceptual Structure
(x, y, p1, p2, k) L(x, x, p1, p2, A12, Gt, k)
L(x, y, p1, p2, A12, Gt, k)  robot(x) book(y)
 xy  p1p2
Example(1): ‘carry (verb)’
Dep1 CARRY Dep2.
Mw (carry) [(x,y,p1,p2,k)
L(x,x,p1,p2,A12,Gt,k)
L(x,y,p1,p2,A12,Gt,k)xyp1p2:
ARG(Dep.1,x); ARG(Dep.2,y);]
Example(2): ‘desk (noun)’
Mw (desk) [(x) desk(x) : __ ;]
, where
(x) desk(x)  (x) (…L*(_,x,/,/,A29,Gt,_)
…  L*(_,x,/,/,A39,Gt,_ )  …)
‘At any time, a desk has no taste(A29), …..,
no vitality(A39), …..’
Fundamental Semantic Processing
on texts by IMAGES-M
Detection of
• Semantic anomalies
• Semantic ambiguities
• Paraphrase relations
Postulates about the world
X  Y* .. X  Y,
where Y* denotes that Y holds true over any time-interval.
L(x,y,p,q,a,g,k)  L(z,y,r,s,a,g,k) . . p=r  q=s
Detection of Semantic Anomalies
by using postulates
(Postulate 1)
L(x,y,p1,q1,a,g,k)  L(z,y,p2,q2,a,g,k) . .
p1=p2  q1=q2
‘A matter has never different values of an
attribute at a time.’
Example(1)
Tom stays with the guest from Spain .
D1
D2
M(stay)=[( x, y, p1, p2, k) L(x, y, p1, p2, A12, Gt, k)
 xy  p1=p2 :…….
]
M(from)=[( x, y,p1, p2, k) L(x,y,p1, p2, A12, Gt, k)
 p1 p2: ………
]
D2 violates Postulate 1.
Example(2)
I drank the coffee on the desk, which was sweet.
D1
D2
D1 violates Postulate 1.
L(x,y,sweet,sweet,A29,Gt,k)  desk(y)
 L(x,y,sweet,sweet,A29,Gt,k)  L(z,y,/,/,A29,Gt,k)
 ‘sweet’ = /
Detection of Semantic
Ambiguities
Tom follows
J
s
Pr(D1)
Jim with the stick.
D1
D2
T
J
Pr(D2)
T
s
Paraphrasing based on
understanding
(Input) The girl fetches the book from
the village to the town.
(Output) The girl goes to the village
from the town, and then carries the
book from the village to the town.)
(∃x1,x2,p1,p2,k)
L(x1,x1,p1,p2,A12,Gt,k)•( L(x1,x1,p2,p1,A12,Gt,k)
ΠL(x1,x2,p2,p1,A12,Gt,k) )
∧girl(x1) ∧book(x2) ∧ town(p1)∧village( p2)
Why cross-media translation
(CMT) is important ?
---Problem --I have one chair, one flower-pot, one box, one
lamp and one cat in my room.
The chair is 1m to the right of the flower-pot.
The flower-pot is 4m to the left of the box.
The red lamp hangs above the chair.
The black cat lies under the chair.
Systematic CMT
Explicit algorithms for :
(C1) translating source representations into target
ones as for contents describable by both source
and target media.
(C2) filtering out such contents that are describable
by source medium but not by target one.
(C3) supplementing default contents, that is, such
contents that need to be described in target
representations but not explicitly described in
source representations.
(C4) replacing default contents by definite ones
given in the following contexts.
Realization of systematic CMT
Algorithms for :
(C1) translating source representations into target ones
as for contents describable by both source and
target media  APRs
(C2) filtering out such contents that are describable by
source medium but not by target one.  APRs
(C3) supplementing default contents, that is, such
contents that need to be described in target
representations but not explicitly described in
source representations. 
XYZ
(C4) replacing default contents by definite ones given
in the following contexts.

Only to memorize the processing history
Formalization of cross-media
translation
Y(Smt )=(X(Sms ))
In the case of text-to-picture CMT,
Sms= All the attributes in previous
Table.
Smt = Visual attributes
marked by * in previous Table.
 is defined by a set of APRs shown
in the next table.
CMT between Text and
Picture
Text = The ominisensual world specified by Sms
Text Meaning Representation = X(Sms )
 = APRs and Default reasoning
Picture Meaning Representation = Y(Smt )
Picture = The visual world specified by Smt
Attribute Paraphrasing Rules
(APRs)
Table 4 Attribute paraphrasing rules for text-to-picture translation
APRs
Correspondenc
es of attributes
(Text : Picture)
Value conversion
schema
(Text  Picture)
Interpretations of the schema
APR-01
A12 : A12
pp’
‘position’ into 2D coordinates (within the display area).
APR-02
{A12, A13, A17} : A12
{ p, d, l}p’+l’d’
{‘position’, ‘direction’, ‘distance’} into 2D coordinates.
{s, v}v’s’
{‘shape’, ‘volume’} into a set of outlines of the object.
APR-03
APR-04
APR-05
{A11, A10} : A11
A12 : A12
{A12, A44} : A12
cc’
{pa,m}{pa’, pb’}
‘color’ into 3D coordinates of the color solid.
{‘position’, ‘topology’} into a pair of 2D coordinates.
For example, APR-02 is for such a sentence as “The box is 3 meters
to the left of the chair.”
S1 = There is a hard cubic object.
shape=cube  C1
hardness=indescribable  C2
P1 =
color=default  C3
volume=default  C3
S2 = The object is large and red.
color=red  C4
P2 =
volume=large  C4
Discussions and conclusions
· The cross-references between texts in several
languages (Japanese, Chinese, Albanian and
English) and pictorial patterns like maps were
successfully implemented on our intelligent
system IMAGES-M.
· At our best knowledge, there is no other system
that can perform cross-media reference in such
a seamless way as ours.
Future works
• Automatic acquisition of word meanings
from sensory data.
• Human-robot communication by natural
language under real environments
• etc