requirements on analysis tools for "Big Data"

requirements on analysis tools for "Big Data"
especially text analysis tools, of course
jussi karlgren
gavagai
!
stockholm nlp meetup, october 2014
gavagai
text analysis on internet scale since 2008
applied primarily to media monitoring
right here in Stockholm
jussi karlgren
founder
researcher in text stylistics and evaluation of information access systems
adjoint professor of language technology at kth
big data is not just slightly larger data
it’s not a scalability issue
situation awareness, not search
@tripbirds at instead. Ping @jonasl @jocke
Internetdagarna. #ind11 (@ Stockholm Waterfront Congre
Centre w/ 7 others) http://t.co/lf3eLdOT
@RonnieRitter ska på #ind11! Eller menar du den här jädra
omg, YITAS
Werd
så-gullig-marsplan? :)
!
and my rents caught me
@LennartBon @mansj gameon! #ind11
Gr8 work, rly gr8 work!
new
text
needs
robust
analysis
wos the otha
place we gt
chased out of?hehe!tht
wos
!
so funny!
@per_p ja! Kul att du kommer!
där hela
the yr 5's Blir
enit idu
wos callin
ya but veckan?
u were 2 busy #in
*grins evil*
sup
lol well.....dunno lol
wat
i wud tap that
u lik?
rly?
thx stranger
thank u hehe :)ssssssshhhhhh! ..... i wish i wos eatin
a
all da cakes lol.
member. fantage_girl08 8 months ago. we hate each!
other bcuz ...
we r the sissters adn we wil cill u al bcuz u dun
ppl think you need to grow up x
beliv our story!!!!
por ke sabes tantO...
!
!
!
!
is bcuz she is dead she tried 2 come bak then and
soposedly got out ..... of 2 sexy 4 u? and all the
q's and a's are similar to each other. ...
Dragon Ball Z: Budokai Tenkaichi 4 might feature !a
character ...
-MORE MORE MORE ATTACKS, like goku for example, in
bt3 he he has two .... In yor msg
bjr, Adele!! koi29?
lut, ca va?
heeh
had a kewl day morph?
=)
Greeting
i came home 2 am ..and went up at 9
sum bastard woke me up @ 2pm
=P
heh naah.
Irritation i h8 dat guy
lol
phoner on the sms?
:)
#"$%&%' ("#$%&. τήλε — !"#$ () *)+. video — %&'()%) — ,)-./$0)"*1
1)*#"), 42%$)-/ .)--+#+$/ )& 5#$)(/ 642-)&/0)423)61*#"#"/3 .2$
%/*7# 8$)&3*/$#"/3+-/3.!
8#*#-/7/2 4)19:#"*2"#"/ 3)+)-#3 /;#"3 0-/)&/ 1920-/)&/ 9*#"/()&.
*#-/7/) 06() )<81)*1$/ :-#*) 35#$2./, +14=) %/$/+)()( 3)&)6)2"
3)/&52$4)=/2 51&<=/) 4//;2. "2*2 9*#"./ /&8#$&#8/3 0)&-/+)$#"
81)*1$/ 0)6() /&8#$&#8 8#*#-/7/).!
⾃自らの作品制作を⾏行うかたわら、芸術イベント『GEISAI』プロジェクトのチェアマンを務め、アーティスト集団
『カイカイ・キキ(Kaikai Kiki)』を主宰し、若⼿手アーティストのプロデュースを⾏行うなど、活発な活動を展開
4 h of news feed
している。同集団は、アメリカのニューヨークにも版権を管理するエージェントオフィスをもつ。(
!
43171 English
⽇日本アニメポップ的な作⾵風の裏には、⽇日本画の浮世絵や琳派の構成に影響されている部分も強く、⽇日本画のフラ
15135 Chinese
9939 German
ト感、オタクの⽂文脈とのリンクなど現代⽂文化のキーワードが含まれている。中でもアニメ、フィギュアなどいわ
7526 Spanish
ゆるサブカルチャーであるオタク系の題材を⽤用いた作品が有名。アニメ⾵風の美少⼥女キャラクターをモチーフとし
4899 French
た作品は中原浩⼤大の「ナディア」に影響を受けたと本⼈人も認めている。アニメーター・⾦金⽥田伊功の影響を強く受
3611 Russian
the internet is not in english any more
3167 Japanese
けており、⾃自分の作品は⾦金⽥田の功績を作例として表現しているだけと話したこともある。
この村上の⼀一連の創作
2879 Italian
2722 Korean
活動について漫画家の細野不⼆二彦は⾃自⾝身の作品『ギャラリーフェイク』を通して「既成⽂文化の盗作に過ぎない。
2432 Swedish
⽇日本のオタク⽂文化に詳しくない外国⼈人が、これらの作品の引⽤用的要素をオリジナリティと勘違いして⾼高く評価す
1904 Dutch
1795 Portuguese
るのは当たり前」と⾮非難している(細野はかつてスタジオぬえのスタッフだった)。また、漫画原作者である⼤大
1752 Turkish
塚英志は教授として就任した⼤大学のトークショーにおいて「現代美術のパチモノの村上隆は尊敬はしないし、潰
1389 Arabic
していく。我々の⾔言うむらかみたかしは4コマまんがの村上たかしのことだ」と強く⾮非難している。⼤大塚は現代美
1362 Hungarian
...
術家がサブカルを安易に取り上げることや後述のリトルボーイ展の戦後⽇日本⼈人のメンタリティを無視した展⽰示内
5 Kannada
容に強い不快感を持っている。また映画評論家の町⼭山智浩も⾃自⾝身のブログで、「本来好きでもないのに、『電通
4 Galician
3 Maltese
的なマーケティング』でアニメ的⼿手法を⽤用いているのが許せない」「村上⾃自⾝身は『⾃自分には表現すべきものがな
!
い』と⾔言っているそうだが、本当は『⾃自分は偉い』ということだけがテーマなのだ」と、村上を痛烈に批判して
...
いる。精神科医の斎藤環は、「村上隆は⽇日本のオタク⽂文化のいいとこどりをしただけ」といった批判は根本的に
noise is not something you wash off
change is normal
learning, not training
it's not search any more
it's not about the needle in the haystack
it's about the shape of the field over time
it's about situation awareness, not misplaced
items
from features to tasks - where do we (as nlp professionals)
want to contribute?
what is our expertise?
fiddling with parameters?
formulating features?
building tools?
understanding people?
understanding language?
note 1
any model we wish to deploy in practical use must
prove its worth by improving something
•
how to evaluate?
•
quantitatively?
what do we want to achieve?
•
product output quality?
•
better coverage?
•
product agility?
•
explanatory power?
evaluation? how? what are our objective target
functions? make them explicit!
research: explanatory height!
engineering: scalability!
industrial application: convenience and scale!
sales: revenue!
evaluation by
research: gold standards
engineering: unit tests, performance profiling, and
sla
sales: revenue
industrial application: profit
gold standard
test subjects
data
use case
system
practicioners
task
context
features
algorithm
parameters
clients
how can we stop our resource from becoming
a wet blanket?
note 2
difficult vs easy tasks
why use computational methods and machinery for information access?
1 amount of data is overwhelming → reduce data complexity let’s call these “simple” tasks
2 signal is weak and complex → peer closer into data
let’s call these “simple” tasks
data
note 3:
meteorology as a model
measure process, rather than outcome
a case in point
sentiment analysis
sentiment analysis is difficult and challenging
And the sound quality - my God!
Raymond left no room for error on his recordings and it shows.
Definitely one of the better tracks on the album.
Wow, could have been a expansion pack.
I loved The Spy Who Came In From The Cold but the movie is a bit dated in a
way the book never will be.
Meat is more environmentally friendly than seafood.
I am unsure about the feasibility of this knitting pattern.
I love the Samsung B2710 but I would not recommend it to my colleagues.
I don't know if I should call her up – I liked her when I met her last weekend.
This is true.
this is why it's fun
but is it any good?
(in terms of the above discussion)
hypothesis
since human emotion is (likely to be better)
represented by dimensional model, not a
categorial model, textual attitude also should
be modelled dimensionally
can this be tested with our setup?
well-established basic emotions
anger, fear, sadness, enjoyment, disgust, surprise, contempt (recent addition)
candidate basic emotions
amusement, relief, excitement, shame, pride in achievement, guilt, embarrassment, contentment, awe, sensory pleasure
example of criticism:
where is jealousy and paternal love?
arousal
strength
valence
technology enabler: the big data stack
a semantic base technology?
is this an example of that?
are these two the same?
has this changed? how?
what is the relation of this and that?
is this a new way of saying that?
are these or those more like this?
is this typical or strange?
can we trust this?
does the author believe this to be true?
evaluating marketing campaigns
giveaways to bloggers
traditional marketing campaign
giveaways to cosmetics subscribers
tracking violence in the world
kazakhstan stands out unreported in western
media, a violent
altercation in a court
where protestors were
sentenced to prison
sentences took place on
this week in 2012
what are now our requirements?
learn, don't expect teaching
answer the right questions
(first, formulate those questions)
embrace change, analogy, and homeosemy
model what is similar between languages not what is specific to them
aim for situation awareness, not classification as primary task
adjust evaluation metrics accordingly
measure process, not outcome on gold standard
note that sales figures are but one aspect of evaluation
!