Deep Distillation from Text

Deep Distillation from Text
Naveen Ashish
University of Southern California & Cognie Inc.,
March 18th 2014
This is about ….. § “DEEP TEXT DISTILLATION” § The hard nut of having computers “understand” natural language (text) …. §  Pushing the boundaries of what we can achieve …. "It's (the problem of computers understanding natural language) ambi<ous ...in fact there's no more important project than understanding intelligence and recrea<ng it.“ -­‐ Ray Kurzweil (2013) Alan Turing based the Turing Test en<rely on wriDen language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do ar<ficial intelligence in. -­‐ Ray Kurzweil (2013) Why …. search
big data analytics
health informatics
text analytics
social-media intelligence
§  the problem is far from solved ….. !!!! IntroducSon § About myself § Associate Professor (InformaScs), Keck School of Medicine, University of Southern California § Cognie Inc., § Work leverages § InformaSon extracSon work and systems developed at UC Irvine §  XAR, UCI-­‐PEP § Advisory consulSng engagements with several companies and start-­‐ups Outline § Deep disSllaSon: What is and why § State-­‐of-­‐the-­‐art § Fundamentals § Approach § Details § Expressions, EnSSes, SenSment § Case studies § Retail, Health, Risk assessment § Conclusions What is “Deep” text distillation ?
Deep DisSllaSon I think you need better chefs à SUGGESTION
I used to take Lipitor for …à PERSONAL EXPERIENCE
The mocha is too sweet à NEGATIVE
The dim lights have a cozy effect ….à AMBIENCE
§ The abstract, not explicitly menSoned ! § What falls in this category § Expressions § Contextual senSment § Aspect classificaSon A Common IntersecSon § DisSll at sentence level § Aggregate to enSre feedback, post, comment or thread § Three primary elements § Expression/Intent § EnSSes/Aspects (and Classes) § SenSment Why Deeper ? §  Goal: Get acSonable insights from data ! §  Hypothesis: Deeper extracSon à Beaer insights ! The top advice items advised for skin rash are aloe vera,
vitamin E oil and oatmeal
Complaints comprise 36% of the overall feedback with top
issues being slow service, drinks and coffee
Context UCI-PEP
XAR
COGNIE TM
SHIP
SURVEY
ANALYTICS
RISK
ASSESSMENT
RETAIL
ANALYTICS
§ COGNIETM: A PLATFORM for text analyScs Expressions § Beyond enSSes and senSment : EXPRESSSIONS § EXPRESSIONS § Introduced in [Ashish et al, 2011] Expressions
HEALTH
You should try Vitamin E oil … à ADVICE
..I have had arthritis since 1991… à EXPERIENCE
..for me lipitor worked like a charm… à OUTCOME
Expressions
RETAIL/ENTERPRISE
…showers had no hot water !… à COMPLAINT
..you should have more veggie options… à SUGGESTION
..meats on special this weekend… à ANNOUNCEMENT
..this is the best store on the west side… à ADVOCACY
RISK ASSESSMENT
There is hardly any evidence to suggest a link between salt and diabetes à This results confirm that high intake of salt leads to increase in BPà +
The Landscape
Text AnalyScs Spectrum § Wide offering of § Text analyScs engines § Text analysis tools – many open-­‐source § Largely sSll for “spofng things” § enSSes, concepts, senSment, topics, emoSons …. § Going deeper § Luminoso § Aaensity (Intents) § Deep Learning for SenSment § Stanford §  Recursive Neural Networks Approach
Approach machine learning
natural language processing
semantics
Architecture: COGNIE TM Platform
Knowledge Engineering
Existing (DMOZ, SNOMED,UMLS)
Creation
NLP
Declarative
Segmentation
POS Tagging
Entity extraction
Anaphora
Parsing
Gram analysis
Machine Learning
Naïve-Bayes
MaxEnt
TFIDF
CRF
RNN Deep Learning
ENSEMBLE
The Indicators: “Give Aways” …showers had no hot water !…
COMPLAINT
(You) should have more veggie options…
SUGGESTION
..i have been on lipitor…
EXPERIENCE
..this is the best store on the west side…
ADVOCACY
§ A combina<on of mul<ple types of elements ! Approach: Given Indicators § NLP § IdenSficaSon of individual elements §  Unsupervised § RelaSonships between elements § SemanScs § IdenSficaSon of individual elements §  Knowledge driven § Machine Learning ClassificaSon § Combine elements à classify Natural Language Processing §  UIMA and GATE §  Stanford NLP Tools § POS tagging §  Parsing §  NE Recognizer §  Geo-­‐tagger §  …. Natural Language Processing §  Text SegmentaSon § In many cases the “unit” if disSllaSon is a sentence §  SegmentaSon §  UIMA (or GATE) §  Custom §  Complex sentence segmentaSon §  Breakup into individual clauses NLP §  Part-­‐of-­‐speech tags are key indicators § Expression disSllaSon §  EnSty extracSon § Names, LocaSons, OrganizaSons §  Parsing § If required §  Anaphora NGram Analysis § Unigram and Bigram analysis § Obtain § Grams § Frequency § Entropy § Grams of tokens as well as POS Paaerns § VB VBD Before Automated ClassificaSon: Manual Paaerns § SoL: Sequences of Labels § Labels § LEX-­‐FOODADJ §  spicy § LEX-­‐EXCESS §  too, very § ONT-­‐FOOD § POS-­‐NOUN § Sequences (Paaerns) § ANY LEX-­‐EXCESS LEX-­‐FOODADJ ANY à § POS-­‐VB POS-­‐MD …. ClassificaSon: Machine Learning §  ClassificaSon tasks § Expression § (Contextual) SenSment § Aspect category § Frameworks § Weka § Mallet Baseline Classifiers §  Mallet and Weka § NaiveBayes § MaxEnt § CRF §  Gram-­‐based § Uni, Bi and Trigram features § Baseline § ~ 10% accuracy Expression ClassificaSon: Features §  Features § Polar words § PunctuaSons § Ngrams § POS paaerns § Length ! § Beginning § Ontology § … Classifiers §  Trees § Decision Tree (J48) § FuncSons § LogisSc Regression § SVM § Sequence Tagging § CRF: CondiSonal Random Fields Expression ClassificaSon: Results § Have achieved 75% precision and recall for all expressions considered § Factors § Feature engineering § Classifier selecSon § Knowledge engineering Contextual SenSment The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow
§  (Just) polar words can be misleading ! § Polar words many not be present at all ! § CombinaSon of elements SemanScs: Ontologies §  Health § Drugs § CondiSons § Procedures § Symptoms § … § Retail (Dining) § Food/Entrees § Service § Ambience § …. Leverage Exis<ng Knowledge Sources § Health informaScs § UMLS §  NCI Thesaurus § SNOMED § Retail § DMOZ § Many other § Freebase § Wikipedia, DBPedia § OpenData § data.gov Knowledge Engineering Tools § “Mini” ontology creaSon § API access § Freebase § BioPortal § Wrappers § DMOZ, …. PracScal Requirements § Confidence Measures § Below threshold routed to manual transcripSon teams § Polarity § Snippets Open-Source Leverage
COGNIE TM : Open Source Tools § Framework § UIMA § ClassificaSon § Weka § Mallet § NLP § Stanford tools § Indexing § Lucene § Databases § MySQL, MongoDB § Knowledge Engineering § Protégé Select Case Studies
Case Study: Health InformaScs Insights from, for, by Patients
DisSllaSon Case Study: Retail & Survey AnalyScs …food was awesome, service needs improvement ….
you need to be open longer !
§ Feedback § Direct, device collected § Social-­‐media § Typically short, few sentences § Strong requirement for aspect classificaSon § [Food,Service,Ambience,Pricing,Other] § NegaSve : “Immediate” vs “Long Term” classificaSon Case Study: Risk Assessment §  Biomedical Literature Abstracts § CorrelaSon direcSon (+ -­‐) § Subject § ArScle type § Features § Clauses § NegaSon and Triggers § SemanSc Heterogeneity Performance
MapReduce § Throughput can be an issue § Complex language processing algorithms § Large ontologies in some cases § Hadoop MapReduce § [Kahn and Ashish, 2014] Conclusions
Conclusions § Deeper disSllaSon from text is important § Can be achieved by § DetecSng and combining mulSple elements in text §  Feature engineering §  Knowledge engineering §  Classifier selecSon § Does not have to be perfect § Every domain, dataset has its nuances thank you !
[email protected]