Deep Distillation from Text Naveen Ashish University of Southern California & Cognie Inc., March 18th 2014 This is about ….. § “DEEP TEXT DISTILLATION” § The hard nut of having computers “understand” natural language (text) …. § Pushing the boundaries of what we can achieve …. "It's (the problem of computers understanding natural language) ambi<ous ...in fact there's no more important project than understanding intelligence and recrea<ng it.“ -‐ Ray Kurzweil (2013) Alan Turing based the Turing Test en<rely on wriDen language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do ar<ficial intelligence in. -‐ Ray Kurzweil (2013) Why …. search big data analytics health informatics text analytics social-media intelligence § the problem is far from solved ….. !!!! IntroducSon § About myself § Associate Professor (InformaScs), Keck School of Medicine, University of Southern California § Cognie Inc., § Work leverages § InformaSon extracSon work and systems developed at UC Irvine § XAR, UCI-‐PEP § Advisory consulSng engagements with several companies and start-‐ups Outline § Deep disSllaSon: What is and why § State-‐of-‐the-‐art § Fundamentals § Approach § Details § Expressions, EnSSes, SenSment § Case studies § Retail, Health, Risk assessment § Conclusions What is “Deep” text distillation ? Deep DisSllaSon I think you need better chefs à SUGGESTION I used to take Lipitor for …à PERSONAL EXPERIENCE The mocha is too sweet à NEGATIVE The dim lights have a cozy effect ….à AMBIENCE § The abstract, not explicitly menSoned ! § What falls in this category § Expressions § Contextual senSment § Aspect classificaSon A Common IntersecSon § DisSll at sentence level § Aggregate to enSre feedback, post, comment or thread § Three primary elements § Expression/Intent § EnSSes/Aspects (and Classes) § SenSment Why Deeper ? § Goal: Get acSonable insights from data ! § Hypothesis: Deeper extracSon à Beaer insights ! The top advice items advised for skin rash are aloe vera, vitamin E oil and oatmeal Complaints comprise 36% of the overall feedback with top issues being slow service, drinks and coffee Context UCI-PEP XAR COGNIE TM SHIP SURVEY ANALYTICS RISK ASSESSMENT RETAIL ANALYTICS § COGNIETM: A PLATFORM for text analyScs Expressions § Beyond enSSes and senSment : EXPRESSSIONS § EXPRESSIONS § Introduced in [Ashish et al, 2011] Expressions HEALTH You should try Vitamin E oil … à ADVICE ..I have had arthritis since 1991… à EXPERIENCE ..for me lipitor worked like a charm… à OUTCOME Expressions RETAIL/ENTERPRISE …showers had no hot water !… à COMPLAINT ..you should have more veggie options… à SUGGESTION ..meats on special this weekend… à ANNOUNCEMENT ..this is the best store on the west side… à ADVOCACY RISK ASSESSMENT There is hardly any evidence to suggest a link between salt and diabetes à This results confirm that high intake of salt leads to increase in BPà + The Landscape Text AnalyScs Spectrum § Wide offering of § Text analyScs engines § Text analysis tools – many open-‐source § Largely sSll for “spofng things” § enSSes, concepts, senSment, topics, emoSons …. § Going deeper § Luminoso § Aaensity (Intents) § Deep Learning for SenSment § Stanford § Recursive Neural Networks Approach Approach machine learning natural language processing semantics Architecture: COGNIE TM Platform Knowledge Engineering Existing (DMOZ, SNOMED,UMLS) Creation NLP Declarative Segmentation POS Tagging Entity extraction Anaphora Parsing Gram analysis Machine Learning Naïve-Bayes MaxEnt TFIDF CRF RNN Deep Learning ENSEMBLE The Indicators: “Give Aways” …showers had no hot water !… COMPLAINT (You) should have more veggie options… SUGGESTION ..i have been on lipitor… EXPERIENCE ..this is the best store on the west side… ADVOCACY § A combina<on of mul<ple types of elements ! Approach: Given Indicators § NLP § IdenSficaSon of individual elements § Unsupervised § RelaSonships between elements § SemanScs § IdenSficaSon of individual elements § Knowledge driven § Machine Learning ClassificaSon § Combine elements à classify Natural Language Processing § UIMA and GATE § Stanford NLP Tools § POS tagging § Parsing § NE Recognizer § Geo-‐tagger § …. Natural Language Processing § Text SegmentaSon § In many cases the “unit” if disSllaSon is a sentence § SegmentaSon § UIMA (or GATE) § Custom § Complex sentence segmentaSon § Breakup into individual clauses NLP § Part-‐of-‐speech tags are key indicators § Expression disSllaSon § EnSty extracSon § Names, LocaSons, OrganizaSons § Parsing § If required § Anaphora NGram Analysis § Unigram and Bigram analysis § Obtain § Grams § Frequency § Entropy § Grams of tokens as well as POS Paaerns § VB VBD Before Automated ClassificaSon: Manual Paaerns § SoL: Sequences of Labels § Labels § LEX-‐FOODADJ § spicy § LEX-‐EXCESS § too, very § ONT-‐FOOD § POS-‐NOUN § Sequences (Paaerns) § ANY LEX-‐EXCESS LEX-‐FOODADJ ANY à § POS-‐VB POS-‐MD …. ClassificaSon: Machine Learning § ClassificaSon tasks § Expression § (Contextual) SenSment § Aspect category § Frameworks § Weka § Mallet Baseline Classifiers § Mallet and Weka § NaiveBayes § MaxEnt § CRF § Gram-‐based § Uni, Bi and Trigram features § Baseline § ~ 10% accuracy Expression ClassificaSon: Features § Features § Polar words § PunctuaSons § Ngrams § POS paaerns § Length ! § Beginning § Ontology § … Classifiers § Trees § Decision Tree (J48) § FuncSons § LogisSc Regression § SVM § Sequence Tagging § CRF: CondiSonal Random Fields Expression ClassificaSon: Results § Have achieved 75% precision and recall for all expressions considered § Factors § Feature engineering § Classifier selecSon § Knowledge engineering Contextual SenSment The mocha is too sweet Wait time is over an hour Aisles are too narrow Service is slow § (Just) polar words can be misleading ! § Polar words many not be present at all ! § CombinaSon of elements SemanScs: Ontologies § Health § Drugs § CondiSons § Procedures § Symptoms § … § Retail (Dining) § Food/Entrees § Service § Ambience § …. Leverage Exis<ng Knowledge Sources § Health informaScs § UMLS § NCI Thesaurus § SNOMED § Retail § DMOZ § Many other § Freebase § Wikipedia, DBPedia § OpenData § data.gov Knowledge Engineering Tools § “Mini” ontology creaSon § API access § Freebase § BioPortal § Wrappers § DMOZ, …. PracScal Requirements § Confidence Measures § Below threshold routed to manual transcripSon teams § Polarity § Snippets Open-Source Leverage COGNIE TM : Open Source Tools § Framework § UIMA § ClassificaSon § Weka § Mallet § NLP § Stanford tools § Indexing § Lucene § Databases § MySQL, MongoDB § Knowledge Engineering § Protégé Select Case Studies Case Study: Health InformaScs Insights from, for, by Patients DisSllaSon Case Study: Retail & Survey AnalyScs …food was awesome, service needs improvement …. you need to be open longer ! § Feedback § Direct, device collected § Social-‐media § Typically short, few sentences § Strong requirement for aspect classificaSon § [Food,Service,Ambience,Pricing,Other] § NegaSve : “Immediate” vs “Long Term” classificaSon Case Study: Risk Assessment § Biomedical Literature Abstracts § CorrelaSon direcSon (+ -‐) § Subject § ArScle type § Features § Clauses § NegaSon and Triggers § SemanSc Heterogeneity Performance MapReduce § Throughput can be an issue § Complex language processing algorithms § Large ontologies in some cases § Hadoop MapReduce § [Kahn and Ashish, 2014] Conclusions Conclusions § Deeper disSllaSon from text is important § Can be achieved by § DetecSng and combining mulSple elements in text § Feature engineering § Knowledge engineering § Classifier selecSon § Does not have to be perfect § Every domain, dataset has its nuances thank you ! [email protected]
© Copyright 2025 ExpyDoc