logging in or signing up gg Amateur Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 77 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: August 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Natural Language Processing at the limit between symbolic and numeric processing Gregory Grefenstette RFIA’2002 13e Congrès Francophone AFRIF-AFIA de Reconnaissance des Formes et Intelligence Artificielle Angers 8-10 Janvier 2002 Slide2: Two approaches to treating information from language -- Great Divide AI and Natural Language Processing Information Retrieval systems and Language Recent Marriage Two IR systems incorporating AI Question Answering Retrieving right answer Affect Analysis Retrieving emotive content Noble Goal: How to Improve Yield?: Noble Goal: How to Improve Yield? Improve harvesting Improve grain Optimizing Yield: Scale: Optimizing Yield: Scale Slide5: Expert systems Search Engines Machine translation Summarization Classification AI stream IR stream Computational Linguistics DB Slide6: Computational Linguistics DB Expert systems Search Engines Machine translation Summarization Classification AI stream IR stream Question Answering Two Approaches to Meaning : Origins: Two Approaches to Meaning : Origins Artificial Intelligence Paradigm Information Retrieval Paradigm AI’s Paradigm: Robot: AI’s Paradigm: Robot See Hear Move Manipulate React Understand Role of Language in AI: Role of Language in AI Command and Control Correct analysis Logical Inference World Model Database interface SHRDLU (Winograd, 1972): SHRDLU (Winograd, 1972) world model SHRDLU (Winograd, 1972): SHRDLU (Winograd, 1972) Put the pyramid on the the block on the blue block reasoning Conceptual Dependency (Schank, 1975): Conceptual Dependency (Schank, 1975) John shot Mary with a gun knowledge representation Logical Inference: Logical Inference ontology Expert Systems (MYCIN, 1972): Expert Systems (MYCIN, 1972) Computer can answer questions about its reasoning AI presumptions for NLP : AI presumptions for NLP Every word indicates part of a logical structure Mapping between relations in text and logic Language eases programming and control Precision, accuracy, completeness Ontology Knowledge Representation Inference Symbolic Processing IR’s Paradigm: Library: IR’s Paradigm: Library Human Info Need Search Redundancy Relevance IR presumptions for NLP: IR presumptions for NLP Every word is a new feature Every word describes a independent axis in space Words are weighted by numbers Some words are useless Syntax does not matter One technique applies to all documents Statistical Processing Ontology Knowledge Representation Inference Morphology and role are meaningless: Morphology and role are meaningless Suffixes stripped to single stem (eg. SMART) computer compute computerization computerize computation computational computor comput Semantics comes from Frequency: Semantics comes from Frequency Term Weighting Rarer words should have more weight inverse document frequency log (N/n) N -- number of documents in the collection n -- number of documents in which terms appears eg. suppose 300,000 documents word ‘most’ appears 93,000 times -- weight 1.2 word ‘negative’ appears 5,000 times -- weight 4.1 Words appearing more frequently in query and document should be more important term frequency Tempered by raw frequency, or square root, or log Similarity: Cosine measure, not Inferred: Similarity: Cosine measure, not Inferred Each document is indexed by stems Document Di = (di1,di2,di3,...,dit) The current query is reduced to stems Query Qj = (qj1,qj2,qj3,...,qjt) Imagine each stem as an axis distance along axis = frequency of term the query vector and document vector describe points distance (similarity) between points cosine(Di,Qj) = Sk(dik*qjk) / ((Skdik2)(Skqjk2))1/2 Meaning is a Position in Space: Meaning is a Position in Space word2 word3 doc1 doc2 word3 Texts become bags of words: Texts become bags of words doc1 -andgt; …ant… ant… bee doc2 -andgt; …bee …hog …ant… dog doc3 -andgt; …cat …pig ….eel ...dog …eel … fox words ant bee cat dog eel fox pig hog doc1 2 1 doc2 1 1 1 1 doc3 1 1 2 1 1 Example (continued): Example (continued) doc1 doc2 doc3 doc1 1 0.71 0 doc2 0.71 1 0.22 doc3 0 0.22 1 Similarity of documents in example: Similarity measures the occurrences of terms, but no other characteristics of the documents. IR, an Experimental Science: IR, an Experimental Science Basic Premises in a free text IR experiment Query Natural language expression of interest Document Data Base Relevance Binary: 'this document is relevant to this query' (Not usefulness) Objective IR Testbed: Objective IR Testbed ftp://ftp.cs.cornell.edu/pub/smart/med .I 6 .W ventricular septal defect occurring in association with aortic regurgitation .I 7 .W radioisotopes in heart scanning. mainly used in diagnosis of pericardial effusions. also used to study tumors, heart enlargement, aneurysms and pericardial thickening. technetium, rihsa, radioactive hippurate, cholegraffin are used. .I 8 .W the effects of drugs on the bone marrow of man and animals, specifically the effect of pesticides. also, the significance of bone marrow changes. 5 332 5 333 6 112 6 115 6 116 6 118 6 122 6 238 6 239 6 242 6 260 6 309 6 320 6 321 6 323 7 92 7 121 7 189 7 247 7 261 7 382 7 385 7 386 7 387 7 388 7 389 7 390 7 391 7 392 7 393 8 52 8 60 conditions . .I 237 cisternal fluid oxygen ... using a beckman micro-oxyg.. tension simultaneously in the.. and in arterial blood under.. that the cisternal oxygen.. oxygen tension of the surroun. the available free oxygen... duration in the cerebral... .I 238 ventricular septal defect obstruction . a case of ventricular... lesion and infundibular... coronary cusp of the aortic.. septal defect, was demonstra.. as a polyp-like mass in the... catheterization and angiocard ventricular outflow obstr... .I 239 functional adaptations of the congenital heart disease .... functional adaptations in... been discussed in relation to stenosis... queries qrels documents recall/precision: recall/precision ftp://ftp.cs.cornell.edu/pub/smart/med .I 6 .W ventricular septal defect occurring in association with aortic regurgitation .I 7 .W radioisotopes in heart scanning. mainly used in diagnosis of pericardial effusions. also used to study tumors, heart enlargement, aneurysms and pericardial thickening. technetium, rihsa, radioactive hippurate, cholegraffin are used. .I 8 .W the effects of drugs on the bone marrow of man and animals, specifically the effect of pesticides. also, the significance of bone marrow changes. 5 332 5 333 6 112 6 115 6 116 6 118 6 122 6 238 6 239 6 242 6 260 6 309 6 320 6 323 7 92 7 121 7 189 7 247 7 261 7 382 7 385 7 386 7 387 7 388 7 389 7 390 7 391 7 392 7 393 8 52 8 60 queries qrels documents returned by sample IR system for query 6: 0.99 6 238 0.96 6 98 0.92 6 115 0.87 6 117 0.78 6 242 0.78 6 323 0.45 6 122 0.36 6 350 0.35 6 259 0.23 6 118 0.16 6 256 at 50% recall (6 out of 12 found), 60% precision (6 out of 10 corrects) Recall and Precision Plot: Recall and Precision Plot 0 1.00 1.00 0 .10 0.737 .20 0.654 .30 0.639 .40 0.610 .50 0.570 .60 0.517 .70 0.478 .80 0.394 .90 0.261 Avg 0.540 recall precision Most Successful IR Improvement Technique: : Most Successful IR Improvement Technique: Not use of thesauri Not use of natural language processing Not use of conceptual typing Winner: Relevance Feedback Run query over collection Retain top 20 (relevant) documents Fold new words back into original query Re-run expanded query TREC --Text REtrieval Conferences, IR motor: TREC --Text REtrieval Conferences, IR motor 1992: National Institute Standards andamp; Technology (NIST) and Defense Advanced Research Projets Agency (DARPA) increased research in IR, on large-scale test collections increased communication academia-industry-government tech transfer between research labs and commercial products state-of-art showcase of retrieval methods improved evaluation techniques Common Task Method training data given for a few months new task data, results returned after a few weeks independent evaluation results revealed at conference time only participating systems attend conference (pay-to-play) TREC Databases – Real World Sizes: TREC Databases – Real World Sizes Federal Register (94) IR Digest News Groups Associated Press (88,90) Federal Register (88) Ziff Communications San Jose Mercury News (91) U. S. Patents (93) Wall Street Journal (90-92) 283 7 237 482 211 552 290 245 247 456 2383 340 472 1398 315 285 4777 377 55,554 455 102,598 158,240 19,860 75,180 90,257 6,711 74,520 collection MBytes Average Nb Terms Total Records Fifty New Queries a year -- Average Length 60 full words Slide31: Bag of words Bag of words Interface Bag of Words Bag of Words Scale Influence AI TREC sponsored tracks drive Information Retrieval Information Retrieval finally welcomes AI (timidly): Information Retrieval finally welcomes AI (timidly) Question Answering Track Structure of Test Typical Systems QALC Microsoft Entity ontologies, relations Future of typing Typing Lexicons Example, Affect Analysis QA Track: QA Track Number of Documents: 979,000 Megabytes of Text: 3033 Document Sources: AP, WSJ, Financial Times, San Jose Mercury News, LA Times, FBIS Number of Questions: 682 Question Sources: Encarta log, Excite log Sample Questions: How much folic acid should an expectant mother get daily? Who invented the paper clip? What university was Woodrow Wilson president of? Where is Rider College located? Name a film in which Jude Law acted. Where do lobsters like to live? TREC Question Answering track: TREC Question Answering track Types of Questions: Factoids When did X ? Lists What countries did X visit ? Definitions What is X ? Evaluation over Real-world Testbed Slide35: Slide36: Slide37: QA Evaluation measure: QA Evaluation measure Reciprocal Ranking Scheme: the score for a question is 1/R, where R is rank of the first correct answer in the list. Q: What is the capital of Texas? A1: Dallas A2: Austin A3: Fort Worth A4: Denton A5: Austin The score for question Q would be 1/2. Typical Systems: Typical Systems Query Typing (what are we looking for?) Traditional IR on rest of query Select Best Passages in most relevant documents Search for Desired Answer Entity Types Score Entities found by Frequency Closeness to other query words Return Entities Ranked Slide40: Query/Answer Typing (QALC, Ferret, 2001): Query/Answer Typing (QALC, Ferret, 2001) Question: Who developed the Macintosh Computer? Named Entity List = PERSON, ORGANIZATION Question: What metal has the highest melting point? General Type = metal Question: What is the name of the chocolate company in San Francisco? Named Entity List = ORGANIZATION General Type = company Question: What does a defibrillator do? Category = WhatDoNP Question: When was Rosa Park born? Category = WhenBePNborn Query/Answer Typing (QALC, Ferret, 2001): Query/Answer Typing (QALC, Ferret, 2001) Question: What metal has the highest melting point? General Type = metal After typing, Traditional IR on rest of query: After typing, Traditional IR on rest of query Question: What metal has the highest melting point? General Type = metal Entity Extraction from Best Passages: Entity Extraction from Best Passages …new element which they called rhenium in honor of the Rhine River. A year after its discovery they prepared the first gram of the new metal from 660,000 grams of molybdenite ore. Rhenium with a melting point of 3180 ºC, has the highest melting point next to tungsten. Only osmium, iridium, and platinum exceed its density of 21.04 g/cc. Because of its high melting point, rhenium is a refractory metal. In that classification, rhenium is unique. It is the only refractory metal that does not form carbides. Its crystallographic structure is hexagonal close-packed (hcp), while other refractory metals have a body centered cubic (bcc) structure. Rhenium does not have a ductile-to-brittle transition temperature. In other words it maintains its ductility from absolute zero all the way to its melting point. Rhenium also has a high modulus of elasticity. rhenium tungsten osmium iridium platinum density classification carbides Entity Type of Desired Answer: Entity Type of Desired Answer Question: What metal has the highest melting point? General Type = metal rhenium tungsten osmium iridium platinum density classification carbides Entity Typing Lexicon and Ontology (WordNet): Entity Typing Lexicon and Ontology (WordNet) rhenium, Re, atomic number 75 =andgt; metallic element, metal =andgt; chemical element, element =andgt; substance, matter =andgt; object, physical object =andgt; entity, something tungsten, wolfram, W, atomic number 74 =andgt; metallic element, metal =andgt; chemical element, element =andgt; substance, matter =andgt; object, physical object =andgt; entity, something Entity Scoring, Ranking: Entity Scoring, Ranking tungsten – 105 platinum - 35 osmium - 7 rhenium – 5 iridium - 4 Altavista Advanced Search 'highest melting point' near X rhenium tungsten osmium iridium platinum density classification carbides QALC: QALC Statistical Frequency Independence Symbolic Ontology Syntax Microsoft Approach– Web based QA: Microsoft Approach– Web based QA Question Transformation What is the capital of Texas 'is the capital of Texas' 'the capital of Texas is' 'capital of Texas is the' 'of Texas is the capital' 'Texas is the capital of' Answer gathering Gather summary text Eliminate stopwords Count 1-, 2-, 3-grams Question/Answer typing W W W Slide50: Question Answering Systems: Question Answering Systems Best Question Answering Systems Richest Question typing Richest Typed Lexicons Richest Ontology ---Most AI Second IR system with AI traits: Second IR system with AI traits Another application of Typed Lexicons Another AI dream Recognizing Emotion Affect Analysis Assessing Affect(In N Dimensions): Assessing Affect (In N Dimensions) Affect Lexicon: Affect Lexicon Affect Thesaurus: Affect Thesaurus admiration sn attraction 0.80 0.50 admire vb attraction 0.80 0.50 … dazzle vb attraction 0.80 0.90 ... magnetism sn attraction 1.00 0.50 adoration sn love 0.90 1.00 adore vb love 0.90 1.00 … dazzle vb love 0.90 1.00 ... passionate adj love 0.70 0.90 attraction love 0.80 Slide56: Luis Bunuel's The Exterminating Angel (1962) is a macabre comedy, a mordant view of human nature that suggests we harbor savage instincts and unspeakable secrets. Take a group of prosperous dinner guests and pen them up long enough, he suggests, and they'll turn on one another like rats in an overpopulation study. Bunuel begins with small, alarming portents. The cook and the servants suddenly put on their coats and escape, just as the dinner guests are arriving. The hostess is furious; she planned an after-dinner entertainment involving a bear and two sheep. Now it will have to be canceled. It is typical of Bunuel that such surrealistic touches are dropped in without comment. The dinner party is a success. The guests whisper slanders about each other, their eyes playing across the faces of their fellow guests with greed, lust and envy. After dinner, they stroll into the drawing room, where we glimpse a woman's purse, filled with chicken feathers and rooster claws. Affect Tagging Luis Bunuel's The Exterminating Angel (1962) is a macabre comedy, a mordant view of human nature that suggests we harbor savage instincts and unspeakable secrets. Take a group of prosperous dinner guests and pen them up long enough, he suggests, and they'll turn on one another like rats in an overpopulation study. Bunuel begins with small, alarming portents. The cook and the servants suddenly put on their coats and escape, just as the dinner guests are arriving. The hostess is furious; she planned an after-dinner entertainment involving a bear and two sheep. Now it will have to be canceled. It is typical of Bunuel that such surrealistic touches are dropped in without comment. The dinner party is a success. The guests whisper slanders about each other, their eyes playing across the faces of their fellow guests with greed, lust and envy. After dinner, they stroll into the drawing room, where we glimpse a woman's purse, filled with chicken feathers and rooster claws. macabre,adj,death,0.50,0.60 macabre,adj,horror,0.90,0.60 ... savage,adj,violence,1.00,1.00 ... secret,sn,slyness,0.50,0.50 secret,sn,deception,0.50,0.50 prosperous,adj,surfeit,0.50,0.50 rat,sn,disloyalty,0.30,0.90 rat,sn,horror,0.20,0.60 rat,sn,repulsion,0.60,0.70 ... portent,sn,promise,0.70,0.90 portent,sn,warning,1.00,0.80 ... surrealistic,adj,absurdity,0.80,0.50 surrealistic,adj,creation,0.30,0.40 surrealistic,adj,insanity,0.50,0.30 surrealistic,adj,surprise,0.30,0.30 success,sn,success,1.00,0.60 whisper,vb,slyness,0.40,0.50 whisper,vb,slander,0.40,0.40 ... greed,sn,desire,0.60,1.00 greed,sn,greed,1.00,0.70 lust,vb,desire,0.80,0.90 envy,sn,desire,0.7,0.6 envy,sn,greed,0.7,0.6 envy,sn,inferiority,0.4,0.4 envy,sn,lack,0.5,0.5 envy,sn,slyness,0.5,0.6 fill,sn,surfeit,0.70,0.40 violence 1.0 humor 1.0 warning 1.0 anger 1.0 success 1.0 slander 1.0 greed 1.0 horror 0.90 aversion 0.90 absurdity 0.80 excitement 0.80 desire 0.80 pleasure 0.70 promise 0.70 surfeit 0.70 repulsion 0.60 fear 0.60 lack 0.50 death 0.50 slyness 0.50 intelligence 0.50 deception 0.50 insanity 0.50 clarity 0.40 innocence 0.40 inferiority 0.40 pain 0.30 disloyalty 0.30 failure 0.30 creation 0.30 surprise 0.30 Slide57: Affect Typing: Visualization Luis Bunuel's The Exterminating Angel (1962) is a macabre comedy, a mordant view of human nature that suggests we harbor savage instincts and unspeakable secrets. Take a group of prosperous dinner guests and pen them up long enough, he suggests, and they'll turn on one another like rats in an overpopulation study. Bunuel begins with small, alarming portents. The cook and the servants suddenly put on their coats and escape, just as the dinner guests are arriving. The hostess is furious; she planned an after-dinner entertainment involving a bear and two sheep. Now it will have to be canceled. It is typical of Bunuel that such surrealistic touches are dropped in without comment. The dinner party is a success. The guests whisper slanders about each other, their eyes playing across the faces of their fellow guests with greed, lust and envy. After dinner, they stroll into the drawing room, where we glimpse a woman's purse, filled with chicken feathers and rooster claws. Slide58: Analyzing Movie Categories Slide59: London train death toll jumps to 26 Search crews working through night to pull out bodies LONDON, Oct. 5 — The death toll in Tuesday's train collision in London jumped to 26, police said, as more bodies were pulled from the wreckage near Paddington Station. Some 160 people were injured, 24 of them seriously. The last survivors were freed from the wreckage some five hours after the morning crash, NBC's Charles Sabine reported from the scene. SEARCH CREWS brought in floodlights and cranes to sift through the derailed and smashed passenger cars throughout the night, but police said it could take at least 24 hours before all of the bodies were removed. Survivors said there was a fireball immediately after the collision and then a rush to flee. One woman said it was 'mass panic' as passengers rushed the doors in the car she was on. The collision occurred about two miles from Paddington, near Ladbroke Grove. Another passenger said her wagon 'went up into flames' and tipped over. 'There were really badly hurt people, badly burnt people,' echoed another commuter. 'Some people have been impaled by seats.' And one of the passengers who saw the fireball recalled how he wondered if he and others would perish in the flames. SIDE CRASH? Passenger Mark Rogers said he 'was reading a book and found myself crashing into the person opposite me. The train was going over and over and over, and people were thrown onto the floor.' 'People were screaming, a person pretty clearly dead, a woman who was thrown out of the train,' he added. The accident happened at an intersection on the busy rail line, and might not have been head on but rather from the side. 'I think we hit on an angle, on the side,' said BBC radio editor Phil Longman, who was on board the inbound train. An engine and a front car were on their sides, he said, and another was pointing at the sky. One of the train drivers survived the crash, but he could not confirm the fate of the other one. The cause was not yet known, but it comes as public dissatisfaction with the railway system's performance is at an all-time high. Consumer groups and regulators say the system, privatized two years ago, cannot cope with passenger traffic that is growing faster than forecast. They are calling for more investment for train maintenance. The accident happened on the same line as a 1997 train crash that killed seven people and injured 150. EIGHT WAGONS DAMAGED Reuters journalist Wolfgang Waehner-Schmidt, who was on one of the trains, an inter-city Great Western Trains service from Cheltenham to Paddington, said the collision was with a smaller local train. The other train was headed away from London, toward Wiltshire. It had left Paddington Station about five minutes before the accident happened shortly at 8:11 a.m. local time. Waehner-Schmidt said about eight wagons were damaged and smoke was coming from some of them. 'We were in one of the last carriages. We got out immediately, smashed the window and jumped out of the train,' he added. 'AMAZED WE ARE ALIVE' Andrew Hoskin, who lives near the scene of the crash, said: 'It is a terrible mess. One train is completely off the rails.' Danny Firth, a passenger on the Great Western train described the crash as 'an almighty bang and everything that was in front of me came flying forward. There was fire outside. It was general chaos. People were walking around with burns and bruises.' 'I am amazed we are alive,' said a 21-year-old woman sobbing with shock and relief after clambering out of a twisted carriage. 'The first I knew there was a sudden brake. The train flipped over on to its side. There were sparks and screams and seats falling all apart and lots of glass.' Train Crash near Paddington, October 99. Conclusion: Conclusion AI approach to Natural language processing Command Task oriented Closed World Model, Ontology, Knowledge Representation Logic, Inference IR approach One size fits all If you lose information something, that’s ok Open World answers, Evaluable Question Answering, Affect Analysis --IR Borrows from AI Entities belong to an ontology Relations important You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
gg Amateur Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 77 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: August 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Natural Language Processing at the limit between symbolic and numeric processing Gregory Grefenstette RFIA’2002 13e Congrès Francophone AFRIF-AFIA de Reconnaissance des Formes et Intelligence Artificielle Angers 8-10 Janvier 2002 Slide2: Two approaches to treating information from language -- Great Divide AI and Natural Language Processing Information Retrieval systems and Language Recent Marriage Two IR systems incorporating AI Question Answering Retrieving right answer Affect Analysis Retrieving emotive content Noble Goal: How to Improve Yield?: Noble Goal: How to Improve Yield? Improve harvesting Improve grain Optimizing Yield: Scale: Optimizing Yield: Scale Slide5: Expert systems Search Engines Machine translation Summarization Classification AI stream IR stream Computational Linguistics DB Slide6: Computational Linguistics DB Expert systems Search Engines Machine translation Summarization Classification AI stream IR stream Question Answering Two Approaches to Meaning : Origins: Two Approaches to Meaning : Origins Artificial Intelligence Paradigm Information Retrieval Paradigm AI’s Paradigm: Robot: AI’s Paradigm: Robot See Hear Move Manipulate React Understand Role of Language in AI: Role of Language in AI Command and Control Correct analysis Logical Inference World Model Database interface SHRDLU (Winograd, 1972): SHRDLU (Winograd, 1972) world model SHRDLU (Winograd, 1972): SHRDLU (Winograd, 1972) Put the pyramid on the the block on the blue block reasoning Conceptual Dependency (Schank, 1975): Conceptual Dependency (Schank, 1975) John shot Mary with a gun knowledge representation Logical Inference: Logical Inference ontology Expert Systems (MYCIN, 1972): Expert Systems (MYCIN, 1972) Computer can answer questions about its reasoning AI presumptions for NLP : AI presumptions for NLP Every word indicates part of a logical structure Mapping between relations in text and logic Language eases programming and control Precision, accuracy, completeness Ontology Knowledge Representation Inference Symbolic Processing IR’s Paradigm: Library: IR’s Paradigm: Library Human Info Need Search Redundancy Relevance IR presumptions for NLP: IR presumptions for NLP Every word is a new feature Every word describes a independent axis in space Words are weighted by numbers Some words are useless Syntax does not matter One technique applies to all documents Statistical Processing Ontology Knowledge Representation Inference Morphology and role are meaningless: Morphology and role are meaningless Suffixes stripped to single stem (eg. SMART) computer compute computerization computerize computation computational computor comput Semantics comes from Frequency: Semantics comes from Frequency Term Weighting Rarer words should have more weight inverse document frequency log (N/n) N -- number of documents in the collection n -- number of documents in which terms appears eg. suppose 300,000 documents word ‘most’ appears 93,000 times -- weight 1.2 word ‘negative’ appears 5,000 times -- weight 4.1 Words appearing more frequently in query and document should be more important term frequency Tempered by raw frequency, or square root, or log Similarity: Cosine measure, not Inferred: Similarity: Cosine measure, not Inferred Each document is indexed by stems Document Di = (di1,di2,di3,...,dit) The current query is reduced to stems Query Qj = (qj1,qj2,qj3,...,qjt) Imagine each stem as an axis distance along axis = frequency of term the query vector and document vector describe points distance (similarity) between points cosine(Di,Qj) = Sk(dik*qjk) / ((Skdik2)(Skqjk2))1/2 Meaning is a Position in Space: Meaning is a Position in Space word2 word3 doc1 doc2 word3 Texts become bags of words: Texts become bags of words doc1 -andgt; …ant… ant… bee doc2 -andgt; …bee …hog …ant… dog doc3 -andgt; …cat …pig ….eel ...dog …eel … fox words ant bee cat dog eel fox pig hog doc1 2 1 doc2 1 1 1 1 doc3 1 1 2 1 1 Example (continued): Example (continued) doc1 doc2 doc3 doc1 1 0.71 0 doc2 0.71 1 0.22 doc3 0 0.22 1 Similarity of documents in example: Similarity measures the occurrences of terms, but no other characteristics of the documents. IR, an Experimental Science: IR, an Experimental Science Basic Premises in a free text IR experiment Query Natural language expression of interest Document Data Base Relevance Binary: 'this document is relevant to this query' (Not usefulness) Objective IR Testbed: Objective IR Testbed ftp://ftp.cs.cornell.edu/pub/smart/med .I 6 .W ventricular septal defect occurring in association with aortic regurgitation .I 7 .W radioisotopes in heart scanning. mainly used in diagnosis of pericardial effusions. also used to study tumors, heart enlargement, aneurysms and pericardial thickening. technetium, rihsa, radioactive hippurate, cholegraffin are used. .I 8 .W the effects of drugs on the bone marrow of man and animals, specifically the effect of pesticides. also, the significance of bone marrow changes. 5 332 5 333 6 112 6 115 6 116 6 118 6 122 6 238 6 239 6 242 6 260 6 309 6 320 6 321 6 323 7 92 7 121 7 189 7 247 7 261 7 382 7 385 7 386 7 387 7 388 7 389 7 390 7 391 7 392 7 393 8 52 8 60 conditions . .I 237 cisternal fluid oxygen ... using a beckman micro-oxyg.. tension simultaneously in the.. and in arterial blood under.. that the cisternal oxygen.. oxygen tension of the surroun. the available free oxygen... duration in the cerebral... .I 238 ventricular septal defect obstruction . a case of ventricular... lesion and infundibular... coronary cusp of the aortic.. septal defect, was demonstra.. as a polyp-like mass in the... catheterization and angiocard ventricular outflow obstr... .I 239 functional adaptations of the congenital heart disease .... functional adaptations in... been discussed in relation to stenosis... queries qrels documents recall/precision: recall/precision ftp://ftp.cs.cornell.edu/pub/smart/med .I 6 .W ventricular septal defect occurring in association with aortic regurgitation .I 7 .W radioisotopes in heart scanning. mainly used in diagnosis of pericardial effusions. also used to study tumors, heart enlargement, aneurysms and pericardial thickening. technetium, rihsa, radioactive hippurate, cholegraffin are used. .I 8 .W the effects of drugs on the bone marrow of man and animals, specifically the effect of pesticides. also, the significance of bone marrow changes. 5 332 5 333 6 112 6 115 6 116 6 118 6 122 6 238 6 239 6 242 6 260 6 309 6 320 6 323 7 92 7 121 7 189 7 247 7 261 7 382 7 385 7 386 7 387 7 388 7 389 7 390 7 391 7 392 7 393 8 52 8 60 queries qrels documents returned by sample IR system for query 6: 0.99 6 238 0.96 6 98 0.92 6 115 0.87 6 117 0.78 6 242 0.78 6 323 0.45 6 122 0.36 6 350 0.35 6 259 0.23 6 118 0.16 6 256 at 50% recall (6 out of 12 found), 60% precision (6 out of 10 corrects) Recall and Precision Plot: Recall and Precision Plot 0 1.00 1.00 0 .10 0.737 .20 0.654 .30 0.639 .40 0.610 .50 0.570 .60 0.517 .70 0.478 .80 0.394 .90 0.261 Avg 0.540 recall precision Most Successful IR Improvement Technique: : Most Successful IR Improvement Technique: Not use of thesauri Not use of natural language processing Not use of conceptual typing Winner: Relevance Feedback Run query over collection Retain top 20 (relevant) documents Fold new words back into original query Re-run expanded query TREC --Text REtrieval Conferences, IR motor: TREC --Text REtrieval Conferences, IR motor 1992: National Institute Standards andamp; Technology (NIST) and Defense Advanced Research Projets Agency (DARPA) increased research in IR, on large-scale test collections increased communication academia-industry-government tech transfer between research labs and commercial products state-of-art showcase of retrieval methods improved evaluation techniques Common Task Method training data given for a few months new task data, results returned after a few weeks independent evaluation results revealed at conference time only participating systems attend conference (pay-to-play) TREC Databases – Real World Sizes: TREC Databases – Real World Sizes Federal Register (94) IR Digest News Groups Associated Press (88,90) Federal Register (88) Ziff Communications San Jose Mercury News (91) U. S. Patents (93) Wall Street Journal (90-92) 283 7 237 482 211 552 290 245 247 456 2383 340 472 1398 315 285 4777 377 55,554 455 102,598 158,240 19,860 75,180 90,257 6,711 74,520 collection MBytes Average Nb Terms Total Records Fifty New Queries a year -- Average Length 60 full words Slide31: Bag of words Bag of words Interface Bag of Words Bag of Words Scale Influence AI TREC sponsored tracks drive Information Retrieval Information Retrieval finally welcomes AI (timidly): Information Retrieval finally welcomes AI (timidly) Question Answering Track Structure of Test Typical Systems QALC Microsoft Entity ontologies, relations Future of typing Typing Lexicons Example, Affect Analysis QA Track: QA Track Number of Documents: 979,000 Megabytes of Text: 3033 Document Sources: AP, WSJ, Financial Times, San Jose Mercury News, LA Times, FBIS Number of Questions: 682 Question Sources: Encarta log, Excite log Sample Questions: How much folic acid should an expectant mother get daily? Who invented the paper clip? What university was Woodrow Wilson president of? Where is Rider College located? Name a film in which Jude Law acted. Where do lobsters like to live? TREC Question Answering track: TREC Question Answering track Types of Questions: Factoids When did X ? Lists What countries did X visit ? Definitions What is X ? Evaluation over Real-world Testbed Slide35: Slide36: Slide37: QA Evaluation measure: QA Evaluation measure Reciprocal Ranking Scheme: the score for a question is 1/R, where R is rank of the first correct answer in the list. Q: What is the capital of Texas? A1: Dallas A2: Austin A3: Fort Worth A4: Denton A5: Austin The score for question Q would be 1/2. Typical Systems: Typical Systems Query Typing (what are we looking for?) Traditional IR on rest of query Select Best Passages in most relevant documents Search for Desired Answer Entity Types Score Entities found by Frequency Closeness to other query words Return Entities Ranked Slide40: Query/Answer Typing (QALC, Ferret, 2001): Query/Answer Typing (QALC, Ferret, 2001) Question: Who developed the Macintosh Computer? Named Entity List = PERSON, ORGANIZATION Question: What metal has the highest melting point? General Type = metal Question: What is the name of the chocolate company in San Francisco? Named Entity List = ORGANIZATION General Type = company Question: What does a defibrillator do? Category = WhatDoNP Question: When was Rosa Park born? Category = WhenBePNborn Query/Answer Typing (QALC, Ferret, 2001): Query/Answer Typing (QALC, Ferret, 2001) Question: What metal has the highest melting point? General Type = metal After typing, Traditional IR on rest of query: After typing, Traditional IR on rest of query Question: What metal has the highest melting point? General Type = metal Entity Extraction from Best Passages: Entity Extraction from Best Passages …new element which they called rhenium in honor of the Rhine River. A year after its discovery they prepared the first gram of the new metal from 660,000 grams of molybdenite ore. Rhenium with a melting point of 3180 ºC, has the highest melting point next to tungsten. Only osmium, iridium, and platinum exceed its density of 21.04 g/cc. Because of its high melting point, rhenium is a refractory metal. In that classification, rhenium is unique. It is the only refractory metal that does not form carbides. Its crystallographic structure is hexagonal close-packed (hcp), while other refractory metals have a body centered cubic (bcc) structure. Rhenium does not have a ductile-to-brittle transition temperature. In other words it maintains its ductility from absolute zero all the way to its melting point. Rhenium also has a high modulus of elasticity. rhenium tungsten osmium iridium platinum density classification carbides Entity Type of Desired Answer: Entity Type of Desired Answer Question: What metal has the highest melting point? General Type = metal rhenium tungsten osmium iridium platinum density classification carbides Entity Typing Lexicon and Ontology (WordNet): Entity Typing Lexicon and Ontology (WordNet) rhenium, Re, atomic number 75 =andgt; metallic element, metal =andgt; chemical element, element =andgt; substance, matter =andgt; object, physical object =andgt; entity, something tungsten, wolfram, W, atomic number 74 =andgt; metallic element, metal =andgt; chemical element, element =andgt; substance, matter =andgt; object, physical object =andgt; entity, something Entity Scoring, Ranking: Entity Scoring, Ranking tungsten – 105 platinum - 35 osmium - 7 rhenium – 5 iridium - 4 Altavista Advanced Search 'highest melting point' near X rhenium tungsten osmium iridium platinum density classification carbides QALC: QALC Statistical Frequency Independence Symbolic Ontology Syntax Microsoft Approach– Web based QA: Microsoft Approach– Web based QA Question Transformation What is the capital of Texas 'is the capital of Texas' 'the capital of Texas is' 'capital of Texas is the' 'of Texas is the capital' 'Texas is the capital of' Answer gathering Gather summary text Eliminate stopwords Count 1-, 2-, 3-grams Question/Answer typing W W W Slide50: Question Answering Systems: Question Answering Systems Best Question Answering Systems Richest Question typing Richest Typed Lexicons Richest Ontology ---Most AI Second IR system with AI traits: Second IR system with AI traits Another application of Typed Lexicons Another AI dream Recognizing Emotion Affect Analysis Assessing Affect(In N Dimensions): Assessing Affect (In N Dimensions) Affect Lexicon: Affect Lexicon Affect Thesaurus: Affect Thesaurus admiration sn attraction 0.80 0.50 admire vb attraction 0.80 0.50 … dazzle vb attraction 0.80 0.90 ... magnetism sn attraction 1.00 0.50 adoration sn love 0.90 1.00 adore vb love 0.90 1.00 … dazzle vb love 0.90 1.00 ... passionate adj love 0.70 0.90 attraction love 0.80 Slide56: Luis Bunuel's The Exterminating Angel (1962) is a macabre comedy, a mordant view of human nature that suggests we harbor savage instincts and unspeakable secrets. Take a group of prosperous dinner guests and pen them up long enough, he suggests, and they'll turn on one another like rats in an overpopulation study. Bunuel begins with small, alarming portents. The cook and the servants suddenly put on their coats and escape, just as the dinner guests are arriving. The hostess is furious; she planned an after-dinner entertainment involving a bear and two sheep. Now it will have to be canceled. It is typical of Bunuel that such surrealistic touches are dropped in without comment. The dinner party is a success. The guests whisper slanders about each other, their eyes playing across the faces of their fellow guests with greed, lust and envy. After dinner, they stroll into the drawing room, where we glimpse a woman's purse, filled with chicken feathers and rooster claws. Affect Tagging Luis Bunuel's The Exterminating Angel (1962) is a macabre comedy, a mordant view of human nature that suggests we harbor savage instincts and unspeakable secrets. Take a group of prosperous dinner guests and pen them up long enough, he suggests, and they'll turn on one another like rats in an overpopulation study. Bunuel begins with small, alarming portents. The cook and the servants suddenly put on their coats and escape, just as the dinner guests are arriving. The hostess is furious; she planned an after-dinner entertainment involving a bear and two sheep. Now it will have to be canceled. It is typical of Bunuel that such surrealistic touches are dropped in without comment. The dinner party is a success. The guests whisper slanders about each other, their eyes playing across the faces of their fellow guests with greed, lust and envy. After dinner, they stroll into the drawing room, where we glimpse a woman's purse, filled with chicken feathers and rooster claws. macabre,adj,death,0.50,0.60 macabre,adj,horror,0.90,0.60 ... savage,adj,violence,1.00,1.00 ... secret,sn,slyness,0.50,0.50 secret,sn,deception,0.50,0.50 prosperous,adj,surfeit,0.50,0.50 rat,sn,disloyalty,0.30,0.90 rat,sn,horror,0.20,0.60 rat,sn,repulsion,0.60,0.70 ... portent,sn,promise,0.70,0.90 portent,sn,warning,1.00,0.80 ... surrealistic,adj,absurdity,0.80,0.50 surrealistic,adj,creation,0.30,0.40 surrealistic,adj,insanity,0.50,0.30 surrealistic,adj,surprise,0.30,0.30 success,sn,success,1.00,0.60 whisper,vb,slyness,0.40,0.50 whisper,vb,slander,0.40,0.40 ... greed,sn,desire,0.60,1.00 greed,sn,greed,1.00,0.70 lust,vb,desire,0.80,0.90 envy,sn,desire,0.7,0.6 envy,sn,greed,0.7,0.6 envy,sn,inferiority,0.4,0.4 envy,sn,lack,0.5,0.5 envy,sn,slyness,0.5,0.6 fill,sn,surfeit,0.70,0.40 violence 1.0 humor 1.0 warning 1.0 anger 1.0 success 1.0 slander 1.0 greed 1.0 horror 0.90 aversion 0.90 absurdity 0.80 excitement 0.80 desire 0.80 pleasure 0.70 promise 0.70 surfeit 0.70 repulsion 0.60 fear 0.60 lack 0.50 death 0.50 slyness 0.50 intelligence 0.50 deception 0.50 insanity 0.50 clarity 0.40 innocence 0.40 inferiority 0.40 pain 0.30 disloyalty 0.30 failure 0.30 creation 0.30 surprise 0.30 Slide57: Affect Typing: Visualization Luis Bunuel's The Exterminating Angel (1962) is a macabre comedy, a mordant view of human nature that suggests we harbor savage instincts and unspeakable secrets. Take a group of prosperous dinner guests and pen them up long enough, he suggests, and they'll turn on one another like rats in an overpopulation study. Bunuel begins with small, alarming portents. The cook and the servants suddenly put on their coats and escape, just as the dinner guests are arriving. The hostess is furious; she planned an after-dinner entertainment involving a bear and two sheep. Now it will have to be canceled. It is typical of Bunuel that such surrealistic touches are dropped in without comment. The dinner party is a success. The guests whisper slanders about each other, their eyes playing across the faces of their fellow guests with greed, lust and envy. After dinner, they stroll into the drawing room, where we glimpse a woman's purse, filled with chicken feathers and rooster claws. Slide58: Analyzing Movie Categories Slide59: London train death toll jumps to 26 Search crews working through night to pull out bodies LONDON, Oct. 5 — The death toll in Tuesday's train collision in London jumped to 26, police said, as more bodies were pulled from the wreckage near Paddington Station. Some 160 people were injured, 24 of them seriously. The last survivors were freed from the wreckage some five hours after the morning crash, NBC's Charles Sabine reported from the scene. SEARCH CREWS brought in floodlights and cranes to sift through the derailed and smashed passenger cars throughout the night, but police said it could take at least 24 hours before all of the bodies were removed. Survivors said there was a fireball immediately after the collision and then a rush to flee. One woman said it was 'mass panic' as passengers rushed the doors in the car she was on. The collision occurred about two miles from Paddington, near Ladbroke Grove. Another passenger said her wagon 'went up into flames' and tipped over. 'There were really badly hurt people, badly burnt people,' echoed another commuter. 'Some people have been impaled by seats.' And one of the passengers who saw the fireball recalled how he wondered if he and others would perish in the flames. SIDE CRASH? Passenger Mark Rogers said he 'was reading a book and found myself crashing into the person opposite me. The train was going over and over and over, and people were thrown onto the floor.' 'People were screaming, a person pretty clearly dead, a woman who was thrown out of the train,' he added. The accident happened at an intersection on the busy rail line, and might not have been head on but rather from the side. 'I think we hit on an angle, on the side,' said BBC radio editor Phil Longman, who was on board the inbound train. An engine and a front car were on their sides, he said, and another was pointing at the sky. One of the train drivers survived the crash, but he could not confirm the fate of the other one. The cause was not yet known, but it comes as public dissatisfaction with the railway system's performance is at an all-time high. Consumer groups and regulators say the system, privatized two years ago, cannot cope with passenger traffic that is growing faster than forecast. They are calling for more investment for train maintenance. The accident happened on the same line as a 1997 train crash that killed seven people and injured 150. EIGHT WAGONS DAMAGED Reuters journalist Wolfgang Waehner-Schmidt, who was on one of the trains, an inter-city Great Western Trains service from Cheltenham to Paddington, said the collision was with a smaller local train. The other train was headed away from London, toward Wiltshire. It had left Paddington Station about five minutes before the accident happened shortly at 8:11 a.m. local time. Waehner-Schmidt said about eight wagons were damaged and smoke was coming from some of them. 'We were in one of the last carriages. We got out immediately, smashed the window and jumped out of the train,' he added. 'AMAZED WE ARE ALIVE' Andrew Hoskin, who lives near the scene of the crash, said: 'It is a terrible mess. One train is completely off the rails.' Danny Firth, a passenger on the Great Western train described the crash as 'an almighty bang and everything that was in front of me came flying forward. There was fire outside. It was general chaos. People were walking around with burns and bruises.' 'I am amazed we are alive,' said a 21-year-old woman sobbing with shock and relief after clambering out of a twisted carriage. 'The first I knew there was a sudden brake. The train flipped over on to its side. There were sparks and screams and seats falling all apart and lots of glass.' Train Crash near Paddington, October 99. Conclusion: Conclusion AI approach to Natural language processing Command Task oriented Closed World Model, Ontology, Knowledge Representation Logic, Inference IR approach One size fits all If you lose information something, that’s ok Open World answers, Evaluable Question Answering, Affect Analysis --IR Borrows from AI Entities belong to an ontology Relations important