logging in or signing up DTCIII Haylee Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 196 Category: Entertainment License: All Rights Reserved Like it (2) Dislike it (0) Added: November 26, 2007 This Presentation is Public Favorites: 1 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Automatic Text Classification: Automatic Text Classification Yutaka Sasaki NaCTeM School of Computer Science ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide2: Introduction ﴀ ©2007 Yutaka Sasaki, University of ManchesterIntroduction: Introduction Text Classification is the task: to classify documents into predefined classes Text Classification is also called Text Categorization Document Classification Document Categorization Two approaches manual classification and automatic classification ﴀ ©2007 Yutaka Sasaki, University of ManchesterRelevant technologies: Relevant technologies Text Clustering Create clusters of documents without any external information Information Retrieval (IR) Retrieve a set of documents relevant to a query Information Filtering Filter out irrelevant documents through interactions Information Extraction (IE) Extract fragments of information, e.g., person names, dates, and places, in documents Text Classification No query, interactions, external information Decide topics of documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of relevant technologies: Examples of relevant technologies ﴀ ©2007 Yutaka Sasaki, University of Manchester web documentsExample of clustering: Example of clustering web documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of information retrieval: Examples of information retrieval x web documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of information filtering: Examples of information filtering web documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of information extraction: Examples of information extraction web documents about accidents Date: 04/12/03 Place: London Type: traffic Casualty: 5 Key information on accidents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of text classification: Examples of text classification web documents ﴀ ©2007 Yutaka Sasaki, University of Manchester sports economicsText Classification Applications: Text Classification Applications E-mail spam filtering Categorize newspaper articles and newswires into topics Organize Web pages into hierarchical categories Sort journals and abstracts by subject categories (e.g., MEDLINE, etc.) Assigning international clinical codes to patient clinical records ﴀ ©2007 Yutaka Sasaki, University of ManchesterSimple text classification example: Simple text classification example You want to classify documents into 4 classes: economics, sports, science, life. There are two approaches that you can take: rule-based approach write a set of rules that classify documents machine learning-based approach using a set of sample documents that are classified into the classes (training data), automatically create classifiers based on the training data ﴀ ©2007 Yutaka Sasaki, University of ManchesterComparison of Two Approaches (1): Comparison of Two Approaches (1) Rule-based classification Pros: very accurate when rules are written by experts classification criteria can be easily controlled when the number of rules are small. Cons: sometimes, rules conflicts each other maintenance of rules becomes more difficult as the number of rules increases the rules have to be reconstructed when a target domain changes low coverage because of a wide variety of expressions ﴀ ©2007 Yutaka Sasaki, University of ManchesterComparison of Two Approaches (2): Comparison of Two Approaches (2) Machine Learning-based approach Pros: domain independent high predictive performance Cons: not accountable for classification results training data required ﴀ ©2007 Yutaka Sasaki, University of ManchesterFormal Definition: Formal Definition Given: A set of documents D = {d1, d2,…, dm} A fixed set of topics T = {t1, t2,…, tn} Determine: The topic of d: t(d) T, where t(x) is a classification function whose domain is D and whose range is T. ﴀ ©2007 Yutaka Sasaki, University of ManchesterRule-based approach: Rule-based approach Example: Classify documents into sports “ball” must be a word that is frequently used in sports Rule 1: “ball” d t(d) = sports But there are other meanings of “ball” Def.2-1 : a large formal gathering for social dancing (WEBSTER) Rule 2: “ball” d & “dance” d t(d) = sports Def.2-2 : a very pleasant experience : a good time (WEBSTER) Rule 3: “ball” d & “dance” d & “game” d & “play” d t(d) = sports Natural language has a rich variety of expressions: e.g., “Many people have a ball when they play a bingo game.” ﴀ ©2007 Yutaka Sasaki, University of ManchesterMachine Learning Approach: Machine Learning Approach Prepare a set of training data Attach topic information to the documents in a target domain. Create a classifier (model) Apply a Machine Learning tool to the data Support Vector Machine (SVM), Maximum Entropy Models (MEM) Classify new documents by the classifier sports science life classifier sports science life classifier … life sports Training data ﴀ ©2007 Yutaka Sasaki, University of ManchesterCloser look atMachine Learning-based approach: Closer look at Machine Learning-based approach f1 f2 f3 f4 game play ball dance Classifier c(·|·) c(sports|x) document d c(science|x) c(economics|x) c(y|x) x=(f1, f2, f3, f4) features feature extraction feature vector (input vector) c(life|x) Select the best classification result ﴀ ©2007 Yutaka Sasaki, University of ManchesterRule-based vs. Machine Learning-based[Creecy at al., 1992]: Rule-based vs. Machine Learning-based [Creecy at al., 1992] Data: US Census Bureau Decennial Census 1990 22 million natural language responses 232 industry categories and 504 occupation categories It costs about $15 million if fully done by hand Define classification rules manually: Expert System AIOCS Development time: 192 person-months (2 people, 8 years) Accuracy = 57%(industry), 37%(occupation) Learn classification function Machine Learning-based System PACE Development time: 4 person-months Accuracy = 63%(industry), 57%(occupation) ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide20: Evaluation ﴀ ©2007 Yutaka Sasaki, University of ManchesterCommon Evaluation Metrics: Common Evaluation Metrics Accuracy Precision Recall F-measure harmonic mean of recall and precision micro-average F1 global calculation of F1 regardless of topics macro-average F1: average on F1 scores of all the topics ﴀ ©2007 Yutaka Sasaki, University of ManchesterAccuracy: Accuracy The rate of correctly predicted topics system’s prediction correct answer true positive false positive (Type I error, false alarm) false negative (Type II error, missed alarm) (TP) (FP) (FN) Accuracy = TP + TN TP + FP + FN + TN true negative (TN) ﴀ ©2007 Yutaka Sasaki, University of ManchesterAccuracy: Accuracy Example: classify docs into spam or not spam Accuracy = = = 0.4 TP+TN TP+FP+FN+TN d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 1+1 1+2+1+1 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Issue in Accuracy: Issue in Accuracy When a certain topic (e.g., not-spam) is a majority, the accuracy easily reaches a high percentage. Accuracy = = = 0.99 TP+TN TP+FP+FN+TN d1 … N N system’s prediction correct answer Y Y TP FP FN TN 1 1 990 d11- d1000 990 1000 N … N d10 … … … … N … N ﴀ ©2007 Yutaka Sasaki, University of Manchester Precision: Precision The rate of correctly predicted topics system’s prediction correct answer true positive false positive (Type I error, false alarm) false negative (Type II error, missed alarm) (TP) (FP) (FN) Precision = TP TP + FP true negative (TN) ﴀ ©2007 Yutaka Sasaki, University of ManchesterPrecision: Precision Example: classify docs into spam or not spam Precision = = = 0.333 TP TP+FP d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 1 1+2 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Issue in Precision: Issue in Precision When a system outputs only confident topics, the precision easily reaches a high percentage. Accuracy = = = 1 TP TP+FP d1 … N N system’s prediction correct answer Y N TP FP FN TN 1 1 1 1 1 Y d999 … …(Y or N)… … … Y d1000 ﴀ ©2007 Yutaka Sasaki, University of Manchester Recall: Recall The rate of correctly predicted topics system’s prediction correct answer true positive false positive (Type I error, false alarm) false negative (Type II error, missed alarm) (TP) (FP) (FN) Precision = TP TP + FN true negative (TN) ﴀ ©2007 Yutaka Sasaki, University of ManchesterRecall: Recall Example: classify docs into spam or not spam Precision = = = 0.5 TP TP+FN d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 1 1+1 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Issue in Recall: Issue in Recall When a system outputs loosely, the recall easily reaches a high percentage. Accuracy = = = 1 TP TP+FN d1 … Y Y system’s prediction correct answer Y N TP FP FN TN 1 1 1 n n Y d999 … …(Y or N)… … … Y d1000 ﴀ ©2007 Yutaka Sasaki, University of Manchester F-measure: F-measure Harmonic mean of recall and precision Since there is a trade-off between recall and precision, F-measure is widely used to evaluate text classification system. Micro-average F1: Global calculation of F1 regardless of topics Macro-evarage F1: Average on F1 scores of all topics 2 · Precision · Recall Precision + Recall ﴀ ©2007 Yutaka Sasaki, University of ManchesterF-measure: F-measure Example: classify docs into spam or not spam F = = = 0.4 2·Recall·Precision Recall+Precision d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 2·1/3·1/2 1/3 + 1/2 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Summary: Evaluation Metrics: Summary: Evaluation Metrics Accuracy Precision Recall F-measure Micro F1: Global average of F1 regardless of topics Macro F1: Average on F1 scores of all topics Cost-Sensitive Accuracy Measure (*) Multi-Topic Accuracy (*) TP (# system's correct predictions) TP+FP (# system’s outputs) TP (# system's correct predictions) TP+FN (# correct answers) 2 * Recall * Precision Recall + Precision ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide34: Feature Extraction: from Text to Data ﴀ ©2007 Yutaka Sasaki, University of ManchesterBasic Approach (1): Basic Approach (1) Bag-of-Word approach a document is regarded as a set of words regardless of the word order and grammar. The brown fox jumps over the lazy dog. brown fox jumps over the lazy dog The ﴀ ©2007 Yutaka Sasaki, University of ManchesterBasic Approach (2): Basic Approach (2) Bi-grams, tri-grams, n-grams Extract all of two, three, or n words in a row in the text The brown fox jumps over the lazy dog. Bi-grams: the brown, brown fox, fox jumps, jumps over, the lazy, lazy dog Tri-grams: the brown fox, brown fox jumps, fox jumps over, jumps over the, the lazy dog ﴀ ©2007 Yutaka Sasaki, University of ManchesterBasic Approach (3): Basic Approach (3) Normalization Convert words into a normalized forms down-case, e.g, The the, NF-kappa B nf-kappa b lemmatization: to basic forms, e.g., jumps jump stemming: mechanically remove/change suffixes e.g., yi, s , “the brown fox jump over the lazi dog.” the Porter’s Stemmer is widely used. Stop-word removal ignore predefined common words, e.g., the, a, to, with, that … the SMART Stop List is widely used ﴀ ©2007 Yutaka Sasaki, University of ManchesterFrom Symbols to Numeric: From Symbols to Numeric Term occurrence: occur (1) or not-occur (0) Term Frequency tfi = the number of times where word/n-gram wi appears in a document. Inverse document frequency the inverted rate of documents that contain word/n-gram wi against a whole set of documents idfi = | D | / | {d | wi d D }|. tf-idf tf-idfi = tfi · idfi frequent words that appear only in a small number of documents achieve high value. ﴀ ©2007 Yutaka Sasaki, University of ManchesterCreate Feature Vectors: Create Feature Vectors a an …brown,.. dog … fox jump lazi over the ( 0, 0,…,0, 1, ,0,…,0, 1,0,…,0, 1,0,…,0,1, 0,…,0,1,0,…,0,1, 2, 0, ..) 1. enumerate all word/n-grams in a whole set of documents 2. remove duplications and sort the words/n-grams 3. convert each word into its value, e.g., tf, idf, or tf-idf. 4. create a vector whose i-th value is the value of i-th term The brown fox jumps over the lazy dog. Generally, feature vectors are very sparse, i.e., most of the values are 0. feature vector with tf weights: ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide40: Multi-Topic Text Classification ﴀ ©2007 Yutaka Sasaki, University of ManchesterMulti-topic Text Classification: Multi-topic Text Classification <TOPICS>ship</TOPICS> The Panama Canal Commission, a U.S. government agency, said in its daily operations report that there was a backlog of 39 ships waiting to enter the canal early today. <TOPICS>crude</TOPICS> Diamond Shamrock Corp said that effective today it had cut its contract prices for crude oil by 1.50 dlrs a barrel. <TOPICS>crude:ship</TOPICS> The port of Philadelphia was closed when a Cypriot oil tanker, Seapride II, ran aground after hitting a 200-foot tower supporting power lines across the river, a Coast Guard spokesman said. (Excerpt from Ruters-21578) One single document belongs to multiple topics An interesting and important research theme that is not nicely solved yet. Topic A&B is not always a mixture of A and B ﴀ ©2007 Yutaka Sasaki, University of ManchesterA View on Multi-topic Text Classification: A View on Multi-topic Text Classification Open Topic Assumption (OTA) (conventional view) A document has multiple topics The topics other than the given topics are neutral. Closed Topic Assumption (CTA) A document has multiple topics The other topics are considered to be explicitly excluded. E.g., if there exist three topics A,B,C and a text d is given the topic A, then this assignment is regarded that d belongs to A but does not belong to B and C. B A C A A but neither B nor C CTA OTA A ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide43: ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide44: Case Studies ﴀ ©2007 Yutaka Sasaki, University of ManchesterExperiments: Experiments Objective compare the performance of approaches based on Closed Topic Assumption and Open Topic Assumption. Data 1 (Clinical records) Training: about 986 documents Test: 984 documents Data 2 (Reuters newswires) Training: 9,603 documents Test: 3,299 documents Machine Learning methods SVM: Support Vector Machines MEM: Maximum Entropy Models Approaches BC: Binary Class Classification MC: Multi Class Classification ﴀ ©2007 Yutaka Sasaki, University of ManchesterClassification of Clinical Records: Classification of Clinical Records Medical NLP Challenge (Computational Medicine Centre) Classify anonymized real clinical records into International Clinical Codes (ICD-9-CM) 44 research institutes participated Sample Record: # Clinical History This is a patient with meningomyelocele and neurogenic bladder. # Impression Normal renal ultrasound in a patient with neurogenic bladder. Correct codes (possibly multiple codes): 596.54 (Neurogenic bladder NOS) 741.90 (Without mention of hydrocephalus) ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide47: Document Predicted codes (multi-topics) Correct codes Top 5 Candidates Significance of each feature ﴀ ©2007 Yutaka Sasaki, University of ManchesterEvaluation Metrics: Evaluation Metrics AC: multi-labelling accuracy Cost-Sensitive Accuracy Measure (for clinical data) Precision Recall F1 Micro F1: Global calculation of F1 regardless of topics Macro F1: Average on F1 scores of all topics # system correct labeling # system output # system correct labeling # correct labeling 2 * Precison * Recall Precision + Recall ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide49: Classification Experiments on Clinical Records ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide50: Experimental Results on Clinical Records (cont.) ﴀ ©2007 Yutaka Sasaki, University of ManchesterExperimental Results on Reuters: Experimental Results on Reuters CTA OTA CTA OTA ﴀ ©2007 Yutaka Sasaki, University of ManchesterMulti-topic accuracy (Reuters): Multi-topic accuracy (Reuters)Micro-average F1 (Reuters): Micro-average F1 (Reuters) (Excerpt from Ruters-21578) ﴀ ©2007 Yutaka Sasaki, University of ManchesterMacro-average F1 (Reuters): Macro-average F1 (Reuters) ﴀ ©2007 Yutaka Sasaki, University of ManchesterReferences: References Rule-based vs. Machine Learning Based Text Classification Robert H. Creecy, Brij M. Masand, Stephen J. Smith, David L. Walt, Trading MIPS and memory for knowledge engineerring, Communications of the ACM, Vol. 35, Isuue 8, pp. 48-64, 1992. Review paper on Text Classification Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002. CMC Medical NLP Challenge 2007 http://www.computationalmedicine.org/challenge/index.php Clinical Text Classification Yutaka Sasaki, Brian Rea, Sophia Ananiadou, Multi-Topic Aspects in Clinical Text Classification, IEEE International Conference on Bioinformatics and Biomedicine 2007 (IEEE BIBM-07), Silicon Valley, Nov. 2-7, 2007. Selected papers on Text Classification S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami, Inductive Learning Algorithms and Representations for Text Categorization, Prof. CIKM '98, pp.148-155, 1998. Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proc. of 10th European Conference on Machine Learning (ECML-98)}, pp.137-142, 1998. A. McCallum, Multi-label Text Classification with a Mixture Model Trained by EM, AAAI-99 Workshop on Text Learning, 1999. K. Nigam, J. Lafferty, A. McCallum, Using Maximum Entropy for Text Classification, IJCAI-99 Workshop on Machine Learning for Information Filtering, pp.61-67, 1999. John C. Platt, Nello Cristianini, John Shawe-Taylor, Large Margin DAGs for Multiclass Classification, Proc. of NIPS-1999, pp. 547-553, 1999. RE Schapire and Y Singer, BoosTexter: A Boosting-based System for Text Categorization, Machine Learning, Springer, Vol. 39, pp.135-168, 2000. ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide56: Thank you ﴀ ©2007 Yutaka Sasaki, University of Manchester You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
DTCIII Haylee Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 196 Category: Entertainment License: All Rights Reserved Like it (2) Dislike it (0) Added: November 26, 2007 This Presentation is Public Favorites: 1 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Automatic Text Classification: Automatic Text Classification Yutaka Sasaki NaCTeM School of Computer Science ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide2: Introduction ﴀ ©2007 Yutaka Sasaki, University of ManchesterIntroduction: Introduction Text Classification is the task: to classify documents into predefined classes Text Classification is also called Text Categorization Document Classification Document Categorization Two approaches manual classification and automatic classification ﴀ ©2007 Yutaka Sasaki, University of ManchesterRelevant technologies: Relevant technologies Text Clustering Create clusters of documents without any external information Information Retrieval (IR) Retrieve a set of documents relevant to a query Information Filtering Filter out irrelevant documents through interactions Information Extraction (IE) Extract fragments of information, e.g., person names, dates, and places, in documents Text Classification No query, interactions, external information Decide topics of documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of relevant technologies: Examples of relevant technologies ﴀ ©2007 Yutaka Sasaki, University of Manchester web documentsExample of clustering: Example of clustering web documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of information retrieval: Examples of information retrieval x web documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of information filtering: Examples of information filtering web documents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of information extraction: Examples of information extraction web documents about accidents Date: 04/12/03 Place: London Type: traffic Casualty: 5 Key information on accidents ﴀ ©2007 Yutaka Sasaki, University of ManchesterExamples of text classification: Examples of text classification web documents ﴀ ©2007 Yutaka Sasaki, University of Manchester sports economicsText Classification Applications: Text Classification Applications E-mail spam filtering Categorize newspaper articles and newswires into topics Organize Web pages into hierarchical categories Sort journals and abstracts by subject categories (e.g., MEDLINE, etc.) Assigning international clinical codes to patient clinical records ﴀ ©2007 Yutaka Sasaki, University of ManchesterSimple text classification example: Simple text classification example You want to classify documents into 4 classes: economics, sports, science, life. There are two approaches that you can take: rule-based approach write a set of rules that classify documents machine learning-based approach using a set of sample documents that are classified into the classes (training data), automatically create classifiers based on the training data ﴀ ©2007 Yutaka Sasaki, University of ManchesterComparison of Two Approaches (1): Comparison of Two Approaches (1) Rule-based classification Pros: very accurate when rules are written by experts classification criteria can be easily controlled when the number of rules are small. Cons: sometimes, rules conflicts each other maintenance of rules becomes more difficult as the number of rules increases the rules have to be reconstructed when a target domain changes low coverage because of a wide variety of expressions ﴀ ©2007 Yutaka Sasaki, University of ManchesterComparison of Two Approaches (2): Comparison of Two Approaches (2) Machine Learning-based approach Pros: domain independent high predictive performance Cons: not accountable for classification results training data required ﴀ ©2007 Yutaka Sasaki, University of ManchesterFormal Definition: Formal Definition Given: A set of documents D = {d1, d2,…, dm} A fixed set of topics T = {t1, t2,…, tn} Determine: The topic of d: t(d) T, where t(x) is a classification function whose domain is D and whose range is T. ﴀ ©2007 Yutaka Sasaki, University of ManchesterRule-based approach: Rule-based approach Example: Classify documents into sports “ball” must be a word that is frequently used in sports Rule 1: “ball” d t(d) = sports But there are other meanings of “ball” Def.2-1 : a large formal gathering for social dancing (WEBSTER) Rule 2: “ball” d & “dance” d t(d) = sports Def.2-2 : a very pleasant experience : a good time (WEBSTER) Rule 3: “ball” d & “dance” d & “game” d & “play” d t(d) = sports Natural language has a rich variety of expressions: e.g., “Many people have a ball when they play a bingo game.” ﴀ ©2007 Yutaka Sasaki, University of ManchesterMachine Learning Approach: Machine Learning Approach Prepare a set of training data Attach topic information to the documents in a target domain. Create a classifier (model) Apply a Machine Learning tool to the data Support Vector Machine (SVM), Maximum Entropy Models (MEM) Classify new documents by the classifier sports science life classifier sports science life classifier … life sports Training data ﴀ ©2007 Yutaka Sasaki, University of ManchesterCloser look atMachine Learning-based approach: Closer look at Machine Learning-based approach f1 f2 f3 f4 game play ball dance Classifier c(·|·) c(sports|x) document d c(science|x) c(economics|x) c(y|x) x=(f1, f2, f3, f4) features feature extraction feature vector (input vector) c(life|x) Select the best classification result ﴀ ©2007 Yutaka Sasaki, University of ManchesterRule-based vs. Machine Learning-based[Creecy at al., 1992]: Rule-based vs. Machine Learning-based [Creecy at al., 1992] Data: US Census Bureau Decennial Census 1990 22 million natural language responses 232 industry categories and 504 occupation categories It costs about $15 million if fully done by hand Define classification rules manually: Expert System AIOCS Development time: 192 person-months (2 people, 8 years) Accuracy = 57%(industry), 37%(occupation) Learn classification function Machine Learning-based System PACE Development time: 4 person-months Accuracy = 63%(industry), 57%(occupation) ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide20: Evaluation ﴀ ©2007 Yutaka Sasaki, University of ManchesterCommon Evaluation Metrics: Common Evaluation Metrics Accuracy Precision Recall F-measure harmonic mean of recall and precision micro-average F1 global calculation of F1 regardless of topics macro-average F1: average on F1 scores of all the topics ﴀ ©2007 Yutaka Sasaki, University of ManchesterAccuracy: Accuracy The rate of correctly predicted topics system’s prediction correct answer true positive false positive (Type I error, false alarm) false negative (Type II error, missed alarm) (TP) (FP) (FN) Accuracy = TP + TN TP + FP + FN + TN true negative (TN) ﴀ ©2007 Yutaka Sasaki, University of ManchesterAccuracy: Accuracy Example: classify docs into spam or not spam Accuracy = = = 0.4 TP+TN TP+FP+FN+TN d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 1+1 1+2+1+1 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Issue in Accuracy: Issue in Accuracy When a certain topic (e.g., not-spam) is a majority, the accuracy easily reaches a high percentage. Accuracy = = = 0.99 TP+TN TP+FP+FN+TN d1 … N N system’s prediction correct answer Y Y TP FP FN TN 1 1 990 d11- d1000 990 1000 N … N d10 … … … … N … N ﴀ ©2007 Yutaka Sasaki, University of Manchester Precision: Precision The rate of correctly predicted topics system’s prediction correct answer true positive false positive (Type I error, false alarm) false negative (Type II error, missed alarm) (TP) (FP) (FN) Precision = TP TP + FP true negative (TN) ﴀ ©2007 Yutaka Sasaki, University of ManchesterPrecision: Precision Example: classify docs into spam or not spam Precision = = = 0.333 TP TP+FP d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 1 1+2 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Issue in Precision: Issue in Precision When a system outputs only confident topics, the precision easily reaches a high percentage. Accuracy = = = 1 TP TP+FP d1 … N N system’s prediction correct answer Y N TP FP FN TN 1 1 1 1 1 Y d999 … …(Y or N)… … … Y d1000 ﴀ ©2007 Yutaka Sasaki, University of Manchester Recall: Recall The rate of correctly predicted topics system’s prediction correct answer true positive false positive (Type I error, false alarm) false negative (Type II error, missed alarm) (TP) (FP) (FN) Precision = TP TP + FN true negative (TN) ﴀ ©2007 Yutaka Sasaki, University of ManchesterRecall: Recall Example: classify docs into spam or not spam Precision = = = 0.5 TP TP+FN d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 1 1+1 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Issue in Recall: Issue in Recall When a system outputs loosely, the recall easily reaches a high percentage. Accuracy = = = 1 TP TP+FN d1 … Y Y system’s prediction correct answer Y N TP FP FN TN 1 1 1 n n Y d999 … …(Y or N)… … … Y d1000 ﴀ ©2007 Yutaka Sasaki, University of Manchester F-measure: F-measure Harmonic mean of recall and precision Since there is a trade-off between recall and precision, F-measure is widely used to evaluate text classification system. Micro-average F1: Global calculation of F1 regardless of topics Macro-evarage F1: Average on F1 scores of all topics 2 · Precision · Recall Precision + Recall ﴀ ©2007 Yutaka Sasaki, University of ManchesterF-measure: F-measure Example: classify docs into spam or not spam F = = = 0.4 2·Recall·Precision Recall+Precision d1 d2 d3 Y Y N system’s prediction correct answer N Y Y TP FP FN TN 1 1 1 1 1 d4 N 2·1/3·1/2 1/3 + 1/2 N d5 N Y ﴀ ©2007 Yutaka Sasaki, University of Manchester Summary: Evaluation Metrics: Summary: Evaluation Metrics Accuracy Precision Recall F-measure Micro F1: Global average of F1 regardless of topics Macro F1: Average on F1 scores of all topics Cost-Sensitive Accuracy Measure (*) Multi-Topic Accuracy (*) TP (# system's correct predictions) TP+FP (# system’s outputs) TP (# system's correct predictions) TP+FN (# correct answers) 2 * Recall * Precision Recall + Precision ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide34: Feature Extraction: from Text to Data ﴀ ©2007 Yutaka Sasaki, University of ManchesterBasic Approach (1): Basic Approach (1) Bag-of-Word approach a document is regarded as a set of words regardless of the word order and grammar. The brown fox jumps over the lazy dog. brown fox jumps over the lazy dog The ﴀ ©2007 Yutaka Sasaki, University of ManchesterBasic Approach (2): Basic Approach (2) Bi-grams, tri-grams, n-grams Extract all of two, three, or n words in a row in the text The brown fox jumps over the lazy dog. Bi-grams: the brown, brown fox, fox jumps, jumps over, the lazy, lazy dog Tri-grams: the brown fox, brown fox jumps, fox jumps over, jumps over the, the lazy dog ﴀ ©2007 Yutaka Sasaki, University of ManchesterBasic Approach (3): Basic Approach (3) Normalization Convert words into a normalized forms down-case, e.g, The the, NF-kappa B nf-kappa b lemmatization: to basic forms, e.g., jumps jump stemming: mechanically remove/change suffixes e.g., yi, s , “the brown fox jump over the lazi dog.” the Porter’s Stemmer is widely used. Stop-word removal ignore predefined common words, e.g., the, a, to, with, that … the SMART Stop List is widely used ﴀ ©2007 Yutaka Sasaki, University of ManchesterFrom Symbols to Numeric: From Symbols to Numeric Term occurrence: occur (1) or not-occur (0) Term Frequency tfi = the number of times where word/n-gram wi appears in a document. Inverse document frequency the inverted rate of documents that contain word/n-gram wi against a whole set of documents idfi = | D | / | {d | wi d D }|. tf-idf tf-idfi = tfi · idfi frequent words that appear only in a small number of documents achieve high value. ﴀ ©2007 Yutaka Sasaki, University of ManchesterCreate Feature Vectors: Create Feature Vectors a an …brown,.. dog … fox jump lazi over the ( 0, 0,…,0, 1, ,0,…,0, 1,0,…,0, 1,0,…,0,1, 0,…,0,1,0,…,0,1, 2, 0, ..) 1. enumerate all word/n-grams in a whole set of documents 2. remove duplications and sort the words/n-grams 3. convert each word into its value, e.g., tf, idf, or tf-idf. 4. create a vector whose i-th value is the value of i-th term The brown fox jumps over the lazy dog. Generally, feature vectors are very sparse, i.e., most of the values are 0. feature vector with tf weights: ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide40: Multi-Topic Text Classification ﴀ ©2007 Yutaka Sasaki, University of ManchesterMulti-topic Text Classification: Multi-topic Text Classification <TOPICS>ship</TOPICS> The Panama Canal Commission, a U.S. government agency, said in its daily operations report that there was a backlog of 39 ships waiting to enter the canal early today. <TOPICS>crude</TOPICS> Diamond Shamrock Corp said that effective today it had cut its contract prices for crude oil by 1.50 dlrs a barrel. <TOPICS>crude:ship</TOPICS> The port of Philadelphia was closed when a Cypriot oil tanker, Seapride II, ran aground after hitting a 200-foot tower supporting power lines across the river, a Coast Guard spokesman said. (Excerpt from Ruters-21578) One single document belongs to multiple topics An interesting and important research theme that is not nicely solved yet. Topic A&B is not always a mixture of A and B ﴀ ©2007 Yutaka Sasaki, University of ManchesterA View on Multi-topic Text Classification: A View on Multi-topic Text Classification Open Topic Assumption (OTA) (conventional view) A document has multiple topics The topics other than the given topics are neutral. Closed Topic Assumption (CTA) A document has multiple topics The other topics are considered to be explicitly excluded. E.g., if there exist three topics A,B,C and a text d is given the topic A, then this assignment is regarded that d belongs to A but does not belong to B and C. B A C A A but neither B nor C CTA OTA A ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide43: ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide44: Case Studies ﴀ ©2007 Yutaka Sasaki, University of ManchesterExperiments: Experiments Objective compare the performance of approaches based on Closed Topic Assumption and Open Topic Assumption. Data 1 (Clinical records) Training: about 986 documents Test: 984 documents Data 2 (Reuters newswires) Training: 9,603 documents Test: 3,299 documents Machine Learning methods SVM: Support Vector Machines MEM: Maximum Entropy Models Approaches BC: Binary Class Classification MC: Multi Class Classification ﴀ ©2007 Yutaka Sasaki, University of ManchesterClassification of Clinical Records: Classification of Clinical Records Medical NLP Challenge (Computational Medicine Centre) Classify anonymized real clinical records into International Clinical Codes (ICD-9-CM) 44 research institutes participated Sample Record: # Clinical History This is a patient with meningomyelocele and neurogenic bladder. # Impression Normal renal ultrasound in a patient with neurogenic bladder. Correct codes (possibly multiple codes): 596.54 (Neurogenic bladder NOS) 741.90 (Without mention of hydrocephalus) ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide47: Document Predicted codes (multi-topics) Correct codes Top 5 Candidates Significance of each feature ﴀ ©2007 Yutaka Sasaki, University of ManchesterEvaluation Metrics: Evaluation Metrics AC: multi-labelling accuracy Cost-Sensitive Accuracy Measure (for clinical data) Precision Recall F1 Micro F1: Global calculation of F1 regardless of topics Macro F1: Average on F1 scores of all topics # system correct labeling # system output # system correct labeling # correct labeling 2 * Precison * Recall Precision + Recall ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide49: Classification Experiments on Clinical Records ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide50: Experimental Results on Clinical Records (cont.) ﴀ ©2007 Yutaka Sasaki, University of ManchesterExperimental Results on Reuters: Experimental Results on Reuters CTA OTA CTA OTA ﴀ ©2007 Yutaka Sasaki, University of ManchesterMulti-topic accuracy (Reuters): Multi-topic accuracy (Reuters)Micro-average F1 (Reuters): Micro-average F1 (Reuters) (Excerpt from Ruters-21578) ﴀ ©2007 Yutaka Sasaki, University of ManchesterMacro-average F1 (Reuters): Macro-average F1 (Reuters) ﴀ ©2007 Yutaka Sasaki, University of ManchesterReferences: References Rule-based vs. Machine Learning Based Text Classification Robert H. Creecy, Brij M. Masand, Stephen J. Smith, David L. Walt, Trading MIPS and memory for knowledge engineerring, Communications of the ACM, Vol. 35, Isuue 8, pp. 48-64, 1992. Review paper on Text Classification Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002. CMC Medical NLP Challenge 2007 http://www.computationalmedicine.org/challenge/index.php Clinical Text Classification Yutaka Sasaki, Brian Rea, Sophia Ananiadou, Multi-Topic Aspects in Clinical Text Classification, IEEE International Conference on Bioinformatics and Biomedicine 2007 (IEEE BIBM-07), Silicon Valley, Nov. 2-7, 2007. Selected papers on Text Classification S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami, Inductive Learning Algorithms and Representations for Text Categorization, Prof. CIKM '98, pp.148-155, 1998. Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proc. of 10th European Conference on Machine Learning (ECML-98)}, pp.137-142, 1998. A. McCallum, Multi-label Text Classification with a Mixture Model Trained by EM, AAAI-99 Workshop on Text Learning, 1999. K. Nigam, J. Lafferty, A. McCallum, Using Maximum Entropy for Text Classification, IJCAI-99 Workshop on Machine Learning for Information Filtering, pp.61-67, 1999. John C. Platt, Nello Cristianini, John Shawe-Taylor, Large Margin DAGs for Multiclass Classification, Proc. of NIPS-1999, pp. 547-553, 1999. RE Schapire and Y Singer, BoosTexter: A Boosting-based System for Text Categorization, Machine Learning, Springer, Vol. 39, pp.135-168, 2000. ﴀ ©2007 Yutaka Sasaki, University of ManchesterSlide56: Thank you ﴀ ©2007 Yutaka Sasaki, University of Manchester