logging in or signing up lecture19 Cubemiddle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 176 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: December 05, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript I256: Applied Natural Language Processing: I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006 Today : Today Comparing term clustering and category output Clustering in Weka Data mining from blogsLDA: LDA Latent Dirchelet Allocation Blei, Ng, Jordan, JLMR 03. LDA is a hierarchical probabilistic model of documents. “LDA allows you to analyze of corpus, and extract the topics that combined to form its documents.” http://www.cs.princeton.edu/~blei/lda-c/ Not really clustering, but in the “soft clustering” ballpark.LDA on Recipes: LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco LDA on Recipes: LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco CastaNet: CastaNet (Semi)automated facet creation Stoica & Hearst Build up from WordNet Algorithm is fully automatic but we think you can improve results manually afterwards.CastaNet on Recipes: CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco CastaNet on Recipes: CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco TopicSeek on Enron Email: TopicSeek on Enron Email Technique: pLSI (probabilistic LSI, Hofmann 99) Hand-picked example for website http://topicseek.com/enron.html TopicSeek on Medline: TopicSeek on Medline Technique: pLSI (probabilistic LSI, Hofmann 99) Hand-picked example for website http://topicseek.com/pubmed.html CastaNet on Medline Journal Titles: CastaNet on Medline Journal Titles http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/medicine-automated/Flamenco Clustering in Weka: Clustering in Weka Looking at Clustering Results: Looking at Clustering Results Weka lets you save cluster results to an ARFF file I wrote some python code to process this file and pull out the Subject headings for each newsgroup posting in each cluster.15-way clustering: 15-way clustering Cobweb clustering: Cobweb clustering Blog Analysis: Blog Analysis What’s special about blogs?Blog analysis sites: Blog analysis sites http://dijest.com/bc/ Called blogcount; lots of stats and news about blogs http://blogcensus.net/?page=tools Language, location, marketshare http://www.perseus.com/blogsurvey/ Stats about biggest blogs, demographics http://www.weblogs.com/ Notify when new content posted http://blogpulse.com/ Trends and recent popular topicsBlogs vs. Newsgroups: Blogs vs. Newsgroups Posting about products … what can we tell? Blog: Newsgroup: Example from Glance, Hurst, and Tomokiyo ‘04Analyzing Blogs for Market Data: Analyzing Blogs for Market Data Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05 Idea: examine comments about a product (or a product’s competition or market) in an automated fashion. Application area: handheld electronic devices.Analyzing Blogs for Market Data: Analyzing Blogs for Market Data Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05Technology used: Technology used Post segmentation Important phrases Foreground vs. background corpus Background: text about product Foreground: certain negative paragraphs about product Sentiment classification What do people talk about when saying negative things about product X? Social network analysis (on discussion boards) What does this group of people talk about when saying negative things about product X? Author dispersion Many people talking about it, or just a few? Example: Example What common phrases to people use when saying negative things about product X?Example: Example What do people in this group say when saying negative things about product X?Example: Example What do people in this group say when saying negative things about product X?Predicting Film Sales: Predicting Film Sales Idea: Use discussion before a film to predict its opening weekend box office scores Use discussion afterwards to predict longer-term sales Separate out topic labels from sentiment labels Outcome: Good predictor for opening weekend, but not for longer term Observation: the nature of discussion gets (and thus harder to analyze) after the film has been out a while. Example from Mishne & Glance, 2006Predicting Film Sales: Predicting Film Sales Example from Mishne & Glance, 2006Prediction Film Sales: Prediction Film Sales Example from Mishne & Glance, 2006Predicting Film Sales: Predicting Film Sales Example from Mishne & Glance, 2006Analyzing Political Blogs: Analyzing Political Blogs Analyze: Who links to whom What the popularity profile looks like A powerlaw/Zipf/Pareto, of course Look at structure of topic-specific blogs By #inbound links Image from blogsphere ecosystem via ShirkyAnalyzing Political Blogs: Analyzing Political Blogs Earlier work examined books bought together in pairs at major retailers Krebs, Divided we Stand??? http://www.orgnet.com/leftright.html In other domains the groupings are more distributed. Slide36: http://www.orgnet.com/booknet.htmlSlide37: http://www.orgnet.com/leftright.html from Jan 2003Slide38: http://www.orgnet.com/divided.html from 2004 electionAnalyzing Political Blogs: Analyzing Political Blogs Study by Adamic and Glance, 2005 Analyzed 40 most popular political blogs 2 months preceding 2004 US presidential election Also study 1000 political blogs on a one day snapshot Findings for the latter: Liberal and conservative blogs had distinct lists of favorate news sources, people, and topics, with some overlap on current news Use labels from aggregator sources Linking patterns were indeed pretty internal (91% stayed within political leaning) More and more frequent linking among conservatives 82% conservative linked out vs. 74% of liberalAnalyzing Political Blogs: Analyzing Political Blogs For the 40 most popular blogs: Looked for “echo chamber” effect The conservative blogs are more tightly interlinked. Question: do they repeat the same concepts more? Measured textual similarity among blog posts Slightly stronger within a political leaning than between, but not one orientation more than the other. Looked for interaction with “mainstream” media Found strong distinctions between which sources citedSlide41: Image from Adamic & Glance 200Slide42: Image from Adamic & Glance 200Slide43: Image from Adamic & Glance 200Slide44: Image from Adamic & Glance 200Slide45: Image from Adamic & Glance 200Slide46: Image from Adamic & Glance 200Next Time: Next Time Sentiment and Opinion Analysis You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
lecture19 Cubemiddle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 176 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: December 05, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript I256: Applied Natural Language Processing: I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006 Today : Today Comparing term clustering and category output Clustering in Weka Data mining from blogsLDA: LDA Latent Dirchelet Allocation Blei, Ng, Jordan, JLMR 03. LDA is a hierarchical probabilistic model of documents. “LDA allows you to analyze of corpus, and extract the topics that combined to form its documents.” http://www.cs.princeton.edu/~blei/lda-c/ Not really clustering, but in the “soft clustering” ballpark.LDA on Recipes: LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco LDA on Recipes: LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco CastaNet: CastaNet (Semi)automated facet creation Stoica & Hearst Build up from WordNet Algorithm is fully automatic but we think you can improve results manually afterwards.CastaNet on Recipes: CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco CastaNet on Recipes: CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco TopicSeek on Enron Email: TopicSeek on Enron Email Technique: pLSI (probabilistic LSI, Hofmann 99) Hand-picked example for website http://topicseek.com/enron.html TopicSeek on Medline: TopicSeek on Medline Technique: pLSI (probabilistic LSI, Hofmann 99) Hand-picked example for website http://topicseek.com/pubmed.html CastaNet on Medline Journal Titles: CastaNet on Medline Journal Titles http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/medicine-automated/Flamenco Clustering in Weka: Clustering in Weka Looking at Clustering Results: Looking at Clustering Results Weka lets you save cluster results to an ARFF file I wrote some python code to process this file and pull out the Subject headings for each newsgroup posting in each cluster.15-way clustering: 15-way clustering Cobweb clustering: Cobweb clustering Blog Analysis: Blog Analysis What’s special about blogs?Blog analysis sites: Blog analysis sites http://dijest.com/bc/ Called blogcount; lots of stats and news about blogs http://blogcensus.net/?page=tools Language, location, marketshare http://www.perseus.com/blogsurvey/ Stats about biggest blogs, demographics http://www.weblogs.com/ Notify when new content posted http://blogpulse.com/ Trends and recent popular topicsBlogs vs. Newsgroups: Blogs vs. Newsgroups Posting about products … what can we tell? Blog: Newsgroup: Example from Glance, Hurst, and Tomokiyo ‘04Analyzing Blogs for Market Data: Analyzing Blogs for Market Data Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05 Idea: examine comments about a product (or a product’s competition or market) in an automated fashion. Application area: handheld electronic devices.Analyzing Blogs for Market Data: Analyzing Blogs for Market Data Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05Technology used: Technology used Post segmentation Important phrases Foreground vs. background corpus Background: text about product Foreground: certain negative paragraphs about product Sentiment classification What do people talk about when saying negative things about product X? Social network analysis (on discussion boards) What does this group of people talk about when saying negative things about product X? Author dispersion Many people talking about it, or just a few? Example: Example What common phrases to people use when saying negative things about product X?Example: Example What do people in this group say when saying negative things about product X?Example: Example What do people in this group say when saying negative things about product X?Predicting Film Sales: Predicting Film Sales Idea: Use discussion before a film to predict its opening weekend box office scores Use discussion afterwards to predict longer-term sales Separate out topic labels from sentiment labels Outcome: Good predictor for opening weekend, but not for longer term Observation: the nature of discussion gets (and thus harder to analyze) after the film has been out a while. Example from Mishne & Glance, 2006Predicting Film Sales: Predicting Film Sales Example from Mishne & Glance, 2006Prediction Film Sales: Prediction Film Sales Example from Mishne & Glance, 2006Predicting Film Sales: Predicting Film Sales Example from Mishne & Glance, 2006Analyzing Political Blogs: Analyzing Political Blogs Analyze: Who links to whom What the popularity profile looks like A powerlaw/Zipf/Pareto, of course Look at structure of topic-specific blogs By #inbound links Image from blogsphere ecosystem via ShirkyAnalyzing Political Blogs: Analyzing Political Blogs Earlier work examined books bought together in pairs at major retailers Krebs, Divided we Stand??? http://www.orgnet.com/leftright.html In other domains the groupings are more distributed. Slide36: http://www.orgnet.com/booknet.htmlSlide37: http://www.orgnet.com/leftright.html from Jan 2003Slide38: http://www.orgnet.com/divided.html from 2004 electionAnalyzing Political Blogs: Analyzing Political Blogs Study by Adamic and Glance, 2005 Analyzed 40 most popular political blogs 2 months preceding 2004 US presidential election Also study 1000 political blogs on a one day snapshot Findings for the latter: Liberal and conservative blogs had distinct lists of favorate news sources, people, and topics, with some overlap on current news Use labels from aggregator sources Linking patterns were indeed pretty internal (91% stayed within political leaning) More and more frequent linking among conservatives 82% conservative linked out vs. 74% of liberalAnalyzing Political Blogs: Analyzing Political Blogs For the 40 most popular blogs: Looked for “echo chamber” effect The conservative blogs are more tightly interlinked. Question: do they repeat the same concepts more? Measured textual similarity among blog posts Slightly stronger within a political leaning than between, but not one orientation more than the other. Looked for interaction with “mainstream” media Found strong distinctions between which sources citedSlide41: Image from Adamic & Glance 200Slide42: Image from Adamic & Glance 200Slide43: Image from Adamic & Glance 200Slide44: Image from Adamic & Glance 200Slide45: Image from Adamic & Glance 200Slide46: Image from Adamic & Glance 200Next Time: Next Time Sentiment and Opinion Analysis