logging in or signing up FA05 cs294 5 lecture 6 final Clown Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 95 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript CS 294-5: StatisticalNatural Language Processing: CS 294-5: Statistical Natural Language Processing Text Clustering, EM Lecture 6: 9/19/05 Guest Lecturer: Teg Grenager, Stanford UniversityOverview: Overview So far: Classification Applications: text categorization, language identification, word sense disambiguation Generative models: Naïve Bayes Discriminative models: maximum entropy models (a.k.a. logistic regression) “Supervised” learning paradigm Today: Clustering “Unsupervised” learning: no class labels to learn from Magic: discovers hidden patterns in the data Useful in a range of NLP tasks: IR, smoothing, data mining, exploratory data analysis Please interrupt me (I hear you’re good at that!)Ambiguous web queries: Ambiguous web queries Web queries are often truly ambiguous: jaguar NLP paris hilton Seems like word sense ambiguation should help Different senses of jaguar: animal, car, OS X… In practice WSD doesn’t help for web queries Disambiguation is either impossible (“jaguar”) or trivial (“jaguar car”) Better to let the user decide “Cluster” the results into useful groupings Demo: Meet “Clusty”: Demo: Meet “Clusty”How’d they do that?: How’d they do that? Text categorization Label data and build a MaxEnt classifier for every major disambiguation decision Expensive, impractical for open domain Many clustering methods have been developed Most start with a pairwise distance function Most can be interpreted probabilistically (with some effort) Axes: flat / hierarchical, agglomerative / divisive, incremental / iterative, probabilistic / graph theoretic / linear algebraic Our focus: “model-based” vs. “model-free” Model-Free: Define a notion of “page similarity”, and put similar things together in clusters (heuristic, agglomerative) Model-Based: Define a generative probabilistic model over the pages and their clusters, and search for parameters which maximize data likelihood (probabilistic, generative) Point Clustering: Point Clustering Task: group points into clusters Here we illustrate with simple two-dimensional point examples Warning: quite different from text clustering Featural representations of text will typically have a large number of dimensions (103 - 106) Euclidean distance isn’t necessarily the best distance metric for featural representations of text Two Views of Documents: Two Views of Documents Probabilistic A document is a collection of words sampled from some distribution, an empirical distribution Correlations between words flows through hidden model structure Distance: divergences Vector Space A document is a point in a high-dimensional vector space Correlations between words reflects low rank of valid document subspace Distance: Euclidean / cosineHigh-Dimensional Data: High-Dimensional Data Both of these pictures are totally misleading! Documents are zero in almost all axes Most document pairs are very far apart (i.e. not strictly orthogonal, but only share very common words and a few scattered others) In classification terms: virtually all document sets are separable, for most any classification Model-Based Clustering: Model-Based Clustering Document clustering with probabilistic models: Find C and to maximize P(X,C|) LONDON -- Soccer team wins match… NEW YORK – Stocks close up 3%… Investing in the stock market has… The first game of the world series… Observed (X) c1 c2 c2 c1 Unobserved (C)k-Means Clustering: The simplest model-based technique Procedure: Failure Cases: k-Means ClusteringMixture Models: Mixture Models Consider models of the form: Example: generating points in 2D with Gaussian The observed data instances The clusters they belong to Prior probability of cluster i Prob of cluster generating data instance i Learning with EM: Learning with EM Recall that in supervised learning, we search for model parameters which maximize data likelihood Not guaranteed to work well, but it’s a reasonable thing to do and we know how to do it Maximum likelihood estimation is trivial in a generative model: can compute in closed form from data counts Can we do that here? We could if we knew the cluster labels ci Iterative procedure (Expectation-Maximization): Guess some initial parameters for the model Use model to make best guesses of ci (E-step) Use the new complete data to learn better model (M-step) Repeat steps 2 and 3 until convergencek-Means is Hard EM: k-Means is Hard EM Iterative procedure (Expectation-Maximization): Guess some initial parameters for the model Use model to make best guesses of ci (E-step) Use the new complete data to learn better model (M-step) Repeat steps 2 and 3 until convergence EM in Detail: EM in Detail Expectation step Using current model parameters, do probabilistic inference to compute the probability of the cluster labels c These Q’s can viewed as “soft completions” of the data Note: k-Means approximates this Q function with the max Maximization step Compute the model parameters which maximize the log likelihood of the “completed” data (can do in closed form)EM Properties: EM Properties EM is a general technique for learning anytime we have incomplete data (x,y) Convenience Scenario: we want P(x), including y just makes the model simpler (e.g. mixing weights) Induction Scenario: we actually want to know y (e.g. clustering) You’ll see it again in this course! Each step of EM is guaranteed to increase data likelihood - a hill climbing procedure Not guaranteed to find global maximum of data likelihood Data likelihood typically has many local maxima for a general model class and rich feature set Many “patterns” in the data that we can fit our model to…EM Monotonicity Proof: EM Monotonicity Proof Jensen’s inequality for concave function f: f(E[x]) E[f(x)] Multiply by 1 We had chosen (t) to be the max, so any other is worse. Uhoh! Jensen’s would go the wrong way! where EM For Text Clustering: EM For Text Clustering Remember, we care about documents, not points How to model probability of a document given a class? Probabilistic: Naïve Bayes Doesn’t represent differential feature weighting Vector Space: Gaussian Euclidean distance assumption isn’t quite right Agglomerative Clustering: Agglomerative Clustering Most popular heuristic clustering methods Big idea: pick up similar documents and stick them together, repeat Point Example (single link): You get a cluster hierarchy for freeAgglomerative Choices: Agglomerative Choices Choice of distance metric between instances: Euclidean distance (L2-norm) - equivalent to vector space model KL-divergence - equivalent to probabilistic model Choice of distance metric between clusters: Single-link: distance between closest instances in clusters Complete-link: distance between furthest instances in clusters Average-link: average distance between instances in clusters Ward’s method: difference between sum squared error to centroid of combined cluster and separate clustersSingle-Link Clustering: Single-Link Clustering Procedure: Failure Cases Fails when clusters are not well separated (often!) Model Form: Corresponds to fitting a model where instances in each cluster were generated by a random walk though the space Complete-Link Clustering: Complete-Link Clustering Procedure: Failure Cases Fails when clusters aren’t spherical, or of uniform size Model Form Corresponds to fitting a model where instances in each cluster are generated in uniform spheres around a centroid Clustering Demo: Clustering DemoClustering Method Summary: Clustering Method Summary Agglomerative methods: Pro: easy to code Pro: you get a hierarchy of clusters for free Pro/Con: you don’t have to explicitly propose a model (but your distance metrics imply one anyway) Con: runtime > n2, which becomes prohibitive Model-based methods: Pro/Con: you’re forced to propose an explicit model Pro: usually quick to converge Con: very sensitive to initialization Con: how many clusters?Clustering vs. Classification: Clustering vs. Classification Classification: we specify which pattern we want, features uncorrelated with pattern are idle Clustering: clustering procedure locks on to whichever pattern is most salient P(content words | class) will learn topics P(length, function words | class) will learn style P(characters | class) will learn “language”Multiple Patterns: Multiple Patterns Even with the same model class, there are multiple patterns in the data…Multiple Patterns: Multiple Patterns Model Parameterizations Data LikelihoodMultiple Patterns: Multiple Patterns Ways to deal with it Change the data itself Change the search procedure (including smart initialization) Change the model classMultiple Patterns: Multiple Patterns Change Data 1D Projection Examples: Remove stopwords from documents Use dimensionality reduction techniques to change featural representationMultiple Patterns: Multiple Patterns Change Search Examples: Smart initialization of the search Search a subspace by only reestimating some of the model parameters in the M-stepMultiple Patterns: Multiple Patterns Change Model Examples: Add heuristic feature weighting such as inverse document frequency (IDF) Add a hierarchical emission model to Naïve Bayes Limit the form of the covariance matrix in a Gaussian Clustering Problems: Clustering Problems There are multiple patterns in the data, basic approach will just give you the most salient one Relationship between the data representation and the model class is complex and not well understood Data likelihood isn’t usually what you want to maximize Can’t find the global maximum anyway Practical Advice: Practical Advice What can go wrong: Bad initialization (more on this later) Bad interaction between data representation and model bias Can learn some salient pattern that is not what you wanted What can you do? Get used to disappointment Look at errors! Understand what the model family can (and can’t) learn Change data representation Change model structure or estimators …or change objective function [Smith and Eisner, ACL 05] Semi-Supervised Learning: Semi-Supervised Learning A middle ground: semi-supervised methods Use a small labeled training set and a large unlabeled extension set Use labeled data to lock onto the desired patterns Use unlabeled data to flesh out model parameters Some approaches Constrained clustering Self-training Adaptation / anchoring Also: active learningSummary: Summary Clustering Clustering is cool It’s easy to find the most salient pattern It’s quite hard to find the pattern you want It’s hard to know how to fix when broken EM is a useful optimization technique you should understand well if you don’t already Next time: Part of speech tagging You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
FA05 cs294 5 lecture 6 final Clown Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 95 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript CS 294-5: StatisticalNatural Language Processing: CS 294-5: Statistical Natural Language Processing Text Clustering, EM Lecture 6: 9/19/05 Guest Lecturer: Teg Grenager, Stanford UniversityOverview: Overview So far: Classification Applications: text categorization, language identification, word sense disambiguation Generative models: Naïve Bayes Discriminative models: maximum entropy models (a.k.a. logistic regression) “Supervised” learning paradigm Today: Clustering “Unsupervised” learning: no class labels to learn from Magic: discovers hidden patterns in the data Useful in a range of NLP tasks: IR, smoothing, data mining, exploratory data analysis Please interrupt me (I hear you’re good at that!)Ambiguous web queries: Ambiguous web queries Web queries are often truly ambiguous: jaguar NLP paris hilton Seems like word sense ambiguation should help Different senses of jaguar: animal, car, OS X… In practice WSD doesn’t help for web queries Disambiguation is either impossible (“jaguar”) or trivial (“jaguar car”) Better to let the user decide “Cluster” the results into useful groupings Demo: Meet “Clusty”: Demo: Meet “Clusty”How’d they do that?: How’d they do that? Text categorization Label data and build a MaxEnt classifier for every major disambiguation decision Expensive, impractical for open domain Many clustering methods have been developed Most start with a pairwise distance function Most can be interpreted probabilistically (with some effort) Axes: flat / hierarchical, agglomerative / divisive, incremental / iterative, probabilistic / graph theoretic / linear algebraic Our focus: “model-based” vs. “model-free” Model-Free: Define a notion of “page similarity”, and put similar things together in clusters (heuristic, agglomerative) Model-Based: Define a generative probabilistic model over the pages and their clusters, and search for parameters which maximize data likelihood (probabilistic, generative) Point Clustering: Point Clustering Task: group points into clusters Here we illustrate with simple two-dimensional point examples Warning: quite different from text clustering Featural representations of text will typically have a large number of dimensions (103 - 106) Euclidean distance isn’t necessarily the best distance metric for featural representations of text Two Views of Documents: Two Views of Documents Probabilistic A document is a collection of words sampled from some distribution, an empirical distribution Correlations between words flows through hidden model structure Distance: divergences Vector Space A document is a point in a high-dimensional vector space Correlations between words reflects low rank of valid document subspace Distance: Euclidean / cosineHigh-Dimensional Data: High-Dimensional Data Both of these pictures are totally misleading! Documents are zero in almost all axes Most document pairs are very far apart (i.e. not strictly orthogonal, but only share very common words and a few scattered others) In classification terms: virtually all document sets are separable, for most any classification Model-Based Clustering: Model-Based Clustering Document clustering with probabilistic models: Find C and to maximize P(X,C|) LONDON -- Soccer team wins match… NEW YORK – Stocks close up 3%… Investing in the stock market has… The first game of the world series… Observed (X) c1 c2 c2 c1 Unobserved (C)k-Means Clustering: The simplest model-based technique Procedure: Failure Cases: k-Means ClusteringMixture Models: Mixture Models Consider models of the form: Example: generating points in 2D with Gaussian The observed data instances The clusters they belong to Prior probability of cluster i Prob of cluster generating data instance i Learning with EM: Learning with EM Recall that in supervised learning, we search for model parameters which maximize data likelihood Not guaranteed to work well, but it’s a reasonable thing to do and we know how to do it Maximum likelihood estimation is trivial in a generative model: can compute in closed form from data counts Can we do that here? We could if we knew the cluster labels ci Iterative procedure (Expectation-Maximization): Guess some initial parameters for the model Use model to make best guesses of ci (E-step) Use the new complete data to learn better model (M-step) Repeat steps 2 and 3 until convergencek-Means is Hard EM: k-Means is Hard EM Iterative procedure (Expectation-Maximization): Guess some initial parameters for the model Use model to make best guesses of ci (E-step) Use the new complete data to learn better model (M-step) Repeat steps 2 and 3 until convergence EM in Detail: EM in Detail Expectation step Using current model parameters, do probabilistic inference to compute the probability of the cluster labels c These Q’s can viewed as “soft completions” of the data Note: k-Means approximates this Q function with the max Maximization step Compute the model parameters which maximize the log likelihood of the “completed” data (can do in closed form)EM Properties: EM Properties EM is a general technique for learning anytime we have incomplete data (x,y) Convenience Scenario: we want P(x), including y just makes the model simpler (e.g. mixing weights) Induction Scenario: we actually want to know y (e.g. clustering) You’ll see it again in this course! Each step of EM is guaranteed to increase data likelihood - a hill climbing procedure Not guaranteed to find global maximum of data likelihood Data likelihood typically has many local maxima for a general model class and rich feature set Many “patterns” in the data that we can fit our model to…EM Monotonicity Proof: EM Monotonicity Proof Jensen’s inequality for concave function f: f(E[x]) E[f(x)] Multiply by 1 We had chosen (t) to be the max, so any other is worse. Uhoh! Jensen’s would go the wrong way! where EM For Text Clustering: EM For Text Clustering Remember, we care about documents, not points How to model probability of a document given a class? Probabilistic: Naïve Bayes Doesn’t represent differential feature weighting Vector Space: Gaussian Euclidean distance assumption isn’t quite right Agglomerative Clustering: Agglomerative Clustering Most popular heuristic clustering methods Big idea: pick up similar documents and stick them together, repeat Point Example (single link): You get a cluster hierarchy for freeAgglomerative Choices: Agglomerative Choices Choice of distance metric between instances: Euclidean distance (L2-norm) - equivalent to vector space model KL-divergence - equivalent to probabilistic model Choice of distance metric between clusters: Single-link: distance between closest instances in clusters Complete-link: distance between furthest instances in clusters Average-link: average distance between instances in clusters Ward’s method: difference between sum squared error to centroid of combined cluster and separate clustersSingle-Link Clustering: Single-Link Clustering Procedure: Failure Cases Fails when clusters are not well separated (often!) Model Form: Corresponds to fitting a model where instances in each cluster were generated by a random walk though the space Complete-Link Clustering: Complete-Link Clustering Procedure: Failure Cases Fails when clusters aren’t spherical, or of uniform size Model Form Corresponds to fitting a model where instances in each cluster are generated in uniform spheres around a centroid Clustering Demo: Clustering DemoClustering Method Summary: Clustering Method Summary Agglomerative methods: Pro: easy to code Pro: you get a hierarchy of clusters for free Pro/Con: you don’t have to explicitly propose a model (but your distance metrics imply one anyway) Con: runtime > n2, which becomes prohibitive Model-based methods: Pro/Con: you’re forced to propose an explicit model Pro: usually quick to converge Con: very sensitive to initialization Con: how many clusters?Clustering vs. Classification: Clustering vs. Classification Classification: we specify which pattern we want, features uncorrelated with pattern are idle Clustering: clustering procedure locks on to whichever pattern is most salient P(content words | class) will learn topics P(length, function words | class) will learn style P(characters | class) will learn “language”Multiple Patterns: Multiple Patterns Even with the same model class, there are multiple patterns in the data…Multiple Patterns: Multiple Patterns Model Parameterizations Data LikelihoodMultiple Patterns: Multiple Patterns Ways to deal with it Change the data itself Change the search procedure (including smart initialization) Change the model classMultiple Patterns: Multiple Patterns Change Data 1D Projection Examples: Remove stopwords from documents Use dimensionality reduction techniques to change featural representationMultiple Patterns: Multiple Patterns Change Search Examples: Smart initialization of the search Search a subspace by only reestimating some of the model parameters in the M-stepMultiple Patterns: Multiple Patterns Change Model Examples: Add heuristic feature weighting such as inverse document frequency (IDF) Add a hierarchical emission model to Naïve Bayes Limit the form of the covariance matrix in a Gaussian Clustering Problems: Clustering Problems There are multiple patterns in the data, basic approach will just give you the most salient one Relationship between the data representation and the model class is complex and not well understood Data likelihood isn’t usually what you want to maximize Can’t find the global maximum anyway Practical Advice: Practical Advice What can go wrong: Bad initialization (more on this later) Bad interaction between data representation and model bias Can learn some salient pattern that is not what you wanted What can you do? Get used to disappointment Look at errors! Understand what the model family can (and can’t) learn Change data representation Change model structure or estimators …or change objective function [Smith and Eisner, ACL 05] Semi-Supervised Learning: Semi-Supervised Learning A middle ground: semi-supervised methods Use a small labeled training set and a large unlabeled extension set Use labeled data to lock onto the desired patterns Use unlabeled data to flesh out model parameters Some approaches Constrained clustering Self-training Adaptation / anchoring Also: active learningSummary: Summary Clustering Clustering is cool It’s easy to find the most salient pattern It’s quite hard to find the pattern you want It’s hard to know how to fix when broken EM is a useful optimization technique you should understand well if you don’t already Next time: Part of speech tagging