logging in or signing up Shen CRF Dabby Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 751 Category: Product Traini.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 25, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript ICML 2001 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data: ICML 2001 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty, Andrew McCallum, Fernando Pereira Presentation by Rongkun Shen Nov. 20, 2003 Sequence Segmenting and Labeling: Sequence Segmenting and Labeling Goal: mark up sequences with content tags Application in computational biology DNA and protein sequence alignment Sequence homolog searching in databases Protein secondary structure prediction RNA secondary structure analysis Application in computational linguistics andamp; computer science Text and speech processing, including topic segmentation, part-of-speech (POS) tagging Information extraction Syntactic disambiguation Example: Protein secondary structure prediction: Example: Protein secondary structure prediction Conf: 977621015677468999723631357600330223342057899861488356412238 Pred: CCCCCCCCCCCCCEEEEEEECCCCCCCCCCCCCHHHHHHHHHHHHHHHCCCCEEEEHHCC AA: EKKSINECDLKGKKVLIRVDFNVPVKNGKITNDYRIRSALPTLKKVLTEGGSCVLMSHLG 10 20 30 40 50 60 Conf: 855764222454123478985100010478999999874033445740023666631258 Pred: CCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHCCCCCCCCCCCCHHHHHHCCC AA: RPKGIPMAQAGKIRSTGGVPGFQQKATLKPVAKRLSELLLRPVTFAPDCLNAADVVSKMS 70 80 90 100 110 120 Conf: 874688611002343044310017899999875053355212244334552001322452 Pred: CCCEEEECCCHHHHHHCCCCCHHHHHHHHHHHHHCCEEEECCCCCCCCCCCCCCCCHHHH AA: PGDVVLLENVRFYKEEGSKKAKDREAMAKILASYGDVYISDAFGTAHRDSATMTGIPKIL 130 140 150 160 170 180 Generative Models: Generative Models Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples Generative Models (cont’d): Generative Models (cont’d) Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations Conditional Models: Conditional Models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models Discriminative ModelsMaximum Entropy Markov Models (MEMMs): Discriminative Models Maximum Entropy Markov Models (MEMMs) Exponential model Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization MEMMs (cont’d): MEMMs (cont’d) MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states ('conservation of score mass') Subject to Label Bias Problem Bias toward states with fewer outgoing transitions Label Bias Problem: Label Bias Problem P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation Consider this MEMM: Solve the Label Bias Problem: Solve the Label Bias Problem Change the state-transition structure of the model Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction) Random Field: Random Field Conditional Random Fields (CRFs): Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions 'vote' more strongly than others depending on the corresponding observations Definition of CRFs: Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences Example of CRFs: Example of CRFs Graphical comparison among HMMs, MEMMs and CRFs: Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF Conditional Distribution: Conditional Distribution Conditional Distribution (cont’d): Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x Parameter Estimation for CRFs: Parameter Estimation for CRFs The paper provided iterative scaling algorithms It turns out to be very inefficient Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient Training of CRFs (From Prof. Dietterich): Training of CRFs (From Prof. Dietterich) Then, take the derivative of the above equation For training, the first 2 items are easy to get. For example, for each lk, fk is a sequence of Boolean numbers, such as 00101110100111. is just the total number of 1’s in the sequence. The hardest thing is how to calculate Z(x) Training of CRFs (From Prof. Dietterich) (cont’d): Training of CRFs (From Prof. Dietterich) (cont’d) Maximal cliques Modeling the label bias problem: Modeling the label bias problem In a simple HMM, each state generates its designated symbol with probability 29/32 and the other symbols with probability 1/32 Train MEMM and CRF with the same topologies A run consists of 2,000 training examples and 500 test examples, trained to convergence using Iterative Scaling algorithm CRF error is 4.6%, and MEMM error is 42% MEMM fails to discriminate between the two branches CRF solves label bias problem MEMM vs. HMM: MEMM vs. HMM The HMM outperforms the MEMM MEMM vs. CRF: MEMM vs. CRF CRF usually outperforms the MEMM CRF vs. HMM: CRF vs. HMM Each open square represents a data set with α andlt; 1/2, and a solid circle indicates a data set with α ≥ 1/2; When the data is mostly second order (α ≥ 1/2), the discriminatively trained CRF usually outperforms the HMM POS tagging Experiments: POS tagging Experiments POS tagging Experiments (cont’d): POS tagging Experiments (cont’d) Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set) Summary: Summary Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance Thanks for your attention!Special thanks to Prof. Dietterich & Tadepalli!: Thanks for your attention! Special thanks to Prof. Dietterich andamp; Tadepalli! You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Shen CRF Dabby Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 751 Category: Product Traini.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 25, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript ICML 2001 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data: ICML 2001 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty, Andrew McCallum, Fernando Pereira Presentation by Rongkun Shen Nov. 20, 2003 Sequence Segmenting and Labeling: Sequence Segmenting and Labeling Goal: mark up sequences with content tags Application in computational biology DNA and protein sequence alignment Sequence homolog searching in databases Protein secondary structure prediction RNA secondary structure analysis Application in computational linguistics andamp; computer science Text and speech processing, including topic segmentation, part-of-speech (POS) tagging Information extraction Syntactic disambiguation Example: Protein secondary structure prediction: Example: Protein secondary structure prediction Conf: 977621015677468999723631357600330223342057899861488356412238 Pred: CCCCCCCCCCCCCEEEEEEECCCCCCCCCCCCCHHHHHHHHHHHHHHHCCCCEEEEHHCC AA: EKKSINECDLKGKKVLIRVDFNVPVKNGKITNDYRIRSALPTLKKVLTEGGSCVLMSHLG 10 20 30 40 50 60 Conf: 855764222454123478985100010478999999874033445740023666631258 Pred: CCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHCCCCCCCCCCCCHHHHHHCCC AA: RPKGIPMAQAGKIRSTGGVPGFQQKATLKPVAKRLSELLLRPVTFAPDCLNAADVVSKMS 70 80 90 100 110 120 Conf: 874688611002343044310017899999875053355212244334552001322452 Pred: CCCEEEECCCHHHHHHCCCCCHHHHHHHHHHHHHCCEEEECCCCCCCCCCCCCCCCHHHH AA: PGDVVLLENVRFYKEEGSKKAKDREAMAKILASYGDVYISDAFGTAHRDSATMTGIPKIL 130 140 150 160 170 180 Generative Models: Generative Models Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples Generative Models (cont’d): Generative Models (cont’d) Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations Conditional Models: Conditional Models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models Discriminative ModelsMaximum Entropy Markov Models (MEMMs): Discriminative Models Maximum Entropy Markov Models (MEMMs) Exponential model Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization MEMMs (cont’d): MEMMs (cont’d) MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states ('conservation of score mass') Subject to Label Bias Problem Bias toward states with fewer outgoing transitions Label Bias Problem: Label Bias Problem P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation Consider this MEMM: Solve the Label Bias Problem: Solve the Label Bias Problem Change the state-transition structure of the model Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction) Random Field: Random Field Conditional Random Fields (CRFs): Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions 'vote' more strongly than others depending on the corresponding observations Definition of CRFs: Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences Example of CRFs: Example of CRFs Graphical comparison among HMMs, MEMMs and CRFs: Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF Conditional Distribution: Conditional Distribution Conditional Distribution (cont’d): Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x Parameter Estimation for CRFs: Parameter Estimation for CRFs The paper provided iterative scaling algorithms It turns out to be very inefficient Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient Training of CRFs (From Prof. Dietterich): Training of CRFs (From Prof. Dietterich) Then, take the derivative of the above equation For training, the first 2 items are easy to get. For example, for each lk, fk is a sequence of Boolean numbers, such as 00101110100111. is just the total number of 1’s in the sequence. The hardest thing is how to calculate Z(x) Training of CRFs (From Prof. Dietterich) (cont’d): Training of CRFs (From Prof. Dietterich) (cont’d) Maximal cliques Modeling the label bias problem: Modeling the label bias problem In a simple HMM, each state generates its designated symbol with probability 29/32 and the other symbols with probability 1/32 Train MEMM and CRF with the same topologies A run consists of 2,000 training examples and 500 test examples, trained to convergence using Iterative Scaling algorithm CRF error is 4.6%, and MEMM error is 42% MEMM fails to discriminate between the two branches CRF solves label bias problem MEMM vs. HMM: MEMM vs. HMM The HMM outperforms the MEMM MEMM vs. CRF: MEMM vs. CRF CRF usually outperforms the MEMM CRF vs. HMM: CRF vs. HMM Each open square represents a data set with α andlt; 1/2, and a solid circle indicates a data set with α ≥ 1/2; When the data is mostly second order (α ≥ 1/2), the discriminatively trained CRF usually outperforms the HMM POS tagging Experiments: POS tagging Experiments POS tagging Experiments (cont’d): POS tagging Experiments (cont’d) Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set) Summary: Summary Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance Thanks for your attention!Special thanks to Prof. Dietterich & Tadepalli!: Thanks for your attention! Special thanks to Prof. Dietterich andamp; Tadepalli!