Category: Entertainment

Presentation Description

No description available.


Presentation Transcript

Discovery and Reconciliation of Entity Type Taxonomies: 

Discovery and Reconciliation of Entity Type Taxonomies Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

Searching with types and entities: 

Searching with types and entities Answer types How far is it from Rome to Paris? type=distance#n#1 near words={Rome, Paris} Restrictions on match conditions How many movies did 20th Century Fox release in 1923? …by 1950, the forest had only 1923 foxes left… type={number#n#1,hasDigit} NEAR … year=1923 organization=“20th Century Fox” Corpus = set of docs, doc = token sequence, tokens connected to lexical networks

Searching personal info networks: 

Searching personal info networks No clean schema, data changes rapidly Lots of generic “graph proximity” information Time LastMod EmailDate EmailDate XYZ A. U. Thor EmailTo EmailTo ECML Canonical node Want toquickly find this file

Building blocks: 

Building blocks Structured “seeds” of type info WordNet, Wikipedia, OpenCYC, … Semi-structured sources List of faculty members in a department Catalog of products at ecommerce site Unstructured open-domain text Email bodies, text of papers, blog text, Web page, … Discovery and extension of type attachments Hearst patterns, list extraction, NE taggers Reconciling federated type systems Schema and data integration Query execution engines using type catalogs

Hearst patterns and enhancements: 

Hearst patterns and enhancements Hearst, 1992; KnowItAll (Etzioni+ 2004) T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, … C-PANKOW (Cimiano and Staab 2005) Suitable for unformatted natural language Generally high precision, low recall If few possible Ts, use a named entity tagger

Set extraction: 

Set extraction Each node of a graph is a word Edge connects words wi and wj if these words occur together in more than k docs Use apriori style searches to enumerate Edge weight depends on #docs Given a set of words Q, set up Pagerank Random surfer on word graph W.p. d, jump to some element of Q W.p. 1−d, walk to a neighbor Present nodes (words) with largest Pagerank



List extraction: 

List extraction Given a current set of candidate Ts Limit to candidates having high confidence Select random subset of k=4 candidates Generate query from selected candidates Download response documents Look for lists containing candidate mentions Extract more instances from lists found Boosts extraction rate 2—5 folds

Wrapping, scraping, tagging: 

Wrapping, scraping, tagging HTML formatting clues Help extract records and fields Extensive work in the DB, KDD, ML communities Paper P167 is-a Gerhard Pass Gregor Heinrich has-author Investigating word correlation… has-title

Reconciling type systems: 

Reconciling type systems WordNet: small and precise Wikipedia: much larger, less controlled Collect into a common is-a database

Mapping between taxonomies: 

Mapping between taxonomies Each type has a set of instances Assoc Prof: K. Burn, R. Cook Synset: lemmas from leaf instances Wikipedia concept: list of instances Yahoo topic: set of example Web pages Goal: establish connections between types Connections could be “soft” or probabilistic


Cross-training Set of labels or types B, partly related but not identical to type set A A=Dmoz topics, B=Yahoo topics A=Personal bookmark topics, B=Yahoo topics Training docs come in two flavors now Fully labeled with A and B labels (rare) Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? Inductive transfer, multi-task learning (Sarawagi+ 2003) DA DB


Motivation Symmetric taxonomy mapping Ecommerce catalogs: A=distributor, B=retailer Web directories: A = Dmoz, B = Yahoo Incomplete taxonomies, small training sets Bookmark taxonomy vs. Yahoo Cartesian label spaces UK USA … Regional Top Sports Baseball Cricket Region Topic Label-pair- conditioned term distribution

Labels as features: 

Labels as features A-label known, estimate B-label Suppose we have A+B labeled training set Discrete valued “label column”  Most text classifiers cannot balance importance of very heterogeneous features Do not have fully-labeled data Must guess  (use soft scores instead of 0/1) Term feature values   Augmented feature vector Target label

SVM-CT: Cross-trained SVM: 

SVM-CT: Cross-trained SVM S(A,1) S(B,2) S(A,2) …

SVM-CT anecdotes: 

SVM-CT anecdotes Discriminant reveals relations between A and B One-to-one, many-to-one, related, antagonistic However, accuracy gains are meager Positive Negative

EM1D: Info from unlabeled docs: 

EM1D: Info from unlabeled docs Use training docs to induce initial classifier for taxonomy B, say Repeat until classifier satisfactory Estimate Pr(|d) for unlabeled doc d, B Reweigh d by factor Pr(|d) and add to training set for label  Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) Ignores labels from another taxonomy A

Stratified EM1D: 

Stratified EM1D Target labels = B B-labeled docs are labeled training instances Consider A-labeled docs labeled  These are unlabeled for taxonomy B Run EM1D for each row  Test instance has  known Invoke semi-supervised model for row  to classify A topics B-topics Docs in DA–DB labeled  …

EM2D: Cartesian product EM: 

EM2D: Cartesian product EM Initialize with fully labeled docs which go to a specific (,) cell Smear training doc across label row or column Uniform smear could be bad Use a naïve Bayes classifier to seed Parameters extended from EM1D , prior probability for label pair (,) ,,t multinomial term probability for (,) Labels in A Labels in B A-labeled doc B-labeled doc

EM2D updates: 

EM2D updates E-step for an A-labeled document M-step

Applying EM2D to a test doc: 

Applying EM2D to a test doc Mapping a B-labeled test doc d to an A label (e-commerce catalogs) Given , find argmax Pr(,|d) Classifying a document d with no labels to an A label Aggregation For each  compute  Pr(,|d), pick best  Guessing (EM2D-G) Guess the best * using a B-classifier Find argmax Pr(,*|d) EM pitfalls: damping factor, early stopping


Experiments Selected 5 Dmoz and Yahoo subtree pairs Compare EM2D against Naïve Bayes, best #features and smoothing EM1D: ignore labels from other taxonomy, consider as unlabeled docs Stratified EM1D Mapping test doc with A-label to B-label or vice versa Classifying zero-labeled test doc Accuracy = fraction with correct labels

Accuracy benefits in mapping: 

Accuracy benefits in mapping EM1D and NB are close, because training set sizes for each taxonomy are not too small EM2D > Stratified EM1D > NB 2d transfer of model info seems important Improvement over NB: 30% best, 10% average

Asymmetric setting: 

Asymmetric setting Few (only 300) bookmarked URLs (taxonomy B, target) Many Yahoo URLs, larger number of classes (taxonomy A) Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew

Zero-labeled test documents: 

Zero-labeled test documents EM1D improves accuracy only for 12 train docs EM2D with guessing improves beyond EM1D In fact, better than aggregating scores to 1d Choice of unlabeled:labeled damping ratio L may be important to get benefits

Robustness to initialization: 

Robustness to initialization Seeding choices: hard (best class), NB scores, uniform Smear a fraction uniformly, rest by NB scores EM2D is robust to wide range of smear fractions Fully uniform smearing can fail (local optima) Uniformsmear NaïveBayessmear

Handling constraints: 

Handling constraints Type systems are often hierarchical Neighborhood constraints Two nodes match if their children also match Two nodes match if their parents match and some of their descendants also match If all children of node X match node Y, then X also matches Y If a node in the neighborhood of node X matches ASSOCIATE- PROFESSOR, then the chance that X matches PROFESSOR is increased Three-stage approach (Doan+ 2002) …

Three-stage reconciliation: 

Three-stage reconciliation Use EM2D to build a distribution estimator For every type pair A, B, find, over the domain of entities, Pr(A,B), Pr(A,!B), Pr(!A,B), Pr(!A,!B) From these, compute a local similarity measure E.g. Jaccard: Perform relaxation labeling to find a good mapping (Chakrabarti+ 1998, Doan+ 2002)

Relaxation labeling: 

Relaxation labeling Label in O1 Mapping Label in O2 Everything known Markovian assumption: Features of the immediate neighborhood of X Initialize some reasonable f and reassign via Bayes rule until f stabilizes

Sample results: 

Sample results


Summary To realize the semantic Web vision we must Assemble type schema from diverse sources Mine type instances automatically Annotate large corpora efficiency Build indices integrating text and annotations Support schema-agnostic query languages When did you last type XQuery into Google Design high-performance query execution engines New family of ranking functions

authorStream Live Help