Presentation Transcript
Conceptual Fuzzy Sets and Context Sensitive Information Retrieval: Conceptual Fuzzy Sets and Context Sensitive Information Retrieval Tomohiro Takagi
Meiji University, UC Berkeley
Outline: Outline coping with context dependent meanings
toward conceptual fuzzy sets
from IR point of view
Trial 1: TREC Novelty Track
Trial 2: TREC Web track
Trial 3: Enhancing Google Image Search
Trial 4: Detection of illegal websites
Sparse Cording Model
Coping with context dependent meanings: Coping with context dependent meanings Ordinary approach: Cases
cases×cases×cases× ……. Impossible
Ex) heavy: (elephant or human or dog or cat or mouse or …) ×(old or middle or young or child or baby…) ×(Europe or Asia or Africa or ….) ×
Humans do not memorize things in that way.
Slide4: Our approach: Fusion of fractions of knowledge
“Heavy” means bigger weight than usual.
Usually “middle” and “young” is bigger than “child”.
Usually “baby” is smaller than “child”.
…
Fractions of knowledge: Fractions of knowledge
“meaning representation from use” proposed by Wittgenstein: “meaning representation from use” proposed by Wittgenstein According to Wittgenstein the various meanings of a label (word) can be represented by other labels (words) in its use.
In this spirit, conceptual fuzzy sets, in which meaning of a word is represented by the distribution of the activation of other words depending on context, are proposed.
toward conceptual fuzzy sets: toward conceptual fuzzy sets Heavy word A
(with grade) Word B
(with grade) Word E
(with grade) Word D
(with grade) Word F
(with grade) Word C
(with grade)
toward conceptual fuzzy sets: toward conceptual fuzzy sets Heavy word A
(with grade) Word B
(with grade) Word E
(with grade) Word D
(with grade) Word F
(with grade) Word C
(with grade)
toward conceptual fuzzy sets: toward conceptual fuzzy sets Heavy word A
(with grade) Word B
(with grade) Word E
(with grade) Word D
(with grade) Word F
(with grade) Word C
(with grade) Conceptual fuzzy set
(possibility distribution supporting concept “heavy”)
But how to generate possibility distribution reflecting context?
Meanings of JAVA in three deferent contexts : Meanings of JAVA in three deferent contexts Java coffee Island Programming
language
Activated Fraction of Knowledge: Activated Fraction of Knowledge coffee Island
Slide12: Topics (Concepts)
Clustering words (word – document matrix)
+ optimization using corpus
Dmoz (ODP)
Artificial Brain Word vector Context Word vector IN OUT
Slide13: Java & Mocha Java & Windows Relational matrix
CFS with 3 prototype vector
CFS with 15 prototype vector coffee S/W travel H/W coffee S/W travel H/W
Simulations using actual home pages: Simulations using actual home pages Randomly selected 45 home pages
Extracted 247 words from the pages
Built 247 x 247 relational matrix based on co-occurrence
Slide15: Results expanded from keyword input java & application coffee computer travel 3 times iterations 10 times iterations Co-occurrence
Clustering 60 web pages: Clustering 60 web pages Co-occurrence CFSs :Movie :Music :Travel :Cooking
from IR point of view: from IR point of view Exact word matching
Un-match
from IR point of view: from IR point of view Exact word matching
Expansion
Un-match Soft match
.. but low precision
from IR point of view: from IR point of view Exact word matching
Expansion
Context aware
focused expansion Un-match Soft match
.. but low precision
Better quality match
From both point of view: From both point of view Fuzzy sets Information retrieval Information Retrieval based on meaning representation using CFS,
which is possibility distribution of words reflecting context.
TRIAL 1 : TRIAL 1 10,000 words
800 fractions = 800 clusters
Optimized weights
Slide22: ・ ・ ・ ・ ・ ・ ・ ・ ・ X fraction c1 Similarity (x, c1) Similarity (x, c2) Similarity (x, cm) amn am1 a1n a12 a11
Examples of expansion: Examples of expansion WORLD SPORTS AT 0000 GMT WORLD CUP. PARIS _ FIFA bans Laurent Blanc for two games, confirming that the French defender is out of Sunday's World Cup final against Brazil.
TREC Novelty Track: TREC Novelty Track Tasks
Relevancy Detection
Novelty Detection
Learning data
Reuter (TREC 2002) corpus
810,000 documents
Indexed words: 10,000
Prototypes: 800
Relevancy Detection System: Relevancy Detection System
Result of Task 1 and Task 3 : Result of Task 1 and Task 3
Task 1, Relevant and Novel F Scores: Task 1, Relevant and Novel F Scores
TRIAL 2: TRIAL 2 Case 1: 120,000 fractions = docs
Case 2: 70,000 fractions = clusters of docs.
TREC Web track Topic Distillation Task : document Modified
vector query Modified
vector output matching TREC Web track Topic Distillation Task Gov collection (1.2 million HTML docs.)
Example of Expansion: Example of Expansion Physical fitness
(0.0392 → 0.1362)
0.111806 fit
0.107622 physic
0.031421 sport
0.023926 exercis
0.020036 aerob
0.018505 heart
0.018082 obes
0.017206 particl
0.015366 walk computer virus
(0.0105 → 0.0982)
0.098488 viru
0.086169 comput
0.036659 softwar
0.031507 encrypt
0.029903 vulner
0.027442 hacker
0.026835 virus
0.024170 intrus
0.024154 secur
0.022238 password
Results (R-Precision): Results (R-Precision) Case 2 0.1733
The best last year 0.1636
Case 1 0.1612
2nd best last year 0.1485
Enhancing Google Image Search - 20,000 index words - 60,000 prototypes: Enhancing Google Image Search - 20,000 index words - 60,000 prototypes TRIAL 3
Slide33: “gates” Bill Gates Experimental results
Slide34: Query User relevance feedback Meaning representation using CFS Query refinement Focus reflecting context
Slide35: Experimental result - 1 With feedback Without feedback Query = cat
Slide36: Experimental result - 2 With feedback Without feedback Query = apple
Slide38: Text based Image Search Content (Image) based Search Enhanced Image Search Next Step of Image Search
Detection of illegal websites : Detection of illegal websites TRIAL 4
Illegal sites: Illegal sites Warez Illegal distribution and sale of commercial software Emulation Illegal distribution of software, such as video games Music Distribution of music data that infringes on copyrights Adult Pornographic depictions and expressions Hacking
& Cracking Distribution of illegal hacking and cracking software
sharing of technical know-how Drugs
& Guns Sale of drugs and guns
sharing of acquisition routes Killing Descriptions of murder and other violent acts
Illustration of illegal site: Illustration of illegal site Many suspicious words
Many commercial software names
High link rate to compressed files Looks like Illegal distribution and sale of commercial software
Concept Description of “Warez”, “Music” and “Emulation” : Concept Description of “Warez”, “Music” and “Emulation” Warez Music Emulator Suspicious words Commercial
software Software
maker Suspicious
words Compressed
file types URL
List
CFS System: CFS System HTML document TF-IDF values
Types of linked files and URLs
Names (software, makers, music, artists) Support
Vector
Machine
Evaluation: Evaluation Randomly selected 300 actual Web sites
(including 85 illegal sites)
Compared CFS system with plain TF-IDF system
Results: CFS system TF-IDF system precision 0.9878 1.0000 recall 0.9529 0.8706 E measure 0.0299 0.0692 precision 0.9817 0.9556 recall 0.9953 1.0000 E measure 0.0115 0.0227 illegal pages legal pages 300 pages Results
CFS based on Sparse Cording: CFS based on Sparse Cording Training corpus:
200,000 Reuters news articles
(1996/08/20 - 1997/08/19)
Sparse Cording: Sparse Cording In human brain
One information : one neuron (grandmother cell)
One information : several neurons (cell assembly)
Information A Information B Information C Information A Information B Information C
Interconnection based on Mutual Information: Interconnection based on Mutual Information Term Layer Context Layer
Slide49: Meaning of a word is encoded as a activation pattern of neurons.
Fractions of knowledge are encoded as interconnections of neurons.
Get the most appropriate word as a result.
Operating cell assembly: Operating cell assembly 1. term input 2. propagating
activation
to related context 3. detecting the context 4. propagating
activation
to related word 5. term output
Examples of expansion: Input “child” + “seat” Input “child” Input “seat” Examples of expansion