approx matches

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Search for Approximate Matchesin Large Databases: 

Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes

Motivation: 

Motivation Fast identification of approximate matches in large sets of records. Applications: Medical databases Customer records National security

Outline: 

Outline Records and queries Search for matches Experimental results

Table of records: 

Table of records We specify a table of records by a list of attributes. Example We can describe patients in a hospital by their sex, age, and diagnosis.

Records and queries: 

Records and queries A record includes a specific value for each attribute. A query may include lists of values and numeric ranges. Query Sex: male, female Age: 20..40 Dx: asthma, flu

Query types: 

Query types

Exact matches: 

Exact matches A record is an exact match for a query if every value in the record belongs to the respective range in the query.

Approximate matches: 

Approximate matches A record is an approximate match for a query if it is 'close' to the query region. Record

Approximate queries: 

Approximate queries An approximate query includes: Point or region Distance function Number of matches Distance limit

Outline: 

Outline Records and queries Search for matches Experimental results

Indexing structure: 

Indexing structure male female 30 40 50 30 asthma flu fracture ulcer asthma flu Maintain a PATRICIA tree of records

Search for matches: 

Search for matches male female 30 40 50 30 asthma flu fracture ulcer asthma flu Depth-first search for exact matches Best-first search for approximate matches

Outline: 

Outline Records and queries Search for matches Experimental results

Performance: 

Performance : Twenty-one attributes 1.6 million records Experiments with a database of all patients admitted to Massachusetts hospitals from October 2000 to September 2002 Use of a Pentium computer: 2.4 GHz CPU 1 Gbyte memory 400 MHz bus

Variables: 

Variables Control variables: Number of records Memory size Query type Measurements: Retrieval time

Small memory: 

Small memory Number of records: 100 to 1,672,016 Memory size: 4 MByte

Large memory: 

Large memory Number of records: 1,672,016 Memory size: 64 to 1,024 MByte

Summary: 

Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases

Summary: 

Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases: