logging in or signing up 07 document file demirel Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 59 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 04, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Building an Index: Building an Index By: Ryan Knowles “building the automatic index is as important as any other component of search engine development”Building an Index Requires Two Lengthy Steps: Building an Index Requires Two Lengthy Steps Document analysis and purification Token analysis or term extractionExample: Example There once was a searcher named Hanna, (1) Who needed some info on manna. (2) She put “rye” and “wheat” in her query (3) Along with “potato” or “cranbeery,” (4) But no mention of “sourdough” or “banana.” (5) Instead of rye, cranberry, or wheat, (6) The results had more spiritual meat. (7) So Hanna was not pleased, (8) Nor was her hunger eased, (9) ‘Cause she was looking for something to eat. (10)Document Analysis and Purification: Document Analysis and Purification Why is document analysis needed? Hypertext documents are more than just text. (photos, tables, charts, audio clips) Looks at how each document is organized and what it is composed of. Decides what information will be indexed and what will not.Token Analysis or Term Extraction: Token Analysis or Term Extraction Decides which words should be used to represent the meaning of documents. Why would it not be necessary to extract every word? Stop words-(able, about, after, allow, became, been, before, certainly, clearly, enough…) Stemming-removing suffixes and sometimes prefixes to reduce a word to its root formExample: Terms Extracted: Example: Terms Extracted Doc No. Terms/ Keywords 1 searcher, Hanna 2 manna 3 rye, wheat, query 4 potato, cranbeery cranb 5 sourdough, banana 6 rye, cranberry, wheat cranb 7 spiritual, meat 8 Hanna 9 hunger 10 No termsManual Indexing: Manual Indexing Why is this no longer practical? What are some upsides to this strategy? Do you think any companies still do this? Yahoo 2002 Small companies National Library of Medicine H.W. Wilson Company CinahlAutomatic Indexing: Automatic Indexing The dominant method for processing documents from large web databases Why is this more efficient? What are some downsides? Spamming Intent of searcherItem Normalization: Item Normalization Taking the smallest unit of the document and constructing searchable data structures What needs to be done in order to create an inverted file structure Why is this normalization necessary?Inverted File Structures: Inverted File Structures The document file Each doc is given a unique ID All terms identified The dictionary Sorted list of all the unique terms The inversion list Points from term to which docs contain itExample: Dictionary List: Example: Dictionary List Banana 1 Cranb 2 Hanna 2 Hunger 1 Manna 1 Meat 1 Potato 1 Query 1 Rye 2 Sourdough 1 Spiritual 1 Wheat 2 Example: Inversion List: Example: Inversion List Banana (5,7) Cranb (4,5); (6,4) Hanna (1,7); (8,2) Hunger (9,4) Manna (2,6) Meat (7,6) Potato (4,3) Query (3,8) Rye (3,3); (6,3) Sourdough (5,5) Spiritual (7,5) Wheat (3,5); (6,6) Other File Structures: Other File Structures Signature Files Eliminates all non-matches rather than matching the query with the termOther Questions: Other Questions How frequently should crawlers go through a certain page? A question that is still being looked into You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
07 document file demirel Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 59 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 04, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Building an Index: Building an Index By: Ryan Knowles “building the automatic index is as important as any other component of search engine development”Building an Index Requires Two Lengthy Steps: Building an Index Requires Two Lengthy Steps Document analysis and purification Token analysis or term extractionExample: Example There once was a searcher named Hanna, (1) Who needed some info on manna. (2) She put “rye” and “wheat” in her query (3) Along with “potato” or “cranbeery,” (4) But no mention of “sourdough” or “banana.” (5) Instead of rye, cranberry, or wheat, (6) The results had more spiritual meat. (7) So Hanna was not pleased, (8) Nor was her hunger eased, (9) ‘Cause she was looking for something to eat. (10)Document Analysis and Purification: Document Analysis and Purification Why is document analysis needed? Hypertext documents are more than just text. (photos, tables, charts, audio clips) Looks at how each document is organized and what it is composed of. Decides what information will be indexed and what will not.Token Analysis or Term Extraction: Token Analysis or Term Extraction Decides which words should be used to represent the meaning of documents. Why would it not be necessary to extract every word? Stop words-(able, about, after, allow, became, been, before, certainly, clearly, enough…) Stemming-removing suffixes and sometimes prefixes to reduce a word to its root formExample: Terms Extracted: Example: Terms Extracted Doc No. Terms/ Keywords 1 searcher, Hanna 2 manna 3 rye, wheat, query 4 potato, cranbeery cranb 5 sourdough, banana 6 rye, cranberry, wheat cranb 7 spiritual, meat 8 Hanna 9 hunger 10 No termsManual Indexing: Manual Indexing Why is this no longer practical? What are some upsides to this strategy? Do you think any companies still do this? Yahoo 2002 Small companies National Library of Medicine H.W. Wilson Company CinahlAutomatic Indexing: Automatic Indexing The dominant method for processing documents from large web databases Why is this more efficient? What are some downsides? Spamming Intent of searcherItem Normalization: Item Normalization Taking the smallest unit of the document and constructing searchable data structures What needs to be done in order to create an inverted file structure Why is this normalization necessary?Inverted File Structures: Inverted File Structures The document file Each doc is given a unique ID All terms identified The dictionary Sorted list of all the unique terms The inversion list Points from term to which docs contain itExample: Dictionary List: Example: Dictionary List Banana 1 Cranb 2 Hanna 2 Hunger 1 Manna 1 Meat 1 Potato 1 Query 1 Rye 2 Sourdough 1 Spiritual 1 Wheat 2 Example: Inversion List: Example: Inversion List Banana (5,7) Cranb (4,5); (6,4) Hanna (1,7); (8,2) Hunger (9,4) Manna (2,6) Meat (7,6) Potato (4,3) Query (3,8) Rye (3,3); (6,3) Sourdough (5,5) Spiritual (7,5) Wheat (3,5); (6,6) Other File Structures: Other File Structures Signature Files Eliminates all non-matches rather than matching the query with the termOther Questions: Other Questions How frequently should crawlers go through a certain page? A question that is still being looked into