IRS_UNIT_II-II

Views:
 
     
 

Presentation Description

IRS II unit

Comments

Presentation Transcript

Arial:

“In a day, when you don't come across any problems - you can be sure that you are travelling in a wrong path” ―   Swami Vivekananda Prepared by : Ms. M. Nagalakshmi Asst. Prof., Kshatriya College of Engg., Armoor, Nizamabad.

Calibri:

Topics Introduction to Data Structures Stemming Algorithms Inverted File Structure N-Gram Data Structure PAT Data Structure Signature File Structure Hypertext and XML Data Structures

Lucida Sans Unicode:

Data structures The knowledge of data structure gives an insight into the capabilities available to the system . Each data structure has a set of associated capabilities . Ability to represent the concept and their r/s. Supports location of those concepts Introduction Two major data structures in any IRS: One structure stores and manages received items in their normalized form is called document manger

Wingdings:

2. The other data structure contains processing tokens and associated data to support search Item normalization Document Search manager Document manager Document File creation Original document file Processing token search file Major data structure

Arial Unicode MS:

Result of a search are references to the items that satisfy the search statement which are passed to the document manager for retrieval. Focus : on data structure that support search function Stemming : is the transformation often applied to data before placing it in the searchable data structure Stemming represents concept(word) to a canonical ( authorized; recognized; accepted ) morphological ( the patterns of word formation in a particular language ) representation . Risk with stemming : concept discrimination information may be lost in the process. Causing decrease in performance. Advantage : has a potential to increase recall.

Courier New:

Stemming algorithms Stemming algorithm is used to improve the efficiency of IRS and improve recall. Conflation ( the process or result of fusing items into one entity; fusion; amalgamation) is a term that is used to refer mapping multiple morphological variants to single representation(stem). Stem carries the meaning of the concept associated with the word and the affixes(ending) introduce subtle (slight) modification of the concept. Terms with a common stem will usually have similar meanings, for example:

Symbol:

Ex: Terms with a common stem will usually have similar meanings, for example: CONNECT CONNECTED CONNECTING CONNECTION CONNECTIONS Frequently, the performance of an IR system will be improved if term groups such as this are conflated into a single term . This may be done by removal of the various suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT In addition, the suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous.

Default Design:

Major usage of stemming is to improve recall. Important for a system to categories a word prior to making the decision to stem. Proper names and acronyms (A word formed from the initial letters of a name say JNTU …) should not have stemming applied. Stemming can also cause problems for natural language processing NPL systems by causing loss of information .

“In a day, when you don't come across any problems - you can be sure that you are travelling in a wrong path” ― Swami Vivekananda:

Porter stemming algorithm Based on a set condition of the stem A consonant in a word is a letter other than A, E, I, O or U, some important stem conditions are The measure m of a stem is a function of sequence of vowels (V) followed by a sequence of consonant ( C ) . C (VC) m V. m is number VC repeats

Topics:

The case m = 0 covers the null word. 2 . *<X> - stem ends with a letter X 3 .*v* - stem contains a vowel 4 . *d - stem ends in double consonant (e.g. -TT, -SS). 5 . *o - stem ends in consonant vowel sequence where the final consonant is not w,x,y(e.g. -WIL, -HOP). Suffix cond.s takes the form current _suffix = = pattern Actions are in the form old_suffix ->. New_suffix Rules are divided into steps to define the order for applying the rule.

Data structures :

Step Condition Suffix Replacement Example 1a Null Sses Ss Stresses -> stress 1b *v* Ing Null Making -> mak 1b1 Null At Ate Inflated-> inflate 1c *v* Y I Happy->happi 2 m>0 aliti al Formaliti-> formal 3 m>0 Icate Ic Duplicate->duplie 4 m>1 Able Null Adjustable -> adjust 5a m>1 e Null Inflate-> inflat 5b m>1 and *d Null Single letter Control -> control Examples of the rules

PowerPoint Presentation:

2. Dictionary look up stemmers Use of dictionary look up. The original term or stemmed version of the term is looked up in a dictionary and replaced by the stem that best represents it. This technique has been implemented in INQUERY and Retrieval ware systems- INQUERY system uses the technique called Kstem. Kstem is a morphological analyzer that conflates words variants to a root form. It requires a word to be in the dictionary

PowerPoint Presentation:

Kstem uses 6 major data files to control and limit the stemming process. Dictionary of words (lexicon) Supplemental list of words for dictionary Exceptional list of words that should retain a ‘e’ at the end (e.g., “suites” to “suite” but “suited” to “suit”). Direct _conflation - word pairs that override stemming algorithm. County_nationality _conflation ( British maps to Britain ) Proper nouns -- that should not be stemmed New words that are not special forms (e.g., dates, phone numbers) are located in the dictionary to determine simpler forms by stripping off suffixes and respelling plurals as defined in the dictionary.

Stemming algorithms :

3. Successor stemmers : Based on length of prefixes . The smallest unit of speech that distinguishes on word from another The process uses successor varieties for a word . Uses information to divide a word into segments and selects on of the segments to stem. b n r a g Symbol tree for term bag , barn

PowerPoint Presentation:

Successor variety of words are used to segment a word by applying one of the following four methods. Cutoff method : a cut of value is selected to define the stem length. Peak and plateau : a segment break is made after a character whose successor variety exceeds that of the character. Complete word method : break on boundaries of complete words. Entropy method: uses the distribution method of successor variety letters. Let |Dak| be the number of words beginning with k length sequence of letters a . Let |Dakj| be the number of words in Dak with successor j. The probability that a member of Dak has the successor j is given as |Dakj| / |Dak|

PowerPoint Presentation:

The entropy of |Dak| is 26 Hak =  -( |Dakj| / |Dak| )(log(|Dakj| / |Dak| )) p=1 After a word has been segmented the segment to be used as stem must be selected. Hafer and Weiss selected the following rule If ( first segment occurs in <=12 words in database) First segment is stem Else (second segment is stem)

Porter stemming algorithm :

Data structures: Inverted file structure N-Gram data structures PAT data structures Signature file structure Hypertext and XML data structures ---- class study

PowerPoint Presentation:

Inverted file structure Most common data structure Inverted file structures are composed of three files The document file The inversion list (Posting List) Dictionary The inverted file : based on the methodology of storing an inversion of documents. For each word a list of documents in which the word is found is stored. ( inversion of document ) Each document is given a unique the numerical identifier that is stored in inversion list .

PowerPoint Presentation:

Dictionary is used to located the inversion list for a particular word. Which is a sorted list( processing tokens) in the system and a pointer to the location of its inversion list. Dictionary can also store other information used in query optimization such as length of inversion lists to increase the precision. Doc#1, computer, bit, byte Doc #2 memory, byte Doc #3 computer, bit, memory Doc#4 byte, computer Bit(2) Byte(3) Computer(3) memory Document Dictionary Bit -1, 3 Byte1,2,4 Computer-1,3,4 Memory –2,3 Inverted list

PowerPoint Presentation:

Use zoning to improve precision and Restrict entries. Inversion list consists of document identifier for each document in which the word is found . Ex: bit 1(10),1(12) 1(18) is in 10,12, 18 position of the word bit in the document #1. When a search is performed, the inversion lists for the terms in the query are locate and appropriate logic is applied between inversion lists. Weights can also be stored in the inversion list. Inversion list are used to store concept and their relationship. Words with special characteristics can be stored in their own dictionary. Ex: Date … which require date ranging and numbers. Systems that support ranking are re-organized in ranked order.

PowerPoint Presentation:

B trees can also be used for inversion instead of dictionary. The inversion lists may be at the leaf level or referenced in higher level pointers. A B-tree of order m is defined as: A root node with between 2 and 2m keys All other internal nodes have between m and 2m keys All keys are kept in order from smaller to larger. All leaves are at the same level or differ by at most one level.

PowerPoint Presentation:

N-gram data structure N-Grams can be viewed as a special technique for conflation (stemming) and as a unique data structure in information systems. N-Grams are a fixed length consecutive series of “n” characters. Unlike stemming that generally tries to determine the stem of a word that represents the semantic meaning of the word, n-grams do not care about semantics. The searchable data structure is transformed into overlapping n-grams, which are then used to create the searchable database.

PowerPoint Presentation:

Examples of bigrams, trigrams and pentagrams for the word phrase “sea colony.” se ea co ol lo on ny Bigrams (no interword symbols) sea col olo lon ony Trigrams (no interword symbols) #se sea ea# #co col olo lon ony ny# Trigrams (with interword symbol #) #sea# #colo colon olony lony# Pentagrams (with interword symbol #) The symbol # is used to represent the interword symbol which is anyone of a set of symbols (e.g., blank, period, semicolon, colon, etc.).

PowerPoint Presentation:

The symbol # is used to represent the interword symbol which is anyone of a set of symbols (e.g., blank, period, semicolon, colon, etc.). Each of the n-grams created becomes a separate processing tokens and are searchable. It is possible that the same n-gram can be created multiple times from a single word. Uses : Widely used as cryptography in world war II Spelling errors detection and correction Use bigrams for conflating terms. N-grams as a potential erroneous words.

PowerPoint Presentation:

Damerau specified 4 categories of errors: Error Category Example single char insertion compuuter single char deletion compter single char substitution compiter Transposition of 2 adjacent comp tu er chars Zamora showed trigram analysis provided a viable data structure for identifying misspellings and transposed characters.

PowerPoint Presentation:

This impacts information systems as a possible basis for identifying potential input errors for correction as a procedure within the normalization process. Frequency of occurrence of n-gram patterns can also be used for identifying the language of an item. Trigrams have been used for text compression and to manipulate the length of index terms. To encode profiles for the Selective Dissemination of Information. To store the searchable document file for retrospective search databases.

PowerPoint Presentation:

Advantage: They place a finite limit on the number of searchable token MaxSeg n =( ) n maximum number of unique n grams that can be generated. “ n” is the length of n-grams number of process able symbols Disadvantage: longer the n gram the size of inversion list increase. Performance has 85 % precision .

PowerPoint Presentation:

PAT data structure (practical algorithm to retrieve information coded in alphanumeric) PAT structure or PAT tree or PAT array : continuous text input data structures(string like N-Gram data structure). The input stream is transformed into a searchable data structure consisting of substrings, all substrings are unique. Each position in a input string is a anchor point for a sub string. In creation of PAT trees each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input. Binary tree, most common class for prefix search,But Pat trees are sorted logically which facilitate range search, and more accurate then inversion file . PAT trees provide alternate structure if supporting strings search.

PowerPoint Presentation:

A PAT tree is an unbalanced, binary digital tree defined by the sistrings. The individual bits of the sistrings decide the branching patterns with zeros branching left and ones branching right. PAT trees also allow each node in the tree to specify which bit is used to determine the branching via bit position or the number of bits to skip from the parent node. This is useful in skipping over levels that do not require branching.

N-gram data structure :

Examples of sistrings Text Economics for Warsaw is complex. ----------------------------------------------------------- sistring 1 Economics for Warsaw is complex. sistring 2 conomics for Warsaw is complex. sistring 5 omics for Warsaw is complex. sistring 10 for Warsaw is complex. sistring 20 w is complex. sistring 30 ex.

PowerPoint Presentation:

The key values are stored at the leaf nodes (bottom nodes) in the PAT Tree. For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher level nodes. It is possible to place additional constraints on sistrings for the leaf nodes. If the binary representations of “h” is (100), “o” is (110), “m” is (001) and “e” is (101) then the word “home” produces the input 100110001101.... Using the sistrings.

PowerPoint Presentation:

INPUT 100110001101 sistring 1 1001.... sistring 2 001100... sistring 3 01100.... sistring 4 11....... sistring 5 1000... sistring 6 000..... sistring 7 001101 sistring 8 01101 The full PAT binary tree is

PowerPoint Presentation:

Skipped final version of PAT tree The value in the intermediate nodes (indicated by rectangles) is the number of bits to skip until the next bit to compare that causes differences between similar terms.

PowerPoint Presentation:

Signature file structure The coding is based upon words in the code. The words are mapped into word signatures . A word signature is fixed length code with a fixed number of bits set to 1. The bit positions that are set to one are determined via a hash function of the word. The word signatures are Or ed together to create signature of an item.. Partitioning of words is done in block size ,Which is nothing but set of words, Code length is 16 bits . Search is accomplished by template matching on the bit position . Provide a practical solution applied in parallel processing , distributed environment etc.

PowerPoint Presentation:

To avoid signatures being too dense with “1”s, a maximum number of words is specified and an item is partitioned into blocks of that size. The block size is set at five words, the code length is 16 bits and the number of bits that are allowed to be “1” for each word is five. TEXT: Computer Science graduate students study (assume block size is five words)

PowerPoint Presentation:

WORD Signature ----------------------------------------------------------------- computer 0001 0110 0000 0110 Science 1001 0000 1110 0000 graduate 1000 0101 0100 0010 students 0000 0111 1000 0100 study 0000 0110 0110 0100 ---------------------------------------------------------------- Block Signature 1001 0111 1110 0110 ---------------------------------------------------------------- Superimposed Coding

Examples of sistrings :

Applicatain(s)/Advantage(s) Signature files provide a practical solution for storing and locating information in a number of different situations. Signature files have been applied as medium size databases, databases with low frequency of terms, WORM devices, parallel processing machines, and distributed environments

PowerPoint Presentation:

Hypertext and XML data structures The advent of the Internet and its exponential growth and wide acceptance as a new global information network has introduced new mechanisms for representing information. This structure is called hypertext and differs from traditional information storage data structures in format and use. The hypertext is Hypertext is stored in HTML format and XML . Bot of these languages provide detailed descriptions for subsets of text similar to the zoning. Hypertext allows one item to reference another item via a embedded pointer . HTML defines internal structure for information exchange over WWW on the internet. XML: defined by DTD, DOM, XSL, etc.

PowerPoint Presentation:

Over

authorStream Live Help