web mining.an application of soft computing

Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

*WEB MINING* APPLICATION OF SOFT COMPUTING :

*WEB MINING* APPLICATION OF SOFT COMPUTING PREPARED BY: TRIPTI PRESENTED TO: SWATI AGGARWAL

INDEX:

INDEX INTRODUCTION WEB MINING WEB MINING COMPONENTS & METHODOLOGIES LIMITATIONS OF EXISTING WEB MINING METHODS INTRODUCTION TO SOFT COMPUTING & ITS RELEVANCE FUZZY LOGIC & ITS RELEVANCE TO WEB MINING NEURAL NETWORKS & ITS RELEVANCE TO WEB MINING

INTRODUCTION:

INTRODUCTION Explosive growth in information available on www Web browsers provide easy access to data & text Finding the desired information is not an easy task Profusion of resources prompted the need for web Mining SOFT WEB MINING - a good candidate for developing automated tools in order to find and extract and to evaluate user’s desired info from unlabeled, heterogeneous data.

WEB MINING:

WEB MINING Discovery & analysis of useful info from www Data can be collected at Server side Client side Proxy servers Can be obtained from organizational database Characteristics of web Data Unlabeled Distributed Heterogeneous Semistructued Time varying

WEB MINING COMPONENTS & METHODOLOGIES:

According to the Web Mining category and of the objective, the different phases acquire a different role and importance Web data Information Retrieval Information Extraction Generalizzation Analysis Knowledge ? WEB MINING COMPONENTS & METHODOLOGIES

INFORMATION RETRIEVAL:

INFORMATION RETRIEVAL Deals with automatic retrieval of all relevant documents All non-relevant documents are fetched as few as possible IR process mainly includes Document representation By Furnkranz[36] Bag of words & hyperlink information By Soderland[40] Sentence, phrases & named entity Indexing (collection of terms with pointers to place where documents can be found) POPULAR INDEX ALTA-VISTA WEB CRAWLER(can scan millions of documents and store an index of words in the document) Searching for Document Search Engines are used(programs written to query and retrieve info)

INFORMATION SELECTION/EXTRACTION & PREPROCESSING:

INFORMATION SELECTION/EXTRACTION & PREPROCESSING Task of identifying specific fragments of a single document that constitute its core semantic content METHODS USED ARE Involves writing wrappers Hand coding) which map the documents to some data models Operates by interpreting the various sites as knowledge sources & extract information from them. To do so, system processes the site document to extract relevant text fragments To extract info from hypertext. each page is approached with a set of questions and the problem therefore reduces to identifying the text fragments which answer those specific questions

INFORMATION SELECTION/EXTRACTION & PREPROCESSING:

INFORMATION SELECTION/EXTRACTION & PREPROCESSING LSI(LATENT SEMANTIC INDEXING):-Preprocessing technique for IE. When a user requests a web page it includes: variety of files Images Sound Video Html pages Server contains relevant & irrelevant entities, which needs to be removed using this preprocessing technique. LIS transform the original document to a lower dimensional space by analyzing the correlation structure of terms Similar documents that do not share the same terms are not placed in same category

GENERALIZATION:

GENERALIZATION Uses Pattern Recognition and Machine Learning techniques Machine learning system learn about user’s interest than web itself Major OBSTACLE when learning about web is Labelling problem Data mining technique require inputs labelled as(+ve) or(-ve) FOR EXAMPLE given large set of web pages labeled as (+ve) or (-ve) examples of homepage,then We can design a classifier that predicts whether unknown page is homepage or not..But unfortunately web pages are not labelled. Clustering technique do not require labelled inputs and outputs Association Rule Mining(INTEGRAL PART OF THIS PHASE) X=>Y X,Y ->Sets of Items Expresses whenever a Transaction(T) contains X then T probably contains Y also

ANALYSIS:

ANALYSIS Data Driven Problem Presumes that there is a sufficient data available to extract & analyse useful information Important for validation & interpretation of mined patterns Uses Online Analytical processing(OLAP)techniques Web miner proposes a SQL like queering mechanism for queering the discovered knowledge

WEB MINING CATEGORIES:

WEB MINING CATEGORIES

WCM(Web Content Mining):

WCM(Web Content Mining) Deals with the discovery of useful information from the web contents/data/documents/services. web contents contains Text audio Video symbolic metadata hyperlinked data. Web Text Data(3 TYPES) 1) unstructured data( free text) 2) semistructured data(HTML) 3) fully structured data( tables or databases).

(WSM)Web Structure Mining:

(WSM)Web Structure Mining Mining the structure of hyperlinks within the web itself Structure represents the graph of the links in a site or between sites Reveals more information than just the information contained in documents. Rather than collecting all the index,it focues only on the links that are relevant and avoid irrelevant regions

Links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the variety of documents:

Links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the variety of documents

WUM(WEB USAGE MINING:

WUM(WEB USAGE MINING Mines secondary data generated by the user’s interaction with web Also known as web log mining Works on user profiles, user access patterns, and mining navigation paths Plays a key role in personalizing space, which is the need of the hour. Uses Techniqes like: Association Rules Clustering Sequential Patterns Rough Sets Fuzzy Logic

LIMITATIONS OF EXISTING WEB MINING METHODS:

LIMITATIONS OF EXISTING WEB MINING METHODS INFORMATION RETRIEVAL Subjectivity, Imprecision, and Uncertainty Deduction Page Ranking Dynamism, Scale, and Heterogeneity INFORMATION EXTRACTION Based on the “wrapper” technique Limitation : Each wrapper is an IE system customized for a particular site and is not universally applicable. Ad hoc formatting conventions, used in one site, are rarely relevant elsewhere.

LIMITATIONS(CONTD…):

LIMITATIONS(CONTD…) GENERALIZATION Clustering Outliers Association Rule Mining ANALYSIS Knowledge Discovery out of the information Is a challenge to the analysts The output of knowledge mining algorithms are not suitable for direct human interpretation. The patterns discovered are m ainly in mathematical form

Slide 18:

Soft Computing & Its Relevance

SOFT COMPUTING & ITS RELEVANCE:

SOFT COMPUTING & ITS RELEVANCE SOFT COMPUTING- a collection of methodologies which provide information processing capabilities for handling real life ambiguous situations FUZZY LOGIC ARTIFICIAL NEURAL NETWORKS SOFT COMPUTING

FUZZY LOGIC AND WEB MINING:

FUZZY LOGIC AND WEB MINING FUZZY SETS – Their elements possess degrees of membership Classically membership of an element in a set was bivalent Element belongs to set(1) Element does not belong to set(0) Designated by a pair (A, m) A->Set and m : A->[0,1] Values strictly between 0&1 represent fuzzy members Here the degree of truth of a statement can range between 0 & 1. Degree is not restricted to 2 truth values truth->1 and false->0 Deals with reasoning that is approximate rather then precise.

FUZZY LOGIC AND WEB MINING:

FUZZY LOGIC AND WEB MINING INFORMATION RETRIEVAL YAGER described an IR language which enables a user to specify interrelationships between desired attributes of documents sought using linguistic quantifiers e.g. “at least ”, “most” ,”about half” Q->linguistic expression for quantifier “most” Represented by fuzzy subset over I=[0,1] For any proportion r belong to I, Q (r)-> degree to which r satisfies the concept indicated by quantifier Q. Model proposed by Koczky & Gedeon -Helps in retrieval of documents where it cannot be guaranteed that the user queries include actual words that occur in the documents to be retrieved

FUZZY LOGIC AND WEB MINING:

Model proposed by Bordonga and Pasi for semi-structured (e.g.HTML) document retrieval -> representation of document d D D ->set of archive of documents t T where T is the set of index terms Membership function of is ->significance of term t in section s of document d FUZZY LOGIC AND WEB MINING ->Inverse document frequency of term t ->Normalized term frequency defined as where ->Number of occurrences of term t in section s of document d ->Normalization pattern depending on sections length

COMMERCIALLY AVAILABLE SYSTEMS:

COMMERCIALLY AVAILABLE SYSTEMS NZsearch Search engine based on Fuzzy Logic It considers entire phrase rather than individual words for the purpose of matching DNS Search Uses FL to find the closest DNS entry to your typed URL. E.g. You type www.gogle.com System will give suggestions on possible close URLs Finder Uses Multidimensional optimization to display best or “Most suited” matches to the query. Existing search engines provide exact match to the query. Finder goes beyond “yes” or “no” criterion used by SQL or Btrieve. Uses SCORING MODEL E.g. If one is looking for a blue car but the car in database was red, it will not ignore the entry all together but will give it a lower score.

FL – Prospective areas of application:

FL – Prospective areas of application Provides human like deductive capability to the search engine. Can be used in terms of matching by compromising slightly on precision. For “Page ranking”, the degree of closeness of hits in a document can be used e.g. variables like ”close”, “far”, “nearby” can be used.

NEURAL NETWORK AND WEB MINING:

Parallel interconnected network of simple processing elements which is intended to interact with the objects of the real world in the same way as biological systems do. Designated by Network Topology Weights Node characteristics Status updating rules Characteristics Generalization capability Adaptivity to new data/information Speed due to massively parallel architecture Robustness to missing, confusing, ill-defined/noisy data. Capability for modeling non-linear decision boundaries. NEURAL NETWORK AND WEB MINING

NEURAL NETWORK AND WEB MINING:

WISCONSIN ADAPTIVE WEB ASSISTANT (WAWA-IE+IR) SYSTEM Suggested by Shavlik Uses 2 network models SCORE LINK Uses unsupervised learning. SCORE PAGE Uses supervised learning in form of advice from users. Here the system uses Knowledge based Neural Nets (KBNNs) as its knowledge base to encode the initial knowledge of users which is then refined. NEURAL NETWORK AND WEB MINING INFORMATION RETRIEVAL SELF ORGANIZATION PERSONALIZATION

SLIDING WINDOW:

SLIDING WINDOW map into ScorePage a rule like: When “Professor ?Firstname ?Lastname” then show page. Maps large sized web pages into fixed-sized NNs Parses each page considering three words at a time, and the html tags like <p>, </p>,<br>act as window breakers.

NEURAL NETWORK AND WEB MINING:

Uses Visual map-like displays E.g. WEBSOM Organized very large & high dimensional collections of text documents onto 2 dimensional map displays. Map forms a document landscape Landscape can be labeled with automatically identified descriptive words that convey properties of each area and also act as landmarks during page exploration. NEURAL NETWORK AND WEB MINING INFORMATION RETRIEVAL SELF ORGANIZATION Here similar documents appear close to each other at different points of regular map grid PERSONALIZATION

PERSONALIZATION:

PERSONALIZATION Content and search results are tailored as per user interests and habits. Neural Network can be used in learning user profiles with training data collected from users or from systems. User profiles are highly non-linear functions. NN seems to be an effective tool to learn them. E.g. SYSKILL & WEBERT Agent that learns user profiles using “ Bayesian Classifier ” Learns a new profile and makes suggestions about pages to visit quickly. Once the HTML is analyzed, it annotates each link on the page with an icon indicating user’s rating. INFORMATION RETRIEVAL SELF ORGANIZATION PERSONALIZATION

PROSPECTIVE AREAS OF NEURAL NETWORKS:

PROSPECTIVE AREAS OF NEURAL NETWORKS PERSONALIZED PAGE RANKING PAGE RANKS (Used to rate a page with relevance to user queries) Google search engine computes rank of a page “a” using d->damping factor Pr (a)->Rank of page a which has pages T1,T2,……..,Tn pointing to it. C (a)->Number of outgoing links from a page Google takes care of Popularity of page (Reputation of Incoming links) Richness of page content (Number of outgoing links)

PROSPECTIVE AREAS OF NEURAL NETWORKS:

Google ignores factors like User preference (Link may or may not match preferences of user established from his/her history) Validity (Whether link is currently valid or not) Interestingness (Whether the page is of overall interest to the user or not) PROSPECTIVE AREAS OF NEURAL NETWORKS ANNs can be exploited for determining User preference (Incorporated by training an NN based on user history) Interestingness (ANNs can model non-linear functions and learn from examples)

NEURO-FUZZY IR:

NEURO-FUZZY IR FL can be used Total relevance= relevance to query + relevance to user 3 categories Learning from user history Clustering of users into homogenous groups Using relevance feedback

FUTURE SCOPE:

FUTURE SCOPE Web mining algorithms can be used for handling multimedia data. Currently, queries are in the form of keywords, advanced search engines may support visual queries Multilingual search engines and IR systems which can identify languages, translate, perform thematic classification, and can provide summaries automatically are recently being developed. Soft computing may be used to increase the efficiency of such systems. E-commerce is an important application of soft Computing.It may be used to impart human like interaction in e-shopping portals using fuzzy set. Development of new knowledge visualization techniques for effective user interface may also be done with soft computing.

Slide 34:

THANK YOU

authorStream Live Help