dm11 Web Mining

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

By: muroor.karthik55 (39 month(s) ago)

i want to download this ppt, help me

By: azza5386 (43 month(s) ago)

i want to download this ppt

By: krishnasagar4017 (46 month(s) ago)

hi.i want this ppt very urgent.

By: kumudbala.saxena (47 month(s) ago)

i want download this ppt

By: kumudbala.saxena (47 month(s) ago)

i want download this ppt

See all

Presentation Transcript

Web Mining : 

Web Mining Ahmed M. Zeki

Introduction : 

/46 2 Introduction The World Wide Web is a rich source of knowledge that can be useful to many applications. Source? Billions of web pages and billions of visitors and contributors. What knowledge? e.g., the hyperlink structure and diversity of languages. Purpose? To improve users’ efficiency and effectiveness in searching for information on the web. Decision-making support or business management.

Introduction : 

/46 3 Introduction Web’s Characteristics: Large size Unstructured Different data types: text, image, hyperlinks and user usage information Dynamic content Time dimension Multilingual Hence DM is a significant subfield of this area. The various activities and efforts in this area are referred to as Web Mining.

Introduction : 

/46 4 Introduction

Slide 5: 

/46 5 Information extraction techniques designed to identify useful information from text documents automatically. Named-entity extraction automatic identification from text documents of the names of entities of interest. Machine learning-based entity extraction systems rely on algorithms rather than human-created rules to extract knowledge or identify patterns from texts. Neural networks Decision tree Hidden Markov Model Entropy maximization Introduction

Slide 6: 

/46 6 Relevance feedback helps users conduct searches iteratively and reformulate search queries based on evaluation of previously retrieved documents . Using relevance feedback, a model can learn the common characteristics of a set of relevant documents in order to estimate the probability of relevance for the remaining documents. Various Machine Learning algorithms, such as genetic algorithms have been used in relevance feedback applications. Introduction

Slide 7: 

/46 7 Information filtering techniques try to learn about users’ interests based on their evaluations and actions, and then to use this information to analyze new documents. Many personalization and collaborative systems have been implemented as software agents to help users in information systems. Introduction

Slide 8: 

/46 8 Text classification classification of textual documents into predefined categories (supervised learning) E.g., Support Vector Machine (SVM), a statistical method that tries to find a hyperplane that best separates two classes. Text clustering groups documents into non-predefined categories which dynamically defined based on their similarities (unsupervised learning). Kohonen’s Self-Organizing Map (SOM), a type of neural network that produces a 2-dimensional grid representation for n-dimensional features, has been widely applied in IR. Machine learning is the basis of most text classification and clustering applications. Introduction

Slide 9: 

/46 9 Web Spiders software programs that traverse the www by following hypertext links and retrieving Web documents by HTTP protocol. To build the databases of search engines To perform personal search To archive Web sites or even the whole Web To collect Web statistics Intelligent Web Spiders: some spiders that use more advanced algorithms during the search process have been developed. E.g. , the Itsy Bitsy Spider searches the Web using a best-first search and a genetic algorithm approach. Introduction

Slide 10: 

/46 10 In order to extract non-English knowledge from the web, Web Mining systems have to deal with issues in language-specific text processing. The base algorithms behind most machine learning systems are language-independent. Most algorithms, e.g.,text classification and clustering, need only to take a set of features (a vector of keywords) for the learning process. However, the algorithms usually depend on some phrase segmentation and extraction programs to generate a set of features or keywords to represent Web documents. Introduction

Slide 11: 

/46 11 Web Visualization tools have been used to help users maintain a "big picture" of the retrieval results from search engines, web sites, a subset of the Web, or even the whole Web. The most well known example of using the tree-metaphor for Web browsing is the hyperbolic tree developed by Xerox PARC. Introduction

Slide 12: 

/46 12 Semantic Web technology tries to add metadata to describe data and information on the Web. Based on standards like XML. Machine learning can play three roles in the Semantic Web: can be used to automatically create the markup or metadata for existing unstructured textual documents on the Web. can be used to create, merge, update, and maintain Ontologies. can understand and perform reasoning on the metadata provided by the Semantic Web in order to extract knowledge from the Web more effectively. Introduction

Web Mining : 

/46 13 Web Mining Web mining is the application of data mining techniques to discover patterns from the Web. Coined by Etzioni (1996) How Web Mining is difference from classical DM? The web is not a relation Textual information and linkage structure Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional data warehouses Ability to react in real-time to usage patterns No human in the loop

Benefits of Web Data Mining : 

/46 14 Benefits of Web Data Mining Match your available resources to visitor interests Increase the value of each visitor Improve the visitor's experience at the website Perform targeted resource management Collect information in new ways Test the relevance of content and web site architecture

Web Mining : 

/46 15 Web Mining According to analysis targets, web mining can be divided into three different types: Web usage mining Web content mining Web structure mining

1. Web Usage Mining : 

/46 16 1. Web Usage Mining The application that uses data mining to analyze and discover interesting patterns of user’s usage data on the web. The usage data records the user’s behavior when the user browses or makes transactions on the web site in order to better understand and serve the needs of users or Web-based applications. It is an activity that involves the automatic discovery of patterns from one or more Web servers.

1. Web Usage Mining : 

/46 17 1. Web Usage Mining Organizations often generate and collect large volumes of data; most of this information is usually generated automatically by Web servers and collected in server log. Analyzing such data can help these organizations to determine: the value of particular customers cross marketing strategies across products the effectiveness of promotional campaigns, etc.

1. Web Usage Mining : 

/46 18 1. Web Usage Mining The first web analysis tools simply provided mechanisms to report user activity as recorded in the servers. Using such tools, it was possible to determine such information as: the number of accesses to the server the times or time intervals of visits the domain names and the URLs of users of the Web server. These tools provide little or no analysis of data relationships among the accessed files and directories within the Web space. Now more sophisticated techniques for discovery and analysis of patterns are now emerging. These tools fall into two main categories: Pattern Discovery Tools Pattern Analysis Tools

1. Web Usage Mining : 

/46 19 Web servers, Web proxies, and client applications can quite easily capture Web Usage data. Web server log: Every visit to the pages, what and when files have been requested, the IP address of the request, the error code, the number of bytes sent to user, and the type of browser used… By analyzing the Web usage data, web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests which has various applications: Personalization and Collaboration in Web-based systems Marketing Web site design and evaluation Decision support 1. Web Usage Mining

1. Web Usage Mining : 

/46 20 Web usage mining has been used for various purposes: A knowledge discovery process for mining marketing intelligence information from Web data. Web traffic patterns also can be extracted from Web usage logs in order to improve the performance of a Web site. Search engine transaction logs also provide valuable knowledge about user behavior on Web searching. Such information is very useful for a better understanding of users’ Web searching and information seeking behavior and can improve the design of Web search systems. 1. Web Usage Mining

1. Web Usage Mining : 

/46 21 One of the major goals of Web usage mining is to reveal interesting trends and patterns which can often provide important knowledge about the users of a system. The Framework for Web usage mining. Preprocessing: Data cleansing Pattern discovery: Pattern analysis: Generic machine learning and Data mining techniques, such as association rule mining, classification, and clustering, often can be applied. 1. Web Usage Mining

1. Web Usage Mining : 

/46 22 Many Web applications aim to provide personalized information and services to users. Web usage data provide an excellent way to learn about users’ interest. Web usage mining on Web logs can help identify users who have accessed similar Web pages. The patterns that emerge can be very useful in collaborative Web searching and filtering. Amazon.com uses collaborative filtering to recommend books to potential customers based on the preferences of other customers having similar interests or purchasing histories. Huang et al. (2002) used Hopfield Net to model user interests and product profiles in an online bookstore in Taiwan. 1. Web Usage Mining

Web Server Log : 

/46 23 Web Server Log http://www.kdnuggets.com/jobs/ KDnuggets.com Server User

Web Server Log – A Sample : 

/46 24 Web Server Log – A Sample 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

Web log fields : 

/46 25 Web log fields IP 152.152.98.11 IP address - can be converted to host name, such as xyz.example.com Name The name of the remote user (usually omitted and replaced by a dash “-”) Login Login of the remote user (also usually omitted and replaced by a dash “-”) Date/Time/TZ 16/Nov/2005:16:32:50 -0500 Request, Status code, Object size, Referrer, User agent

Web Usage Mining - Basic : 

/46 26 Web Usage Mining - Basic Totals for each component Hits – total number of requests Files – number of GETs Pages – number of HTML pages Sites – unique IP addresses Response codes Kbytes – total Kbytes transferred User Agents

Web Log Analysis Programs : 

/46 27 Web Log Analysis Programs Free Analog, awstats, webalizer Google analytics Commercial WebTrends, WebSideStory, …

Example: KDnuggets.com Nov 2005 totals : 

/46 28 Example: KDnuggets.com Nov 2005 totals Monthly Statistics (from webalizer) Q: What is the difference between Hits and Files? Answer: the difference between Hits and Files is the number of requests with status code not 200.

Example: KDnuggets.com Nov 2005 totals : 

/46 29 Example: KDnuggets.com Nov 2005 totals Q: What is the meaning of difference between Files and Pages ? A: the difference between Files and Pages is the number of non-HTML files (e.g. image, javascript, etc In November 2005 KDnuggets log HTML files were about 1/3 of all requests However, this data does not separate bot requests (which are heavily weighted towards HTML pages)

2. Web Content Mining : 

/46 30 2. Web Content Mining The process to discover useful information from the content of a web page. The type of the web content may consist of Text Image Audio Video Web content mining sometimes is called web text mining, because the text content is the most widely researched area. The technologies that are normally used in web content mining are Natural Language Processing (NLP) Information Retrieval (IR)

Text Mining : 

/46 31 Text Mining The process of deriving high quality information from text. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value. High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning.

Text Mining : 

/46 32 Text Mining Text mining usually involves the process of: Structuring the input text by Parsing Addition of some derived linguistic features and the removal of others Subsequent insertion into a database Deriving patterns within the structured data Evaluation and interpretation of the output. 'High quality' in text mining refers to some combination of: Relevance Novelty Interestingness

Text Mining : 

/46 33 Text Mining Typical text mining tasks include: Text categorization Text clustering Concept/entity extraction Sentiment analysis Document summarization Entity relation modeling (i.e., learning relations between named entities).

3. Web Structure Mining : 

/46 34 3. Web Structure Mining The process of using the graph theory to analyze the node and connection structure of a web site. Web structure mining can be divided into two kinds: Extract patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location. Mining the document structure. It is using the tree-like structure to analyze and describe the HTML or XML tags within the web page.

3. Web Structure Mining : 

/46 35 Web structure mining has been largely influenced by research in Social network analysis Citation analysis (bibliometrics). in-links: the hyperlinks pointing to a page out-links: the hyperlinks found in a page. Usually, the larger the number of in-links, the better a page is. By analyzing the pages containing a URL, we can also obtain Anchor text: how other Web page authors annotate a page and can be useful in predicting the content of the target page. 3. Web Structure Mining

3. Web Structure Mining : 

/46 36 The PageRank algorithm is computed by weighting each in-link to a page proportionally to the quality of the page containing the in-link. The qualities of these referring pages also are determined by PageRank. Thus, a page p is calculated recursively as follows: 3. Web Structure Mining

Ads vs. search results : 

/46 37 Ads vs. search results Search advertising is the revenue model Multi-billion-dollar industry Advertisers pay for clicks on their ads Interesting problems How to pick the top 10 results for a search from 2,230,000 matching pages? What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid?

Web Mining vs. Information Access : 

/46 38 Web Mining vs. Information Access Text data mining involves extracting “nuggets” and/or overall patterns from a collection of textual information, independent of a users' information need. Information access is the process of helping users find, create, use, re-use, and understand information to satisfy an information need. In other words, data mining is opportunistic, whereas information access is goal-driven.

Search Engine Components : 

/46 39 Search Engine Components Spider (crawler/robot) – builds corpus Collects web pages recursively For each known URL, fetch the page, parse it, and extract new URLs Repeat Additional pages from direct submissions & other sources The indexer – creates inverted indexes Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. Query processor – serves query results Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end – finds matching documents and ranks them

Application Areas of Web Mining : 

/46 40 Application Areas of Web Mining E-commerce Search Engines Personalization Website Design

Application Areas of Web Mining : 

/46 41 Application Areas of Web Mining E-tailers The ability to find new cross-sell opportunities, enable comprehensive prospect profiling, and improve customer satisfaction. B2B and B2C Ventures

Application Areas of Web Mining : 

/46 42 Application Areas of Web Mining Advertising-Based Sites When the revenue is advertising-based. Blindly serving ads to visitors will not result in a large click-thru rate. Instead, ads must be intelligently targeted to the user, providing the visitor with products and services that they are interested in. Entertainment sites Media Portals Advertising Providers

Application Areas of Web Mining : 

/46 43 Application Areas of Web Mining Information Repositories Information overload is a problem that grows larger every day. Indexing, summarization, and other metadata tasks are time consuming. Semantic text analyzers are capable of automating these tasks, and create user navigation systems on the fly. Libraries Technical Support Sites Media Sites Content Providers

Application Areas of Web Mining : 

/46 44 Application Areas of Web Mining Security applications One of the largest text mining applications that exists is probably the classified ECHELON surveillance system. Software and Applications Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes.

Application Areas of Web Mining : 

/46 45 Application Areas of Web Mining Academic applications The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval.

Conclusion : 

/46 46 Major limitations of Web mining research: Lack of suitable test collections that can be reused by researchers. Difficult to collect Web usage data across different Web sites. Future research directions: Multimedia data mining: a picture is worth a thousand words. Multilingual knowledge extraction: Web page translations Wireless Web: WML and HDML. The Hidden Web: forms, dynamically generated Web pages. Semantic Web Conclusion * This presentation is reproduced from the articles attached

authorStream Live Help