Web Mining : Web Mining Ahmed M. Zeki Introduction : /46 2 Introduction The World Wide Web is a rich source of knowledge that can be useful to many applications.
Billions of web pages and billions of visitors and contributors.
e.g., the hyperlink structure and diversity of languages.
To improve users’ efficiency and effectiveness in searching for information on the web.
Decision-making support or business management. Introduction : /46 3 Introduction Web’s Characteristics:
Different data types: text, image, hyperlinks and user usage information
Hence DM is a significant subfield of this area.
The various activities and efforts in this area are referred to as Web Mining. Introduction : /46 4 Introduction Slide 5: /46 5 Information extraction techniques designed to identify useful information from text documents automatically.
Named-entity extraction automatic identification from text documents of the names of entities of interest.
Machine learning-based entity extraction systems rely on algorithms rather than human-created rules to extract knowledge or identify patterns from texts.
Hidden Markov Model
Entropy maximization Introduction Slide 6: /46 6 Relevance feedback helps users conduct searches iteratively and reformulate search queries based on evaluation of previously retrieved documents .
Using relevance feedback, a model can learn the common characteristics of a set of relevant documents in order to estimate the probability of relevance for the remaining documents.
Various Machine Learning algorithms, such as genetic algorithms have been used in relevance feedback applications. Introduction Slide 7: /46 7 Information filtering techniques try to learn about users’ interests based on their evaluations and actions, and then to use this information to analyze new documents.
Many personalization and collaborative systems have been implemented as software agents to help users in information systems. Introduction Slide 8: /46 8 Text classification classification of textual documents into predefined categories (supervised learning)
E.g., Support Vector Machine (SVM), a statistical method that tries to find a hyperplane that best separates two classes.
Text clustering groups documents into non-predefined categories which dynamically defined based on their similarities (unsupervised learning).
Kohonen’s Self-Organizing Map (SOM), a type of neural network that produces a 2-dimensional grid representation for n-dimensional features, has been widely applied in IR.
Machine learning is the basis of most text classification and clustering applications. Introduction Slide 9: /46 9 Web Spiders software programs that traverse the www by following hypertext links and retrieving Web documents by HTTP protocol.
To build the databases of search engines
To perform personal search
To archive Web sites or even the whole Web
To collect Web statistics
Intelligent Web Spiders: some spiders that use more advanced algorithms during the search process have been developed.
E.g. , the Itsy Bitsy Spider searches the Web using a best-first search and a genetic algorithm approach. Introduction Slide 10: /46 10 In order to extract non-English knowledge from the web, Web Mining systems have to deal with issues in language-specific text processing.
The base algorithms behind most machine learning systems are language-independent. Most algorithms, e.g.,text classification and clustering, need only to take a set of features (a vector of keywords) for the learning process.
However, the algorithms usually depend on some phrase segmentation and extraction programs to generate a set of features or keywords to represent Web documents. Introduction Slide 11: /46 11 Web Visualization tools have been used to help users maintain a "big picture" of the retrieval results from search engines, web sites, a subset of the Web, or even the whole Web.
The most well known example of using the tree-metaphor for Web browsing is the hyperbolic tree developed by Xerox PARC. Introduction Slide 12: /46 12 Semantic Web technology tries to add metadata to describe data and information on the Web. Based on standards like XML.
Machine learning can play three roles in the Semantic Web:
can be used to automatically create the markup or metadata for existing unstructured textual documents on the Web.
can be used to create, merge, update, and maintain Ontologies.
can understand and perform reasoning on the metadata provided by the Semantic Web in order to extract knowledge from the Web more effectively. Introduction Web Mining : /46 13 Web Mining Web mining is the application of data mining techniques to discover patterns from the Web.
Coined by Etzioni (1996)
How Web Mining is difference from classical DM?
The web is not a relation
Textual information and linkage structure
Usage data is huge and growing rapidly
Google’s usage logs are bigger than their web crawl
Data generated per day is comparable to largest conventional data warehouses
Ability to react in real-time to usage patterns
No human in the loop Benefits of Web Data Mining : /46 14 Benefits of Web Data Mining Match your available resources to visitor interests
Increase the value of each visitor
Improve the visitor's experience at the website
Perform targeted resource management
Collect information in new ways
Test the relevance of content and web site architecture Web Mining : /46 15 Web Mining According to analysis targets, web mining can be divided into three different types:
Web usage mining
Web content mining
Web structure mining 1. Web Usage Mining : /46 16 1. Web Usage Mining The application that uses data mining to analyze and discover interesting patterns of user’s usage data on the web.
The usage data records the user’s behavior when the user browses or makes transactions on the web site in order to better understand and serve the needs of users or Web-based applications.
It is an activity that involves the automatic discovery of patterns from one or more Web servers. 1. Web Usage Mining : /46 17 1. Web Usage Mining Organizations often generate and collect large volumes of data; most of this information is usually generated automatically by Web servers and collected in server log. Analyzing such data can help these organizations to determine:
the value of particular customers
cross marketing strategies across products
the effectiveness of promotional campaigns, etc. 1. Web Usage Mining : /46 18 1. Web Usage Mining The first web analysis tools simply provided mechanisms to report user activity as recorded in the servers. Using such tools, it was possible to determine such information as:
the number of accesses to the server
the times or time intervals of visits
the domain names and the URLs of users of the Web server.
These tools provide little or no analysis of data relationships among the accessed files and directories within the Web space.
Now more sophisticated techniques for discovery and analysis of patterns are now emerging. These tools fall into two main categories:
Pattern Discovery Tools
Pattern Analysis Tools 1. Web Usage Mining : /46 19 Web servers, Web proxies, and client applications can quite easily capture Web Usage data.
Web server log: Every visit to the pages, what and when files have been requested, the IP address of the request, the error code, the number of bytes sent to user, and the type of browser used…
By analyzing the Web usage data, web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests which has various applications:
Personalization and Collaboration in Web-based systems
Web site design and evaluation
Decision support 1. Web Usage Mining 1. Web Usage Mining : /46 20 Web usage mining has been used for various purposes:
A knowledge discovery process for mining marketing intelligence information from Web data.
Web traffic patterns also can be extracted from Web usage logs in order to improve the performance of a Web site.
Search engine transaction logs also provide valuable knowledge about user behavior on Web searching.
Such information is very useful for a better understanding of users’ Web searching and information seeking behavior and can improve the design of Web search systems. 1. Web Usage Mining 1. Web Usage Mining : /46 21 One of the major goals of Web usage mining is to reveal interesting trends and patterns which can often provide important knowledge about the users of a system.
The Framework for Web usage mining.
Preprocessing: Data cleansing
Pattern analysis: Generic machine learning and Data mining techniques, such as association rule mining, classification, and clustering, often can be applied. 1. Web Usage Mining 1. Web Usage Mining : /46 22 Many Web applications aim to provide personalized information and services to users. Web usage data provide an excellent way to learn about users’ interest.
Web usage mining on Web logs can help identify users who have accessed similar Web pages. The patterns that emerge can be very useful in collaborative Web searching and filtering.
Amazon.com uses collaborative filtering to recommend books to potential customers based on the preferences of other customers having similar interests or purchasing histories.
Huang et al. (2002) used Hopfield Net to model user interests and product profiles in an online bookstore in Taiwan. 1. Web Usage Mining Web Server Log : /46 23 Web Server Log http://www.kdnuggets.com/jobs/ KDnuggets.com
Server User Web Server Log – A Sample : /46 24 Web Server Log – A Sample 126.96.36.199
"GET /jobs/ HTTP/1.1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" Web log fields : /46 25 Web log fields IP
IP address - can be converted to host name, such as xyz.example.com
The name of the remote user (usually omitted and replaced by a dash “-”)
Login of the remote user (also usually omitted and replaced by a dash “-”)
Request, Status code, Object size, Referrer, User agent Web Usage Mining - Basic : /46 26 Web Usage Mining - Basic Totals for each component
Hits – total number of requests
Files – number of GETs
Pages – number of HTML pages
Sites – unique IP addresses
Kbytes – total Kbytes transferred
User Agents Web Log Analysis Programs : /46 27 Web Log Analysis Programs Free
Analog, awstats, webalizer
WebTrends, WebSideStory, … Example: KDnuggets.com Nov 2005 totals : /46 28 Example: KDnuggets.com Nov 2005 totals Monthly Statistics (from webalizer) Q: What is the difference between Hits and Files?
Answer: the difference between Hits and Files is the number of requests with status code not 200. Example: KDnuggets.com Nov 2005 totals : /46 29 Example: KDnuggets.com Nov 2005 totals Q: What is the meaning of difference between Files and Pages ?
In November 2005 KDnuggets log HTML files were about 1/3 of all requests
However, this data does not separate bot requests (which are heavily weighted towards HTML pages) 2. Web Content Mining : /46 30 2. Web Content Mining The process to discover useful information from the content of a web page.
The type of the web content may consist of
Web content mining sometimes is called web text mining, because the text content is the most widely researched area.
The technologies that are normally used in web content mining are
Natural Language Processing (NLP)
Information Retrieval (IR) Text Mining : /46 31 Text Mining The process of deriving high quality information from text.
Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning. Text Mining : /46 32 Text Mining Text mining usually involves the process of:
Structuring the input text by
Addition of some derived linguistic features and the removal of others
Subsequent insertion into a database
Deriving patterns within the structured data
Evaluation and interpretation of the output.
'High quality' in text mining refers to some combination of:
Interestingness Text Mining : /46 33 Text Mining Typical text mining tasks include:
Entity relation modeling (i.e., learning relations between named entities). 3. Web Structure Mining : /46 34 3. Web Structure Mining The process of using the graph theory to analyze the node and connection structure of a web site. Web structure mining can be divided into two kinds:
Extract patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location.
Mining the document structure. It is using the tree-like structure to analyze and describe the HTML or XML tags within the web page. 3. Web Structure Mining : /46 35 Web structure mining has been largely influenced by research in
Social network analysis
Citation analysis (bibliometrics).
in-links: the hyperlinks pointing to a page
out-links: the hyperlinks found in a page.
Usually, the larger the number of in-links, the better a page is.
By analyzing the pages containing a URL, we can also obtain
Anchor text: how other Web page authors annotate a page and can be useful in predicting the content of the target page. 3. Web Structure Mining 3. Web Structure Mining : /46 36 The PageRank algorithm is computed by weighting each in-link to a page proportionally to the quality of the page containing the in-link.
The qualities of these referring pages also are determined by PageRank. Thus, a page p is calculated recursively as follows: 3. Web Structure Mining Ads vs. search results : /46 37 Ads vs. search results Search advertising is the revenue model
Advertisers pay for clicks on their ads
How to pick the top 10 results for a search from 2,230,000 matching pages?
What ads to show for a search?
If I’m an advertiser, which search terms should I bid on and how much to bid? Web Mining vs. Information Access : /46 38 Web Mining vs. Information Access Text data mining involves extracting “nuggets” and/or overall patterns from a collection of textual information, independent of a users' information need.
Information access is the process of helping users find, create, use, re-use, and understand information to satisfy an information need.
In other words, data mining is opportunistic, whereas information access is goal-driven. Search Engine Components : /46 39 Search Engine Components Spider (crawler/robot) – builds corpus
Collects web pages recursively
For each known URL, fetch the page, parse it, and extract new URLs
Additional pages from direct submissions & other sources
The indexer – creates inverted indexes
Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc.
Query processor – serves query results
Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc.
Back end – finds matching documents and ranks them Application Areas of Web Mining : /46 40 Application Areas of Web Mining E-commerce
Website Design Application Areas of Web Mining : /46 41 Application Areas of Web Mining E-tailers
The ability to find new cross-sell opportunities, enable comprehensive prospect profiling, and improve customer satisfaction.
B2B and B2C Ventures Application Areas of Web Mining : /46 42 Application Areas of Web Mining Advertising-Based Sites
When the revenue is advertising-based. Blindly serving ads to visitors will not result in a large click-thru rate. Instead, ads must be intelligently targeted to the user, providing the visitor with products and services that they are interested in.
Advertising Providers Application Areas of Web Mining : /46 43 Application Areas of Web Mining Information Repositories
Information overload is a problem that grows larger every day. Indexing, summarization, and other metadata tasks are time consuming. Semantic text analyzers are capable of automating these tasks, and create user navigation systems on the fly.
Technical Support Sites
Content Providers Application Areas of Web Mining : /46 44 Application Areas of Web Mining Security applications
One of the largest text mining applications that exists is probably the classified ECHELON surveillance system.
Software and Applications
Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes. Application Areas of Web Mining : /46 45 Application Areas of Web Mining Academic applications
The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. Conclusion : /46 46 Major limitations of Web mining research:
Lack of suitable test collections that can be reused by researchers.
Difficult to collect Web usage data across different Web sites.
Future research directions:
Multimedia data mining: a picture is worth a thousand words.
Multilingual knowledge extraction: Web page translations
Wireless Web: WML and HDML.
The Hidden Web: forms, dynamically generated Web pages.
Semantic Web Conclusion * This presentation is reproduced from the articles attached