logging in or signing up web communities ozturk Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 258 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Web CommunitiesAnalysis and Construction: Web Communities Analysis and Construction Yanchun Zhang · Jeffrey Xu Yu · Jingyu HouOutline: Introduction: Problems & Challenges Theoretical Background Web Search: Ranking Web Pages Web Community Construction Mining Web: Content, structure, usage and Web recommendation Web Community and Web Usage Mining Web Community Finding and Analysis A Web Recommendation Technique based on PLSA Future Work OutlineSlide3: 1.IntroductionBackground: Background WWW has become an important and popular application platform for disseminating and searching for information as well as conducting business Due to the lack of uniform schema for Web documents, the low precision of most search engines and the information explosion on the World Wide Web, the user is often flooded with a huge amount of information Web features: Web features The amount of data on the Web is enormous. The data on the Web is distributed The data on the Web is heterogeneous. The data on the Web is unstructured. The data on the Web is dynamic. The data on the Web is hyperlinked. Problems & Challenges: Problems & Challenges Web data is neither raw data nor very strictly typed as in conventional database systems. Web information retrieval and Web data management are becoming a challenging problem. Approach: Approach The Web community a set of Web-based objects (documents and users) has its own logical structures another effective and efficient approach to reorganize Web-based objects, support information retrieval and implement various applications Slide8: 2. Theoretical BackgroundMatrix Expression of Hyperlinks : Matrix Expression of Hyperlinks Matrix model has been widely used in many areas to model various actual situations, such as documents and keywords in IR… In context of Web Community, hyperlink matrix is adjacency matrix: A = [aij]mn if there is a hyperlink from page i to page j (i j), then the value of aij is 1, otherwise is 0. Matrix Expression of Hyperlinks: Matrix Expression of Hyperlinks In hyperlink analysis, this matrix expression can also be extended to represent semantics of hyperlinks. In this case, the values of the matrix elements are not required to be either 1 or 0, depends on the particular situations For example, the correlations between pages can be expressed in a matrix, where the value of a matrix element aij, which is between 0 and 1… Basic Requirements for Matrix Model: Basic Requirements for Matrix Model A data (information) space is constructed Two sets of information entities (objects), denoted as E1 and E2, within the constructed data space are identified For example, E1 could be a set of documents; E2 could be a set of keywords Original correlation expression between entities is defined and modeled into a matrix Mathematical Operations on Matrix : Mathematical Operations on Matrix Each Observation (e.g. page) is considered as a row or column of the matrix (vector model) Eigenvalues and eigenvectors computing, singular value decomposition Web page clustering through matrix permutation and partitioning Eigenvalue and Eigenvector of Matrix : Eigenvalue and Eigenvector of Matrix Two commonly used concepts in matrix model application An eigenvalue of A is a number with the property that for some vector v, we have Av = v. Such a vector v is called an eigenvector of A associated with the eigenvalue . Eigenvalue and Eigenvector: Eigenvalue and Eigenvector The set of all eigenvectors associated with a given eigenvalue forms a subspace of Rn, and the dimension of this space will be referred to as the multiplicity of . If A is a symmetric matrix, then A has at most n distinct eigenvalues, each of them is a real number and the sum of their multiplicities is exactly n. We denote these eigenvalues of matrix A by 1(A), 2(A), …, n(A), listing each a number of times equal to it multiplicity. Eigenvalue and Eigenvector: Eigenvalue and Eigenvector A matrix M is orthogonal if MTM = I If A is a symmetric nn matrix, is the diagonal matrix with diagonal entries 1(A), 2(A), …, n(A), and M is the matrix with column equal to v1(A), v2(A), …, vn(A). Then it is easy to verify that M is an orthogonal matrix and we have MMT = A the eigenvalues and eigenvetors provide a useful “normal form” representation for symmetric square matrix in terms of orthogonal and diagonal matrices Norms and Lipschitz Continuous Function : Norms and Lipschitz Continuous Function A matrix norm, denoted by ||||, is a measurement of a matrix. ||A-B|| can be understood as the distance between these two matrices A and B. A matrix norm ||A|| is a nonnegative number associated with A meeting the following properties ||A|| > 0 when A 0 and ||A|| = 0 iif A = 0, ||kA|| = |k| ||A|| for any scalar k, ||A+B|| ||A|| + ||B|| ||AB|| ||A|| ||B|| Norms: Norms Frobenius norm of a matrix A ||A||F = 1-norm of a matrix A |A||1= -norm of a matrix A ||A|| = 2-norm of a matrix A ||A||2 = (maximum eigenvalue of ATA)1/2 Lipschitz continuous function: Lipschitz continuous function Lipschitz continuous function is a function f (x) that for all x, y, we have |f (x) –f (y)| L |x-y|. L is called Lipschitz constant This is certainly satisfied if f has a first derivative bounded by L. Singular Value Decomposition : Singular Value Decomposition let A =[aij]m×n be a real matrix, rank(A) = r. Then there exist orthogonal matrices Um×m, Vn×n and such that where UTU=Im, VTV=In, ∑1=diag(σ1,…,σm), σi≥σi+1>0, 1ir-1, σj=0 for j≥r+1 Example of SVD: Example of SVD Let the SVD of A is where Basis of SVD: Basis of SVD Used to extract certain important properties relating to the structure of a matrix, such as the number of independent columns or rows, eigenvalues, approximation matrix and so on. Since the singular values of A are in non-increasing order, it is possible to choose a proper parameter k such that the last r-k singular values are much smaller than the first k singular values, and these k singular values dominate the decomposition. Theorem: Theorem Let the SVD of A be given and U = [u1 , u2 , … , um], V = [v1 , v2 , … , vn] with 0<r=rank(A)min(m,n), where ui, 1im is an m-vector, vj, 1 j n is an n-vector and Let and define Then rank(Ak) = k Advantage of SVD: Advantage of SVD Ak captures the main structure information of A and minor factors in A are filtered. Since kr and only partial matrix elements are involved in constructing Ak, the computation cost of an algorithm based on Ak could be reduced. Construction of Approximation : Construction of Approximation Similarity in Vector Space Model : Similarity in Vector Space Model We discuss the similarity here within a framework of vector space model, i.e, the concerned objects such as documents and web pages are represented as vectors. These vectors could be rows / columns of a matrix, or just individual vectors. The vector representation naturally suggests numerical similarity metrics for objects based on Euclidean distance or the vector inner product Cosine Similarity: Cosine Similarity The representative metric is the cosine measure where Graph Theory Basis : Graph Theory Basis A graph G = (V, E) consists of a set of nodes V and a set of edges E. Each element e E represents a connection between an unordered node pair (u, v) in V A graph is connected if the node set cannot be partitioned into components such that there are no edges whose connected nodes occur in different components Graph Theory Basis (cont.): Graph Theory Basis (cont.) A bipartite graph G = (V1, V2, E) consists of two disjoint sets of nodes V1, V2 such that each edge e E has one node from V1 and the other node from V2. A bipartite graph is complete if each node in V1 is connected to every node in V2. A matching is a subset of edges such that for each edge in the matching, there is no other edge that shares a node with it. A maximum matching is a matching of largest cardinality Graph Theory Basis (cont.): Graph Theory Basis (cont.) A weighted graph is a graph with a (non-negative) weight we for every edge e. A directed graph is a graph with an edge being an ordered pair of nodes (u, v), representing a connection from u to v. A directed acyclic graph (DAG) is a directed graph with no directed cycles. In a DAG, a sink node is one with no directed path to any other node. Markov Model : Markov Model A (homogeneous) Markov chain for a system is specified by a set of states S = {1, 2, …, n} and an n×n non-negative, stochastic matrix M. The system begins in some start state in S and at each step moves from one state to another state. This transition is guided by M: at each step, if the system is in state i, it moves to state j with probability Mij. Markov Model (cont.): Markov Model (cont.) If the current state is given as a probability distribution, the probability distribution of the next state is given by the product of the vector representing the current state distribution and M. In practical use, a random walk model based on a graph can also be represented as a Markov chain model under some assumptions. For example, a web surfer surfs the web along hyperlinks between pages. 3. Web Search/Search Engine: 3. Web Search/Search EngineStatistics: Statistics In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents. Up to 09/2005 Google indexes 8,200,000,000 web pages. Search Engines : Search Engines A search engine is a system which collects, organizes & presents a way to select Web documents based on certain words, phrases, or patterns within documents Model the Web as a full-text DB Index a portion of the Web docs Search Web documents using user-specified words/patterns in a text Search Engines: Search Engines Two categories of search engine general-purpose search engine, e.g. Yahoo!, AltaVista and Google special-purpose search engines (or Internet Portals), e.g. LinuxStart (www.linuxstart.com)Search Engines: Search Engines Two main components of a search engine: web crawler (spider), which collects massive Web pages. large database, which stores and indexes collected Web pages. Ranking has to be performed without accessing the text, just the index Ranking algorithms: all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries Search Engine: Search Engine Information Retrieval (IR) is a key to search engine or Web Search. Most commonly – used models: Boolean Model Vector Space Model (VSM) Probability Model their variations Search Engine: Search Engine The Google is a representative general-purpose search engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features). The PageRank in Google is defined as follow: Assume page A has pages P1...Pn which point to it. The parameter d is a damping factor which can be set between 0 and 1. Also C(Pi) is defined as the number of links going out of page Pi. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(P1)/C(P1) + ... + PR(Pn)/C(Pn)) Usually the parameter d is set to 0.85. PageRank or PR(A) can be calculated using a simple iterative algorithm. Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web. Ranking result pages: Ranking result pages Based on content Based on content Number of occurrences of the search terms Based on link structure Backlink count PageRank And more. (http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/)Problems with content-based ranking: Problems with content-based ranking Many pages containing search terms may be of poor quality or irrelevant Example: a page with just a line “search engine”. Many high-quality or relevant pages do not even contain the search terms Example: Google homepage Page containing more occurrences of the search terms are ranked higher; spamming is easy Example: a page with line “search engine” Repeated many timesBased on link structure: Based on link structure Hyperlinks among web pages provide new web search opportunities. Our focus in this lecture PageRank HITSBacklink: Backlink A backlink of a page p is a link that points to p A page with more backlinks is ranked higher Intuition: Each backlink is a “vote” for the page’s importance Based on local link structure; still easy to spam Create lots of pages that point to a particular page PageRank: PageRank Page et al., "The PageRank Citation Ranking: Brining Order to the Web.", 1998 Main idea: Pages pointed by high-ranking pages are ranked higher Definition is recursive by design Based on global link structure; hard to spamPageRank: PageRank Web can be viewed as a huge directed graph G(V, E) where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges). Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks). Each backlink of a page represents a citation to the page. PageRank is a measure of global web page importance based on the backlinks of web pages.Basic ideas: Basic ideas A page is likely to be important if it is linked by many pages. A page is likely to be important If it is linked to by important pages even though it may not necessarily be linked by a large number of pages. The importance of a page is divided evenly and propagated to the pages pointed to by it. PageRank Definition: PageRank Definition N(p): number of outgoing links from page p B(p): set of pages that point to p PageRank(p) = Σq∈B(p) (PageRank(q) / N(q)) Intuition Each page q evenly distributes its importance to all pages that q points to Each page p gets a boost of its importance from each page that points to pComputing PageRank : Computing PageRank Create a stochastic matrix M for the link structure Each page i corresponds to row i and column i If page j has n outgoing links, then the M(i, j) = 1 Ú n if page j Solve by setting all PageRank.s to 1 initially, and thenapplying M repeatedly until the values converge Convergence: Convergence If the ranks converge, i.e., there is a rank vector R such that R = M R, R is the eigenvector of matrix M with eigenvalue being 1. Convergence is guaranteed only if M is aperiodic (the Web graph is not a big cycle). This is practically guaranteed for Web. M is irreducible (the Web graph is strongly connected). This is usually not true. Example: Example The sum of all PageRank’s is the total number of pagesRandom surfer model: Random surfer model A random surfer Starts with a random page Randomly selects a link on the page to visit next Never uses the .back. button PageRank(p) measures the probability that a random surfer visits page pRank leak : Rank leak Pages with no outgoing links Called a rank leak because eventually all importance will leak. out of the WebRank sink: Rank sink A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages. Causes the loss of total ranks. Eventually accumulate all importance of the Web An example of rank sink: An example of rank sink p1 p2 p3Solution: Solution d: damping factor PageRank(p) =d · Σq∈B(p) (PageRank(q) /N(q)) + (1 -d ) Intuition in the random surfer model A surfer occasionally gets bored and jump to a random page on the Web instead of following a random link Another intuition Make the graph fully connected An example: An example p1 p2 p3 the sum of all PageRank’s is still the total number of pages. Slide56: 4. Web Community Construction Problem with Web Searching: Problem with Web Searching Current Web search tools such as Yahoo!, AltaVista, Google, … return more information than those required Example: search “Java Programming”, Google returns 1,330,000 (more than 1 million) Web pages, AltaVista returns 16,921,862 (more than 16 million) Web pages. Users only concern about a small and interesting portion of the returned search results. Necessary to build web communities. Related research work: Constructing good Web communities. Identifying related Web pages. Clustering Web pages. Web Community: Web Community A web community is defined as a set of web pages that link to more web pages within the set than to web pages outside the community (Flake/Lawrence/Giles 2000) Hyperlink analysis if a web page A has a hyperlink to page B, then the author of page A usually considers that page B contains valuable information that is related to page A; hyperlinks convey a considerable amount of latent human judgment; Some recent works on Web Communities: Some recent works on Web Communities Noise page elimination Related page finding Web page clustering Authority and Hub Pages : Authority and Hub Pages A page is a good authoritative page with respect to a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query. A page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query. Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other. Constructing Web Community: Given a query/topic, a web community consists of two distinct, but interrelated, types of pages: authority pages (authorities) and hub pages (hubs). Constructing Web Community Hubs Authorities Figure 2. A skeletal web communitySlide62: Construct Good Quality Web Page Communities World Wide Web Concerned Info. Space Web Page Community Hubs Authorities Search Engines ? ?Slide63: Kleinberg’s HITS Algorithm (Hypertext Induced Topic Search) Concerned Info. Space Calculate Authority/Hub values Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R.HITS Algorithm: HITS Algorithm Hub and authority value computation "p q" denotes "page p has a hyperlink to page q". Normalize the a and h vectors after each iteration. Improved HITS algorithm (Bharat and Henzinger, 1998) , Slide65: Problems with various HITS algorithms Concerned Info. Space Community Construction Algorithms Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R. Topic Drift Problem Noise pagesExample: Example Noise page --- contains no query terms Experiment Query Term: Harvard Search Engine: AltaVista Size of Base Set: 8064 Size of Root Set: 200 Slide67: Ten arbitrarily selected noise pagesSlide68: Ten arbitrarily selected topic-related pagesSlide69: Authorities & Hubs (by HITS algorithm)Slide70: Noise Page Elimination Concerned Info. Space Community Construction Algorithms Good Quality Web Page Community Noise Page Elimination Algorithms !Hyperlink Analysis via SVD: Hyperlink hyperlink matrix Hyperlink matrix singular value decomposition getting intrinsic relationships between the pages Hyperlink Analysis via SVD SVD of a Matrix: SVD of a Matrix Definition of SVD then where Slide73: Approximation Property is the best approximation to A with rank k ( ). Noise Page Elimination Algorithm: Noise Page Elimination Algorithm Main Procedure of the Algorithm Hyperlinks Matrix A SVD of A Approximation Matrix Ak Ak captures main structure information of A Ak filters minor factors in A For a proper k, computing cost can be reduced Slide75: Noise Page Elimination Algorithm B R Matrix A = (aij) n×n Matrix S=(sij)(m-n)×n Size(R)= n Size(B)= m î í ì = ® ® = otherwise j i i j j i s ij 0 or or 1Slide76: Matrix A SVD of A Choose parameter k such that Approximation matrix of A Rows of matrix Coordinate vector of page i in k-dim. space. page i B - R i = 1, 2, …, n. Slide77: Matrix S Each row s in S, s =(g1, g2, …, gn) in new k-space Pages in A and S are mapped into the vectors in the same k-dimensional subspace, it is possible to measure the similarity (relevance degree) between a page in S and a page in A. Yes No Keep page i in B Eliminate page i from B Choice of Parameter c: Choice of Parameter c Algorithm is called c = avgAlgo Algorithm c = maxAlgo Algorithm c = minAlgo Algorithm Slide79: Authorities & Hubs after Elimination (NPEA and HITS algorithm)Finding Related Web Pages: Given a URL or web page, find a set of pages that address the same topic of the original page. Co-citation algorithm: the sibling pages of the given URL u that have the most number of common parent pages with u are the related pages. HITS and Improved HITS algorithms: authority pages with the highest authority values are related pages. Finding Related Web PagesSlide81: Finding related pages Co-Citation Algorithm Given page (URL) u, page source S is constructed in the following way: chooses up to B (e.g. 2000) arbitrary parents of u; for each of these parents p, adds to S up to BF (e.g. 8) children of p that surround the link from p to u. The elements of S are siblings. for each s in S, co-citation degree of s and u is determined; the algorithm returns the 10 pages that have the highest co-citation degrees with u as the relevant pages.Slide82: Extended Co-Citation Algorithm Given page u, Choose up to B parent pages of u, and add up to BF children pages (different from u) of each parent page. choose up to F children pages of u, and add up to FB parent pages (diff from u) of each child page. B, F, BF and FB are used to keep the page source to a reasonable size. In practice, choose B = FB = 200, F = BF = 40. Web Page Clustering: With a set of pages, the key is to find a similarity between the pages Text content based algorithms Regard a web page as a set of words or phrases; Each page is represented as a feature vector; The similarity is measured as where wik is the k-th component of the feature vector of the i-th page, and L is the length of the feature vectors. Web Page Clustering Slide84: Hyperlink based algorithms Co-citation algorithm Linkage vector algorithm Each page P is represented as two linkage vectors: Pout and Pin ; The similarity is defined as : dot product of vectors Pout and Qout . Web Page Clustering : Page Source Construction Given a user’s query topic, page source S = R BV FV with hyperlinks. Web Page Clustering Slide86: Page Similarity Assume: size(R) = m, size(V) = n, V = BV FV Hyperlinks Adjacency Matrix C Slide87: (Out-link relationship of page i in R) (In-link relationship of page i in R) Out-link similarity of pages i and j In-link similarity of pages i and jSlide88: Similarity between any two pages i and j in R: Similarity Matrix SM = (smi,j)mm , where MODij = || rowi || + || rowj || + || coli || + || colj || Slide89: Matrix-Based Clustering Algorithms Three main steps: Similarity Matrix Permutation Purpose: put closely related pages together in the similarity matrix SM. Matrix Partition Purpose: cluster pages at one level with the partition. Hierarchical Clustering Recursively apply the above two steps to the existing clusters.Slide90: Matrix-based hierarchical clustering diagram: Similarity Matrix mm Page Source SM CL1 CL2 Matrix Partition Hierarchical Clusters Permuted Similarity Matrix mm SM’ Slide91: An example of matrix permutation and partition 1 0 0 0 .47 0 .6 0 0 0 1 0 0 0 0 .49 0 0 0 0 1 0 0 0 0 .7 .66 0 0 0 1 0 0 0 0 .42 .47 0 0 0 1 0 .9 0 0 0 0 0 0 0 1 0 .92 .45 .6 .49 0 0 .9 0 1 0 0 0 0 .7 0 0 .92 0 1 0 0 0 .66 .42 0 .45 0 0 1 P1 P2 P3 P4 P5 P6 P7 P8 P9 P1 P2 P3 P4 P5 P6 P7 P8 P9 (a) A similarity matrix (b) Permuted matrix and partition 1 .47 .6 0 0 0 0 0 0 .47 1 .9 0 0 0 0 0 0 .6 .9 1 .49 0 0 0 0 0 0 0 .49 1 0 0 0 0 0 0 0 0 0 1 .42 0 0 0 0 0 0 0 .42 1 .66 0 .45 0 0 0 0 0 .66 1 .7 0 0 0 0 0 0 0 .7 1 .92 0 0 0 0 0 .45 0 .92 1 D P1 P5 P7 P2 P4 P9 P3 P8 P6 P1 P5 P7 P2 P4 P9 P3 P8 P6 Cluster1 = { P1 , P5 , P7 , P2 } , Cluster2 = { P4 , P9 , P3 , P8 , P6 }Slide92: Examples of Some Major Clusters 5. Mining Web: Content, structure, usage and Web recommendation: 5. Mining Web: Content, structure, usage and Web recommendationWeb mining: Web mining Web mining utilize data mining methods to induce and extract useful information from web data and service. Web content mining Web structure mining Web usage mining Web Content mining: Web Content mining Web content mining tries to discover valuable information from web contents (i.e. web documents or web pages), alternatively termed as text mining sometimes Web structure mining: Web structure mining Web structure mining involves modeling web site in terms of link structures. The mutual linkage information obtained could be used to cluster the web pages or find relevant pages based on the similarity or relevance between different web pages. Web usage mining: Web usage mining Web usage mining tries to reveal the underlying access patterns from web user transactions User pattern can be utilized to recommend or personalize web contents to users. Capturing the web user access interest can help for better understanding user navigational behavior, and for improving web site structure design. Slide98: Web usage mining is utilized to make recommendation: Firstly, extract the informative knowledge from web log files and identify the underlying user functional interests; Secondly, create user profiles for representing common user navigational behavior based on observation usage data; Finally, present the desired web content in personalized style to user by matching the current active user session with the discovered patterns. Web recommendation: Web recommendation Web recommendation is a process that recommends customized web content to users according to their specific preference. Two common approaches: content-based filtering and collaborative filtering. Web recommendation: Web recommendation Content-based filtering systems, such as WebWatcher and client-side agent Letizia generally generate recommendation based on the pre-constructed user profiles by measuring the similarity of web content to these profiles Collaborative filtering systems make recommendation by utilizing the rating of current user for objects via other users’ preference.Slide101: 6. Web Community and Web Usage Mining Web Community and Web Usage Mining : Web Community and Web Usage Mining Pierrakos et al presented a way of constructing web communities using web logs archived in a proxy server The web logs contain the basic information, such as web pages accessed and IP addresses. Second, a web page can be abstracted as a feature vector where a feature indicates whether the corresponding word appears in the web page. The page categories can then be obtained using a hierarchical clustering to cluster web pages being accessed. Community Directory Miner (CDM): Community Directory Miner (CDM) The clustering result in a tree structure where a leaf node represents a cluster of web pages which share high similarity. The internal node represents a larger cluster which includes smaller clusters as represented by the child nodes of the internal node. The root node represents all web pages as a top largest cluster. A page category is a node in the tree. There exist parent/child relationships among page categories Community Directory Miner (CDM): Community Directory Miner (CDM) Basic Node Weights: Suppose that there exist n web pages in a page category. The weight of a feature i is denoted wi for the page category, and is computed as Community Directory Miner (CDM): Community Directory Miner (CDM) Basic Edge Weights: Suppose that there exist n user access sessions that access both page categories, vi and vk. The weight of the edge between vi and vk is denoted wik, and is computed as Community Directory Miner (CDM): Community Directory Miner (CDM) Accumulated Node/Edge Weights: A weight of a feature in a page category, wi, is the weight of itself plus all the weight of its child page categories. In a similar fashion, the weight of an edge between page categories, pi and pj, is accumulated by the weight of the edge between pi and pj as well as all the weights of the edges between pm and pn if pm is a child of pi and pn is a child of pj. The accumulated node/edge weights make it include more page categories as a part of web communities. Community Directory Miner (CDM): Community Directory Miner (CDM) The above shows how to construct a weighted undirected graph. The resulting weighted undirected graph may have high connectivities. Some edges of the graph are pruned using a threshold. After pruning edges, the CDM algorithm converts the weighted undirected graph into an undirected graph, Then find maximal cliques. The found web communities can be used to build a web community directory Slide108: 7. Web Community Finding and Analysis Web Community : Web Community The web communities can be viewed as containing a core of central, authoritative pages linked together by hub pages; and they exhibit a natural type of hierarchical topic generalization that can be inferred directly from the pattern of linkage. By David Gibson, Jon Kleinberg and Prabbakar Raghavan Understand the global structure of the ways in which independent users build connections to one another in a distributed fashion. Given a user-specified query, HITS (Hyperlink-Induced Topic Search) finds a root set of pages. Let the core of a community consists of top 10 authoritative and top 10 hub pages. Can HITS discover communities? Does the choice of the root set affect the communities discovered? Small World Phenomenon : Small World Phenomenon In social networks, it is known that the average number of the length of the chains of people was between 5 and 6 (Six degrees of Separation). Experiments: from Nebraska to Massachusetts (Stanley Milgran et al) Watts and Strogatz showed that small world phenomenon is pervasive in a range of networks (Nature 393, 1998). A fundamental ingredient in the evolution of the World Wide Web. Short paths are constructed based on users’ local information collectively. Is there a decentralized algorithm operating with local information only to construct short paths in networks (WWW)? Kleinberg’s Algorithmic Perspective (1) : Kleinberg’s Algorithmic Perspective (1) Model a network that exhibits the richness of local connections and a few random long-range connections over a grid with directed edges. The distance between two points is . For local connections, a parameter p (≥1) is introduced. For long-range connections, q and r are introduced. q specifies how many long-range connections a point has, and r specifies how they are linked using independent random trials. The i-th directed edge from p1 to p2 is with probability proportional to for r ≥ 0. The reverse r-th power distribution. When r = 0 , uniform distribution. If r increases, the long-range contacts of a node become more clustered. Kleinberg’s Algorithmic Perspective (2) : Kleinberg’s Algorithmic Perspective (2) The decentralized algorithm follows a simple strategy, to transmit a message from s to t in as few steps as possible. Every message-holder chooses a contact that is as close to the target as possible in terms of the distance measurement. The message-holder does not know anything about how the messages have been delivered so far. Theorem 2: There is a decentralized algorithm and a constant α, independent of n, so that when r = 2 and p=q=1, the expected delivery time of the algorithm is at most Two Types of Web Communities : Two Types of Web Communities Explicitly-defined communities Yahoo and AOL. Implicitly-defined communities The number of such web communities is large. They may appear as emerging web communities, and may disappear. Web Community: A Complete Directed Bipartite Graph : Web Community: A Complete Directed Bipartite Graph What does a web community look like in WWW? If we consider a web community as a subgraph in a web graph, how do we define such a subgraph? Define a web community as a complete directed bipartite graph, G(V, E) where and . Vf is a set of fans (hub). Vc is a set of centers (authority). An example of a web community Trawling the Web : Trawling the Web As the first attempt, Kumar et al conducted a two week experimental studies using a 18 month and 200 million web-page web archive in 1999. Generating fans and centers Identify the potential fans as specialized hubs using HITS. With human interactions, they determined that a potential fan has links to at least 6 different web sites, and maintained the number of potential centers as roughly 30 times of the number of potential fans. Some pruning strategies are used. The web communities do exist. tested , for f is taken from {3,4,5,6} and c is taken from {3,5,7,9}. , , and . Modeling the Web (1) : Modeling the Web (1) Kumar et al also attempted to graph model for web. A class of graph models are specified by four stochastic processes at a time step t. Node-creation and node-deletion are with probability and Edge-creation is with a probability β. Consider a node, v, to be randomly selected to add edges. With probability β, it adds k edges from v to other nodes independently and uniformly selected. With probability 1- β, it chooses another node u randomly, and copies k out-going edges of u to v. Edge deletion is modeled as to randomly delete a node with a probability . Modeling the Web (2): Modeling the Web (2) Both in-degree distribution of nodes and out-degree distribution of nodes are of Zipfian distribution, when , , and Let pi,t and qi,t be the fraction of nodes at time t with in-degree i and out-degree i, then Here, α and β are two probabilities for creating a new edge (u, v). When α=0.52 and β=0.58, the probability of a node that has in-degree i is i-2.1 and probability of a node that has out-degree i is i-2.38. They remarked that both match the reality of web. Web Community: A Dense Directed Bipartite Graph : Web Community: A Dense Directed Bipartite Graph Are the condition of Complete Bipartite Graph too strong? Do we miss any interesting web communities? A dense directed bipartite graph, GD(V, E) , is a directed bipartite graph, denoted A node in Vf must link to at least nodes in Vc. At least nodes in Vf link to every node in Vc. An example A web community hierarchy can be determined on top of dense directed bipartite graphs where centers at level i become fans at level i+1. Finding Dense Directed Bipartite Graphs : Finding Dense Directed Bipartite Graphs The cocited relationship can help to find complete bipartite graphs. But is too strong for dense directed bipartite graphs. The node vi is Cocited with V in a directed graph G, if , where . The main difference in finding dense directed bipartite graphs is to use Cocited instead of cocited. Web Community in Any Shapes : Web Community in Any Shapes Both Complete/Dense Directed Bipartite Graphs need to specify two numbers, f (fans) and c (centers). Is it possible to find web communities of arbitrary shapes or diameters? Flake et al define a web community as a collection of web pages such that each member page has more hyperlinks within the community than outside of the community. Network Flow : Network Flow Given a flow network . A flow is a function, denoted as , for every edge satisfying two conditions. . For all nodes , . The s-t maximum flow problem over is to find the maximum value of . The s-t minimum cut over N is to find the minimum value of a cut set, . The above two problems are identical over N. Ford-Fulkerson Method Web Community Based on Network Flow : Web Community Based on Network Flow A web community is a set of nodes over graph such as every node v in Vc has many edges connecting to the other nodes in Vc than the nodes in V- Vc. Let #(s) be the number of edges from s to any other nodes in Vc \ {s}. Let #(t) be the number of edges from t to any other nodes in V \ Vc \ {s}. Condition: and . How to select s and t? Selecting s and t: Selecting s and t Consider s and t as additional virtual nodes. Given a set of seed nodes (pages) as in the web community, S . The s connects to all nodes in S, with infinite capacity. It makes possible that all S will be selected in the web community. Intuitively, it is difficult to determine t. Flake et al formalize the problem of choosing t as whether the cut set remains the same given two different terminals, the best, t, and the heuristic Selecting s and t: Selecting s and t The best t is the graph center. The virtual terminal, , is connected with all nodes in V, except the seeds, S and itself, with the minimum capacity 1. Given two flow networks, and . Let C1, and C2 be the minimum cut sets for and . If holds, then such as c’ only includes the edges for Two Algorithms: Two Algorithms Exact-Flow-Community algorithm, which takes three inputs, a set of seed nodes, S, and the entire web graph, G, and edge capacity k (integer). Approximate-Flow-Community algorithm, which takes three inputs, a set of seed nodes, S, and two integers, d and m. Approximate-Flow-Community does not have the entire web as an input into the algorithm. Exact-Flow-Community algorithm : Exact-Flow-Community algorithm Approximate-Flow-Community algorithm : Approximate-Flow-Community algorithm How to determine the edge capacities? : How to determine the edge capacities? The edge capacities are important in the work of Flake et al. Flake et al did not show how to assign edge capacities. Imafuji and Kitsuregawa show the impacts of the edge capacity on the quality of the web community being found by Flake et al. Increasing edge capacity can increase the size of web community. It is difficult to determine proper edge capacity. When the edge capacity is less than the quantum jump point, the web community, based on a set of seed nodes, is too small. When the edge capacity is greater than or equal to the quantum jump, the whole graph will be the web community The noises in the web community cannot be easily removed by controlling the edge capacity. Estimation of the final quantum jump q: Estimation of the final quantum jump q Let all the edge capacities be q-1 Consider a node that has the largest degree, Then Edge Capacity Assignment : Edge Capacity Assignment The edge capacity of an edge is set as avl and hvk are the authority score and hub score. rh =max(avk)/max(hvl). ra =q. Web Community Charts : Web Community Charts A web community is a set of pages densely connected by symmetric derivation relationships. Two communities are related if there is an asymmetric derivation relationship between members of them. A web community chart is a graph where is a set of web communities and is a set of edges from a web community to another web community. From Web Community Chart to Web Community Evolution : From Web Community Chart to Web Community Evolution Let a web community chart, G(V,E), and a web community, v∈G, at time t, be denoted as G(t) and v(t) ∈G(t), respectively. Consider two web community charts, G(tk) and G(tl) where tk <tl, and tk and tl are close enough to detect web community changes. There are five cases, at time tl, in comparison to the web communities at time tk (<tl). emerged web communities, dissolved web communities, grown or shrunk communities, split communities, and merged communities The key issue is to how to determine a web community, v(tk), that corresponds to a web community v(tl). From Web Community Chart to Web Community Evolution : From Web Community Chart to Web Community Evolution When can a web community be unique? : When can a web community be unique? Given a subgraph Gh in a graph G, a web community of Gh is the largest subgraph of the maximal edge-connectivity containing Gh, denoted Gc (⊆ G), such as Gh ⊆ Gc, conn(Gc) ≥ conn(Gd) if Gh ⊆ Gd ⊆ G, and Gd ⊆ Gc if Gh ⊆ Gd ⊆ G and conn(Gd)=conn(Gc). The conn(G) is the weight of a minimal cut of G. Such a web community is unique Discovering Web Communities Using Co-occurrence : Discovering Web Communities Using Co-occurrence Identify a web community , and enlarge Let be the set of all web pages that co-refer to all the pages in ; While do let be a set of web pages linked by all web pages in ; sort based on the co-occurrence; select the top-ranked web page in , and add it into ; be the set of all web pages that co-refer to all the pages in ; end while return ; Do newly added centers really contain the similar contents like those in ? Generate a set of initial inputs, where . In total pairs. Enlarge each individually. Select the top-ranked ones. Can we find the boundary between centers and non-centers? Web Community and Formal Concept Analysis (1) : Web Community and Formal Concept Analysis (1) A context is a triple (G,M,I) where Gis the domain of objects, and M is the domain of attributes, Iis a binary relation such as I⊆ G× M. A concept in the context is a pair (CG,CM), where CG (⊆ G) is a set of objects that have all the attributes in CM (⊆ M). The ordered set of all concepts of the context (G,M,I) forms a concept lattice. Rome and Haralick in considered a context (G,M,I)as a web graph. Here, both Gand M are web pages, and I is a set links that connect web pages. Rome and Haralick considered to merge small web communities as a large web community by using the existing techniques to coalesce concepts. Web Community and Formal Concept Analysis (2) : Web Community and Formal Concept Analysis (2) Web Community and Formal Concept Analysis (3) : Web Community and Formal Concept Analysis (3) A large web communities is, Vf = {1, 3, 4, 5, 7} and Vc = {2, 6, 9}, by adding (4, 2) and (5, 9) edges. Finding Geographical Scopes of Web Resources : Finding Geographical Scopes of Web Resources The geographical scope of a web resource is the geographical area that the creator of intends to reach. Let be a hierarchy for geographical scopes. Two conditions, for a web resource . The fraction states that there are significant web pages in a geographical scope containing links to . The uniformity states that the links to are distributed smoothly across the geographical scope . Discovering Unexpected Information from Competitors : Discovering Unexpected Information from Competitors Given two web sites, Wb and Wc, where Wc is the web site for the competitor, and Wb is the web site, as the basis, to find the unexpected information from Wc. The Unexpected information in Wc is the information which is relevant to Wb but is unknown. Several methods were proposed to compare Wb with Wc in. 8. A Web Recommendation Technique based on Probabilistic Latent Semantic Analysis : 8. A Web Recommendation Technique based on Probabilistic Latent Semantic Analysis 1. Introduction: 1. Introduction Internet has become a huge data repository for retrieving information, discovering knowledge Web users usually suffer from the information overload problem One approach addressed to the information overload is the recommendation system 1. Introduction (cont.): 1. Introduction (cont.) Web recommender system focuses on identifying Web users or objects, collecting users’ preference information adapting its service to satisfy the users’ needs Web mining technique is to extract textual, topological and usage information on Web1. Introduction (cont.): 1. Introduction (cont.) Web users may exhibit various types of behavior pattern For example, within an automatics web site one type of customers intend to browse specific category vehicle, whereas another is interested in one particular famous brand.1. Introduction (cont.): 1. Introduction (cont.) Clickstream data (usage) convey access pattern information. Web usage mining technique is to uncover the user access pattern. Then, usage pattern information provide benefit for Web recommending. 1. Introduction (cont.): 1. Introduction (cont.) Many advanced techniques in data mining K-Nearest Neighbor (KNN) Web clustering Association rule mining Sequence pattern mining. Successful progress benefits, e.g. Adaptive Web Design Web Recommender1. Introduction (cont.): 1. Introduction (cont.) Related work Content-based Filtering, measuring content similarity Collaborative Filtering, referring others’ preference similar to current user Web Usage Mining (WUM) Latent Semantic Analysis (LSA) and its application Variant of LSA, Probabilistic Latent Semantic Analysis (PLSA)2. PLSA Model : 2. PLSA Model Data sessionalization process Pageview identification: P = {p1, p2,…,pn} User sessionization: S = {s1, s2, …. , sm} T Usage data in matrix expression: considered as an n-dimensional page vector: si = {ai1, ai2…, ain}, where aij denotes the weight for pageview pj in si user session Session-pageview matrix: SPmn Matrix expressionEntry (weight) is determined by visiting duration or hit numbers: Matrix expression Entry (weight) is determined by visiting duration or hit numbersAn snapshot example: An snapshot example 1) Main Movies: 20sec Movies News: 15sec NewsBox: 43sec Box-Office Evita: 52sec News Argentina:31 sec Evita: 44sec 2) Music Box: llsec Box-Office Crucible: 12sec Crucible Book: 13sec Books: 19sec 3) Main Movies: 33sec Movies Box: 21sec Boxoffice Evita: 44sec News Box: 53sec Box-office Evita: 61 sec Evita : 31sec 4) Main Movies: 19sec Movies News: 21sec News box: 38sec Box-Office Evita:61 sec News Evita:24sec Evita News: 31 sec News Argentina: 19sec Evita: 39sec 5) Movies Box: 32sec Box-Office News: 17sec News Jordan: 64sec Box-Office Evita: 19sec Evita: 50sec 6) Main Box: 17sec Box-Office Evita: 33sec News Box: 41 sec Box-Office Evita: 54sec Evita News: 56sec News: 47sec PLSA Model : PLSA Model For the given aspect model, suppose that there is a latent factor variable space zk ∈ Z = {z1, z2, · · · , zl} associated with co-occurrence observation data (i.e. session-pageview data). Each co-occurrence observation data (si, pj) is associated with the factor zk∈ Z by varying degree to zk . Probabilities definitions: Probabilities definitions P(si) denotes the probability that a particular user session si will be observed in the occurrences data P(zk|si) denotes a user session-specific probability distribution on the unobserved class factor zk explained above, P(pj|zk) denotes the class-conditional probability distribution of pages over a specific latent variable zk. Slide153: Goal: maximize Li where Re-parameterized version Total likelihood Li as Formulation of PLSA (1) (2) (3)Expectation & Maximization (EM) algorithm : Expectation & Maximization (EM) algorithm Firstly, given the randomized initial values of P(zk), P(si|zk), P(pj|zk) Maximization (M) step: Expectation (E) step : Implementation of EM algorithm: Implementation of EM algorithm The E-step and M-step are iterating until a local optimal limit of Li is approached. We can obtain the conditional probability estimates P(zk), P(si|zk), P(pj|zk) corresponding to local maximum likelihood Li , and further P(zk|si), P(zk|pj) respectively. They will be utilized to cluster web session to induce user access pattern and identify factor (task) space.Generated conditional-probabilities: Generated conditional-probabilitiesScheme of WUM and WR: Scheme of WUM and WRAlgorithms for WUM and WR(1): Algorithms for WUM and WR(1) Algorithm 1 Characterizing latent factor Aim: identify the “dominant” page sets that are considered to contribute significantly to the associated latent factors. Procedure: probability inference. Algorithms for WUM and WR(2): Algorithms for WUM and WR(2) Algorithm 2 Clustering Web Session Aim: partition Web sessions into task-oriented group to represent user access pattern Procedure: employ clustering algorithm on transferred session space and build centroid of cluster as user profile Algorithms for WUM and WR(3): Algorithms for WUM and WR(3) Algorithm 3 Web Recommendation Aim: recommend user potentially visited pages Procedure: measure the similarity between active user session and generated user profile to determined most matching usage pattern, then predict the top N pages upon the selected pattern.Experimental Results : Experimental Results Real world data sets The data set is downloaded from KDDCUP including 9308 user sessions and 69 pageviews, (KDD dataset). The second data set is a 2-week web log file from university website containing 13745 sessions and 683 pages (CTI dataset). Slide162: 9. Future WorkFuture directions: Future directions What we can additionally provide to users on top of the Web communities and Web clusters found. How we can build up recommendation systems for Web users based on all we can possibly obtain including Web clusters, Web communities, as well as users viewpointsResearch questions: Q1. What model is suitable for capturing the relationships between Web documents and Web usage information? Q2. How do we discover Web communities and user communities based on such a model? Q3. How do we make recommendations based on the Web communities and user communities being found? Q4. How do we automatically build user profiles for a large number of users? Q5. How do we make online recommendations? Research questionsReference: Reference Y. Zhang, J.X. Yu and J Hou, Web Communities: Analysis and Construction, Springer 2005 (ISBN: 3-540-27737-4) You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
web communities ozturk Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 258 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Web CommunitiesAnalysis and Construction: Web Communities Analysis and Construction Yanchun Zhang · Jeffrey Xu Yu · Jingyu HouOutline: Introduction: Problems & Challenges Theoretical Background Web Search: Ranking Web Pages Web Community Construction Mining Web: Content, structure, usage and Web recommendation Web Community and Web Usage Mining Web Community Finding and Analysis A Web Recommendation Technique based on PLSA Future Work OutlineSlide3: 1.IntroductionBackground: Background WWW has become an important and popular application platform for disseminating and searching for information as well as conducting business Due to the lack of uniform schema for Web documents, the low precision of most search engines and the information explosion on the World Wide Web, the user is often flooded with a huge amount of information Web features: Web features The amount of data on the Web is enormous. The data on the Web is distributed The data on the Web is heterogeneous. The data on the Web is unstructured. The data on the Web is dynamic. The data on the Web is hyperlinked. Problems & Challenges: Problems & Challenges Web data is neither raw data nor very strictly typed as in conventional database systems. Web information retrieval and Web data management are becoming a challenging problem. Approach: Approach The Web community a set of Web-based objects (documents and users) has its own logical structures another effective and efficient approach to reorganize Web-based objects, support information retrieval and implement various applications Slide8: 2. Theoretical BackgroundMatrix Expression of Hyperlinks : Matrix Expression of Hyperlinks Matrix model has been widely used in many areas to model various actual situations, such as documents and keywords in IR… In context of Web Community, hyperlink matrix is adjacency matrix: A = [aij]mn if there is a hyperlink from page i to page j (i j), then the value of aij is 1, otherwise is 0. Matrix Expression of Hyperlinks: Matrix Expression of Hyperlinks In hyperlink analysis, this matrix expression can also be extended to represent semantics of hyperlinks. In this case, the values of the matrix elements are not required to be either 1 or 0, depends on the particular situations For example, the correlations between pages can be expressed in a matrix, where the value of a matrix element aij, which is between 0 and 1… Basic Requirements for Matrix Model: Basic Requirements for Matrix Model A data (information) space is constructed Two sets of information entities (objects), denoted as E1 and E2, within the constructed data space are identified For example, E1 could be a set of documents; E2 could be a set of keywords Original correlation expression between entities is defined and modeled into a matrix Mathematical Operations on Matrix : Mathematical Operations on Matrix Each Observation (e.g. page) is considered as a row or column of the matrix (vector model) Eigenvalues and eigenvectors computing, singular value decomposition Web page clustering through matrix permutation and partitioning Eigenvalue and Eigenvector of Matrix : Eigenvalue and Eigenvector of Matrix Two commonly used concepts in matrix model application An eigenvalue of A is a number with the property that for some vector v, we have Av = v. Such a vector v is called an eigenvector of A associated with the eigenvalue . Eigenvalue and Eigenvector: Eigenvalue and Eigenvector The set of all eigenvectors associated with a given eigenvalue forms a subspace of Rn, and the dimension of this space will be referred to as the multiplicity of . If A is a symmetric matrix, then A has at most n distinct eigenvalues, each of them is a real number and the sum of their multiplicities is exactly n. We denote these eigenvalues of matrix A by 1(A), 2(A), …, n(A), listing each a number of times equal to it multiplicity. Eigenvalue and Eigenvector: Eigenvalue and Eigenvector A matrix M is orthogonal if MTM = I If A is a symmetric nn matrix, is the diagonal matrix with diagonal entries 1(A), 2(A), …, n(A), and M is the matrix with column equal to v1(A), v2(A), …, vn(A). Then it is easy to verify that M is an orthogonal matrix and we have MMT = A the eigenvalues and eigenvetors provide a useful “normal form” representation for symmetric square matrix in terms of orthogonal and diagonal matrices Norms and Lipschitz Continuous Function : Norms and Lipschitz Continuous Function A matrix norm, denoted by ||||, is a measurement of a matrix. ||A-B|| can be understood as the distance between these two matrices A and B. A matrix norm ||A|| is a nonnegative number associated with A meeting the following properties ||A|| > 0 when A 0 and ||A|| = 0 iif A = 0, ||kA|| = |k| ||A|| for any scalar k, ||A+B|| ||A|| + ||B|| ||AB|| ||A|| ||B|| Norms: Norms Frobenius norm of a matrix A ||A||F = 1-norm of a matrix A |A||1= -norm of a matrix A ||A|| = 2-norm of a matrix A ||A||2 = (maximum eigenvalue of ATA)1/2 Lipschitz continuous function: Lipschitz continuous function Lipschitz continuous function is a function f (x) that for all x, y, we have |f (x) –f (y)| L |x-y|. L is called Lipschitz constant This is certainly satisfied if f has a first derivative bounded by L. Singular Value Decomposition : Singular Value Decomposition let A =[aij]m×n be a real matrix, rank(A) = r. Then there exist orthogonal matrices Um×m, Vn×n and such that where UTU=Im, VTV=In, ∑1=diag(σ1,…,σm), σi≥σi+1>0, 1ir-1, σj=0 for j≥r+1 Example of SVD: Example of SVD Let the SVD of A is where Basis of SVD: Basis of SVD Used to extract certain important properties relating to the structure of a matrix, such as the number of independent columns or rows, eigenvalues, approximation matrix and so on. Since the singular values of A are in non-increasing order, it is possible to choose a proper parameter k such that the last r-k singular values are much smaller than the first k singular values, and these k singular values dominate the decomposition. Theorem: Theorem Let the SVD of A be given and U = [u1 , u2 , … , um], V = [v1 , v2 , … , vn] with 0<r=rank(A)min(m,n), where ui, 1im is an m-vector, vj, 1 j n is an n-vector and Let and define Then rank(Ak) = k Advantage of SVD: Advantage of SVD Ak captures the main structure information of A and minor factors in A are filtered. Since kr and only partial matrix elements are involved in constructing Ak, the computation cost of an algorithm based on Ak could be reduced. Construction of Approximation : Construction of Approximation Similarity in Vector Space Model : Similarity in Vector Space Model We discuss the similarity here within a framework of vector space model, i.e, the concerned objects such as documents and web pages are represented as vectors. These vectors could be rows / columns of a matrix, or just individual vectors. The vector representation naturally suggests numerical similarity metrics for objects based on Euclidean distance or the vector inner product Cosine Similarity: Cosine Similarity The representative metric is the cosine measure where Graph Theory Basis : Graph Theory Basis A graph G = (V, E) consists of a set of nodes V and a set of edges E. Each element e E represents a connection between an unordered node pair (u, v) in V A graph is connected if the node set cannot be partitioned into components such that there are no edges whose connected nodes occur in different components Graph Theory Basis (cont.): Graph Theory Basis (cont.) A bipartite graph G = (V1, V2, E) consists of two disjoint sets of nodes V1, V2 such that each edge e E has one node from V1 and the other node from V2. A bipartite graph is complete if each node in V1 is connected to every node in V2. A matching is a subset of edges such that for each edge in the matching, there is no other edge that shares a node with it. A maximum matching is a matching of largest cardinality Graph Theory Basis (cont.): Graph Theory Basis (cont.) A weighted graph is a graph with a (non-negative) weight we for every edge e. A directed graph is a graph with an edge being an ordered pair of nodes (u, v), representing a connection from u to v. A directed acyclic graph (DAG) is a directed graph with no directed cycles. In a DAG, a sink node is one with no directed path to any other node. Markov Model : Markov Model A (homogeneous) Markov chain for a system is specified by a set of states S = {1, 2, …, n} and an n×n non-negative, stochastic matrix M. The system begins in some start state in S and at each step moves from one state to another state. This transition is guided by M: at each step, if the system is in state i, it moves to state j with probability Mij. Markov Model (cont.): Markov Model (cont.) If the current state is given as a probability distribution, the probability distribution of the next state is given by the product of the vector representing the current state distribution and M. In practical use, a random walk model based on a graph can also be represented as a Markov chain model under some assumptions. For example, a web surfer surfs the web along hyperlinks between pages. 3. Web Search/Search Engine: 3. Web Search/Search EngineStatistics: Statistics In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents. Up to 09/2005 Google indexes 8,200,000,000 web pages. Search Engines : Search Engines A search engine is a system which collects, organizes & presents a way to select Web documents based on certain words, phrases, or patterns within documents Model the Web as a full-text DB Index a portion of the Web docs Search Web documents using user-specified words/patterns in a text Search Engines: Search Engines Two categories of search engine general-purpose search engine, e.g. Yahoo!, AltaVista and Google special-purpose search engines (or Internet Portals), e.g. LinuxStart (www.linuxstart.com)Search Engines: Search Engines Two main components of a search engine: web crawler (spider), which collects massive Web pages. large database, which stores and indexes collected Web pages. Ranking has to be performed without accessing the text, just the index Ranking algorithms: all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries Search Engine: Search Engine Information Retrieval (IR) is a key to search engine or Web Search. Most commonly – used models: Boolean Model Vector Space Model (VSM) Probability Model their variations Search Engine: Search Engine The Google is a representative general-purpose search engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features). The PageRank in Google is defined as follow: Assume page A has pages P1...Pn which point to it. The parameter d is a damping factor which can be set between 0 and 1. Also C(Pi) is defined as the number of links going out of page Pi. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(P1)/C(P1) + ... + PR(Pn)/C(Pn)) Usually the parameter d is set to 0.85. PageRank or PR(A) can be calculated using a simple iterative algorithm. Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web. Ranking result pages: Ranking result pages Based on content Based on content Number of occurrences of the search terms Based on link structure Backlink count PageRank And more. (http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/)Problems with content-based ranking: Problems with content-based ranking Many pages containing search terms may be of poor quality or irrelevant Example: a page with just a line “search engine”. Many high-quality or relevant pages do not even contain the search terms Example: Google homepage Page containing more occurrences of the search terms are ranked higher; spamming is easy Example: a page with line “search engine” Repeated many timesBased on link structure: Based on link structure Hyperlinks among web pages provide new web search opportunities. Our focus in this lecture PageRank HITSBacklink: Backlink A backlink of a page p is a link that points to p A page with more backlinks is ranked higher Intuition: Each backlink is a “vote” for the page’s importance Based on local link structure; still easy to spam Create lots of pages that point to a particular page PageRank: PageRank Page et al., "The PageRank Citation Ranking: Brining Order to the Web.", 1998 Main idea: Pages pointed by high-ranking pages are ranked higher Definition is recursive by design Based on global link structure; hard to spamPageRank: PageRank Web can be viewed as a huge directed graph G(V, E) where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges). Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks). Each backlink of a page represents a citation to the page. PageRank is a measure of global web page importance based on the backlinks of web pages.Basic ideas: Basic ideas A page is likely to be important if it is linked by many pages. A page is likely to be important If it is linked to by important pages even though it may not necessarily be linked by a large number of pages. The importance of a page is divided evenly and propagated to the pages pointed to by it. PageRank Definition: PageRank Definition N(p): number of outgoing links from page p B(p): set of pages that point to p PageRank(p) = Σq∈B(p) (PageRank(q) / N(q)) Intuition Each page q evenly distributes its importance to all pages that q points to Each page p gets a boost of its importance from each page that points to pComputing PageRank : Computing PageRank Create a stochastic matrix M for the link structure Each page i corresponds to row i and column i If page j has n outgoing links, then the M(i, j) = 1 Ú n if page j Solve by setting all PageRank.s to 1 initially, and thenapplying M repeatedly until the values converge Convergence: Convergence If the ranks converge, i.e., there is a rank vector R such that R = M R, R is the eigenvector of matrix M with eigenvalue being 1. Convergence is guaranteed only if M is aperiodic (the Web graph is not a big cycle). This is practically guaranteed for Web. M is irreducible (the Web graph is strongly connected). This is usually not true. Example: Example The sum of all PageRank’s is the total number of pagesRandom surfer model: Random surfer model A random surfer Starts with a random page Randomly selects a link on the page to visit next Never uses the .back. button PageRank(p) measures the probability that a random surfer visits page pRank leak : Rank leak Pages with no outgoing links Called a rank leak because eventually all importance will leak. out of the WebRank sink: Rank sink A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages. Causes the loss of total ranks. Eventually accumulate all importance of the Web An example of rank sink: An example of rank sink p1 p2 p3Solution: Solution d: damping factor PageRank(p) =d · Σq∈B(p) (PageRank(q) /N(q)) + (1 -d ) Intuition in the random surfer model A surfer occasionally gets bored and jump to a random page on the Web instead of following a random link Another intuition Make the graph fully connected An example: An example p1 p2 p3 the sum of all PageRank’s is still the total number of pages. Slide56: 4. Web Community Construction Problem with Web Searching: Problem with Web Searching Current Web search tools such as Yahoo!, AltaVista, Google, … return more information than those required Example: search “Java Programming”, Google returns 1,330,000 (more than 1 million) Web pages, AltaVista returns 16,921,862 (more than 16 million) Web pages. Users only concern about a small and interesting portion of the returned search results. Necessary to build web communities. Related research work: Constructing good Web communities. Identifying related Web pages. Clustering Web pages. Web Community: Web Community A web community is defined as a set of web pages that link to more web pages within the set than to web pages outside the community (Flake/Lawrence/Giles 2000) Hyperlink analysis if a web page A has a hyperlink to page B, then the author of page A usually considers that page B contains valuable information that is related to page A; hyperlinks convey a considerable amount of latent human judgment; Some recent works on Web Communities: Some recent works on Web Communities Noise page elimination Related page finding Web page clustering Authority and Hub Pages : Authority and Hub Pages A page is a good authoritative page with respect to a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query. A page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query. Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other. Constructing Web Community: Given a query/topic, a web community consists of two distinct, but interrelated, types of pages: authority pages (authorities) and hub pages (hubs). Constructing Web Community Hubs Authorities Figure 2. A skeletal web communitySlide62: Construct Good Quality Web Page Communities World Wide Web Concerned Info. Space Web Page Community Hubs Authorities Search Engines ? ?Slide63: Kleinberg’s HITS Algorithm (Hypertext Induced Topic Search) Concerned Info. Space Calculate Authority/Hub values Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R.HITS Algorithm: HITS Algorithm Hub and authority value computation "p q" denotes "page p has a hyperlink to page q". Normalize the a and h vectors after each iteration. Improved HITS algorithm (Bharat and Henzinger, 1998) , Slide65: Problems with various HITS algorithms Concerned Info. Space Community Construction Algorithms Web Page Community R : root set of pages B : base set of pages R construction: collecting r highest-ranked pages from the search results. B construction: adding to R more pages that are pointed by or pointing to the pages in R. Topic Drift Problem Noise pagesExample: Example Noise page --- contains no query terms Experiment Query Term: Harvard Search Engine: AltaVista Size of Base Set: 8064 Size of Root Set: 200 Slide67: Ten arbitrarily selected noise pagesSlide68: Ten arbitrarily selected topic-related pagesSlide69: Authorities & Hubs (by HITS algorithm)Slide70: Noise Page Elimination Concerned Info. Space Community Construction Algorithms Good Quality Web Page Community Noise Page Elimination Algorithms !Hyperlink Analysis via SVD: Hyperlink hyperlink matrix Hyperlink matrix singular value decomposition getting intrinsic relationships between the pages Hyperlink Analysis via SVD SVD of a Matrix: SVD of a Matrix Definition of SVD then where Slide73: Approximation Property is the best approximation to A with rank k ( ). Noise Page Elimination Algorithm: Noise Page Elimination Algorithm Main Procedure of the Algorithm Hyperlinks Matrix A SVD of A Approximation Matrix Ak Ak captures main structure information of A Ak filters minor factors in A For a proper k, computing cost can be reduced Slide75: Noise Page Elimination Algorithm B R Matrix A = (aij) n×n Matrix S=(sij)(m-n)×n Size(R)= n Size(B)= m î í ì = ® ® = otherwise j i i j j i s ij 0 or or 1Slide76: Matrix A SVD of A Choose parameter k such that Approximation matrix of A Rows of matrix Coordinate vector of page i in k-dim. space. page i B - R i = 1, 2, …, n. Slide77: Matrix S Each row s in S, s =(g1, g2, …, gn) in new k-space Pages in A and S are mapped into the vectors in the same k-dimensional subspace, it is possible to measure the similarity (relevance degree) between a page in S and a page in A. Yes No Keep page i in B Eliminate page i from B Choice of Parameter c: Choice of Parameter c Algorithm is called c = avgAlgo Algorithm c = maxAlgo Algorithm c = minAlgo Algorithm Slide79: Authorities & Hubs after Elimination (NPEA and HITS algorithm)Finding Related Web Pages: Given a URL or web page, find a set of pages that address the same topic of the original page. Co-citation algorithm: the sibling pages of the given URL u that have the most number of common parent pages with u are the related pages. HITS and Improved HITS algorithms: authority pages with the highest authority values are related pages. Finding Related Web PagesSlide81: Finding related pages Co-Citation Algorithm Given page (URL) u, page source S is constructed in the following way: chooses up to B (e.g. 2000) arbitrary parents of u; for each of these parents p, adds to S up to BF (e.g. 8) children of p that surround the link from p to u. The elements of S are siblings. for each s in S, co-citation degree of s and u is determined; the algorithm returns the 10 pages that have the highest co-citation degrees with u as the relevant pages.Slide82: Extended Co-Citation Algorithm Given page u, Choose up to B parent pages of u, and add up to BF children pages (different from u) of each parent page. choose up to F children pages of u, and add up to FB parent pages (diff from u) of each child page. B, F, BF and FB are used to keep the page source to a reasonable size. In practice, choose B = FB = 200, F = BF = 40. Web Page Clustering: With a set of pages, the key is to find a similarity between the pages Text content based algorithms Regard a web page as a set of words or phrases; Each page is represented as a feature vector; The similarity is measured as where wik is the k-th component of the feature vector of the i-th page, and L is the length of the feature vectors. Web Page Clustering Slide84: Hyperlink based algorithms Co-citation algorithm Linkage vector algorithm Each page P is represented as two linkage vectors: Pout and Pin ; The similarity is defined as : dot product of vectors Pout and Qout . Web Page Clustering : Page Source Construction Given a user’s query topic, page source S = R BV FV with hyperlinks. Web Page Clustering Slide86: Page Similarity Assume: size(R) = m, size(V) = n, V = BV FV Hyperlinks Adjacency Matrix C Slide87: (Out-link relationship of page i in R) (In-link relationship of page i in R) Out-link similarity of pages i and j In-link similarity of pages i and jSlide88: Similarity between any two pages i and j in R: Similarity Matrix SM = (smi,j)mm , where MODij = || rowi || + || rowj || + || coli || + || colj || Slide89: Matrix-Based Clustering Algorithms Three main steps: Similarity Matrix Permutation Purpose: put closely related pages together in the similarity matrix SM. Matrix Partition Purpose: cluster pages at one level with the partition. Hierarchical Clustering Recursively apply the above two steps to the existing clusters.Slide90: Matrix-based hierarchical clustering diagram: Similarity Matrix mm Page Source SM CL1 CL2 Matrix Partition Hierarchical Clusters Permuted Similarity Matrix mm SM’ Slide91: An example of matrix permutation and partition 1 0 0 0 .47 0 .6 0 0 0 1 0 0 0 0 .49 0 0 0 0 1 0 0 0 0 .7 .66 0 0 0 1 0 0 0 0 .42 .47 0 0 0 1 0 .9 0 0 0 0 0 0 0 1 0 .92 .45 .6 .49 0 0 .9 0 1 0 0 0 0 .7 0 0 .92 0 1 0 0 0 .66 .42 0 .45 0 0 1 P1 P2 P3 P4 P5 P6 P7 P8 P9 P1 P2 P3 P4 P5 P6 P7 P8 P9 (a) A similarity matrix (b) Permuted matrix and partition 1 .47 .6 0 0 0 0 0 0 .47 1 .9 0 0 0 0 0 0 .6 .9 1 .49 0 0 0 0 0 0 0 .49 1 0 0 0 0 0 0 0 0 0 1 .42 0 0 0 0 0 0 0 .42 1 .66 0 .45 0 0 0 0 0 .66 1 .7 0 0 0 0 0 0 0 .7 1 .92 0 0 0 0 0 .45 0 .92 1 D P1 P5 P7 P2 P4 P9 P3 P8 P6 P1 P5 P7 P2 P4 P9 P3 P8 P6 Cluster1 = { P1 , P5 , P7 , P2 } , Cluster2 = { P4 , P9 , P3 , P8 , P6 }Slide92: Examples of Some Major Clusters 5. Mining Web: Content, structure, usage and Web recommendation: 5. Mining Web: Content, structure, usage and Web recommendationWeb mining: Web mining Web mining utilize data mining methods to induce and extract useful information from web data and service. Web content mining Web structure mining Web usage mining Web Content mining: Web Content mining Web content mining tries to discover valuable information from web contents (i.e. web documents or web pages), alternatively termed as text mining sometimes Web structure mining: Web structure mining Web structure mining involves modeling web site in terms of link structures. The mutual linkage information obtained could be used to cluster the web pages or find relevant pages based on the similarity or relevance between different web pages. Web usage mining: Web usage mining Web usage mining tries to reveal the underlying access patterns from web user transactions User pattern can be utilized to recommend or personalize web contents to users. Capturing the web user access interest can help for better understanding user navigational behavior, and for improving web site structure design. Slide98: Web usage mining is utilized to make recommendation: Firstly, extract the informative knowledge from web log files and identify the underlying user functional interests; Secondly, create user profiles for representing common user navigational behavior based on observation usage data; Finally, present the desired web content in personalized style to user by matching the current active user session with the discovered patterns. Web recommendation: Web recommendation Web recommendation is a process that recommends customized web content to users according to their specific preference. Two common approaches: content-based filtering and collaborative filtering. Web recommendation: Web recommendation Content-based filtering systems, such as WebWatcher and client-side agent Letizia generally generate recommendation based on the pre-constructed user profiles by measuring the similarity of web content to these profiles Collaborative filtering systems make recommendation by utilizing the rating of current user for objects via other users’ preference.Slide101: 6. Web Community and Web Usage Mining Web Community and Web Usage Mining : Web Community and Web Usage Mining Pierrakos et al presented a way of constructing web communities using web logs archived in a proxy server The web logs contain the basic information, such as web pages accessed and IP addresses. Second, a web page can be abstracted as a feature vector where a feature indicates whether the corresponding word appears in the web page. The page categories can then be obtained using a hierarchical clustering to cluster web pages being accessed. Community Directory Miner (CDM): Community Directory Miner (CDM) The clustering result in a tree structure where a leaf node represents a cluster of web pages which share high similarity. The internal node represents a larger cluster which includes smaller clusters as represented by the child nodes of the internal node. The root node represents all web pages as a top largest cluster. A page category is a node in the tree. There exist parent/child relationships among page categories Community Directory Miner (CDM): Community Directory Miner (CDM) Basic Node Weights: Suppose that there exist n web pages in a page category. The weight of a feature i is denoted wi for the page category, and is computed as Community Directory Miner (CDM): Community Directory Miner (CDM) Basic Edge Weights: Suppose that there exist n user access sessions that access both page categories, vi and vk. The weight of the edge between vi and vk is denoted wik, and is computed as Community Directory Miner (CDM): Community Directory Miner (CDM) Accumulated Node/Edge Weights: A weight of a feature in a page category, wi, is the weight of itself plus all the weight of its child page categories. In a similar fashion, the weight of an edge between page categories, pi and pj, is accumulated by the weight of the edge between pi and pj as well as all the weights of the edges between pm and pn if pm is a child of pi and pn is a child of pj. The accumulated node/edge weights make it include more page categories as a part of web communities. Community Directory Miner (CDM): Community Directory Miner (CDM) The above shows how to construct a weighted undirected graph. The resulting weighted undirected graph may have high connectivities. Some edges of the graph are pruned using a threshold. After pruning edges, the CDM algorithm converts the weighted undirected graph into an undirected graph, Then find maximal cliques. The found web communities can be used to build a web community directory Slide108: 7. Web Community Finding and Analysis Web Community : Web Community The web communities can be viewed as containing a core of central, authoritative pages linked together by hub pages; and they exhibit a natural type of hierarchical topic generalization that can be inferred directly from the pattern of linkage. By David Gibson, Jon Kleinberg and Prabbakar Raghavan Understand the global structure of the ways in which independent users build connections to one another in a distributed fashion. Given a user-specified query, HITS (Hyperlink-Induced Topic Search) finds a root set of pages. Let the core of a community consists of top 10 authoritative and top 10 hub pages. Can HITS discover communities? Does the choice of the root set affect the communities discovered? Small World Phenomenon : Small World Phenomenon In social networks, it is known that the average number of the length of the chains of people was between 5 and 6 (Six degrees of Separation). Experiments: from Nebraska to Massachusetts (Stanley Milgran et al) Watts and Strogatz showed that small world phenomenon is pervasive in a range of networks (Nature 393, 1998). A fundamental ingredient in the evolution of the World Wide Web. Short paths are constructed based on users’ local information collectively. Is there a decentralized algorithm operating with local information only to construct short paths in networks (WWW)? Kleinberg’s Algorithmic Perspective (1) : Kleinberg’s Algorithmic Perspective (1) Model a network that exhibits the richness of local connections and a few random long-range connections over a grid with directed edges. The distance between two points is . For local connections, a parameter p (≥1) is introduced. For long-range connections, q and r are introduced. q specifies how many long-range connections a point has, and r specifies how they are linked using independent random trials. The i-th directed edge from p1 to p2 is with probability proportional to for r ≥ 0. The reverse r-th power distribution. When r = 0 , uniform distribution. If r increases, the long-range contacts of a node become more clustered. Kleinberg’s Algorithmic Perspective (2) : Kleinberg’s Algorithmic Perspective (2) The decentralized algorithm follows a simple strategy, to transmit a message from s to t in as few steps as possible. Every message-holder chooses a contact that is as close to the target as possible in terms of the distance measurement. The message-holder does not know anything about how the messages have been delivered so far. Theorem 2: There is a decentralized algorithm and a constant α, independent of n, so that when r = 2 and p=q=1, the expected delivery time of the algorithm is at most Two Types of Web Communities : Two Types of Web Communities Explicitly-defined communities Yahoo and AOL. Implicitly-defined communities The number of such web communities is large. They may appear as emerging web communities, and may disappear. Web Community: A Complete Directed Bipartite Graph : Web Community: A Complete Directed Bipartite Graph What does a web community look like in WWW? If we consider a web community as a subgraph in a web graph, how do we define such a subgraph? Define a web community as a complete directed bipartite graph, G(V, E) where and . Vf is a set of fans (hub). Vc is a set of centers (authority). An example of a web community Trawling the Web : Trawling the Web As the first attempt, Kumar et al conducted a two week experimental studies using a 18 month and 200 million web-page web archive in 1999. Generating fans and centers Identify the potential fans as specialized hubs using HITS. With human interactions, they determined that a potential fan has links to at least 6 different web sites, and maintained the number of potential centers as roughly 30 times of the number of potential fans. Some pruning strategies are used. The web communities do exist. tested , for f is taken from {3,4,5,6} and c is taken from {3,5,7,9}. , , and . Modeling the Web (1) : Modeling the Web (1) Kumar et al also attempted to graph model for web. A class of graph models are specified by four stochastic processes at a time step t. Node-creation and node-deletion are with probability and Edge-creation is with a probability β. Consider a node, v, to be randomly selected to add edges. With probability β, it adds k edges from v to other nodes independently and uniformly selected. With probability 1- β, it chooses another node u randomly, and copies k out-going edges of u to v. Edge deletion is modeled as to randomly delete a node with a probability . Modeling the Web (2): Modeling the Web (2) Both in-degree distribution of nodes and out-degree distribution of nodes are of Zipfian distribution, when , , and Let pi,t and qi,t be the fraction of nodes at time t with in-degree i and out-degree i, then Here, α and β are two probabilities for creating a new edge (u, v). When α=0.52 and β=0.58, the probability of a node that has in-degree i is i-2.1 and probability of a node that has out-degree i is i-2.38. They remarked that both match the reality of web. Web Community: A Dense Directed Bipartite Graph : Web Community: A Dense Directed Bipartite Graph Are the condition of Complete Bipartite Graph too strong? Do we miss any interesting web communities? A dense directed bipartite graph, GD(V, E) , is a directed bipartite graph, denoted A node in Vf must link to at least nodes in Vc. At least nodes in Vf link to every node in Vc. An example A web community hierarchy can be determined on top of dense directed bipartite graphs where centers at level i become fans at level i+1. Finding Dense Directed Bipartite Graphs : Finding Dense Directed Bipartite Graphs The cocited relationship can help to find complete bipartite graphs. But is too strong for dense directed bipartite graphs. The node vi is Cocited with V in a directed graph G, if , where . The main difference in finding dense directed bipartite graphs is to use Cocited instead of cocited. Web Community in Any Shapes : Web Community in Any Shapes Both Complete/Dense Directed Bipartite Graphs need to specify two numbers, f (fans) and c (centers). Is it possible to find web communities of arbitrary shapes or diameters? Flake et al define a web community as a collection of web pages such that each member page has more hyperlinks within the community than outside of the community. Network Flow : Network Flow Given a flow network . A flow is a function, denoted as , for every edge satisfying two conditions. . For all nodes , . The s-t maximum flow problem over is to find the maximum value of . The s-t minimum cut over N is to find the minimum value of a cut set, . The above two problems are identical over N. Ford-Fulkerson Method Web Community Based on Network Flow : Web Community Based on Network Flow A web community is a set of nodes over graph such as every node v in Vc has many edges connecting to the other nodes in Vc than the nodes in V- Vc. Let #(s) be the number of edges from s to any other nodes in Vc \ {s}. Let #(t) be the number of edges from t to any other nodes in V \ Vc \ {s}. Condition: and . How to select s and t? Selecting s and t: Selecting s and t Consider s and t as additional virtual nodes. Given a set of seed nodes (pages) as in the web community, S . The s connects to all nodes in S, with infinite capacity. It makes possible that all S will be selected in the web community. Intuitively, it is difficult to determine t. Flake et al formalize the problem of choosing t as whether the cut set remains the same given two different terminals, the best, t, and the heuristic Selecting s and t: Selecting s and t The best t is the graph center. The virtual terminal, , is connected with all nodes in V, except the seeds, S and itself, with the minimum capacity 1. Given two flow networks, and . Let C1, and C2 be the minimum cut sets for and . If holds, then such as c’ only includes the edges for Two Algorithms: Two Algorithms Exact-Flow-Community algorithm, which takes three inputs, a set of seed nodes, S, and the entire web graph, G, and edge capacity k (integer). Approximate-Flow-Community algorithm, which takes three inputs, a set of seed nodes, S, and two integers, d and m. Approximate-Flow-Community does not have the entire web as an input into the algorithm. Exact-Flow-Community algorithm : Exact-Flow-Community algorithm Approximate-Flow-Community algorithm : Approximate-Flow-Community algorithm How to determine the edge capacities? : How to determine the edge capacities? The edge capacities are important in the work of Flake et al. Flake et al did not show how to assign edge capacities. Imafuji and Kitsuregawa show the impacts of the edge capacity on the quality of the web community being found by Flake et al. Increasing edge capacity can increase the size of web community. It is difficult to determine proper edge capacity. When the edge capacity is less than the quantum jump point, the web community, based on a set of seed nodes, is too small. When the edge capacity is greater than or equal to the quantum jump, the whole graph will be the web community The noises in the web community cannot be easily removed by controlling the edge capacity. Estimation of the final quantum jump q: Estimation of the final quantum jump q Let all the edge capacities be q-1 Consider a node that has the largest degree, Then Edge Capacity Assignment : Edge Capacity Assignment The edge capacity of an edge is set as avl and hvk are the authority score and hub score. rh =max(avk)/max(hvl). ra =q. Web Community Charts : Web Community Charts A web community is a set of pages densely connected by symmetric derivation relationships. Two communities are related if there is an asymmetric derivation relationship between members of them. A web community chart is a graph where is a set of web communities and is a set of edges from a web community to another web community. From Web Community Chart to Web Community Evolution : From Web Community Chart to Web Community Evolution Let a web community chart, G(V,E), and a web community, v∈G, at time t, be denoted as G(t) and v(t) ∈G(t), respectively. Consider two web community charts, G(tk) and G(tl) where tk <tl, and tk and tl are close enough to detect web community changes. There are five cases, at time tl, in comparison to the web communities at time tk (<tl). emerged web communities, dissolved web communities, grown or shrunk communities, split communities, and merged communities The key issue is to how to determine a web community, v(tk), that corresponds to a web community v(tl). From Web Community Chart to Web Community Evolution : From Web Community Chart to Web Community Evolution When can a web community be unique? : When can a web community be unique? Given a subgraph Gh in a graph G, a web community of Gh is the largest subgraph of the maximal edge-connectivity containing Gh, denoted Gc (⊆ G), such as Gh ⊆ Gc, conn(Gc) ≥ conn(Gd) if Gh ⊆ Gd ⊆ G, and Gd ⊆ Gc if Gh ⊆ Gd ⊆ G and conn(Gd)=conn(Gc). The conn(G) is the weight of a minimal cut of G. Such a web community is unique Discovering Web Communities Using Co-occurrence : Discovering Web Communities Using Co-occurrence Identify a web community , and enlarge Let be the set of all web pages that co-refer to all the pages in ; While do let be a set of web pages linked by all web pages in ; sort based on the co-occurrence; select the top-ranked web page in , and add it into ; be the set of all web pages that co-refer to all the pages in ; end while return ; Do newly added centers really contain the similar contents like those in ? Generate a set of initial inputs, where . In total pairs. Enlarge each individually. Select the top-ranked ones. Can we find the boundary between centers and non-centers? Web Community and Formal Concept Analysis (1) : Web Community and Formal Concept Analysis (1) A context is a triple (G,M,I) where Gis the domain of objects, and M is the domain of attributes, Iis a binary relation such as I⊆ G× M. A concept in the context is a pair (CG,CM), where CG (⊆ G) is a set of objects that have all the attributes in CM (⊆ M). The ordered set of all concepts of the context (G,M,I) forms a concept lattice. Rome and Haralick in considered a context (G,M,I)as a web graph. Here, both Gand M are web pages, and I is a set links that connect web pages. Rome and Haralick considered to merge small web communities as a large web community by using the existing techniques to coalesce concepts. Web Community and Formal Concept Analysis (2) : Web Community and Formal Concept Analysis (2) Web Community and Formal Concept Analysis (3) : Web Community and Formal Concept Analysis (3) A large web communities is, Vf = {1, 3, 4, 5, 7} and Vc = {2, 6, 9}, by adding (4, 2) and (5, 9) edges. Finding Geographical Scopes of Web Resources : Finding Geographical Scopes of Web Resources The geographical scope of a web resource is the geographical area that the creator of intends to reach. Let be a hierarchy for geographical scopes. Two conditions, for a web resource . The fraction states that there are significant web pages in a geographical scope containing links to . The uniformity states that the links to are distributed smoothly across the geographical scope . Discovering Unexpected Information from Competitors : Discovering Unexpected Information from Competitors Given two web sites, Wb and Wc, where Wc is the web site for the competitor, and Wb is the web site, as the basis, to find the unexpected information from Wc. The Unexpected information in Wc is the information which is relevant to Wb but is unknown. Several methods were proposed to compare Wb with Wc in. 8. A Web Recommendation Technique based on Probabilistic Latent Semantic Analysis : 8. A Web Recommendation Technique based on Probabilistic Latent Semantic Analysis 1. Introduction: 1. Introduction Internet has become a huge data repository for retrieving information, discovering knowledge Web users usually suffer from the information overload problem One approach addressed to the information overload is the recommendation system 1. Introduction (cont.): 1. Introduction (cont.) Web recommender system focuses on identifying Web users or objects, collecting users’ preference information adapting its service to satisfy the users’ needs Web mining technique is to extract textual, topological and usage information on Web1. Introduction (cont.): 1. Introduction (cont.) Web users may exhibit various types of behavior pattern For example, within an automatics web site one type of customers intend to browse specific category vehicle, whereas another is interested in one particular famous brand.1. Introduction (cont.): 1. Introduction (cont.) Clickstream data (usage) convey access pattern information. Web usage mining technique is to uncover the user access pattern. Then, usage pattern information provide benefit for Web recommending. 1. Introduction (cont.): 1. Introduction (cont.) Many advanced techniques in data mining K-Nearest Neighbor (KNN) Web clustering Association rule mining Sequence pattern mining. Successful progress benefits, e.g. Adaptive Web Design Web Recommender1. Introduction (cont.): 1. Introduction (cont.) Related work Content-based Filtering, measuring content similarity Collaborative Filtering, referring others’ preference similar to current user Web Usage Mining (WUM) Latent Semantic Analysis (LSA) and its application Variant of LSA, Probabilistic Latent Semantic Analysis (PLSA)2. PLSA Model : 2. PLSA Model Data sessionalization process Pageview identification: P = {p1, p2,…,pn} User sessionization: S = {s1, s2, …. , sm} T Usage data in matrix expression: considered as an n-dimensional page vector: si = {ai1, ai2…, ain}, where aij denotes the weight for pageview pj in si user session Session-pageview matrix: SPmn Matrix expressionEntry (weight) is determined by visiting duration or hit numbers: Matrix expression Entry (weight) is determined by visiting duration or hit numbersAn snapshot example: An snapshot example 1) Main Movies: 20sec Movies News: 15sec NewsBox: 43sec Box-Office Evita: 52sec News Argentina:31 sec Evita: 44sec 2) Music Box: llsec Box-Office Crucible: 12sec Crucible Book: 13sec Books: 19sec 3) Main Movies: 33sec Movies Box: 21sec Boxoffice Evita: 44sec News Box: 53sec Box-office Evita: 61 sec Evita : 31sec 4) Main Movies: 19sec Movies News: 21sec News box: 38sec Box-Office Evita:61 sec News Evita:24sec Evita News: 31 sec News Argentina: 19sec Evita: 39sec 5) Movies Box: 32sec Box-Office News: 17sec News Jordan: 64sec Box-Office Evita: 19sec Evita: 50sec 6) Main Box: 17sec Box-Office Evita: 33sec News Box: 41 sec Box-Office Evita: 54sec Evita News: 56sec News: 47sec PLSA Model : PLSA Model For the given aspect model, suppose that there is a latent factor variable space zk ∈ Z = {z1, z2, · · · , zl} associated with co-occurrence observation data (i.e. session-pageview data). Each co-occurrence observation data (si, pj) is associated with the factor zk∈ Z by varying degree to zk . Probabilities definitions: Probabilities definitions P(si) denotes the probability that a particular user session si will be observed in the occurrences data P(zk|si) denotes a user session-specific probability distribution on the unobserved class factor zk explained above, P(pj|zk) denotes the class-conditional probability distribution of pages over a specific latent variable zk. Slide153: Goal: maximize Li where Re-parameterized version Total likelihood Li as Formulation of PLSA (1) (2) (3)Expectation & Maximization (EM) algorithm : Expectation & Maximization (EM) algorithm Firstly, given the randomized initial values of P(zk), P(si|zk), P(pj|zk) Maximization (M) step: Expectation (E) step : Implementation of EM algorithm: Implementation of EM algorithm The E-step and M-step are iterating until a local optimal limit of Li is approached. We can obtain the conditional probability estimates P(zk), P(si|zk), P(pj|zk) corresponding to local maximum likelihood Li , and further P(zk|si), P(zk|pj) respectively. They will be utilized to cluster web session to induce user access pattern and identify factor (task) space.Generated conditional-probabilities: Generated conditional-probabilitiesScheme of WUM and WR: Scheme of WUM and WRAlgorithms for WUM and WR(1): Algorithms for WUM and WR(1) Algorithm 1 Characterizing latent factor Aim: identify the “dominant” page sets that are considered to contribute significantly to the associated latent factors. Procedure: probability inference. Algorithms for WUM and WR(2): Algorithms for WUM and WR(2) Algorithm 2 Clustering Web Session Aim: partition Web sessions into task-oriented group to represent user access pattern Procedure: employ clustering algorithm on transferred session space and build centroid of cluster as user profile Algorithms for WUM and WR(3): Algorithms for WUM and WR(3) Algorithm 3 Web Recommendation Aim: recommend user potentially visited pages Procedure: measure the similarity between active user session and generated user profile to determined most matching usage pattern, then predict the top N pages upon the selected pattern.Experimental Results : Experimental Results Real world data sets The data set is downloaded from KDDCUP including 9308 user sessions and 69 pageviews, (KDD dataset). The second data set is a 2-week web log file from university website containing 13745 sessions and 683 pages (CTI dataset). Slide162: 9. Future WorkFuture directions: Future directions What we can additionally provide to users on top of the Web communities and Web clusters found. How we can build up recommendation systems for Web users based on all we can possibly obtain including Web clusters, Web communities, as well as users viewpointsResearch questions: Q1. What model is suitable for capturing the relationships between Web documents and Web usage information? Q2. How do we discover Web communities and user communities based on such a model? Q3. How do we make recommendations based on the Web communities and user communities being found? Q4. How do we automatically build user profiles for a large number of users? Q5. How do we make online recommendations? Research questionsReference: Reference Y. Zhang, J.X. Yu and J Hou, Web Communities: Analysis and Construction, Springer 2005 (ISBN: 3-540-27737-4)