logging in or signing up Santa Fe WWW July 2005 Misree Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 35 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript On Search, Ranking, and Matchmaking in Information Networks: On Search, Ranking, and Matchmaking in Information Networks Sergei Maslov Brookhaven National LaboratoryInformation networks: WWW and beyond: Information networks: WWW and beyond First part of my talk: 1010 webpages in the world: need to search and rank!! Second part: opinion networks WWW can be thought of as a network of opinions (hyperlinks – positive votes) Our choices and opinions on products, services and each other – a much larger opinion network! Very incomplete (sparse) one could use intelligent “matchmaking” to match users to new products or each otherRanking webpages: Ranking webpages Assign an “importance factor” Gi to every webpage Given a keyword (say “jaguar”) find all the pages that have it in their text and display them in the order of descending Gi. One solution still used in scientific publishing is Gi=Kin(i) (the number of incoming links), but: Too democratic: It doesn’t take into account the importance of nodes sending links Easy to trick and artificially boost the rankingHow Google works: How Google works Google’s recipe (circa 1998) is to simulate the behavior of many virtual “random surfers” PageRank: Gi ~ the number of virtual hits the page gets. It is also ~ the steady state number of random surfers at a given page Popular pages send more surfers your way PageRank ~ Kin weighted by the popularity of a source of each hyperlink Surfers get bored following links with probability =0.15 at any timestep a surfer jumps to a randomly selected page (not following any hyperlinks) Last rule also solves the ergodicity problemMathematics of the Google: Mathematics of the Google To calculate the PageRank Google solves a self-consistent Eq.: Gi ~ ji Gj /Kout (j) To account for random jumps: Gi = (1-) ji Gj /Kout (j) + j Gj/N = (1-) ji Gj /Kout (j) + (uses normalization: <G>=j Gj/N =1) Pages with Kout (j)=0 are removedMatrix formulation: Matrix formulation Equivalent to finding the principal eigenvector (with =1) of the matrix (1-) T + U, where Tij= 1/Kout (j) if j i and 0 otherwise, and Uij=1/N Could be easily solved iteratively by starting with Gi(0)=1 and repeating G(n+1)= (1-) T G(n)+ All Gi > Slide7: How Communities in the WWW influence the Google ranking H. Xie, K.-K. Yan, SM, cond-mat/0409087How do WWW communities influence their average Gi?: How do WWW communities influence their average Gi? Pages in a web-community preferentially link to each other. Examples: Pages from the same organization (e.g. SFI) Pages devoted to a common topic (e.g. physics) Pages in the same geographical location (e.g Santa Fe) Naïve argument: communities tend to “trap” random surfers to spend more time inside them they should increase the Google ranking of individual webpages in the communityTest of a naïve argument : log10(<G>c) # of intra-community links Test of a naïve argument Naïve argument is wrong! The effect could go either way Community #1 Community #2Slide10: Ecc EwwSlide11: Gc – average Google rank of pages in the community; Gw=1 – in the outside world Ecw Gc/<Kout>c – current from C to W It must be equal to: Ewc Gw/<Kout>w – current from W to C Thus Gc depends on the ratio between Ecw and Ewc – the number of edges (hyperlinks) between the community and the worldBalancing currents for nonzero : Balancing currents for nonzero Jcw=(1- ) Ecw Gc/<Kout>c + Gc Nc – current from C to W It must be equal to: Jcw=(1- ) Ewc Gw/<Kout> + Gw Nw(Nc/Nw) – current from W to CWhat are the consequences?: For very isolated communities (Ecw/E(r)cw< and Ewc/E(r)wc<) one has Gc=1. Their Google rank is decoupled from the outside world! Overall range: <Gc<1/ What are the consequences?WWW - the empirical data: WWW - the empirical data We have data for ~10 US universities (+ all UK and Australian Universities) Looked closely at Long Island University 4 large campuses 45,000 webpages and 160,000 hyperlinks After removing Kout=0 left with ~15,000 webpages and 90,000 links Can do a mini-Google PageRank on this set aloneLIU communities: LIU communities LI University has 4 campuses. We looked at one of them (CWP Campus) Ecw=1393; E(r)cw 16,000; Ecw/E (r) cw ~ 0.09 <=0.15 Ewc=336; E(r)wc 12,500; Ewc/E(r)wc ~ 0.03<=0.15 This community should be decoupled from the outside worldSlide16: Ratios are renormalized to Ecw/E (r) cw ~0.01 and Ewc/E(r)wc ~0.005 But: the community effect could be also strong!: But: the community effect could be also strong!Slide18: =0.001 Abnormally high PageRankTop PageRank LIU websites for =0.001 don’t make sense: Top PageRank LIU websites for =0.001 don’t make sense #1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/higher_ed/ hear.html' #5 …/higher_ed/ index.html #9 …/higher_ed/courses.html World Strongly connected component Collaborators and postdoc info:: Collaborators and postdoc info: Collaborators: Huafeng Xie – City University of NY Koon-Kiu Yan - Stony Brook U. Looking for a postdoc to work in my group at Brookhaven National Laboratory in New York starting Fall/Winter 2005 or even 2006 Topics: Large-scale properties of (mostly) bionetworks (partially supported by a NIH/NSF grant with Ariadne Genomics) Internet/Google/Opinion networks E-mail CV and 3 letters of recommendation to: maslov@bnl.gov; See www.cmth.bnl.gov/~maslovPart 2: Opinion networks: Part 2: Opinion networks "Extracting Hidden Information from Knowledge Networks", S. Maslov, and Y-C. Zhang, Phys. Rev. Lett. (2001). "Exploring an opinion network for taste prediction: an empirical study", M. Blattner, Y.-C. Zhang, and S. Maslov, in preparation.Predicting customers’ tastes from their opinions on products: Predicting customers’ tastes from their opinions on products Each of us has personal tastes Information about them is contained in our opinions on products Matchmaking: opinions of customers with tastes similar to mine could be used to forecast my opinions on untested products Internet allows to do it on large scale (see amazon.com and many others)Opinion networks: Opinion networks Webapges Other webpages 2 1 3 4 1 2 3 WWW Opinions of movie-goers on movies Customers Movies opinion 2 1 3 4 1 2 3 Storing opinions: Storing opinions Matrix of opinions IJ Network of opinions 9 8 8 1 2 Customers 2 1 3 4 1 2 3 MoviesUsing correlations to reconstruct customer’s tastes: Using correlations to reconstruct customer’s tastes Similar opinions similar tastes Simplest model: Movie-goers M-dimensional vector of tastes TI Movies M-dimensional vector of features FJ Opinions scalar product: IJ= TIFJ 9 8 8 1 2 Customers 2 1 3 4 1 2 3 MoviesLoop correlation: Loop correlation Predictive power 1/M(L-1)/2 One needs many loops to best reconstruct unknown opinions L=5 known opinions: Predictive power of an unknown opinion is 1/M2 An unknown opinionMain parameter: density of edges: Main parameter: density of edges The larger is the density of edges p the easier is the prediction At p1 1/N (N=Ncostomers+Nmovies) macroscopic prediction becomes possible. Nodes are connected but vectors TI and FJ are not fixed: ordinary percolation threshold At p2 2M/N > p1 all tastes and features (TI and FJ) can be uniquely reconstructed: rigidity percolation thresholdSlide29: Real empirical data (EachMovie dataset) on opinions of customers on movies: 5-star ratings of 1600 movies by 73000 users 1.6 million opinions!Spectral properties of : Spectral properties of For M<N the matrix IJ has N-M zero eigenvalues and M positive ones: = R R+. Using SVD one can “diagonalize” R = U D V+ such that matrices V and U are orthogonal V+ V = 1, U U+ = 1, and D is diagonal. Then = U D2 U+ The amount of information contained in : NM-M(M-1)/2 << N(N-1)/2 - the # of off-diagonal elementsRecursive algorithm for the prediction of unknown opinions: Recursive algorithm for the prediction of unknown opinions Start with 0 where all unknown elements are filled with <> (zero in our case) Diagonalize and keep only M largest eigenvalues and eigenvectors In the resulting truncated matrix ’0 replace all known elements with their exact values and go to step 1Convergence of the algorithm: Convergence of the algorithm Above p2 the algorithm exponentially converges to the exact values of unknown elements The rate of convergence scales as (p-p2)2Reality check: sources of errors: Reality check: sources of errors Customers are not rational! IJ= rIbJ + IJ(idiosyncrasy) Opinions are delivered to the matchmaker through a narrow channel: Binary channel SIJ = sign(IJ) : 1 or 0 (liked or not) Experience rated on a scale 1 to 5 or 1 to 10 at best If number of edges K, and size N are large, while M is small these errors could be reducedHow to determine M?: How to determine M? In real systems M is not fixed: there are always finer and finer details of tastes Given the number of known opinions K one should choose Meff K/(Nreaders+Nbooks) so that systems are below the second transition p2 tastes should be determined hierarchically Avoid overfitting: Avoid overfitting Divide known votes into training and test sets Select Meff so that to avoid overfitting !!! Knowledge networks in biology: Knowledge networks in biology Interacting biomolecules: key and lock principle Matrix of interactions (binding energies) IJ= kIlJ+ lIkJ Matchmaker (bioinformatics researcher) tries to guess yet unknown interactions based on the pattern of known ones Many experiments measure SIJ =(IJ-th) k(1) k(2) l(2) l(1)Collaborators:: Collaborators: Yi-Cheng Zhang – U. of Fribourg Marcel Blattner – U. of Fribourg Postdoc position: Postdoc position Looking for a postdoc to work in my group at Brookhaven National Laboratory in New York starting Fall 2005 Topic - large-scale properties of (mostly) bionetworks (partially supported by a NIH/NSF grant with Ariadne Genomics) E-mail CV and 3 letters of recommendation to: maslov@bnl.gov See www.cmth.bnl.gov/~maslovTHE END: THE ENDInformation networks: Information networks Why the research into properties of complex networks is so active lately? Biology: lots of large-scale experimental data is generated in the last 10 years: most of it is on the level of networks The explosive growth of information networks (WWW and the Internet) is what fuels it all (directly or indirectly)!Analysis: Analysis Derived for =0 Uses a strong mean field approximation that nodes that send current to and from the community have average Gi for the outside world (Gw=1) and community (Gc) In a true community both Ecw and Ewc are smaller than in randomized network but the effect depends on the competition between themSlide45: =0.15Networks with artificial communities: Networks with artificial communities To test we generate a scale-free network with an artificial community of Nc pre-selected nodes Use Metropolis Algorithm with H=-(# of intra-community nodes) and some inverse temperature Detailed balance: Slide47: Modules in networks and how to detect them using the Random walks/diffusion K. Eriksen, I. Simonsen, SM, K. Sneppen, PRL (2003)What is a module?: What is a module? Nodes in a given module (or community group or functional unit) tend to connect with other nodes in the same module Biology: proteins of the same function or sub-cellular localization WWW – websites on a common topic Internet – geography or organization (e.g. military)Do you see any modules here?: Do you see any modules here?Random walkers on a network: Random walkers on a network Study the behavior of many VIRTUAL random walkers on a network At each time step each random walker steps on a randomly selected neighbor They equilibrate to a steady state ni ~ ki (solid state physics: ni = const) Slow modes allow to detect modules and extreme edges Matrix formalism: Matrix formalismEigenvectors of the transfer matrix Tij: Eigenvectors of the transfer matrix TijSlide53: US Military Russia Slide54: 2 0.9626 RU RU RU RU CA RU RU ?? ?? US US US US ?? (US Department of Defence) 3 0.9561 ?? FR FR FR ?? FR ?? RU RU RU ?? ?? RU ?? 4 0.9523 US ?? US ?? ?? ?? ?? (US Navy) NZ NZ NZ NZ NZ NZ NZ 5. 0.9474 KR KR KR KR KR ?? KR UA UA UA UA UA UA UASlide55: Hacked site You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Santa Fe WWW July 2005 Misree Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 35 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript On Search, Ranking, and Matchmaking in Information Networks: On Search, Ranking, and Matchmaking in Information Networks Sergei Maslov Brookhaven National LaboratoryInformation networks: WWW and beyond: Information networks: WWW and beyond First part of my talk: 1010 webpages in the world: need to search and rank!! Second part: opinion networks WWW can be thought of as a network of opinions (hyperlinks – positive votes) Our choices and opinions on products, services and each other – a much larger opinion network! Very incomplete (sparse) one could use intelligent “matchmaking” to match users to new products or each otherRanking webpages: Ranking webpages Assign an “importance factor” Gi to every webpage Given a keyword (say “jaguar”) find all the pages that have it in their text and display them in the order of descending Gi. One solution still used in scientific publishing is Gi=Kin(i) (the number of incoming links), but: Too democratic: It doesn’t take into account the importance of nodes sending links Easy to trick and artificially boost the rankingHow Google works: How Google works Google’s recipe (circa 1998) is to simulate the behavior of many virtual “random surfers” PageRank: Gi ~ the number of virtual hits the page gets. It is also ~ the steady state number of random surfers at a given page Popular pages send more surfers your way PageRank ~ Kin weighted by the popularity of a source of each hyperlink Surfers get bored following links with probability =0.15 at any timestep a surfer jumps to a randomly selected page (not following any hyperlinks) Last rule also solves the ergodicity problemMathematics of the Google: Mathematics of the Google To calculate the PageRank Google solves a self-consistent Eq.: Gi ~ ji Gj /Kout (j) To account for random jumps: Gi = (1-) ji Gj /Kout (j) + j Gj/N = (1-) ji Gj /Kout (j) + (uses normalization: <G>=j Gj/N =1) Pages with Kout (j)=0 are removedMatrix formulation: Matrix formulation Equivalent to finding the principal eigenvector (with =1) of the matrix (1-) T + U, where Tij= 1/Kout (j) if j i and 0 otherwise, and Uij=1/N Could be easily solved iteratively by starting with Gi(0)=1 and repeating G(n+1)= (1-) T G(n)+ All Gi > Slide7: How Communities in the WWW influence the Google ranking H. Xie, K.-K. Yan, SM, cond-mat/0409087How do WWW communities influence their average Gi?: How do WWW communities influence their average Gi? Pages in a web-community preferentially link to each other. Examples: Pages from the same organization (e.g. SFI) Pages devoted to a common topic (e.g. physics) Pages in the same geographical location (e.g Santa Fe) Naïve argument: communities tend to “trap” random surfers to spend more time inside them they should increase the Google ranking of individual webpages in the communityTest of a naïve argument : log10(<G>c) # of intra-community links Test of a naïve argument Naïve argument is wrong! The effect could go either way Community #1 Community #2Slide10: Ecc EwwSlide11: Gc – average Google rank of pages in the community; Gw=1 – in the outside world Ecw Gc/<Kout>c – current from C to W It must be equal to: Ewc Gw/<Kout>w – current from W to C Thus Gc depends on the ratio between Ecw and Ewc – the number of edges (hyperlinks) between the community and the worldBalancing currents for nonzero : Balancing currents for nonzero Jcw=(1- ) Ecw Gc/<Kout>c + Gc Nc – current from C to W It must be equal to: Jcw=(1- ) Ewc Gw/<Kout> + Gw Nw(Nc/Nw) – current from W to CWhat are the consequences?: For very isolated communities (Ecw/E(r)cw< and Ewc/E(r)wc<) one has Gc=1. Their Google rank is decoupled from the outside world! Overall range: <Gc<1/ What are the consequences?WWW - the empirical data: WWW - the empirical data We have data for ~10 US universities (+ all UK and Australian Universities) Looked closely at Long Island University 4 large campuses 45,000 webpages and 160,000 hyperlinks After removing Kout=0 left with ~15,000 webpages and 90,000 links Can do a mini-Google PageRank on this set aloneLIU communities: LIU communities LI University has 4 campuses. We looked at one of them (CWP Campus) Ecw=1393; E(r)cw 16,000; Ecw/E (r) cw ~ 0.09 <=0.15 Ewc=336; E(r)wc 12,500; Ewc/E(r)wc ~ 0.03<=0.15 This community should be decoupled from the outside worldSlide16: Ratios are renormalized to Ecw/E (r) cw ~0.01 and Ewc/E(r)wc ~0.005 But: the community effect could be also strong!: But: the community effect could be also strong!Slide18: =0.001 Abnormally high PageRankTop PageRank LIU websites for =0.001 don’t make sense: Top PageRank LIU websites for =0.001 don’t make sense #1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/higher_ed/ hear.html' #5 …/higher_ed/ index.html #9 …/higher_ed/courses.html World Strongly connected component Collaborators and postdoc info:: Collaborators and postdoc info: Collaborators: Huafeng Xie – City University of NY Koon-Kiu Yan - Stony Brook U. Looking for a postdoc to work in my group at Brookhaven National Laboratory in New York starting Fall/Winter 2005 or even 2006 Topics: Large-scale properties of (mostly) bionetworks (partially supported by a NIH/NSF grant with Ariadne Genomics) Internet/Google/Opinion networks E-mail CV and 3 letters of recommendation to: maslov@bnl.gov; See www.cmth.bnl.gov/~maslovPart 2: Opinion networks: Part 2: Opinion networks "Extracting Hidden Information from Knowledge Networks", S. Maslov, and Y-C. Zhang, Phys. Rev. Lett. (2001). "Exploring an opinion network for taste prediction: an empirical study", M. Blattner, Y.-C. Zhang, and S. Maslov, in preparation.Predicting customers’ tastes from their opinions on products: Predicting customers’ tastes from their opinions on products Each of us has personal tastes Information about them is contained in our opinions on products Matchmaking: opinions of customers with tastes similar to mine could be used to forecast my opinions on untested products Internet allows to do it on large scale (see amazon.com and many others)Opinion networks: Opinion networks Webapges Other webpages 2 1 3 4 1 2 3 WWW Opinions of movie-goers on movies Customers Movies opinion 2 1 3 4 1 2 3 Storing opinions: Storing opinions Matrix of opinions IJ Network of opinions 9 8 8 1 2 Customers 2 1 3 4 1 2 3 MoviesUsing correlations to reconstruct customer’s tastes: Using correlations to reconstruct customer’s tastes Similar opinions similar tastes Simplest model: Movie-goers M-dimensional vector of tastes TI Movies M-dimensional vector of features FJ Opinions scalar product: IJ= TIFJ 9 8 8 1 2 Customers 2 1 3 4 1 2 3 MoviesLoop correlation: Loop correlation Predictive power 1/M(L-1)/2 One needs many loops to best reconstruct unknown opinions L=5 known opinions: Predictive power of an unknown opinion is 1/M2 An unknown opinionMain parameter: density of edges: Main parameter: density of edges The larger is the density of edges p the easier is the prediction At p1 1/N (N=Ncostomers+Nmovies) macroscopic prediction becomes possible. Nodes are connected but vectors TI and FJ are not fixed: ordinary percolation threshold At p2 2M/N > p1 all tastes and features (TI and FJ) can be uniquely reconstructed: rigidity percolation thresholdSlide29: Real empirical data (EachMovie dataset) on opinions of customers on movies: 5-star ratings of 1600 movies by 73000 users 1.6 million opinions!Spectral properties of : Spectral properties of For M<N the matrix IJ has N-M zero eigenvalues and M positive ones: = R R+. Using SVD one can “diagonalize” R = U D V+ such that matrices V and U are orthogonal V+ V = 1, U U+ = 1, and D is diagonal. Then = U D2 U+ The amount of information contained in : NM-M(M-1)/2 << N(N-1)/2 - the # of off-diagonal elementsRecursive algorithm for the prediction of unknown opinions: Recursive algorithm for the prediction of unknown opinions Start with 0 where all unknown elements are filled with <> (zero in our case) Diagonalize and keep only M largest eigenvalues and eigenvectors In the resulting truncated matrix ’0 replace all known elements with their exact values and go to step 1Convergence of the algorithm: Convergence of the algorithm Above p2 the algorithm exponentially converges to the exact values of unknown elements The rate of convergence scales as (p-p2)2Reality check: sources of errors: Reality check: sources of errors Customers are not rational! IJ= rIbJ + IJ(idiosyncrasy) Opinions are delivered to the matchmaker through a narrow channel: Binary channel SIJ = sign(IJ) : 1 or 0 (liked or not) Experience rated on a scale 1 to 5 or 1 to 10 at best If number of edges K, and size N are large, while M is small these errors could be reducedHow to determine M?: How to determine M? In real systems M is not fixed: there are always finer and finer details of tastes Given the number of known opinions K one should choose Meff K/(Nreaders+Nbooks) so that systems are below the second transition p2 tastes should be determined hierarchically Avoid overfitting: Avoid overfitting Divide known votes into training and test sets Select Meff so that to avoid overfitting !!! Knowledge networks in biology: Knowledge networks in biology Interacting biomolecules: key and lock principle Matrix of interactions (binding energies) IJ= kIlJ+ lIkJ Matchmaker (bioinformatics researcher) tries to guess yet unknown interactions based on the pattern of known ones Many experiments measure SIJ =(IJ-th) k(1) k(2) l(2) l(1)Collaborators:: Collaborators: Yi-Cheng Zhang – U. of Fribourg Marcel Blattner – U. of Fribourg Postdoc position: Postdoc position Looking for a postdoc to work in my group at Brookhaven National Laboratory in New York starting Fall 2005 Topic - large-scale properties of (mostly) bionetworks (partially supported by a NIH/NSF grant with Ariadne Genomics) E-mail CV and 3 letters of recommendation to: maslov@bnl.gov See www.cmth.bnl.gov/~maslovTHE END: THE ENDInformation networks: Information networks Why the research into properties of complex networks is so active lately? Biology: lots of large-scale experimental data is generated in the last 10 years: most of it is on the level of networks The explosive growth of information networks (WWW and the Internet) is what fuels it all (directly or indirectly)!Analysis: Analysis Derived for =0 Uses a strong mean field approximation that nodes that send current to and from the community have average Gi for the outside world (Gw=1) and community (Gc) In a true community both Ecw and Ewc are smaller than in randomized network but the effect depends on the competition between themSlide45: =0.15Networks with artificial communities: Networks with artificial communities To test we generate a scale-free network with an artificial community of Nc pre-selected nodes Use Metropolis Algorithm with H=-(# of intra-community nodes) and some inverse temperature Detailed balance: Slide47: Modules in networks and how to detect them using the Random walks/diffusion K. Eriksen, I. Simonsen, SM, K. Sneppen, PRL (2003)What is a module?: What is a module? Nodes in a given module (or community group or functional unit) tend to connect with other nodes in the same module Biology: proteins of the same function or sub-cellular localization WWW – websites on a common topic Internet – geography or organization (e.g. military)Do you see any modules here?: Do you see any modules here?Random walkers on a network: Random walkers on a network Study the behavior of many VIRTUAL random walkers on a network At each time step each random walker steps on a randomly selected neighbor They equilibrate to a steady state ni ~ ki (solid state physics: ni = const) Slow modes allow to detect modules and extreme edges Matrix formalism: Matrix formalismEigenvectors of the transfer matrix Tij: Eigenvectors of the transfer matrix TijSlide53: US Military Russia Slide54: 2 0.9626 RU RU RU RU CA RU RU ?? ?? US US US US ?? (US Department of Defence) 3 0.9561 ?? FR FR FR ?? FR ?? RU RU RU ?? ?? RU ?? 4 0.9523 US ?? US ?? ?? ?? ?? (US Navy) NZ NZ NZ NZ NZ NZ NZ 5. 0.9474 KR KR KR KR KR ?? KR UA UA UA UA UA UA UASlide55: Hacked site