logging in or signing up msr_talk shwang5 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 80 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: June 30, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Software Development Meetsthe Wisdom of Crowds : Software Development Meetsthe Wisdom of Crowds Seung-won Hwang (POSTECH) Joint work w/ Jinhan Kim, Mu-woong Lee, Sanghoon Lee, Jong-won Roh (POSTECH) Sunghun Kim(HKUST) Talk Outline : Talk Outline Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010 Search : Search Search aims at finding the “right” needle in the haystack “right away” (~0.1sec) Commercial search engines build on haystacks of documents software code (today’s talk) On Defining “Right” Results : On Defining “Right” Results There exist dual (possibly conflicting) goals: Personalization – maximize the satisfaction of a particular user. Diversification – minimize the dissatisfaction of varying user intents. 4 Nikon D3x Good!! Good?? Nikon D3x On “Fast” Retrieval of “Right” Results : On “Fast” Retrieval of “Right” Results Personalization Mining hidden intents, personalized rank processing Diversification Categorization, clustering, algorithms 5 Example Better than Words: Product Search meets Wisdom of Crowds : For details, find us at DemoFest! (joint work w/ MSR Asia) Example Better than Words: Product Search meets Wisdom of Crowds 6 Showing related products with various relationship types and strength Type from Wisdom of Data: Mined from product features (85 GB tables extracted from the Web) Strength from Wisdom of Crowds: Mined from product co-occurrences (1.1M+ reviews) Going Back to:Software Development Meets Wisdom of Crowds : Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010 Going Back to:Software Development Meets Wisdom of Crowds Motivation:Developer Resources on the Web : Motivation:Developer Resources on the Web API Documents Example better than words Manual Compilation Examples:MSDN, books,… : Manual Compilation Examples:MSDN, books,… Labor intensive! Search Engine Harvesting (MSDN-quality) Examples from Code Corpus : Search Engine Harvesting (MSDN-quality) Examples from Code Corpus API name Open-source codebase Slide 11: Google code search, Koders.com They fail to show good summary, but rather highlight Comments Import statements Constructors Partial string matching results Baseline: Existing Search Engines Connection prepareStatement Slide 12: Example of MSDN List::insert Summarization+Ranking(instructive/concise) Ground-truth: MSDN Examples Slide 13: eXoa JavaDoc Repository Candidates API Query Query Result Summarized codes Summarization eXoa System Ranked Codes Features Representation Ranking Embedding Diversification EXample-Oriented API Docs Slide 14: Summarization Baseline: textual vicinity Ours: parse into AST and keep only the semantically relevant lines Declaring the input arguments of API Changing the values of the arguments Declaring the class of API EXample-Oriented API Docs Slide 15: Diversification Semantic-aware similarity function AST tree similarity (expensive) Tree similarity approximation using vector similarity [1] Adaptivity to the number of clusters EXample-Oriented API Docs … … characteristic vectors [1] { 2, 1, 2, 2} [1] L. Jiang et. al, DECKARD: Scalable and Accurate Tree-based Detection of Code Clone Slide 16: Ranking Aims at finding distinctive andrepresentative code example from each cluster Features used Representative : similarity to the centroid Conciseness : line of code Correctness : class type and argument type checking Will be extended to leverage large-scale user feedbacks (more on later) EXample-Oriented API Docs http://exoa.postech.ac.kr : http://exoa.postech.ac.kr How Does eXoa Compare to Others? : How Does eXoa Compare to Others? eXoaDocs vs. JavaDocs eXoaDocs vs. Google/Koders eXoaDocs vs. Human ranking Is eXoa Useful for Developers? : Is eXoa Useful for Developers? 24 subjects 12 subjects used eXoaDocs 12 subjects used JavaDocs 4 tasks Establish a connection to the database Create SQL statements Execute the SQL statements Present the query results Slide 20: Productivity Average completion time Average number of document lookup Is eXoa Useful for Developers? Talk Outline : Talk Outline Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010 Motivation : Motivation [2] J. Brandt et. al, Example-centric programming: integrating Web search into the development environments showing how (instant) availability of online code examples can change programming practices [2] Programming Context Enhancing Search? : Programming Context Enhancing Search? Current state-of-the-art Clone Detector Structural Similarity Off-line Code Search Engine Text similarity On-line “Post-mortem” analysis vs. preventive/personalized approach Slide 24: … … AST Characteristic vectors Each sub-tree containing at least T nodes Code Representation System Overview (Top-k Clones) : System Overview (Top-k Clones) 25 Characteristic vectors(High dimensional) Code corpus Dimension-reduced vectors Query code Reduced version clones Multi-dimensional index Curse of dimensionality [3] C. Faloutsos et. al, GEMINI: Generalized Multimedia Indexing candidates Step 1: Candidate filtering Step 2: Candidate ranking (original) Dimensionality Reduction : Goals Correctness: Candidates should include all the results Efficiency: Reduced space should preserve the distance relations between two vectors as much as possible NP-Hard Greedy approximation (see paper) Dimensionality Reduction 26 Indexing : R-tree 2D example (node capacity: 4) Indexing 27 Technical Contributions (Summary) : Technical Contributions (Summary) Index building (bulk loading) algorithm optimized for software search workloads Index traversal algorithms Filtering-then-ranking Interleaved + avoid redundant index traversals - accessing a larger number of candidates (loose search bound) + candidate access cost amortized with vector packing + I/O scheduling Approximation (x2~30 faster) Experimental Evaluation : Experimental Evaluation Environment Pentium IV 3.2 GHz CPU 1GB memory P-ATA HDD Linux, gcc Dataset JDK 1.6.0 update 13 7,195 java files, 2,075,573 LOC 400 Java open source projects Hosted on SourceForge, Tigris.org, and GoogleCode 288,846 java files, 54,709,384 LOC DECKARD characteristic vector generator Experimental Evaluation : Experimental Evaluation Index building time Experimental Evaluation : Experimental Evaluation Querying time over varying # of vectors (OSP) k = 20 Experimental Evaluation : Experimental Evaluation Querying time over varying k (OSP) # of vectors = 1,697K Experimental Evaluation : Experimental Evaluation Querying time over range R # of vectors = 784K Conclusion : Conclusion Search-driven approaches can contribute to software development by extracting high-quality code examples for 75%+ Java APIs by interleaving instant search during code development Open questions extracting/indexing more semantics from code interacting with other software haystacks (bugs, tests, specs, people) code completion? Questions? : Questions? http://www.postech.ac.kr/~swhwang You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
msr_talk shwang5 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 80 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: June 30, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Software Development Meetsthe Wisdom of Crowds : Software Development Meetsthe Wisdom of Crowds Seung-won Hwang (POSTECH) Joint work w/ Jinhan Kim, Mu-woong Lee, Sanghoon Lee, Jong-won Roh (POSTECH) Sunghun Kim(HKUST) Talk Outline : Talk Outline Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010 Search : Search Search aims at finding the “right” needle in the haystack “right away” (~0.1sec) Commercial search engines build on haystacks of documents software code (today’s talk) On Defining “Right” Results : On Defining “Right” Results There exist dual (possibly conflicting) goals: Personalization – maximize the satisfaction of a particular user. Diversification – minimize the dissatisfaction of varying user intents. 4 Nikon D3x Good!! Good?? Nikon D3x On “Fast” Retrieval of “Right” Results : On “Fast” Retrieval of “Right” Results Personalization Mining hidden intents, personalized rank processing Diversification Categorization, clustering, algorithms 5 Example Better than Words: Product Search meets Wisdom of Crowds : For details, find us at DemoFest! (joint work w/ MSR Asia) Example Better than Words: Product Search meets Wisdom of Crowds 6 Showing related products with various relationship types and strength Type from Wisdom of Data: Mined from product features (85 GB tables extracted from the Web) Strength from Wisdom of Crowds: Mined from product co-occurrences (1.1M+ reviews) Going Back to:Software Development Meets Wisdom of Crowds : Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010 Going Back to:Software Development Meets Wisdom of Crowds Motivation:Developer Resources on the Web : Motivation:Developer Resources on the Web API Documents Example better than words Manual Compilation Examples:MSDN, books,… : Manual Compilation Examples:MSDN, books,… Labor intensive! Search Engine Harvesting (MSDN-quality) Examples from Code Corpus : Search Engine Harvesting (MSDN-quality) Examples from Code Corpus API name Open-source codebase Slide 11: Google code search, Koders.com They fail to show good summary, but rather highlight Comments Import statements Constructors Partial string matching results Baseline: Existing Search Engines Connection prepareStatement Slide 12: Example of MSDN List::insert Summarization+Ranking(instructive/concise) Ground-truth: MSDN Examples Slide 13: eXoa JavaDoc Repository Candidates API Query Query Result Summarized codes Summarization eXoa System Ranked Codes Features Representation Ranking Embedding Diversification EXample-Oriented API Docs Slide 14: Summarization Baseline: textual vicinity Ours: parse into AST and keep only the semantically relevant lines Declaring the input arguments of API Changing the values of the arguments Declaring the class of API EXample-Oriented API Docs Slide 15: Diversification Semantic-aware similarity function AST tree similarity (expensive) Tree similarity approximation using vector similarity [1] Adaptivity to the number of clusters EXample-Oriented API Docs … … characteristic vectors [1] { 2, 1, 2, 2} [1] L. Jiang et. al, DECKARD: Scalable and Accurate Tree-based Detection of Code Clone Slide 16: Ranking Aims at finding distinctive andrepresentative code example from each cluster Features used Representative : similarity to the centroid Conciseness : line of code Correctness : class type and argument type checking Will be extended to leverage large-scale user feedbacks (more on later) EXample-Oriented API Docs http://exoa.postech.ac.kr : http://exoa.postech.ac.kr How Does eXoa Compare to Others? : How Does eXoa Compare to Others? eXoaDocs vs. JavaDocs eXoaDocs vs. Google/Koders eXoaDocs vs. Human ranking Is eXoa Useful for Developers? : Is eXoa Useful for Developers? 24 subjects 12 subjects used eXoaDocs 12 subjects used JavaDocs 4 tasks Establish a connection to the database Create SQL statements Execute the SQL statements Present the query results Slide 20: Productivity Average completion time Average number of document lookup Is eXoa Useful for Developers? Talk Outline : Talk Outline Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010 Motivation : Motivation [2] J. Brandt et. al, Example-centric programming: integrating Web search into the development environments showing how (instant) availability of online code examples can change programming practices [2] Programming Context Enhancing Search? : Programming Context Enhancing Search? Current state-of-the-art Clone Detector Structural Similarity Off-line Code Search Engine Text similarity On-line “Post-mortem” analysis vs. preventive/personalized approach Slide 24: … … AST Characteristic vectors Each sub-tree containing at least T nodes Code Representation System Overview (Top-k Clones) : System Overview (Top-k Clones) 25 Characteristic vectors(High dimensional) Code corpus Dimension-reduced vectors Query code Reduced version clones Multi-dimensional index Curse of dimensionality [3] C. Faloutsos et. al, GEMINI: Generalized Multimedia Indexing candidates Step 1: Candidate filtering Step 2: Candidate ranking (original) Dimensionality Reduction : Goals Correctness: Candidates should include all the results Efficiency: Reduced space should preserve the distance relations between two vectors as much as possible NP-Hard Greedy approximation (see paper) Dimensionality Reduction 26 Indexing : R-tree 2D example (node capacity: 4) Indexing 27 Technical Contributions (Summary) : Technical Contributions (Summary) Index building (bulk loading) algorithm optimized for software search workloads Index traversal algorithms Filtering-then-ranking Interleaved + avoid redundant index traversals - accessing a larger number of candidates (loose search bound) + candidate access cost amortized with vector packing + I/O scheduling Approximation (x2~30 faster) Experimental Evaluation : Experimental Evaluation Environment Pentium IV 3.2 GHz CPU 1GB memory P-ATA HDD Linux, gcc Dataset JDK 1.6.0 update 13 7,195 java files, 2,075,573 LOC 400 Java open source projects Hosted on SourceForge, Tigris.org, and GoogleCode 288,846 java files, 54,709,384 LOC DECKARD characteristic vector generator Experimental Evaluation : Experimental Evaluation Index building time Experimental Evaluation : Experimental Evaluation Querying time over varying # of vectors (OSP) k = 20 Experimental Evaluation : Experimental Evaluation Querying time over varying k (OSP) # of vectors = 1,697K Experimental Evaluation : Experimental Evaluation Querying time over range R # of vectors = 784K Conclusion : Conclusion Search-driven approaches can contribute to software development by extracting high-quality code examples for 75%+ Java APIs by interleaving instant search during code development Open questions extracting/indexing more semantics from code interacting with other software haystacks (bugs, tests, specs, people) code completion? Questions? : Questions? http://www.postech.ac.kr/~swhwang