msr_talk

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Software Development Meetsthe Wisdom of Crowds : 

Software Development Meetsthe Wisdom of Crowds Seung-won Hwang (POSTECH) Joint work w/ Jinhan Kim, Mu-woong Lee, Sanghoon Lee, Jong-won Roh (POSTECH) Sunghun Kim(HKUST)

Talk Outline : 

Talk Outline Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010

Search : 

Search Search aims at finding the “right” needle in the haystack “right away” (~0.1sec) Commercial search engines build on haystacks of documents software code (today’s talk)

On Defining “Right” Results : 

On Defining “Right” Results There exist dual (possibly conflicting) goals: Personalization – maximize the satisfaction of a particular user. Diversification – minimize the dissatisfaction of varying user intents. 4 Nikon D3x Good!! Good?? Nikon D3x

On “Fast” Retrieval of “Right” Results : 

On “Fast” Retrieval of “Right” Results Personalization Mining hidden intents, personalized rank processing Diversification Categorization, clustering, algorithms 5

Example Better than Words: Product Search meets Wisdom of Crowds : 

For details, find us at DemoFest! (joint work w/ MSR Asia) Example Better than Words: Product Search meets Wisdom of Crowds 6 Showing related products with various relationship types and strength Type from Wisdom of Data: Mined from product features (85 GB tables extracted from the Web) Strength from Wisdom of Crowds: Mined from product co-occurrences (1.1M+ reviews)

Going Back to:Software Development Meets Wisdom of Crowds : 

Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010 Going Back to:Software Development Meets Wisdom of Crowds

Motivation:Developer Resources on the Web : 

Motivation:Developer Resources on the Web API Documents Example better than words

Manual Compilation Examples:MSDN, books,… : 

Manual Compilation Examples:MSDN, books,… Labor intensive!

Search Engine Harvesting (MSDN-quality) Examples from Code Corpus : 

Search Engine Harvesting (MSDN-quality) Examples from Code Corpus API name Open-source codebase

Slide 11: 

Google code search, Koders.com They fail to show good summary, but rather highlight Comments Import statements Constructors Partial string matching results Baseline: Existing Search Engines Connection prepareStatement

Slide 12: 

Example of MSDN List::insert Summarization+Ranking(instructive/concise) Ground-truth: MSDN Examples

Slide 13: 

eXoa JavaDoc Repository Candidates API Query Query Result Summarized codes Summarization eXoa System Ranked Codes Features Representation Ranking Embedding Diversification EXample-Oriented API Docs

Slide 14: 

Summarization Baseline: textual vicinity Ours: parse into AST and keep only the semantically relevant lines Declaring the input arguments of API Changing the values of the arguments Declaring the class of API EXample-Oriented API Docs

Slide 15: 

Diversification Semantic-aware similarity function AST tree similarity (expensive) Tree similarity approximation using vector similarity [1] Adaptivity to the number of clusters EXample-Oriented API Docs … … characteristic vectors [1] { 2, 1, 2, 2} [1] L. Jiang et. al, DECKARD: Scalable and Accurate Tree-based Detection of Code Clone

Slide 16: 

Ranking Aims at finding distinctive andrepresentative code example from each cluster Features used Representative : similarity to the centroid Conciseness : line of code Correctness : class type and argument type checking Will be extended to leverage large-scale user feedbacks (more on later) EXample-Oriented API Docs

http://exoa.postech.ac.kr : 

http://exoa.postech.ac.kr

How Does eXoa Compare to Others? : 

How Does eXoa Compare to Others? eXoaDocs vs. JavaDocs eXoaDocs vs. Google/Koders eXoaDocs vs. Human ranking

Is eXoa Useful for Developers? : 

Is eXoa Useful for Developers? 24 subjects 12 subjects used eXoaDocs 12 subjects used JavaDocs 4 tasks Establish a connection to the database Create SQL statements Execute the SQL statements Present the query results

Slide 20: 

Productivity Average completion time Average number of document lookup Is eXoa Useful for Developers?

Talk Outline : 

Talk Outline Search (in general) On finding the “right” code example [AAAI10] On making code search “fast” [FSE10] [AAAI10] Jinhan Kim, Sanghoon Lee, Seung-won Hwang, Sunghun Kim: Towards an Intelligent Code Search Engine. AAAI 2010 [FSE10] Mu-woong Lee, Jongwon Roh, Seung-won Hwang, Sunghun Kim: Instant Code Clone Search. FSE 2010

Motivation : 

Motivation [2] J. Brandt et. al, Example-centric programming: integrating Web search into the development environments showing how (instant) availability of online code examples can change programming practices [2]

Programming Context Enhancing Search? : 

Programming Context Enhancing Search? Current state-of-the-art Clone Detector Structural Similarity Off-line Code Search Engine Text similarity On-line “Post-mortem” analysis vs. preventive/personalized approach

Slide 24: 

… … AST Characteristic vectors Each sub-tree containing at least T nodes Code Representation

System Overview (Top-k Clones) : 

System Overview (Top-k Clones) 25 Characteristic vectors(High dimensional) Code corpus Dimension-reduced vectors Query code Reduced version clones Multi-dimensional index Curse of dimensionality [3] C. Faloutsos et. al, GEMINI: Generalized Multimedia Indexing candidates Step 1: Candidate filtering Step 2: Candidate ranking (original)

Dimensionality Reduction : 

Goals Correctness: Candidates should include all the results Efficiency: Reduced space should preserve the distance relations between two vectors as much as possible NP-Hard  Greedy approximation (see paper) Dimensionality Reduction 26

Indexing : 

R-tree 2D example (node capacity: 4) Indexing 27

Technical Contributions (Summary) : 

Technical Contributions (Summary) Index building (bulk loading) algorithm optimized for software search workloads Index traversal algorithms Filtering-then-ranking Interleaved + avoid redundant index traversals - accessing a larger number of candidates (loose search bound) + candidate access cost amortized with vector packing + I/O scheduling Approximation (x2~30 faster)

Experimental Evaluation : 

Experimental Evaluation Environment Pentium IV 3.2 GHz CPU 1GB memory P-ATA HDD Linux, gcc Dataset JDK 1.6.0 update 13 7,195 java files, 2,075,573 LOC 400 Java open source projects Hosted on SourceForge, Tigris.org, and GoogleCode 288,846 java files, 54,709,384 LOC DECKARD characteristic vector generator

Experimental Evaluation : 

Experimental Evaluation Index building time

Experimental Evaluation : 

Experimental Evaluation Querying time over varying # of vectors (OSP) k = 20

Experimental Evaluation : 

Experimental Evaluation Querying time over varying k (OSP) # of vectors = 1,697K

Experimental Evaluation : 

Experimental Evaluation Querying time over range R # of vectors = 784K

Conclusion : 

Conclusion Search-driven approaches can contribute to software development by extracting high-quality code examples for 75%+ Java APIs by interleaving instant search during code development Open questions extracting/indexing more semantics from code interacting with other software haystacks (bugs, tests, specs, people) code completion?

Questions? : 

Questions? http://www.postech.ac.kr/~swhwang