Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and a...
AbstractLanguage component plays an important role in data/information retrieval. Data retrieval in ...
Mining source code has become a common task for re-searchers and yielded significant benefits for th...
In today’s software-centric world, ultra-large-scale software repositories, e.g. SourceForge, GitHub...
Address email Large repositories of source code create new challenges and opportunities for sta-tist...
Software repositories contain a vast wealth of information about software development. Mining these ...
sourcerer is a search engine for open source code that extracts fine-grained structural information ...
Software repositories, such as source code, email archives, and bug databases, contain unstructured ...
International audienceProgram understanding aims at discovering human-readable properties of a softw...
A large number of open source projects are hosted on the Internet by popular repository sites like G...
The advancements in machine learning techniques have encouraged researchers to apply these technique...
This dataset contains 703 anonymized developers extracted from 17 open-source projects from GitHub. ...
Mining software repositories provides developers and researchers a chance to learn from previous dev...
Abstract—Exploring linguistic topics in source code is a pro-gram comprehension activity that shows ...
The primary goal of software development is to deliver Optimal Software, i.e., software produced at...
Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collect...
AbstractLanguage component plays an important role in data/information retrieval. Data retrieval in ...
Mining source code has become a common task for re-searchers and yielded significant benefits for th...
In today’s software-centric world, ultra-large-scale software repositories, e.g. SourceForge, GitHub...
Address email Large repositories of source code create new challenges and opportunities for sta-tist...
Software repositories contain a vast wealth of information about software development. Mining these ...
sourcerer is a search engine for open source code that extracts fine-grained structural information ...
Software repositories, such as source code, email archives, and bug databases, contain unstructured ...
International audienceProgram understanding aims at discovering human-readable properties of a softw...
A large number of open source projects are hosted on the Internet by popular repository sites like G...
The advancements in machine learning techniques have encouraged researchers to apply these technique...
This dataset contains 703 anonymized developers extracted from 17 open-source projects from GitHub. ...
Mining software repositories provides developers and researchers a chance to learn from previous dev...
Abstract—Exploring linguistic topics in source code is a pro-gram comprehension activity that shows ...
The primary goal of software development is to deliver Optimal Software, i.e., software produced at...
Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collect...
AbstractLanguage component plays an important role in data/information retrieval. Data retrieval in ...
Mining source code has become a common task for re-searchers and yielded significant benefits for th...
In today’s software-centric world, ultra-large-scale software repositories, e.g. SourceForge, GitHub...