Public Datasets for Code Understanding Tasks December, 2021

Labeled data is typically scarce for code-understanding tasks, both for training and evaluation purposes. In this post, I share a number of datasets that ML/SE researchers have collected and (kindly) shared. I will keep this post up-to-date as I find new datsets. Please email me if some dataset is missing.

Python

  1. QuickBugs
    https://jkoppel.github.io/QuixBugs/
    https://jkoppel.github.io/QuixBugs/quixbugs.pdf
  2. BugSwarm
    http://www.bugswarm.org/dataset/
    https://web.cs.ucdavis.edu/~rubio/includes/icse19.pdf
  3. BugsInPy
    https://github.com/soarsmu/BugsInPy
    https://dl.acm.org/doi/pdf/10.1145/3368089.3417943
  4. refactory
    https://github.com/githubhuyang/refactory
    https://ieeexplore.ieee.org/abstract/document/8952522
  5. Misc.
    https://www.kaggle.com/saitejaponugoti/deepbugs-for-python

Java

  1. XCorpus
    https://bitbucket.org/jensdietrich/xcorpus/src/master
    http://www.jot.fm/issues/issue_2017_04/article1.pdf