Public Datasets for Code Understanding Tasks December, 2021
Labeled data is typically scarce for code-understanding tasks, both for training and evaluation purposes. In this post, I share a number of datasets that ML/SE researchers have collected and (kindly) shared. I will keep this post up-to-date as I find new datsets. Please email me if some dataset is missing.
Python
- QuickBugs
https://jkoppel.github.io/QuixBugs/
https://jkoppel.github.io/QuixBugs/quixbugs.pdf - BugSwarm
http://www.bugswarm.org/dataset/
https://web.cs.ucdavis.edu/~rubio/includes/icse19.pdf -
BugsInPy
https://github.com/soarsmu/BugsInPy
https://dl.acm.org/doi/pdf/10.1145/3368089.3417943 -
refactory
https://github.com/githubhuyang/refactory
https://ieeexplore.ieee.org/abstract/document/8952522 -
Misc.
https://www.kaggle.com/saitejaponugoti/deepbugs-for-python
Java
-
XCorpus
https://bitbucket.org/jensdietrich/xcorpus/src/master
http://www.jot.fm/issues/issue_2017_04/article1.pdf