Public Datasets for Code Understanding Tasks  December, 2021 
Labeled data is typically scarce for code-understanding tasks, both for training and evaluation purposes. In this post, I share a number of datasets that ML/SE researchers have collected and (kindly) shared. I will keep this post up-to-date as I find new datsets. Please email me if some dataset is missing.
Python
- QuickBugs
 https://jkoppel.github.io/QuixBugs/
 https://jkoppel.github.io/QuixBugs/quixbugs.pdf
- BugSwarm
 http://www.bugswarm.org/dataset/
 https://web.cs.ucdavis.edu/~rubio/includes/icse19.pdf
- 
BugsInPy
 https://github.com/soarsmu/BugsInPy
 https://dl.acm.org/doi/pdf/10.1145/3368089.3417943
- 
refactory
 https://github.com/githubhuyang/refactory
 https://ieeexplore.ieee.org/abstract/document/8952522
- 
Misc.
 https://www.kaggle.com/saitejaponugoti/deepbugs-for-python
Java
- 
XCorpus
 https://bitbucket.org/jensdietrich/xcorpus/src/master
 http://www.jot.fm/issues/issue_2017_04/article1.pdf