Random Notes

Quick Deep Learning Notes January, 2022

When I started exploring deep learning, I had to look up different sources for each topic; there are so many of them. Now that I have come to a conclusion about sources that work for me, I share some of these (hand-written) notes in case somebody is looking for fast-paced yet A-to-Z tutorials on Deep Learning and PyTorch. Although typing all the notes is a thing now, I still prefer hand-written notes. So, please excuse my scribbles and doodles!

References:


Neurips 2021 ML4Code-related Papers January, 2022

Many interesting papers appear every year at Neurips. Only a small subset are considered ml4code or of interest to ml4code community. So, to make life easier, I made a list.


Deep Learning for Code: a Collection of Papers December, 2021

In this post, I try to list every ML/DL paper that target code understanding, code representation, and bug finding. Despite my efforts to compile an exhaustive list, there are definitely ones that I have missed. Please email me if you find a paper that is missing. I will keep this post up-to-date as I continue my studies.

Every top-level category consists of the papers that were published in a specific year. Under some of the paper descriptions, I have added a few keywords about the ideas and tasks.

2021

  1. Language-agnostic representation learning of source code from structure and context
    Zugner, Daniel and Kirschstein, Tobias and Catasta, Michele and Leskovec, Jure and Gunnemann, Stephan
    arXiv preprint arXiv:2103.11318
    ICLR
    {model: , tasks: , representation: , highlights: }
  2. Evaluating large language models trained on code
    Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Ponde, Henrique and Kaplan, Jared and Edwards, Harri and Burda, Yura and Joseph, Nicholas and Brockman, Greg and others
    arXiv preprint arXiv:2107.03374
  3. More with less: Exploring how to use deep learning effectively through semi-supervised learning for automatic bug detection in student code.
    Shi, Yang and Mao, Ye and Barnes, Tiffany and Chi, Min and Price, Thomas W
    In Proceedings of the 14th International Conference on Educational Data Mining (EDM)
  4. Fast and memory-efficient neural code completion
    Svyatkovskiy, Alexey and Lee, Sebastian and Hadjitofi, Anna and Riechert, Maik and Franco, Juliana Vicente and Allamanis, Miltiadis
    2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)
    {model: , tasks: code completion, representation: embeds subtokens, highlights: }
  5. CCMC: Code Completion with a Memory Mechanism and a Copy Mechanism
    Yang, Hao and Kuang, Li
    Evaluation and Assessment in Software Engineering
    {model: transformer-XL, tasks: , representation: sequence of AST nodes in in-order DFS fashion, highlights: long-range dependencies but consumes a lot of memory and compute resources}
  6. Studying the usage of text-to-text transfer transformer to support code-related tasks
    Mastropaolo, Antonio and Scalabrino, Simone and Cooper, Nathan and Palacio, David Nader and Poshyvanyk, Denys and Oliveto, Rocco and Bavota, Gabriele
    2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)
  7. InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees
    Bui, Nghi DQ and Yu, Yijun and Jiang, Lingxiao
    2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)
    {model: Tree-based CNN, tasks: , representation: subtrees of an AST--traverses AST and for specific nodes such as stmts extracts subtree rooted at the visited node then makes a vocab of subtrees-->so the model's job would be to predict subtrees given an AST, highlights: unsupervised so helps with scarcity of labeled data, pretraining}
  8. PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code
    Spirin, Egor and Bogomolov, Egor and Kovalenko, Vladimir and Bryksin, Timofey
    arXiv preprint arXiv:2103.12778
  9. A Survey on Software Defect Prediction Using Deep Learning
    Akimova, Elena N and Bersenev, Alexander Yu and Deikov, Artem A and Kobylkin, Konstantin S and Konygin, Anton V and Mezentsev, Ilya P and Misilov, Vladimir E
    Multidisciplinary Digital Publishing Institute
  10. A large-scale benchmark for few-shot program induction and synthesis
    Alet, Ferran and Lopez-Contreras, Javier and Koppel, James and Nye, Maxwell and Solar-Lezama, Armando and Lozano-Perez, Tomas and Kaelbling, Leslie and Tenenbaum, Joshua
    International Conference on Machine Learning
  11. On the Effectiveness of Deep Vulnerability Detectors to Simple Stupid Bug Detection
    Hua, Jiayi and Wang, Haoyu
    2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)
  12. BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?
    Ishtiaq, Abdullah Al and Hasan, Masum and Haque, Md and Anjum, Mahim and Mehrab, Kazi Sajeed and Muttaqueen, Tanveer and Hasan, Tahmid and Iqbal, Anindya and Shahriyar, Rifat
    arXiv preprint arXiv:2104.08017
    {model: code2vec/codebert to embed source code and a simple NN with 2 hidden layers to embed NL query, tasks: , representation: , highlights: learns a mapping between NL embeddings and code embeddings}
  13. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations
    Rabin, Md Rafiqul Islam and Bui, Nghi DQ and Wang, Ke and Yu, Yijun and Jiang, Lingxiao and Alipour, Mohammad Amin
    Information and Software Technology
    {model: , tasks: , representation: , highlights: robustness study}
  14. Do Transformers Really Perform Bad for Graph Representation?
    Ying, Chengxuan and Cai, Tianle and Luo, Shengjie and Zheng, Shuxin and Ke, Guolin and He, Di and Shen, Yanming and Liu, Tie-Yan
    arXiv preprint arXiv:2106.05234
    {model: transformer for graphs, tasks: , representation: , highlights: they call it Graphormer}
  15. TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer
    Berabi, Berkay and He, Jingxuan and Raychev, Veselin and Vechev, Martin
    International Conference on Machine Learning (PMLR)
    {model: transformer, tasks: fixing code errors, representation: sequence of tokens, features: }
  16. Generating Adversarial Computer Programs using Optimized Obfuscations
    Srikant, Shashank and Liu, Sijia and Mitrovska, Tamara and Chang, Shiyu and Fan, Quanfu and Zhang, Gaoyuan and O'Reilly, Una-May
    arXiv preprint arXiv:2103.11882
    {model: , tasks: , representation: , highlights: focuses on adversarial robustness}
  17. Self-Supervised Bug Detection and Repair
    Allamanis, Miltiadis and Jackson-Flux, Henry and Brockschmidt, Marc
    arXiv preprint arXiv:2105.12787
    {model: GNN, tasks: , representation: AST augmented with control and dataflow edges, features: defines the graph as entities and relations, works better than CuBERT and GREAT on real bugs, self-supervised.}
  18. How could Neural Networks understand Programs?
    Peng, Dinglan and Zheng, Shuxin and Li, Yatao and Ke, Guolin and He, Di and Liu, Tie-Yan
    arXiv preprint arXiv:2105.04297
    {model: transformer, tasks: , representation: control flow of LLVM IR, highlights: }
  19. Code prediction by feeding trees to transformers
    Kim, Seohyun and Zhao, Jinman and Tian, Yuchi and Chandra, Satish
    2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)
    {model: transformer, tasks: , representation: 1) token sequence or 2) AST node sequence in pre-order fashion or 3) root-to-leaf paths, highlights: makes the transformer aware of the syntactic structure of code i.e. improves self attention block similar to GREAT}
  20. CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model
    Wang, Xin and Wang, Yasheng and Zhou, Pingyi and Xiao, Meng and Wang, Yadao and Li, Li and Liu, Xiao and Wu, Hao and Liu, Jin and Jiang, Xi
    arXiv preprint arXiv:2108.04556
    {model: , tasks: , representation: AST as sequence, highlights: pretraining, noise invariant code representation using contrastive learning by introducing noise into input sequence at training time, focus on robustness}
  21. CoTexT: Multi-task Learning with Code-Text Transformer
    Phan, Long and Tran, Hieu and Le, Daniel and Nguyen, Hieu and Anibal, James and Peltekian, Alec and Ye, Yanfang
    arXiv preprint arXiv:2105.08645
    {model: transformer, tasks: , representation: sequence of tokens, highlights: focuses on NL-PL tasks, pretraining}
  22. A Mocktail of Source Code Representations
    Vagavolu, Dheeraj and Swarna, Karthik Chandra and Chimalakonda, Sridhar
    arXiv preprint arXiv:2106.10918
    {model: , tasks: , representation: AST+CFG+PDG, highlights: an extension of code2vec}
  23. Program Synthesis with Large Language Models
    Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles
    arXiv preprint arXiv:2108.07732
  24. Automatic Code Generation using Pre-Trained Language Models
    Ottens, Lizi and Perez, Luis and Viswanathan, Sudharshan
    arXiv preprint arXiv:2102.10535
  25. SySeVR: A framework for using deep learning to detect software vulnerabilities
    Li, Zhen and Zou, Deqing and Xu, Shouhuai and Jin, Hai and Zhu, Yawei and Chen, Zhaoxuan
    IEEE Transactions on Dependable and Secure Computing

2020

  1. Structural language models of code
    Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran
    International Conference on Machine Learning
    {model: , tasks: code generation, representation: paths from the root and leaves in AST, features: copy mechanism}
  2. DL-Droid: Deep learning based android malware detection using real devices
    Alzaylaee, Mohammed K and Yerima, Suleiman Y and Sezer, Sakir
    Computers & Security, Elsevier
    {model: , tasks: malware detection, representation: , features: hand-engineered and heuristic-based}
  3. DRAST--A Deep Learning and AST Based Approach for Bug Localization
    Sangle, Shubham and Muvva, Sandeep and Chimalakonda, Sridhar and Ponnalagu, Karthikeyan and Venkoparao, Vijendran Gopalan
    arXiv preprint arXiv:2011.03449
    {model: , tasks: , representation: AST, highlights: }
  4. Backdoors in neural models of source code
    Ramakrishnan, Goutham and Albarghouthi, Aws
    arXiv preprint arXiv:2006.06841
    {model: , tasks: , representation: , highlights: adversarial robustness}
  5. Adversarial examples for models of code
    Yefet, Noam and Alon, Uri and Yahav, Eran
    Proceedings of the ACM on Programming Languages (OOPSLA)
    {model: , tasks: , representation: , highlights: adversarial robustness}
  6. Semantic robustness of models of source code
    Ramakrishnan, Goutham and Henkel, Jordan and Wang, Zi and Albarghouthi, Aws and Jha, Somesh and Reps, Thomas
    arXiv preprint arXiv:2002.03043
    {model: , tasks: , representation: , highlights: focuses on semantic robustness and training with semantic-preserving code transformations}
  7. Software vulnerability detection using deep neural networks: A survey
    Lin, Guanjun and Wen, Sheng and Han, Qing-Long and Zhang, Jun and Xiang, Yang
    Proceedings of the IEEE
  8. Approaches for Representing Software as Graphs for Machine Learning Applications
    Romanov, Vitaly and Ivanov, Vladimir and Succi, Giancarlo
    2020 International Computer Symposium (ICS)
  9. TranS^3: A transformer-based framework for unifying code summarization and code search
    Wang, Wenhua and Zhang, Yuqun and Zeng, Zhengran and Xu, Guandong
    arXiv preprint arXiv:2003.03238
    {model: transformer, tasks: code search and summarization, representation: AST, highlights: unifying framework for both code searching and summarization}
  10. Multi-task learning based pre-trained language model for code completion
    Liu, Fang and Li, Ge and Zhao, Yunfei and Jin, Zhi
    Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
    {model: transformer, tasks: code completion, representation: , highlights: pretraining}
  11. Language Modelling for Source Code with Transformer-XL
    Dowdell, Thomas and Zhang, Hongyu
    arXiv preprint arXiv:2007.15813
    {model: transformer-XL, tasks: , representation: , highlights: language modeling for source code, increase context size}
  12. Big code != big vocabulary: Open-vocabulary models for source code
    Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea
    2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE)
  13. Scelmo: Source code embeddings from language models
    Karampatsis, Rafael-Michael and Sutton, Charles
    arXiv preprint arXiv:2004.13214
  14. DeepVS: an efficient and generic approach for source code modelling usage
    Hussain, Yasir and Huang, Zhiqiu and Zhou, Yu and Wang, Senzhang
    Electronics Letters, Wiley Online Library
  15. Dlfix: Context-based code transformation learning for automated program repair
    Li, Yi and Wang, Shaohua and Nguyen, Tien N
    Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering
    {model: tree-based LSTM, tasks: , representation: AST, highlights: learning transformations to fix code instead of seq2seq}
  16. Compiler-based graph representations for deep learning models of code
    Brauckmann, Alexander and Goens, Andr{
    'es and Ertel, Sebastian and Castrillon, Jeronimo
    Proceedings of the 29th International Conference on Compiler Construction
    {model: GNN, tasks: , representation: AST and control flow edges, highlights: }
  17. Deep learning for source code modeling and generation: Models, applications, and challenges
    Le, Triet HM and Chen, Hao and Babar, Muhammad Ali
    ACM Computing Surveys (CSUR)
  18. Duplicate bug report detection and classification system based on deep learning technique
    Kukkar, Ashima and Mohana, Rajni and Kumar, Yugal and Nayyar, Anand and Bilal, Muhammad and Kwak, Kyung-Sup
    IEEE Access
  19. A self-attentional neural architecture for code completion with multi-task learning
    Liu, Fang and Li, Ge and Wei, Bolin and Xia, Xin and Fu, Zhiyi and Jin, Zhi
    Proceedings of the 28th International Conference on Program Comprehension
    {model: , tasks: code completion, representation: AST nodes as an ordered sequences to root, highlights: }
  20. A transformer-based approach for source code summarization
    Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei
    arXiv preprint arXiv:2005.00653
    {model: transformer, tasks: code summarization, representation: pairwise relationships between tokens based on AST, features: long-range}
  21. Software defect prediction via transformer
    Zhang, Qihang and Wu, Bin
    2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)
    {model: transformer, tasks: , representation: AST, highlights: }
  22. Adversarial robustness for code
    Bielik, Pavol and Vechev, Martin
    International Conference on Machine Learning (PMLR)
    {model: , tasks: , representation: , highlights: focus on adversarial robustness of code}
  23. Modular tree network for source code representation learning
    Wang, Wenhan and Li, Ge and Shen, Sijie and Xia, Xin and Jin, Zhi
    ACM Transactions on Software Engineering and Methodology (TOSEM)
    {model: tree-LSTM, tasks: , representation: AST, highlights: it is a modular tree network extracted from AST}
  24. IntelliCode Compose: Code generation using transformer
    Svyatkovskiy, Alexey and Deng, Shao Kun and Fu, Shengyu and Sundaresan, Neel
    Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    {model: transformer, tasks: code generation, representation: , highlights: they call the model GPT-C}
  25. Automated vulnerability detection in source code using minimum intermediate representation learning
    Li, Xin and Wang, Lu and Xin, Yang and Yang, Yixian and Chen, Yuling
    Applied Sciences, Multidisciplinary Digital Publishing Institute
  26. Hoppity: Learning graph transformations to detect and fix bugs in programs
    Dinella, Elizabeth and Dai, Hanjun and Li, Ziyang and Naik, Mayur and Song, Le and Wang, Ke
    International Conference on Learning Representations (ICLR)
    {model: GNN, tasks: fixing bugs, representation: AST with subtoken cache, highlights: }
  27. CodeBERT: A pre-trained model for programming and natural languages
    Feng, Zhangyin and Guo, Daya and Tang, Duyu and Duan, Nan and Feng, Xiaocheng and Gong, Ming and Shou, Linjun and Qin, Bing and Liu, Ting and Jiang, Daxin and others
    arXiv preprint arXiv:2002.08155
    {model: transformer, tasks: code search, representation: , highlights: bimodal pretrained model for PL/NL}
  28. GraphcodeBERT: Pre-training code representations with data flow
    Guo, Daya and Ren, Shuo and Lu, Shuai and Feng, Zhangyin and Tang, Duyu and Liu, Shujie and Zhou, Long and Duan, Nan and Svyatkovskiy, Alexey and Fu, Shengyu and others
    arXiv preprint arXiv:2009.08366
    {model: transformer, tasks: code refinement, representation: , highlights: pretraining, similar to codebert but uses dataflow info at pretraining}

2019

  1. Synthetic datasets for neural program synthesis
    Shin, Richard and Kant, Neel and Gupta, Kavi and Bender, Christopher and Trabucco, Brandon and Singh, Rishabh and Song, Dawn
    arXiv preprint arXiv:1912.12345
    2019
  2. PathMiner: a library for mining of path-based representations of code
    Kovalenko, Vladimir and Bogomolov, Egor and Bryksin, Timofey and Bacchelli, Alberto
    2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
    2019
  3. A hybrid deep learning image-based analysis for effective malware detection
    Venkatraman, Sitalakshmi and Alazab, Mamoun and Vinayakumar, R
    Journal of Information Security and Applications, Elsevier
    2019
  4. Pre-trained language model representations for language generation
    Edunov, Sergey and Baevski, Alexei and Auli, Michael
    arXiv preprint arXiv:1903.09722
    2019
  5. Multi-modal attention network learning for semantic source code retrieval
    Wan, Yao and Shu, Jingdong and Sui, Yulei and Xu, Guandong and Zhao, Zhou and Wu, Jian and Yu, Philip S
    arXiv preprint arXiv:1909.13516
    2019
  6. A zero-positive learning approach for diagnosing software performance regressions
    Alam, Mejbah and Gottschlich, Justin and Tatbul, Nesime and Turek, Javier S and Mattson, Tim and Muzahid, Abdullah
    Advances in Neural Information Processing Systems
    2019
  7. code2vec: Learning distributed representations of code
    Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran
    Proceedings of the ACM on Programming Languages
    {model: , tasks: , representation: pairwise paths between AST terminal nodes, highlights: }
  8. Deep-autocoder: Learning to complete code precisely with induced code tokens
    Hu, Xing and Men, Rui and Li, Ge and Jin, Zhi
    2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)
    {model: LSTM, tasks: , representation: AST, highlights: learn language models over code corpus}
  9. Pythia: AI-assisted code completion system
    Svyatkovskiy, Alexey and Zhao, Ying and Fu, Shengyu and Sundaresan, Neel
    Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
    {model: LSTM, tasks: code completion, representation: serialized AST, highlights: }
  10. Maybe deep neural networks are the best choice for modeling source code
    Karampatsis, Rafael-Michael and Sutton, Charles
    arXiv preprint arXiv:1903.05734
  11. Structural language models for any-code generation
    Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran
  12. A novel neural source code representation based on abstract syntax tree
    Zhang, Jian and Wang, Xu and Zhang, Hongyu and Sun, Hailong and Wang, Kaixuan and Liu, Xudong
    2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)
    {model: RNN, tasks: , representation: AST split into a sequence of small stmt-level subtrees, highlights: }
  13. IdBench: Evaluating Semantic Representations of Identifier Names in Source Code
    Wainakh, Yaza and Rauf, Moiz and Pradel, Michael
    arXiv preprint arXiv:1910.05177
  14. Neural program repair by jointly learning to localize and repair
    Vasic, Marko and Kanade, Aditya and Maniatis, Petros and Bieber, David and Singh, Rishabh
    arXiv preprint arXiv:1904.01720
  15. Open vocabulary learning on source code with a graph-structured cache
    Cvitkovic, Milan and Singh, Badal and Anandkumar, Animashree
    International Conference on Machine Learning (PMLR)
    {model: GNN, tasks: , representation: augmented AST, highlights: add a graph-structural vocabulary cache to the graph--add edges from a subtoken vocab to terminal nodes}
  16. A literature study of embeddings on source code
    Chen, Zimin and Monperrus, Martin
    arXiv preprint arXiv:1904.03061
  17. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks
    Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang
    arXiv preprint arXiv:1909.03496
    {model: GNN, tasks: , representation: AST, highlights: }
  18. Global Relational Models of Source Code
    Hellendoorn, Vincent J and Sutton, Charles and Singh, Rishabh and Maniatis, Petros and Bieber, David
    International conference on learning representations
    {model: transformer with attention bias, tasks: varmisuse, representation: sequence of tokens but incorporating semantically meaningful relations, highlights: longer-range dependencies compared to transformer but still limited by context-size}
  19. Learning a static bug finder from data
    Wang, Yu and Gao, Fengjuan and Wang, Linzhang and Wang, Ke
    arXiv preprint arXiv:1907.05579
    {model: GNN, tasks: , representation: augmented AST, highlights: split the code graph into multiple disjoint ones, suitable for more complex bugs such as null pointer deref, less accurate when handling large programs (large = 1000 nodes)}
  20. Deep learning for bug-localization in student programs
    Gupta, Rahul and Kanade, Aditya and Shevade, Shirish
    arXiv preprint arXiv:1905.12454
    {model: tree convolutional neural network, tasks: bug localization, representation: AST, highlights: }
  21. A comprehensive study on deep learning bug characteristics
    Islam, Md Johirul and Nguyen, Giang and Pan, Rangeet and Rajan, Hridesh
    Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
  22. 2018

  23. Deepbugs: A learning approach to name-based bug detection
    Pradel, Michael and Sen, Koushik
    Proceedings of the ACM on Programming Languages (OOPSLA)
    {model: , tasks: , representation: embed only program identifiers in a list, highlights: }
  24. code2seq: Generating sequences from structured representations of code
    Alon, Uri and Brody, Shaked and Levy, Omer and Yahav, Eran
    arXiv preprint arXiv:1808.01400
    {model: , tasks: code captioning, representation: all pairwise paths between terminal nodes in AST, features: }

2017

  1. Learning to Represent Student Knowledge on Programming Exercises Using Deep Learning
    Wang, Lisa and Sy, Angela and Liu, Larry and Piech, Chris
    International Educational Data Mining Society
  2. Pallas: Semantic-aware checking for finding deep bugs in fast path
    Huang, Jian and Allen-Bond, Michael and Zhang, Xuechen
    Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
  3. Learning to represent programs with graphs
    Allamanis, Miltiadis and Brockschmidt, Marc and Khademi, Mahmoud
    arXiv preprint arXiv:1711.00740
    {model: GGNN, tasks: varmisuse, representation: AST augmented with control and dataflow, features: }
  4. DeepFix: Fixing common C language errors by deep learning
    Gupta, Rahul and Pal, Soham and Kanade, Aditya and Shevade, Shirish
    Thirty-First AAAI Conference on Artificial Intelligence
    {model: multi-layered seq2seq neural network with attention, tasks: bug finding and fixing, representation: token sequence, highlights: }
  5. Inductive representation learning on large graphs
    Hamilton, William L and Ying, Rex and Leskovec, Jure
    Proceedings of the 31st International Conference on Neural Information Processing Systems
  6. Code completion with neural attention and pointer networks
    Li, Jian and Wang, Yue and Lyu, Michael R and King, Irwin
    arXiv preprint arXiv:1711.09573
    {model: , tasks: , representation: flattened AST, highlights: }
  7. Program synthesis from natural language using recurrent neural networks
    Lin, Xi Victoria and Wang, Chenglong and Pang, Deric and Vu, Kevin and Ernst, Michael D
    University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, Tech. Rep. UW-CSE-17-03-01
    {model: RNN encoder-decoder, tasks: program synthesis from NL, representation: , highlights: generates a program template from NL sentence}

2016

  1. Program synthesis using natural language
    Desai, Aditya and Gulwani, Sumit and Hingorani, Vineet and Jain, Nidhi and Karkare, Amey and Marron, Mark and Roy, Subhajit
    Proceedings of the 38th International Conference on Software Engineering
  2. Neuro-symbolic program synthesis
    Parisotto, Emilio and Mohamed, Abdel-rahman and Singh, Rishabh and Li, Lihong and Zhou, Dengyong and Kohli, Pushmeet
    arXiv preprint arXiv:1611.01855
  3. Summarizing source code using a neural attention model
    Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke
    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

  1. Mantis: Automatic performance prediction for smartphone applications
    Kwon, Yongin and Lee, Sangmin and Yi, Hayoon and Kwon, Donghyun and Yang, Seungjun and Chun, Byung-Gon and Huang, Ling and Maniatis, Petros and Naik, Mayur and Paek, Yunheung
    USENIX Annual Technical Conference 13

Public Datasets for Code Understanding Tasks December, 2021

Labeled data is typically scarce for code-understanding tasks, both for training and evaluation purposes. In this post, I share a number of datasets that ML/SE researchers have collected and (kindly) shared. I will keep this post up-to-date as I find new datsets. Please email me if some dataset is missing.

Python

  1. QuickBugs
    https://jkoppel.github.io/QuixBugs/
    https://jkoppel.github.io/QuixBugs/quixbugs.pdf
  2. BugSwarm
    http://www.bugswarm.org/dataset/
    https://web.cs.ucdavis.edu/~rubio/includes/icse19.pdf
  3. BugsInPy
    https://github.com/soarsmu/BugsInPy
    https://dl.acm.org/doi/pdf/10.1145/3368089.3417943
  4. refactory
    https://github.com/githubhuyang/refactory
    https://ieeexplore.ieee.org/abstract/document/8952522
  5. Misc.
    https://www.kaggle.com/saitejaponugoti/deepbugs-for-python

Java

  1. XCorpus
    https://bitbucket.org/jensdietrich/xcorpus/src/master
    http://www.jot.fm/issues/issue_2017_04/article1.pdf

Perceptron, MLP, and Backpropagation October 10, 2021

Perceptron

Perceptron is a small unit in neural networks. It takes n inputs, computes a weighted sum of them, and applies a step function or one of its variants on the weighted sum. The goal of training is to find weights that minimize the prediction error.

We are going to create a simple classifier using Perceptron that predicts the language (among C, C#, C++, D, Haskell, Java, JS, PHP, Python, and Rust) in which a program is written. For this simple example I just use the number of occurences of 22 tokens as features.

You can download the test and train data from the following links: proglang_test proglang_train

The following lines of code, train a multi-class classifier and evaluate it.

import pandas
from sklearn.linear_model import Perceptron

data_test = pandas.read_csv('proglang_test.csv')
data_train = pandas.read_csv('proglang_train.csv')

# feature columns
X_test = data_test.iloc[:, 0:-2]
X_train = data_train.iloc[:, 0:-2]

# labels
y_test = data_test.iloc[:, -2] 
y_train = data_train.iloc[:, -2]

# train the classifier
classifier = Perceptron()
classifier.fit(X_train, y_train)

# evaluate the classifier
print('Mean accuracy:', classifier.score(X_test, y_test))
Mean accuracy: 0.95

Multi-layer Perceptron

A multi-layer perceptron has 1) one input layer, 2) a number of hidden layers, and 3) an output layer. The building blocks in this kind if network is the Perceptron. Each layer is fully connected to the next layer. The message passes in one direction (from lower layer to upper layers) so it is conviniently called feed forward network. If there are several hidden layers then it is called a deep network.

Backpropagation

After the weights of the model is set to random values, one small batch of samples are run through the network in forward direction until it makes the final predictions. Then, the error is measured in reverse---or backwards---for each layer with the goal of finding how much each connection contributes to the error. Finally, the weights are re-adjusted based on the computed values. As I mentioned earlier, in the original Perceptron, a step function was used at the final step. In modern MLPs, however, this step function is replaced by other activation functions such as tanh and relu.

Reference

  • Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.

Hardening C/C++ Code with Clang Sanitizers May 10, 2020

Many common programming errors manifest as violations of general safety properties such as memory safety (e.g. null pointer dereferencing) and concurrency safety (e.g. data races). Certain languages such as Rust aim to statically check such properties but impose additional burden on developers. Managed languages such as Java and C# dynamically check such properties and throw exceptions when they are violated. But most languages in widespread use, notably C and C++, offer very limited checking at compile-time and run-time. Unchecked safety violations in C/C++ programs are notorious for ill effects ranging from hard-to-debug runtime crashes to security exploits.

Sanitizers are a wide-ranging solution to these problems. They bring C/C++ closer to safer languages like Java and Rust. Major C/C++ compilers such as GCC and Clang provide sanitizers. In this article, we will learn about Clang sanitizers.

Clang sanitizers are safety checkers supported by Clang. They are compiler add-ons that insert checks by instrumenting the code in order to detect common programming errors at runtime. Four of these sanitizers are widely used in practice and have been successfully applied to large applications such as Google Chrome and Mozilla Firefox. These are: Address Sanitizer, Thread Sanitizer, Undefined Behavior Sanitizer, and Memory Sanitizer. The remainder of this article provides an overview of each of these sanitizers and demonstrates how you can apply them to harden C/C++ programs.

An overview of the LLVM workflow is illustrated in the following block diagram. The Clang Sanitizer pass is located between the compiler front-end and the back-end, transforming an unsafe program to a safe one by instrumenting the program IR. This framework supports multiple frontends (e.g., C/C++ and Objective C/C++) as well as multiple backends (e.g., x86, x86_64, ARM, MIPS, etc.)

Major Clang Sanitizers

1. Address Sanitizer (ASan)

Address Sanitizer is a fast memory bug checker with a typical slowdown of 2x. It can detect the following kinds of bugs:

  • Heap buffer overflow: A buffer overflow where the buffer is allocated in the heap portion of memory.
  • Stack buffer overflow: A condition when a program accesses a memory address on the program’s call stack outside of the intended data structure.
  • Use after free: An attempt to access memory after it has been freed.
  • Double/invalid free: An attempt to deallocate freed memory.
  • Memory leaks: Failure to release memory that is no longer needed.

2. Thread Sanitizer (TSan)

Thread Sanitizer is a checker that detects data races. A data race happens when two threads access the same resource concurrently and at least one of the accesses is a write. Data races are one of the most challenging bugs to detect in concurrent systems. Thread Sanitizer incurs an average slowdown of 2x-20x and an average memory overhead of 5x-10x.

3. Undefined Behavior Sanitizer (UBSan)

Undefined Behavior Sanitizer is a fast undefined behavior detector. Some examples of undefined behavior that it catches are as follows:

  1. Misaligned pointer or reference
  2. Out of bounds array indexing where the array bound can be statically determined
  3. Implicit and explicit conversions between floating-points and integers
  4. Division by zero
  5. Performing pointer arithmetic which overflows
  6. Signed and unsigned integer overflow
  7. If control flow reaches an unreachable program point

4. Memory Sanitizer (MSan)

Memory Sanitizer is a checker for uninitialized memory reads by tracking uninitialized bits in the memory. Uninitialized reads happen when the allocated memory is read before it is written. MSan tracks uninitialized bits. This sanitizer imposes an average slowdown of 3x, and an average memory overhead of 2x.

Quick Guide to Run Clang Sanitizers

1. Address Sanitizer (ASan)

Create a new file asan.cc, and copy the following program into it:

// asan.cc
1: int main(int argc, char **argv) {
2:   int *array = new int[100];
3:   delete [] array;
4:   return array[argc];
5: }

Next, compile and link the program with the -fsanitize=address flag:

$ clang++ -g -fsanitize=address asan.cc

For better performance, add the -O1 optimization flag, and to get a nicer stack trace in the error message, add the -fno-omit-frame-pointer flag:

$ clang++ -O1 -g -fsanitize=address -fno-omit-frame-pointer asan.cc

Next, simply run the compiled binary:

$ ./a.out

If you follow the steps successfully, you will see the following error message printed to stderr, and the program will exit with a non-zero exit code.

==37==ERROR: AddressSanitizer: heap-use-after-free on address 0x614000000044 at pc 0x0000004f8d5f bp 0x7ffe692d6990 sp 0x7ffe692d6988
READ of size 4 at 0x614000000044 thread T0
    #0 0x4f8d5e in main /sans/asan.cc:4:10
    #1 0x7f0500759b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310
    #2 0x41af59 in _start (/sans/a.out+0x41af59)
...

In this example, the program is trying to access an array element after it has been deleted. So the sanitizer reports a heap-use-after-free error at line 4.

2. Thread Sanitizer (TSan)

Create a new file tsan.c, and copy the following program into it:

// tsan.c
 1: #include <pthread.h>
 2: int Global;
 3: void *Thread1(void *x) {
 4:   Global = 42;
 5:   return x;
 6: }
 7: int main() {
 8:   pthread_t t;
 9:   pthread_create(&t, NULL, Thread1, NULL);
10:   Global = 43;
11:   pthread_join(t, NULL);
12:   return Global;
13: }

Next, compile and link the program with the -fsanitize=thread flag:

$ clang -fsanitize=thread -g -O1 tsan.c

Next, simply run the compiled binary. You might need to run it multiple times to encounter a data race. Later in the course, you will see how a test input generation tool such as AFL or LibFuzzer can automate this repetitive task.

$ ./a.out

When the bug is encountered, the following error message will be printed to stderr:

WARNING: ThreadSanitizer: data race (pid=145)
  Write of size 4 at 0x000001108278 by main thread:
    #0 main /sans/tsan.c:10:10 (a.out+0x4ac64e)

  Previous write of size 4 at 0x000001108278 by thread T1:
    #0 Thread1 /sans/tsan.c:4:10 (a.out+0x4ac607)

  Location is global 'Global' of size 4 at 0x000001108278 (a.out+0x000001108278)

  Thread T1 (tid=147, finished) created by main thread at:
    #0 pthread_create <null> (a.out+0x422fe5)
    #1 main /sans/tsan.c:9:3 (a.out+0x4ac644)

SUMMARY: ThreadSanitizer: data race /sans/tsan.c:10:10 in main
==================
ThreadSanitizer: reported 1 warnings

3. Undefined Behavior Sanitizer (UBSan)

Create a new file ubsan.c, and copy the following program into it:

// ubsan.c
 1: #include <inttypes.h>
 2: #include <stdint.h>
 3: #include <stdio.h>
 4: #include <string.h>
 5: 
 6: int32_t convert(const uint8_t *restrict p) {
 7:   uint32_t x = (256*p[1] + 256*256*p[2] + 256*256*256*p[3]);
 8:   if (x > INT32_MAX) return (x - INT32_MAX) - 1;
 9:   else return (((int32_t)x + (int32_t)-INT32_MAX) - 1);
10: }
11: 
12: int main() {
13:   uint32_t value;
14:   uint8_t buf[sizeof(uint32_t)];
15:   while (scanf("%" SCNx32, &value) == 1) {
16:     memcpy(buf, &value, sizeof(buf));
17:     printf("%08" PRIx32 "\n", convert(buf));
18:   }
19:   return 0;
20: }

Next, compile and link the program with the -fsanitize=undefined flag:

$ clang -g -fsanitize=undefined ubsan.c

Then, run the compiled binary with the following input:

$ ./a.out
80000001

You should be able to witness the following runtime error:

ubsan.c:7:54: runtime error: signed integer overflow: 16777216 * 128 cannot be represented in type 'int'

After reading 80000001, UBSan detects an error and prints an error message. It points out the signed integer overflow in 256 * 256 * 256 * p[3]. p[3] is an unsigned char which will convert to a signed integer between 0 to 255. Then, it is multiplied by 256 * 256 * 256. In some cases, such as the case where the input is 80000001, a signed integer overflow occurs which is considered an undefined behavior in C/C++.

4. Memory Sanitizer (MSan)

Create a new file msan.cc, and copy the following program into it:

// msan.cc
1: #include <stdio.h>
2: 
3: int main(int argc, char** argv) {
4:   int* a = new int[10];
5:   a[5] = 0;
6:   if (a[argc])
7:     printf("xx\n");
8:   return 0;
9: }

Next, compile and link the program with the -fsanitize=memory flag:

$ clang++ -g -fsanitize=memory msan.cc

For better performance, add the -O2 optimization flag, and to get a stack trace in the error message, add the -fno-omit-frame-pointer flag:

$ clang++ -g -O2 -fno-omit-frame-pointer -fsanitize=memory msan.cc

Next, simply run the compiled binary:

$ ./a.out

When the bug is encountered, the following error message will be printed to stderr:

==52==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x495d7e in main /cis547vm/msan.cc:6:8
    #1 0x7f1d69194b96 in __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310
    #2 0x41b7e9 in _start (/cis547vm/j+0x41b7e9)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /cis547vm/msan.cc:6:8 in main

In this example, an uninitialized value is accessed in line 6, column 8.


The Checker Framework April 21, 2020

If you are interested in preventing bugs in your Java project, look no further. The Checker Framework improves Java’s type system through annotations to detect and prevent errors in Java programs. It includes compiler checkers that find bugs or verify the absence of them. The Checker Framework is widely used at Google, Amazon, and Uber.

The built-in type checker in Java, does provide some reasonable type checking and error prevention. However, the Checker Framework lets you define new type checkers and extend the type system and run them as a part of the javac compiler. The checker framework allows the developers to use the available type systems in the framework, or design their own. In this article, we will learn about the most popular checkers in this framework.

Nullness Checker

The most popular checker in the Checker Framework is the Nullness Checker. If this checker does not issue any warnings at compile time, it is guaranteed that the program will never throw a null pointer exception. The Nullness Checker is a standard part of the build system at Google.

The most important annotations supported by the Nullness Checker are the following:

  • @Nullable: this annotation indicates a type that has null value in its possible set of values. For example, the possible set of values for a Boolean is {true, false, null}.
  • @NonNull: this annotation indicates a type that does not have null in its possible set of values. For example, a variable of type @NonNull Boolean only has true and false in its possible set of values.

This checker emits warnings in the following cases:

  • When you dereference an expression that is not @NonNull.
  • When the value of a @NonNull expression evaluates to null.
  • When you assign null to a @NonNull variable.
  • When a field of type @NonNull is not initialized in a constructor.

Example
@Nullable Object object;
@NonNull Object nonNullObject;
…
object.toString(); // warning: possible null dereference
nonNullObject = object; // warning: nonNullObject might become null
if (nonNullObject == null) doSomething; // warning: redundant check

GUI Effect Checker

When checking graphical user interfaces, the most prevalent bug is invalid thread access by a background process. Well-known GUI frameworks, such as Android and Swing, spawn a single main thread which is known as the UI event thread that is responsible for handling all the events and updates. Any other operation, such as expensive computations, is offloaded to background threads which are known as worker threads. The worker thread should send a request to the main thread to perform any access on its behalf. If any of these background threads directly access a UI element, the framework throws an exception and terminates the application.

This is a challenging issue to debug since it is difficult to remember which methods are permitted to be called on which threads. The GUI Effect Checker solves this problem by allowing the programmer to annotate each method to indicate whether it accesses no UI elements, or whether it may access UI elements. The former method is said to have the safe effect and the latter is said to have the UI effect. A method with the safe effect is prohibited from calling a method with the UI effect.

The most important annotations supported by the GUI Effect Checker are the following:

  • @SafeEffect: this annotation indicates that a method is not allowed to access UI elements.
  • @UIEffect: this annotation indicates that a method may access UI elements. Note that it is always safe to call a @SafeEffect method whenever it is permitted to call a @UIEffect method.

This checker complains in the following cases:

  • When a @UIEffect method is invoked by a @SafeEffect method.
  • When a supertype declares a @SafeEffect method, and its subtype is overridden as @UIEffect.
  • When a method implements or overrides a method in two supertypes and they have different GUI Effect annotations.

Example
@SafeEffect
public void foo(JTextField tf) {
  tf.setText(“first”); // error since setText is a @UIEffect method,
                       // and is called from a @SafeEffect method.
  Display.syncExec(new Runnable {
    @UIEffect
    public void run() {
      tf.setText(“second”);
    }
  });
}

Tainting Checker

The Tainting Checker prevents specific kinds of errors that arise from untrusted or tainted values. These untrusted values come from possibly malicious sources such as user input or unvalidated data. Trust-related errors might cause the application to crash, corrupt data, or leak information. If the checker does not issue any warnings for a correctly-tainted program, no tainted values ever flow to a sensitive sink.

A program must sanitize any untrusted value before using it at any sensitive point in the program. In general, there are two ways to sanitize a value: 1. checking if it is valid, or 2. transforming the value to become valid. An example of the former is to check if the value contains no characters that can be interpreted as SQL commands. An example of the latter is to quote all the characters that can be interpreted as SQL commands.

The Tainting Checker is not aware of the internal semantics of the application, so it cannot warn you correctly if you mis-annotate. The mis-annotation happens in two ways: 1. a sensitive sink annotated as @Tainted data, or 2. external data annotated as @Untainted. However, as long as you correctly annotate the program points, the checker will ensure there is no undesired information flow.

Some example of the purposes that the Tainting Checker can serve is as follows: Prevent SQL injection attacks: external input is @Tainted, and @Untainted is checked for SQL syntax. Prevent cross-site scripting attacks: external input is @Tainted, and @Untainted is checked for JavaScript syntax. Prevent information leak: secret data is @Tainted, and @Untainted may be displayed to the user.

Example
void processRequest() {
  @Tainted String input = getUserInput();
  executeQuery(input); // error: pass tainted string to executeQuery
}

public String validateInput(String userInput) {
  // Do some validation here.
  @Untainted String result = userInput;
  return result;
}

The most important annotations supported by the Tainting Checker are the following:

  • @Untainted: this annotation indicates a type that includes only trusted values.
  • @Tainted: this annotation indicates a type that may include untrusted values. It is a supertype of @Untainted.

Checker Framework in Action

Follow the step-by-step instructions to install and use the Checker Framework to find a security error using the Tainting Checker.

Step 1: Install

First, install Java JDK and ANT build tools.

$ apt-get install openjdk-8-jdk
$ apt-get install ant

Then, follow the instructions to install the checker framework:

$ wget https://checkerframework.org/checker-framework-3.3.0.zip
$ unzip checker-framework-3.3.0.zip
$ cd checker-framework-3.3.0
$ pwd

Copy the value that is printed to the terminal after running pwd. We will need it later!

Step 2: Download

Download the source files from the checker framework website, and unzip it. You will see various examples under the src directory. We are going to work with the personal blog demo in this section.

$ wget https://checkerframework.org/tutorial/sourcefiles.zip
$ unzip sourcefiles
$ cd src/personalblog-demo
Step 3: Run the Checker

For building the project, and running the checker at the same time, we use ANT build system to make the process easier. Go ahead and take a look at src/personalblog-demo/build.xml. On the 4th line of this file, change the value to the value you copied in the first step. So it will look like this:

<property name=”checker-framework” value=”/checker-framework-3.3.0”/>

On line 18, the location of the Checker jar file is specified. Then, on line 43, the kind of the checker that we want to use is specified. In this example, we are going to use the Tainting Checker:

<compilearg value=”org.checkerframework.checker.tainting.TaintingChecker”/>

Now, for building this project with ANT, run the following command:

$ ant

The checker will emit an error complaining about an untainted string on line 162 of PersonalBlogService.java:

[jsr308.javac]  + "%' order by post.created desc");
[jsr308.javac]    ^
[jsr308.javac]  found   : @Tainted String
[jsr308.javac]  required: @Untainted String

BUILD FAILED

Now that we have identified where the problem is, we can fix the issue in PersonalBlogService.java by forcing the client to only pass an @Untainted String to the function getpostsByCategory().

Step 4: Re-run the Checker

Using ANT, re-build the modified project. You will see another error on line 55 of ReadAction.java:

[jsr308.javac]  ...getPostsByCategory(reqCategory));
[jsr308.javac]                        ^
[jsr308.javac]  found   : @Tainted String
[jsr308.javac]  required: @Untainted String

BUILD FAILED

This time, we are going to correct the tainting error by validating the string. There is already a validate() function in the code that, given any string, validates the string and returns an @Untainted one. Use this function to validate reqCategory. After fixing the bug, re-build the project. This time, if corrected properly, the checker will not complain and you should expect the following message:

BUILD SUCCESSFUL

A Fuzzing Lesson (AFL) April 13, 2020

Do not be misled by the title; AFL does not stand for A Fuzzing Lesson. It stands for American Fuzzy Lop: one of the cutest animals out there. You might wonder what this rabbit is doing here since this is not a biology class. In fact, the state-of-the-art fuzzer that you are going to learn about is named after this cute rabbit.

AFL is a security-oriented fuzzer written in C that combines compile-time instrumentation with a genetic algorithm to discover crash-inducing inputs. In contrast to other fuzzers, AFL’s focus is practicality: low-performance overhead, very few configuration options, and handling real-world programs.

Design Goals of AFL

We start out by discussing the design goals of AFL which make it widely usable.

1. AFL is Fast. AFL uses a combination of tricks that make it faster compared to other fuzzers. First, it takes advantage of low-level instrumentation which imposes negligible overhead. Second, it prioritizes mutation strategies that lead to discovering more paths. Finally, it re-evaluates the queue of generated inputs on specific intervals using a fast algorithm that selects a smaller subset of test cases that still cover every exercised tuple of the form (branch_source, branch_destination) so far. Every queue entry is assigned a score proportional to its execution latency and file size. Then, the candidate with the lowest score is selected for each tuple.

2. AFL Focuses on Code Coverage. Coverage-guided fuzzing (aka grey-box fuzzing) tries to maximize the code coverage of a program such that more and more code branches are exercised. AFL instrumentation captures branch coverage and hit counts. It maintains an array shared_mem which is a 64 KB region passed to the instrumented binary. It keeps the information regarding the branch source and branch destination for each branch as tuples. So, it distinguishes between the following execution paths and tries to discover new paths in each iteration. For example, it can distinguish between these execution paths:

(1) A --> B --> C --> D
(2) A --> B --> D --> C

since the recorded tuples are different:

(1) (A,B) (B,C) (C,D)
(2) (A,B) (B,D) (D,C)

3. AFL is Easy to Use. AFL is designed to be highly practical. Compared to other fuzzers, AFL requires virtually no configuration and fine-tuning. You can jump to the end of this article for a quick guide describing how to run AFL. Thereafter, you will be set to start using AFL to hunt bugs in programs!

4. AFL Can be Chained to Other Tools. Although AFL can be used without additional options, one can use additional tools to enhance the effectiveness of AFL and find bugs that might go unnoticed when using the vanilla fuzzer. For instance, AddressSanitizer can be added as a compiler option to enable detecting invalid memory accesses.

5. AFL Can Minimize Generated Inputs. AFL provides a tool, afl-tmin and afl-cmin, for test case minimization. It takes an input file and tries to trim as much data as possible while producing the same crashing state or instrumentation output. In other words, it minimizes the input without altering the execution path. This is especially useful when reporting or investigating a crash.

Kinds of Input Mutations

AFL is based on a set of mutation strategies that are shown to utilize CPU efficiently and generate interesting test cases. Here are some of the main mutation strategies for generating new inputs that make AFL sophisticated:

1. Bit Flips: This simple strategy involves sequential, ordered flips for 1-4 bits. The observed results demonstrate 10 to 70 new additional paths per million inputs generated.

AFL Rocks! ==> AFL Rhcks!
3/div>

Although the number of the additional paths seems small, since AFL usually explores around 1000 inputs per second per core, such paths appear fairly quickly. For instance, on an 8-core machine, it takes 125 seconds to explore a million inputs. Moreover, discovering one unseen path can lead to discovering more unseen paths, as AFL uses the newly discovered paths as feedback to quickly exercise unseen branches.

2. Byte Flips: A natural extension to the previous strategy is flipping bytes. This strategy involves flipping 8-, 16-, or 32-bit wide bitflip. Nearly 30 additional paths are discovered per million generated inputs. Note that this strategy is much cheaper compared to flipping bits due to the underlying hardware support for flipping bytes. However, it limits the opportunities to find additional paths.

C > AFL Rocks! ==> AFL Rhcks!
Pv>

3. Simple Arithmetic: For triggering more complex conditions deterministically, AFL increments or decrements constant integers by 1-35. The effectiveness of this strategy varies depending on the input format. For instance, in JPEG, it helps discover around 2 additional paths per million inputs.

4. Known Integers: Another deterministic strategy in AFL involves replacing the numbers in the input file with a hardcoded set of integer numbers that are likely to trigger edge, crashing cases. Examples of such integers include 0, -1, MAX_INT, and MIN_INT-1. This strategy leads to discovering 2-5 additional paths per one million generated inputs.

char *t                       ==>     char *t
= malloc(128*sizeof(char));           = malloc(MAX_INT*sizeof(char));

5. Stacked Tweaks: This strategy randomly stacks some (or all) of the following operations and applies them: single bit-flip and byte-flip, block duplication, block deletion, etc. The number of operations that are stacked is randomly chosen as a power-of-two between 1 and 64, and the block size for block operations is usually 1 KB.

6. Splicing: In contrast to previous strategies, this one relies on two distinct inputs. Then, they are spliced in a random location. This strategy sometimes discovers 20% additional paths.

Real Bugs Found by AFL

AFL has been able to find many interesting bugs to date, around 500 of which have been security vulnerabilities, and [have been officially reported](https://github.com/mrash/afl-cve). Here is a sampling of the kinds of bugs together with links to the corresponding vulnerabilities found by AFL:

  1. Assertion violation [example: CVE-2016-9399]
  2. Buffer overflow [example: CVE-2015-8126]
  3. Integer overflow [example: CVE-2017-6839]
  4. Null dereference [example: CVE-2017-7475]
  5. Infinite loop [example: CVE-2017-5852]

Quick Guide to Run AFL

Part 1: Running the Fuzzer

Follow the step-by-step instructions to install and run AFL.

1. Install AFL:

$ apt-get install -y afl

2. Download fuzzgoat source code, extract it, and build it with afl-clang. This step will inject instrumentations into the target program.

$ git clone https://github.com/fuzzstati0n/fuzzgoat
$ cd fuzzgoat
$ export CC=afl-clang
$ make -j8

3. Next, create an input folder where the seed file will reside, and an output folder where the generated crashing inputs will go.

$ mkdir input output

4. We are going to use a valid ELF file which is already on the system as the seed input.

$ cp /bin/ps input/

5. Finally, run the fuzzer:

$ afl-fuzz -i input -o output -- ./fuzzgoat @@

You will immediately see the following GUI starting in the terminal. After a few seconds, you should see some crashing inputs. You can find them under the output/crashes folder. You can find more details about the status screen in AFL’s documentation.

Part 2: Minimizing the Crashing Inputs

Next, we are going to try out afl-tmin that performs test minimizing. This tool minimizes the test case to the bare minimum that results in the same execution path. It iterates over the actual bytes in the test case and removes smaller and smaller blocks of bytes. You can observe this tool in action using the following command.

$ afl-tmin -i output/crashes/[crash-file] -o test.min -- ./fuzzgoat