Back to homepage

Publications

Testing Deep Learning Libraries via Neurosymbolic Constraint Learning
M M Abid Naziri^*, Shinhae Kim^*, Feiran Qin, Saikat Dutta, Marcelo d’Amorim
IEEE/ACM International Conference on Software Engineering (accpt. 21.8% [321/1469])
(ICSE 2026), Rio de Janeiro, Brazil, April 2026.
[Preprint] [PDF]

Summary
This paper introduces Centaur, a novel neurosymbolic technique to test Deep Learning library APIs by dynamically learning their input constraints. By uniquely combining a grammar-guided Large Language Model with an SMT solver, Centaur generates more valid and diverse test inputs than prior approaches. Our method significantly improves API and code coverage and has already found 26 new bugs in PyTorch and TensorFlow, 18 of which have been confirmed.
Misbehavior Forecasting for Focused Autonomous Driving Systems Testing
M M Abid Naziri, Stefano Carlo Lambertenghi, Andrea Stocco, Marcelo d’Amorim
IEEE/ACM International Conference on Software Engineering (accpt. 21.8% [321/1469])
(ICSE 2026), Rio de Janeiro, Brazil, April 2026.
[Preprint] [PDF]

Summary
This paper introduces Foresee, a testing technique for autonomous driving systems that identifies potential failures by forecasting and fuzzing "near-miss" events in simulation. By using a misbehavior forecaster to target high-risk scenarios, our approach makes testing more efficient and effective. In our evaluation using the CARLA simulator, Foresee finds up to 128% more failures than baselines while being up to 2.49x faster, and improves the bug-finding capability of state-of-the-art fuzzers by over 93%.
BugsInDLLs : A Database of Reproducible Bugs in Deep Learning Libraries to Enable Systematic Evaluation of Testing Techniques
M M Abid Naziri, Aman Kumar Singh, Feiran Qin, Benjamin Wu, Saikat Dutta, Marcelo d’Amorim
International Symposium on Software Testing and Analysis (Tool Demos)
(ISSTA 2025, Tool Demonstration), Trondheim, Norway, June 2025.
[PDF] [Tool]

Summary
We introduce BugsInDLLs, a curated database of 112 reproducible bugs from popular deep learning libraries like TensorFlow and PyTorch. This benchmark provides the research community with a standard resource to systematically evaluate and improve bug-finding techniques.
Improving Deep Learning Library Testing with Machine Learning
Facundo Molina, M M Abid Naziri, Feiran Qin, Alessandra Gorla, Marcelo d’Amorim
ACM/IEEE International Conference on Automation of Software Test (accpt. 40% [18/45])
(AST 2026), Rio de Janeiro, Brazil, April 2026.
[PDF]

Summary
This paper demonstrates the effectiveness of ML classifiers as an efficient validity checker for DL Library inputs. Once trained with enough data, different out-of-the-box classifiers can achieve high accuracy in predicting whether an input will be valid or not without executing the input on an API. We demonstrate that the classifiers can predict the validity of an input with an accuracy of 91%. We also improve ACETest, a prominent DL Library API fuzzing tool, by complementing the technique with ML classifiers. The validity ratio of the tool increases from 29% to 61% after the integration, demonstrating the effectiveness of ML classifiers as a filter for the DL Library API inputs.
Evaluating the Effectiveness of Coverage-Guided Fuzzing for Testing Deep Learning Library APIs
Feiran Qin, M M Abid Naziri, Saikat Dutta, Marcelo d’Amorim
Submitted
[Preprint] [PDF]

Summary
This work presents the first in-depth study confirming the effectiveness of Coverage-Guided Fuzzing (CGF) for testing Deep Learning library APIs. We introduce FlashFuzz, a novel tool that makes this possible by using Large Language Models (LLMs) to automatically synthesize and repair the required test harnesses. Our approach vastly outperforms state-of-the-art fuzzers in code coverage (up to +212%) and speed (up to 1182x), leading to the discovery of 42 new bugs in PyTorch and TensorFlow.

^*Equal contribution