Scientific workflows orchestrate complex computational pipelines across distributed infrastructure, but failures and anomalies can waste millions of compute hours. My research develops AI-driven methods to monitor, detect, and respond to anomalies in real-time.
Graph Neural Networks for Workflows: Representing workflow executions as graphs enables learning structural patterns that distinguish normal from anomalous behavior. Our GNN-based detectors achieve state-of-the-art performance on workflow anomaly benchmarks and scale to production scientific workflows.
Large Language Models: Exploring how LLMs can interpret workflow logs and execution traces, providing explainable anomaly detection through in-context learning without extensive retraining. This work enables rapid adaptation to new workflow types and failure modes.
Active Learning: Developing human-in-the-loop systems that efficiently query operators for labels on uncertain cases, dramatically reducing annotation costs while maintaining detection accuracy. This is critical for deploying ML in operational HPC environments.
Distributed Scheduling: Novel consensus-based and bio-inspired algorithms for job scheduling in federated computing environments, including ant colony optimization approaches that balance solution quality with computational efficiency.
This work is part of the SWARM project, reimagining scientific workflow management for distributed, federated research infrastructure across DOE national laboratories.
Publications
- SWARM: Reimagining Scientific Workflow Management Systems in a Distributed World - IJHPCA 2025
- Advancing Anomaly Detection in Computational Workflows with Active Learning - Future Generation Computer Systems 2025
- Large Language Models for Anomaly Detection in Computational Workflows - SC24
- Graph Neural Networks for Detecting Anomalies in Scientific Workflows - IJHPCA 2023
- Workflow Anomaly Detection with Graph Neural Networks - WORKS 2022
- Self-Supervised Learning for Anomaly Detection in Computational Workflows - arXiv 2023
- Flow-Bench: A Dataset for Computational Workflow Anomaly Detection - arXiv 2023
- A Greedy Consensus-Based Approach to Distributed Job Selection - CCGrid 2025
- Bridging Speed and Optimality in Job Scheduling: A Hybrid Ant Colony Optimization Approach - SC25 Workshop
- DISTRI: Development and Integration of Simulation Tools for Resilient Infrastructure - IEEE BigData 2024
- DGRO: Diameter-Guided Ring Optimization for Integrated Research Infrastructure - arXiv 2024