Workflow Management & Anomaly Detection

Scientific workflows orchestrate complex computational pipelines across distributed infrastructure, but failures and anomalies can waste millions of compute hours. My research develops AI-driven methods to monitor, detect, and respond to anomalies in real-time.

Graph Neural Networks for Workflows: Representing workflow executions as graphs enables learning structural patterns that distinguish normal from anomalous behavior. Our GNN-based detectors achieve state-of-the-art performance on workflow anomaly benchmarks and scale to production scientific workflows.

Large Language Models: Exploring how LLMs can interpret workflow logs and execution traces, providing explainable anomaly detection through in-context learning without extensive retraining. This work enables rapid adaptation to new workflow types and failure modes.

Active Learning: Developing human-in-the-loop systems that efficiently query operators for labels on uncertain cases, dramatically reducing annotation costs while maintaining detection accuracy. This is critical for deploying ML in operational HPC environments.

Distributed Scheduling: Novel consensus-based and bio-inspired algorithms for job scheduling in federated computing environments, including ant colony optimization approaches that balance solution quality with computational efficiency.

This work is part of the SWARM project, reimagining scientific workflow management for distributed, federated research infrastructure across DOE national laboratories.


Publications