Research

Our research program centers on AI evaluation, safety, and human-AI interaction and collaboration. We combine mathematical rigour with empirical investigation to advance our understanding of how large language models process information, solve tasks, and interact with humans in real-world contexts.

Benchmarks and Evaluation

We study the science of LLM evaluation, utilising systematic reviews and statistical modelling to ground and quantify the measurement validity of LLM benchmarks. We develop novel evaluation settings and frameworks for assessing the limits of LLM reasoning capabilities in adversarial domains, including low-resource language and interactive scenarios.

Agentic AI for Science

We develop agentic AI systems that automate and augment key steps of the scientific process, including literature discovery, evidence synthesis, hypothesis generation, and decision support. A core focus is on building reliable, transparent and domain-grounded agents for real-world scientific and policy-relevant applications.

AI Safety

We study the risks that advanced AI may pose to society, focusing on evaluating harms, developing technical solutions, and strengthening AI governance. Our work spans the spectrum of harms, from research on language model bias and toxicity to misalignment in agentic systems.

Human-AI Interaction

We conduct large-scale empirical studies examining how humans interact with AI systems for decision-making, including our landmark study of 1,300 participants exploring LLM use in medical self-diagnosis and healthcare applications.