The Big Picture
A team of specialized AI helpers coordinated by a central controller makes interview scoring more accurate, more resistant to manipulation, and easier to audit—without retraining models.
ON THIS PAGE
The Evidence
Breaking the interview workflow into four focused agents (question writer, security guard, scorer, and summarizer) plus a central controller leads to clearer, more consistent evaluations Hierarchical Multi-Agent Pattern. In real university admission tests, the multi-agent setup outperformed single-agent systems and matched expert reviewers on reliability and explainability. Layered security checks cut successful prompt-injection attacks compared with single-agent designs. Trade-offs include higher compute and coordination overhead and reliance on preset rubrics for fairness.
Data Highlights
155 candidate participants were used in real admissions-style evaluations, giving practical coverage across backgrounds.
2A panel of 10 senior professors provided ground-truth reviews for scoring and consistency checks.
3The system was tested with 3 different backbone models (GPT-5-mini, Qwen-plus, Kimi-K2) to show model-agnostic adaptability.
What This Means
Engineering teams building AI-driven interviewing or assessment tools should care because modular specialization improves fault tolerance and makes evaluations auditable. Hiring managers and university admissions officers can use the approach to reduce subjective bias and to produce clearer feedback for candidates. Researchers tracking safe AI deployment will find the layered security and traceable logs useful starting points [Agent Registry Pattern].
Not sure where to start?Get personalized recommendations
Key Figures

Fig 1: Figure 1. Overview of CoMAI, a collaborative multi-agent interview framework that orchestrates specialized agents through a centralized controller.

Fig 2: Figure 2. Process overview of the CoMAI framework. The system retrieves a candidate’s resume from the database, which triggers the Question Generation agent to formulate interview questions. Responses are first screened by the Security agent; if approved, they are evaluated by the Scoring agent and archived in the internal memory. Feedback from the Scoring agent informs subsequent question generation. Upon completion of the interview, the Summary agent consolidates all information into a final report, which is stored in the database along with the raw records.

Fig 3: Figure 3. CoMAI dynamically asks follow-up questions to probe the interviewee’s reasoning process.

Fig 4: Figure 4. Categories of intercepted prompt-word attacks.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreConsiderations
The study used a relatively small real-world sample (55 participants), so results may vary at larger scale or in different cultural contexts. The system depends on structured scoring rubrics, so bias in rubrics or training data can still produce skewed outcomes without continuous auditing. Modular coordination adds latency and compute cost, which may limit adoption for high-volume, low-stakes screening tasks [Evaluation-Driven Development (EDDOps)].
Methodology & More
A central controller orchestrates four specialized AI agents: a Question Generation agent that crafts targeted prompts from a candidate’s resume, a Security agent that filters out malicious or confusing inputs, a Scoring agent that applies rubric-guided quantitative and qualitative evaluation, and a Summarization agent that compiles audit-ready reports and stores episodic memory. Agents communicate through a standardized protocol and a deterministic finite-state controller manages flow and state, making every step traceable for audits and reviews [Supervisor Pattern]. The design avoids retraining by working as a plug-and-play layer on top of existing models.
Validated with 55 participants and judged against a panel of 10 senior professors, the multi-agent setup showed clearer, more explainable decisions than single-agent baselines and public single-agent interview systems. Security checks reduced vulnerabilities to prompt-injection style attacks, and rubric-driven scoring enabled adaptive difficulty and more interpretable feedback. Key trade-offs are higher computational cost and added coordination complexity; future work should focus on efficiency, human-in-the-loop calibration, multimodal signals (like video), and continuous fairness auditing to reduce rubric or data-driven bias. LLM-as-Judge Pattern
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
ArXiv preprint with no affiliations listed and authors have very low h-indices (h=1 for lead author). Limited reputation signals — emerging work.