Copilot Studio Agent Builder: a PDS comparison tool and the eval that humbled it

Copilot Studio’s agent builder makes it trivially easy to wrap a set of documents in an AI interface and share it across Teams. A year on from building Clear PDS in ChatGPT, I rebuilt the concept with a different platform, a more structured approach — and ran into a harder question about what it actually means to evaluate one of these tools.

Pat Reen https://www.linkedin.com/in/patrick-reen/
2026-01-15

Background

…What if you built the same tool twice, a year apart, with different platforms — and the interesting finding was neither tool?…

A year ago I built Clear PDS — a custom GPT with ten life insurance Product Disclosure Statements uploaded as its knowledge base, designed for comparative analysis across the Australian market. That post ended on a specific note about its limitations: retrieval over long, dense documents was imperfect; the model would sometimes surface adjacent content and present it as an answer; and instructions about uncertainty needed to be explicit and tight.

The same problem — compare Australian retail life insurance PDSs on demand.

This time: Copilot Studio’s Agent Builder. A more structured instruction set. A cleaner knowledge base. A proper attempt at evaluation. And, as it turned out, a collaboration with the people who actually use this kind of tool daily — which produced the most useful finding of the whole exercise.

Further reading

What is Copilot Studio Agent Builder?

Copilot Studio is Microsoft’s low-code platform for building AI agents integrated with Microsoft 365. The Agent Builder is the no-code entry point: you configure an agent through a form-based UI — description, instructions, knowledge sources — and it’s deployed to Teams or other Microsoft 365 surfaces without writing a line of code.

Structurally it’s similar to custom GPTs. The key inputs are the same: a description and purpose that define what the agent is for, a set of instructions that constrain its behaviour and shape its outputs, and a knowledge base of uploaded files the agent can retrieve from at query time. The underlying model is ChatGPT.

The distinction is integration depth. A custom GPT is shared via a link and lives in the ChatGPT interface. A Copilot Studio agent can be pinned to a Teams channel, embedded in a SharePoint page, or exposed as a standalone web chat. For teams that already work in Microsoft 365, this means the agent can live where the work already happens — not in a separate tool that requires a context switch.

The Agent Builder also exposes Topics and Tools. Topics are conversation branches that trigger on specific phrases — you can route “what’s the suicide exclusion?” differently from a general PDS query, applying specific instructions or injecting context for that branch.

The knowledge base

The PDS corpus covers seven insurers: ClearView, TAL, AIA, Acenda (MLC Life), Zurich, Neos, and NobleOak — all current PDSs at the time of build, uploaded as PDFs. Each PDF is indexed against the agent at configuration time and retrieved via embedding similarity at query time. The embedding however is not controlled by the user/ builder.

The instruction problem

The first version of the instructions was too restrictive.

The intent was sound: for a tool used to inform product decisions in a regulated industry, preventing the agent from fabricating or extrapolating beyond the documents is genuinely important. The instructions led with “use only information explicitly stated in the PDS documents” and “do not infer, assume, or extend beyond the PDS.”

The problem was that this collapsed retrieval-and-synthesis into fabrication in the agent’s behaviour. Comparing two insurers’ income protection terms requires the agent to retrieve relevant chunks from two documents, extract the substantive content, and assemble a comparison. That’s synthesis, not inference — but the tight wording treated them the same. The result: the agent hedged rather than synthesised or simply refused to answer or returned an error.

A second issue was PDS identification. When the agent retrieved a chunk, it wasn’t always confident about which document it came from. Responses would contain the right information but attribute it vaguely to “the relevant PDS” rather than naming the insurer.

The revised instructions

The revised version distinguishes clearly between the constraint that matters (ground your answer in the documents; don’t describe system behaviour or administrative practice) and the synthesis that should be unrestricted (assemble, compare, and present what the documents say). Key changes:

State values directly, not abstractly. The instruction: “if the PDS provides explicit values, options, formulas, or lists, include them explicitly rather than describing them abstractly.” This is the single most impactful change. It’s the difference between “the benefit period depends on occupation category” and “2 years, 5 years, or to age 65.”

Silence is an answer. When the PDS doesn’t address a question, say so: “if the PDS does not specify this information, state that directly.” Hedging toward thematically adjacent content when the document is silent is a more dangerous failure mode than admitting the document doesn’t cover it.

Separate insurers clearly. Every response must handle each insurer independently. Merged or blended answers across insurers are wrong in a way that can’t be fixed by the user; they’d have to know which parts of the answer came from which document.

Minimal responses. “Keep responses minimal and bounded; avoid unnecessary explanation.” For a tool used by people who know the product, over-explaining is friction. The rubric for length: answer the question, cite the source, stop.

Structured output. Each response follows a fixed format:

The typed conditions list is the most useful structural addition. Requiring the agent to classify each condition forces it to be specific about what kind of thing it’s stating — a trigger for a benefit, an exclusion, a cap on the amount — rather than presenting a flat list of facts. It also makes responses machine-parseable if you want to run automated checks against them.

Topics, Tools, and what didn’t work

The Copilot Studio Agent Builder exposes two mechanisms for additional control beyond the main instructions.

Topics let you define conversation branches that trigger on recognised phrases. The intent was to give the agent a separate branch for sensitive topics — mental health exclusions and suicide clauses specifically — with instructions tuned for those queries rather than applying the general instructions. The agent was inconsistent on these topics it would sometimes hedge or decline to engage.

The net result: the agent with well-tuned general instructions and a structured output format outperformed the agent with Topics on most queries. Added configuration is only worth it if the base instruction layer has a genuine gap it can’t fill.

Teams integration

The distribution via Teams is the clearest win. The agent is embedded in a product and pricing team channel. Questions can be asked in context, without context-switching.

The eval

Building the tool is the easy part. Knowing whether it works is harder.

I put together an initial evaluation set using Claude and ChatGPT — fourteen questions covering income protection, TPD, trauma, and cross-insurer comparisons, each with an expected answer grounded in the PDS content. Reasonable starting coverage.

A product team professional reviewed them. The verdict was direct: “Your test queries are way too clean.”

Of the fourteen queries, four were genuinely good: the suicide exclusion, activities of daily living for TPD, any occupation versus own occupation definitions, and the notification timeframe for a claim. The remaining ten were described as “a mix of unhelpful and downright evasive.”

Three examples make the gap concrete:

Benefit period options. The eval’s expected answer: “The benefit period for Income Protection cover is selected at application and shown on the policy schedule. Available benefit periods vary by occupation category and policy structure.” A useful answer: “2 years, 5 years, or to age 65. For policies with a benefit period to age 65, the disability definition changes from regular occupation to suited occupation after 30 months.” The first answer describes where the answer can be found. The second is the answer.

Partial disability calculation. The eval expected: “calculated by reference to the insured’s pre-disability income and the income they earn while partially disabled, using the formula and conditions set out in the Income Protection section.” The formula is in the PDS: (A − B) ÷ A × monthly benefit amount, where A is pre-disability earnings and B is earnings while partially disabled. Describing the existence of a formula while declining to state it is the same failure mode as hedging — technically correct, practically useless.

Waiting periods across insurers. The eval expected: “waiting period options are selected at application and shown on the policy schedule.” The actual answer: ClearView offers 30, 60, or 90 days; TAL uses weekly increments (4, 8, 13, or 26 weeks) by product variant and occupation category; AIA uses the same weekly structure, varying by option and whether the cover is inside or outside superannuation. The eval’s expected answer is the shape of a response, not a response.

Why LLM-generated eval is systematically biased

The pattern across all three is the same. When you ask a language model to generate evaluation questions, it generates questions it can answer well — and expected answers it would produce. LLMs hedge toward hedged answers. They describe the location of specific information rather than stating it. They produce technically-defensible non-answers.

An agent that scores well against this eval has been measured on whether it can match a hedged, evasive style. An agent that scores well against the professional’s eval has been measured on whether it actually answers the question correctly.

The product professional also flagged a dimension the initial eval missed entirely: questions where the PDS is deliberately silent. A good agent needs to be able to say, confidently, that a question cannot be answered from the documents — and this is specifically what you should be testing. Several of the real-world queries are designed to catch an agent that invents rather than admitting the limit of its knowledge.

What good eval looks like

The rubric the professional offered for evaluating responses in practice:

  1. Does the agent answer the actual question — including shorthand, assumption-heavy questions from people who know the product?
  2. Does it avoid inventing when the PDS is silent?
  3. Does it separate product rule from system behaviour when they differ?
  4. Does it set appropriate scope boundaries (tax advice, adviser responsibility) without over-explaining?
  5. Is the response the minimal necessary length? Over-explaining is a failure mode.

The professional also suggested tagging each eval query with two dimensions: answer_type (direct_fact, permission, taxonomy, policy_vs_system, pds_silent) and expected_docs (PDS, Adviser Guide, or “outside PDS”). This turns the eval from a list of questions into a structured test suite — you can measure not just aggregate accuracy but whether the system breaks down specifically on pds_silent queries, or on cross-insurer comparisons, or on questions where the right answer is to refer to an adviser.

The more sophisticated eval set developed alongside the custom RAG version of this tool takes these properties seriously. It includes difficulty ratings (easy / medium / hard), explicit must_contain and must_not_contain constraints per query, notes flagging known traps (the mental health misconception, the claims notification timing that is “as soon as reasonably possible” rather than any specific number of days), and answer-type tagging. This is an eval designed to catch specific failure modes, not just confirm the agent can locate text.

What comes next

Sharing the agent in Teams opened a collaboration that’s changed how I think about both the tool and the eval.

The plan is to work through the eval set with the product team — identifying where the agent’s responses are useful versus evasive, where it’s confident when it shouldn’t be, and where the instruction set needs to be sharpened. There are parts of the Copilot Studio Agent Builder I haven’t fully used: Topics configuration to route sensitive-topic queries differently, and the broader Tool framework.

At the same time, I’m building a custom RAG pipeline over the same corpus. The Copilot Studio agent is deployable today; the custom RAG approach is designed to fix the retrieval limitations the Copilot Studio experience made concrete. Better chunking at section boundaries, structural metadata (section name, insurer, document version), and metadata-filtered retrieval before similarity scoring. For PDSs — long but consistently structured, with identifiable sections for benefits, exclusions, definitions, and conditions — this produces materially better retrieval quality, especially for cross-insurer queries where the relevant chunk might be a specific subsection from two different documents.

The eval set is the shared asset across both. Running the same queries against the Copilot Studio agent and the custom RAG pipeline gives a direct comparison on the metrics that matter: not whether they can find text in a document, but whether they state the formula when asked about a calculation, whether they correctly flag when the PDS is silent, and whether they attribute responses to the right source document. That comparison is still to come.

If you were building this

The patterns that transfer:

Don’t let instructions around uncertainty collapse into instructions against synthesis. “Use only information explicitly stated” and “do not infer beyond the PDS” are different constraints. The first is about grounding; the second, taken too literally, prevents the agent from assembling retrieved facts into a useful response. Separate them in your instructions.

Require explicit values, not abstract references. The instruction “state values directly rather than describing them abstractly” was the single most impactful change to output quality. Apply the same principle when writing eval expected answers: if the PDS gives a formula, your expected answer should contain the formula.

Typed output structure adds discipline. Requiring the agent to produce a typed conditions list (each condition prefixed with [exclusion], [cap], [timing], etc.) is harder to satisfy vaguely than a free-form response. Structure in the output format reduces hedging in the content.

Get a domain expert to review your eval before trusting your metrics. LLM-generated eval is biased toward questions that LLMs can answer well. Real-world questions are shorthand, assumption-heavy, and often at the boundary of what the document covers. Measuring against the first set while deploying into the second gives you a false picture of how the tool performs.

Treat “the PDS is silent on this” as a first-class answer. This is harder than it sounds and should be explicitly tested. Include queries where the correct answer is a bounded non-answer — and verify that your agent states the limit confidently rather than inventing adjacent content.

Teams integration changes the distribution story for internal tools. If your audience is already in Microsoft 365, a Copilot Studio agent in a Teams channel is meaningfully lower-friction than a shared link. The tool is where the work already is, and that changes adoption and the feedback loop.

Eval quality is a multiplier on everything else. A retrieval system that scores well against a bad eval is optimised for the wrong thing. Build the eval before you start optimising — and get someone who does the real-world work to tell you whether your questions are actually hard enough.