{"id":20923,"date":"2025-03-06T11:00:04","date_gmt":"2025-03-06T11:00:04","guid":{"rendered":"https:\/\/www.equalexperts.com\/?p=20923"},"modified":"2025-03-27T12:53:56","modified_gmt":"2025-03-27T12:53:56","slug":"engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2","status":"publish","type":"post","link":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/","title":{"rendered":"Engineering GenAI systems: A systematic approach through healthcare documentation (Part 2)"},"content":{"rendered":"<p>&#8220;Look how well it extracted the patient\u2019s history!&#8221;<\/p>\n<p id=\"c49a\" data-selectable-paragraph=\"\">The physician\u2019s enthusiasm was palpable as they reviewed our initial prototype. At that moment, I witnessed a common pattern in enterprise AI implementations: the seductive power of early demos overshadowing the engineering rigour needed for production systems. While our proof-of-concept (discussed in <a href=\"https:\/\/www.equalexperts.com\/blog\/ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation\/\">Part 1<\/a>) demonstrated the foundational capability of large language models to parse clinical conversations, the path from a promising demo to a production-ready system requires sophisticated validation infrastructure.<\/p>\n<p id=\"31b8\" data-selectable-paragraph=\"\">Continuing the systematic approach from Part 1, we now face a core engineering task: building an evaluation framework robust enough to stress-test our solution across diverse clinical scenarios. Each variation introduces potential failure modes that must be systematically validated, from edge cases in medical terminology to speciality-specific documentation patterns.<\/p>\n<p id=\"cdb2\" data-selectable-paragraph=\"\">Our evaluation framework comprises 3 components: Golden datasets that accurately represent the diversity and complexity of real-world clinical documentation; Comprehensive metrics that quantify technical and clinical quality aspects; and Evaluations, often called Evals, which validate system behaviour.<\/p>\n<p id=\"a9a8\" data-selectable-paragraph=\"\">Let\u2019s see how these components work together in practice.<\/p>\n<h2 id=\"ee62\">Building a golden dataset for SOAP note generation<\/h2>\n<p id=\"64f4\" data-selectable-paragraph=\"\">At the core of any robust evaluation framework lies the golden dataset \u2014 a curated collection of inputs and their corresponding reference outputs that serves as ground truth for system validation. While the concept sounds straightforward, creating a reliable golden dataset often reveals a complex interplay between theoretical requirements and practical constraints.<\/p>\n<p id=\"366b\" data-selectable-paragraph=\"\">In our medical documentation system, we identified a solid foundation: a diverse set of patient-doctor dialogues. However, the critical missing piece was the corresponding reference SOAP notes \u2014 the \u201cgold standard\u201d outputs that would validate our system\u2019s performance \u2014 generating these reference notes required (again) collaboration with a physician due to the domain expertise needed.<\/p>\n<p id=\"14fb\" data-selectable-paragraph=\"\">Rather than pursuing the traditional (and time-intensive) route of manual annotation, I developed a hybrid approach that leveraged AI capabilities. I used a simple crafted prompt to generate initial SOAP notes drafts:<\/p>\n<p data-selectable-paragraph=\"\"><img decoding=\"async\" class=\"aligncenter wp-image-20927 size-large\" src=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/1-1200x314.webp\" alt=\"The prompt to generate initial SOAP notes drafts - the prompt reads &quot;You are an experienced medical professional tasked with converting patient-doctor dialogues into standardized SOAP notes.\" width=\"1200\" height=\"314\" \/><\/p>\n<p data-selectable-paragraph=\"\">combined with a structured JSON schema:<\/p>\n<p data-selectable-paragraph=\"\"><img decoding=\"async\" class=\"aligncenter wp-image-20926 size-large\" src=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/2-1200x962.webp\" alt=\"An image of the JSON schema, as described under the heading &quot;Building a golden dataset for SOAP note generation&quot;.\" width=\"1200\" height=\"962\" \/><\/p>\n<p id=\"584e\" data-selectable-paragraph=\"\">for LLM output formatting (using structured outputs from OpenAI), provided a low-effort starting point for physician review.<\/p>\n<p id=\"430c\" data-selectable-paragraph=\"\">While production systems typically demand hundreds of validated examples, we strategically limited our initial dataset to 25 entries for the sake of my exploration. However, even with this concentrated scope, we maintained rigorous selection criteria:<\/p>\n<ol>\n<li id=\"55b9\" data-selectable-paragraph=\"\">Representational diversity: Included varied dialogue patterns and medical conditions to stress-test the system\u2019s adaptability<\/li>\n<li id=\"3135\" data-selectable-paragraph=\"\">Edge case coverage: Deliberately incorporated complex scenarios that challenge common assumptions<\/li>\n<\/ol>\n<p id=\"c17d\" data-selectable-paragraph=\"\">The process revealed why golden datasets, despite their critical importance, often become a bottleneck in AI system development. They demand a rare combination of domain expertise, technical precision, and substantial time investment \u2014 it took a few hours to review the 25 entries, even with the synthetic drafts. Yet, this foundation proves invaluable when scaling systems from promising prototypes to production-ready solutions.<\/p>\n<p id=\"8c84\" data-selectable-paragraph=\"\">This golden dataset would serve multiple critical functions in our use case. The question became: how do we leverage this carefully curated foundation to build a comprehensive evaluation framework? This leads us to our next challenge: implementing systematic evaluation strategies.<\/p>\n<h2 id=\"baf0\">Evals<\/h2>\n<p id=\"771a\" data-selectable-paragraph=\"\">Evals are fundamentally tests, but they represent an evolution in testing methodology driven by the unique challenges of LLM-based systems. While traditional software tests verify deterministic behaviours against fixed expectations, Evals handle language model outputs&#8217; inherent variability and contextual nature. There are two types of evals:<\/p>\n<p id=\"6cc3\" data-selectable-paragraph=\"\">Deterministic Evals:\u00a0these provide unambiguous pass\/fail signals for structured fields:<\/p>\n<ul>\n<li id=\"7be5\" data-selectable-paragraph=\"\">Boolean fields (smoking status, drug use)<\/li>\n<li id=\"5f8b\" data-selectable-paragraph=\"\">Enumerated values (alcohol consumption frequency)<\/li>\n<li id=\"32c6\" data-selectable-paragraph=\"\">Required field presence<\/li>\n<li id=\"6617\" data-selectable-paragraph=\"\">Format compliance<\/li>\n<\/ul>\n<p id=\"a955\" data-selectable-paragraph=\"\">Non-Deterministic Evals:\u00a0these address qualitative aspects requiring nuanced assessment:<\/p>\n<ul>\n<li id=\"86a7\" data-selectable-paragraph=\"\">Semantic accuracy of chief complaints<\/li>\n<li id=\"102e\" data-selectable-paragraph=\"\">Completeness of medical histories<\/li>\n<\/ul>\n<p id=\"35f3\" data-selectable-paragraph=\"\">While deterministic evals are similar to common tests, non-deterministic evals require us to define the criteria for evaluation, usually composed of a few metrics. Continuing in our example, we identified 3 metrics as a starting point:<\/p>\n<ul>\n<li id=\"3a6b\" data-selectable-paragraph=\"\">Completeness: Measures whether all the clinical information is present regardless of the order or the way it\u2019s described<\/li>\n<li id=\"0013\" data-selectable-paragraph=\"\">Accuracy: All the present information should match the reference<\/li>\n<li id=\"92fc\" data-selectable-paragraph=\"\">No hallucination: Restating\/rephrasing existing information is acceptable, but the inference of a plan, such as a diagnosis, should not happen.<\/li>\n<\/ul>\n<p id=\"b189\" data-selectable-paragraph=\"\">Taking this into account, we can jump into the implementation.<\/p>\n<h2 id=\"61c2\">Practical deep dive<\/h2>\n<p id=\"aae5\" data-selectable-paragraph=\"\">For practical implementation, I leveraged\u00a0<a href=\"https:\/\/github.com\/promptfoo\/promptfoo\" target=\"_blank\" rel=\"noopener ugc nofollow\">promptfoo<\/a>, an open-source framework that brings software engineering rigour to prompt evaluation to illustrate how evals could look in practice.<\/p>\n<p id=\"d862\" data-selectable-paragraph=\"\">Here\u2019s a concrete example of how we can structure a non-deterministic test:<\/p>\n<p data-selectable-paragraph=\"\"><img decoding=\"async\" class=\"aligncenter wp-image-20928 size-large\" src=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/3-1200x630.webp\" alt=\"An example structure of a non-deterministic test as described under the heading &quot;Practical deep dive&quot;.\" width=\"1200\" height=\"630\" \/><\/p>\n<p id=\"bc63\" data-selectable-paragraph=\"\">In the section prompts, we can declare one or more prompts to be evaluated; as for the section providers, we can declare one or two providers to be evaluated. The tests section contains multiple tests, and each test is composed of vars (input variables) and asserts that compare the output with the reference. Promptfoo has many types of assertions, and for this specific example, I\u2019m using a Python assert where I can pass the assertion as Python code.<\/p>\n<p id=\"2c90\" data-selectable-paragraph=\"\">To create the config file for the golden dataset, I created a script that generates the YAML config based on the golden dataset (meaning having 25 tests like the previous one, one for each entry). With the config file in place, we can run the prompt and get the following results:<\/p>\n<p data-selectable-paragraph=\"\"><img decoding=\"async\" class=\"aligncenter wp-image-20930 size-large\" src=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/4-1200x795.webp\" alt=\"Results of the prompt as described under the heading &quot;Practical deep dive.&quot;\" width=\"1200\" height=\"795\" \/><\/p>\n<p id=\"41db\" data-selectable-paragraph=\"\">For this particular case, the initial prompt behaves quite well in both GPT-4o and GPT-4o-mini for the smoking field. Implementing the missing deterministic tests is straightforward now.<\/p>\n<p id=\"2128\" data-selectable-paragraph=\"\">For the non-deterministic tests, we will use a model-graded assert from promptfoo that allows us to pass a rubric prompt where we can describe how a reference field should be evaluated against a generated field.<\/p>\n<p data-selectable-paragraph=\"\"><img decoding=\"async\" class=\"aligncenter wp-image-20929 size-large\" src=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/5-1200x849.webp\" alt=\"An example of a rubric prompt where we can describe how a reference field should be evaluated against a generated field as described under the heading &quot;Practical deep dive.&quot;\" width=\"1200\" height=\"849\" \/><\/p>\n<p id=\"b2ab\" data-selectable-paragraph=\"\">While promptfoo streamlines our evaluation process by running a rubric using GPT-4o as the default model evaluator, this introduces an interesting technical paradox: how do we validate the validator? Using a non-deterministic language model to evaluate another language model\u2019s output creates a potential blind spot in our evaluation framework.<\/p>\n<p id=\"964c\" data-selectable-paragraph=\"\">The solution emerges from applying recursive validation patterns \u2014 a common approach in complex system verification. As we validate our SOAP note generation, we must systematically validate our rubric prompt.<\/p>\n<p id=\"662e\" data-selectable-paragraph=\"\">This requires creating a secondary golden dataset specifically for evaluating our rubric prompts:<\/p>\n<ol>\n<li id=\"b5a0\" data-selectable-paragraph=\"\">Input: Pairs of reference\/generated SOAP notes<\/li>\n<li id=\"153f\" data-selectable-paragraph=\"\">Output: Binary classification (1\/0) indicating evaluation correctness<\/li>\n<li id=\"46a3\" data-selectable-paragraph=\"\">Coverage: Diverse scenarios, including edge cases and boundary conditions<\/li>\n<\/ol>\n<p id=\"698e\" data-selectable-paragraph=\"\">Since our rubric returns a binary value, we can evaluate it with deterministic tests.<\/p>\n<p id=\"7001\" data-selectable-paragraph=\"\">After crafting the golden dataset and the tests in promptfoo, the first iteration of the rubric prompt revealed gaps in evaluation coverage. Critical aspects of medical documentation quality weren\u2019t being captured consistently, and there were a few hallucinations. This led to a comprehensive redesign of the rubric by taking into consideration the metrics shown above, resulting in the following structured rubric:<\/p>\n<p data-selectable-paragraph=\"\"><img decoding=\"async\" class=\"aligncenter wp-image-20925 size-large\" src=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/6-1185x1024.webp\" alt=\"The redesigned structured rubric prompt as described under the heading &quot;Practical deep dive.&quot;\" width=\"1185\" height=\"1024\" \/><\/p>\n<p id=\"9686\" data-selectable-paragraph=\"\">Which worked flawlessly in the (small) rubric golden dataset.<\/p>\n<p id=\"c057\" data-selectable-paragraph=\"\">This experience reinforces a crucial lesson in AI system development: evaluation frameworks themselves must evolve through iterative refinement, guided by concrete validation metrics (deterministic if possible) and real-world usage patterns.<\/p>\n<h2 id=\"cbe7\">The reality check<\/h2>\n<p id=\"5081\" data-selectable-paragraph=\"\">We now have all the tools we need to create the evals for our initial use case: we have the golden dataset, we know how to do deterministic evals and non-deterministic evals, and last but not least, we have a degree of trust in our deterministic evals. To do this, I created a Python script that outputs a promptfoo config file with all the testing scenarios for all the golden dataset records. This is the result of running it with our current implementation described above:<\/p>\n<p data-selectable-paragraph=\"\"><img decoding=\"async\" class=\"aligncenter wp-image-20924 size-large\" src=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/7-1200x1000.webp\" alt=\"The results of the current implementation as described under the heading &quot;The reality check.&quot;\" width=\"1200\" height=\"1000\" \/><\/p>\n<p id=\"ece6\" data-selectable-paragraph=\"\">Running our evaluation framework across the 25-case dataset provided good insights: only one SOAP note passed all evaluation criteria, although 309 out of 350 tests passed. While initially disappointing, this outcome offers valuable lessons about the complexity of medical documentation and the importance of comprehensive evaluation.<\/p>\n<p id=\"1dec\" data-selectable-paragraph=\"\">This result doesn\u2019t indicate failure \u2014 rather, it demonstrates the value of rigorous evaluation in surfacing areas requiring refinement. Each failing test provides specific, actionable feedback for improving our system\u2019s reliability.<\/p>\n<h2 id=\"39a4\">Engineering insights and trade-offs<\/h2>\n<p id=\"0c96\" data-selectable-paragraph=\"\">Through implementing this evaluation framework for SOAP note generation, several key architectural decisions emerged that warrant deeper examination. Let\u2019s explore the trade-offs and practical implications of these choices.<\/p>\n<h3 id=\"bdc2\">Golden datasets as living rrtifacts<\/h3>\n<p id=\"93a1\" data-selectable-paragraph=\"\">While our initial golden dataset provides a foundation for development and validation, production environments demand an evolutionary approach. Golden datasets must function as living artefacts that grow and adapt alongside the system they validate (e.g. incorporate new edge cases).<\/p>\n<h3 id=\"880f\">Granularity of evaluations<\/h3>\n<p id=\"d98f\" data-selectable-paragraph=\"\">Evaluating SOAP notes field-by-field rather than as complete documents proved instrumental in managing system complexity. This approach delivers clearer error signals and enables targeted improvements for specific components. However, it introduces significant computational overhead \u2014 our modest golden dataset of 25 cases spawned 350 distinct tests, with most requiring LLM inference for evaluation. While this granularity provides excellent debugging capabilities during development, it raises important considerations for production scaling.<\/p>\n<h3 id=\"4670\">Rubric design strategy<\/h3>\n<p id=\"b214\" data-selectable-paragraph=\"\">The rubric prompt unified key metrics (completeness, accuracy, hallucination, format compliance) into one rubric. While this consolidated approach served well for initial validation with our limited dataset, production deployment would benefit from decomposition into separate rubric evaluations. This separation would enable:<\/p>\n<ul>\n<li id=\"97c6\" data-selectable-paragraph=\"\">More precise performance monitoring<\/li>\n<li id=\"59bf\" data-selectable-paragraph=\"\">Targeted optimisation of specific quality dimensions<\/li>\n<li id=\"de0a\" data-selectable-paragraph=\"\">Clearer attribution of failure modes<\/li>\n<li id=\"fcf5\" data-selectable-paragraph=\"\">Enhanced maintainability of evaluation logic<\/li>\n<\/ul>\n<h3 id=\"e680\">Rubric score<\/h3>\n<p id=\"643a\" data-selectable-paragraph=\"\">The decision to implement binary scoring (1\/0) for our rubric evaluations, rather than nuanced scoring scales, emerged from lessons in production AI systems. This approach might seem reductionist at first glance. Still, it delivers several critical advantages: deterministic decision boundaries (making it easy to evaluate the rubric), easier to create the rubric golden dataset and minimises subjective interpretations.<\/p>\n<h3 id=\"fd76\">Evals cost<\/h3>\n<p id=\"58ed\" data-selectable-paragraph=\"\">Running evals at scale introduces non-trivial operational costs that demand strategic consideration. Our small golden dataset already incurs several euros per run \u2014 a cost that multiplies significantly with larger datasets and continuous integration pipelines. Practical mitigation strategies include implementing staged evaluations (not every change requires full validation) and deploying dedicated infrastructure. These architectural decisions become crucial for maintaining system reliability and operational efficiency at scale.<\/p>\n<h2 id=\"896a\">Moving forward<\/h2>\n<p id=\"7a71\" data-selectable-paragraph=\"\">Our evaluation framework has provided clear visibility into current system performance. With only one SOAP note passing a full evaluation, we have a data-driven foundation for systematic improvement. The path forward involves targeted refinements across multiple dimensions:<\/p>\n<h3 id=\"20f9\">Prompt engineering evolution<\/h3>\n<p id=\"1487\" data-selectable-paragraph=\"\">The (intentional) simple baseline prompt is a starting point rather than a final solution. With our evaluation framework in place, we can now:<\/p>\n<ul>\n<li id=\"ac62\" data-selectable-paragraph=\"\">Systematically test prompt variations (few-shot prompting, chain-of-thought reasoning, structured guidance)<\/li>\n<li id=\"46c3\" data-selectable-paragraph=\"\">Measure specific impact on evaluation metrics<\/li>\n<\/ul>\n<h3 id=\"c467\">Architectural refinement<\/h3>\n<p id=\"b30f\" data-selectable-paragraph=\"\">If prompt engineering alone proves insufficient, the next step is considering decomposition into specialised components:<\/p>\n<ul>\n<li id=\"9cdb\" data-selectable-paragraph=\"\">Breaking SOAP generation into discrete, independent stages<\/li>\n<li id=\"8b60\" data-selectable-paragraph=\"\">Implementing targeted models for high-complexity sections (e.g. for the smoking field, GPT-4o-mini revealed good enough, but for complex sections, a strong model is needed)<\/li>\n<li id=\"f37b\" data-selectable-paragraph=\"\">Building feedback loops (self-reflection \/ output refining)<\/li>\n<\/ul>\n<p id=\"dc84\" data-selectable-paragraph=\"\">The key insight here isn\u2019t just about improving accuracy \u2014 it\u2019s about building systems that evolve systematically based on quantifiable metrics. Our evaluation framework now enables this disciplined, evidence-based approach to production readiness.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;Look how well it extracted the patient\u2019s history!&#8221; The physician\u2019s enthusiasm was palpable as they reviewed our initial prototype. At that moment, I witnessed a common pattern in enterprise AI implementations: the seductive power of early demos overshadowing the engineering rigour needed for production systems. While our proof-of-concept (discussed in Part 1) demonstrated the foundational [&hellip;]<\/p>\n","protected":false},"author":164,"featured_media":20934,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","inline_featured_image":false,"footnotes":""},"categories":[806],"tags":[],"location":[],"class_list":["post-20923","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Engineering GenAI for healthcare: evaluation strategies | Equal Experts<\/title>\n<meta name=\"description\" content=\"How to evaluate GenAI systems for healthcare, ensure accuracy, reliability, and efficiency in medical documentation with systematic assessment methods.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Engineering GenAI systems: A systematic approach through healthcare documentation (Part 2)\" \/>\n<meta property=\"og:description\" content=\"How to evaluate GenAI systems for healthcare, ensure accuracy, reliability, and efficiency in medical documentation with systematic assessment methods.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/\" \/>\n<meta property=\"og:site_name\" content=\"Equal Experts\" \/>\n<meta property=\"article:published_time\" content=\"2025-03-06T11:00:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-03-27T12:53:56+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"514\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Cl\u00e1udio Diniz\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@EqualExperts\" \/>\n<meta name=\"twitter:site\" content=\"@EqualExperts\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Cl\u00e1udio Diniz\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/\"},\"author\":{\"name\":\"Cl\u00e1udio Diniz\",\"@id\":\"https:\/\/www.equalexperts.com\/#\/schema\/person\/28ff89d676b184c93ab62bc91b0af11e\"},\"headline\":\"Engineering GenAI systems: A systematic approach through healthcare documentation (Part 2)\",\"datePublished\":\"2025-03-06T11:00:04+00:00\",\"dateModified\":\"2025-03-27T12:53:56+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/\"},\"wordCount\":1833,\"publisher\":{\"@id\":\"https:\/\/www.equalexperts.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png\",\"articleSection\":[\"Data &amp; AI\"],\"inLanguage\":\"en-GB\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/\",\"url\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/\",\"name\":\"Engineering GenAI for healthcare: evaluation strategies | Equal Experts\",\"isPartOf\":{\"@id\":\"https:\/\/www.equalexperts.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png\",\"datePublished\":\"2025-03-06T11:00:04+00:00\",\"dateModified\":\"2025-03-27T12:53:56+00:00\",\"description\":\"How to evaluate GenAI systems for healthcare, ensure accuracy, reliability, and efficiency in medical documentation with systematic assessment methods.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage\",\"url\":\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png\",\"contentUrl\":\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png\",\"width\":1200,\"height\":514},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.equalexperts.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Engineering GenAI systems: A systematic approach through healthcare documentation (Part 2)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.equalexperts.com\/#website\",\"url\":\"https:\/\/www.equalexperts.com\/\",\"name\":\"Equal Experts\",\"description\":\"Making Software. Better.\",\"publisher\":{\"@id\":\"https:\/\/www.equalexperts.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.equalexperts.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.equalexperts.com\/#organization\",\"name\":\"Equal Experts\",\"url\":\"https:\/\/www.equalexperts.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/www.equalexperts.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2018\/08\/Equal_Experts_Logo_CMYK_Colour.jpg\",\"contentUrl\":\"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2018\/08\/Equal_Experts_Logo_CMYK_Colour.jpg\",\"width\":719,\"height\":340,\"caption\":\"Equal Experts\"},\"image\":{\"@id\":\"https:\/\/www.equalexperts.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/EqualExperts\",\"https:\/\/www.linkedin.com\/company\/equal-experts\/?viewAsMember=true\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.equalexperts.com\/#\/schema\/person\/28ff89d676b184c93ab62bc91b0af11e\",\"name\":\"Cl\u00e1udio Diniz\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/www.equalexperts.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/d70fbe38b0540d312610b719e2e75bc9f302aafe3264bf1eb8174eb191c4879d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/d70fbe38b0540d312610b719e2e75bc9f302aafe3264bf1eb8174eb191c4879d?s=96&d=mm&r=g\",\"caption\":\"Cl\u00e1udio Diniz\"}}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Engineering GenAI for healthcare: evaluation strategies | Equal Experts","description":"How to evaluate GenAI systems for healthcare, ensure accuracy, reliability, and efficiency in medical documentation with systematic assessment methods.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/","og_locale":"en_GB","og_type":"article","og_title":"Engineering GenAI systems: A systematic approach through healthcare documentation (Part 2)","og_description":"How to evaluate GenAI systems for healthcare, ensure accuracy, reliability, and efficiency in medical documentation with systematic assessment methods.","og_url":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/","og_site_name":"Equal Experts","article_published_time":"2025-03-06T11:00:04+00:00","article_modified_time":"2025-03-27T12:53:56+00:00","og_image":[{"width":1200,"height":514,"url":"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png","type":"image\/png"}],"author":"Cl\u00e1udio Diniz","twitter_card":"summary_large_image","twitter_creator":"@EqualExperts","twitter_site":"@EqualExperts","twitter_misc":{"Written by":"Cl\u00e1udio Diniz","Estimated reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#article","isPartOf":{"@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/"},"author":{"name":"Cl\u00e1udio Diniz","@id":"https:\/\/www.equalexperts.com\/#\/schema\/person\/28ff89d676b184c93ab62bc91b0af11e"},"headline":"Engineering GenAI systems: A systematic approach through healthcare documentation (Part 2)","datePublished":"2025-03-06T11:00:04+00:00","dateModified":"2025-03-27T12:53:56+00:00","mainEntityOfPage":{"@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/"},"wordCount":1833,"publisher":{"@id":"https:\/\/www.equalexperts.com\/#organization"},"image":{"@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png","articleSection":["Data &amp; AI"],"inLanguage":"en-GB"},{"@type":"WebPage","@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/","url":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/","name":"Engineering GenAI for healthcare: evaluation strategies | Equal Experts","isPartOf":{"@id":"https:\/\/www.equalexperts.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage"},"image":{"@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png","datePublished":"2025-03-06T11:00:04+00:00","dateModified":"2025-03-27T12:53:56+00:00","description":"How to evaluate GenAI systems for healthcare, ensure accuracy, reliability, and efficiency in medical documentation with systematic assessment methods.","breadcrumb":{"@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/"]}]},{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#primaryimage","url":"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png","contentUrl":"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2025\/02\/Blog_Lead-52.png","width":1200,"height":514},{"@type":"BreadcrumbList","@id":"https:\/\/www.equalexperts.com\/blog\/data-ai\/engineering-genai-systems-a-systematic-approach-through-healthcare-documentation-part-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.equalexperts.com\/"},{"@type":"ListItem","position":2,"name":"Engineering GenAI systems: A systematic approach through healthcare documentation (Part 2)"}]},{"@type":"WebSite","@id":"https:\/\/www.equalexperts.com\/#website","url":"https:\/\/www.equalexperts.com\/","name":"Equal Experts","description":"Making Software. Better.","publisher":{"@id":"https:\/\/www.equalexperts.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.equalexperts.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Organization","@id":"https:\/\/www.equalexperts.com\/#organization","name":"Equal Experts","url":"https:\/\/www.equalexperts.com\/","logo":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/www.equalexperts.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2018\/08\/Equal_Experts_Logo_CMYK_Colour.jpg","contentUrl":"https:\/\/www.equalexperts.com\/wp-content\/uploads\/2018\/08\/Equal_Experts_Logo_CMYK_Colour.jpg","width":719,"height":340,"caption":"Equal Experts"},"image":{"@id":"https:\/\/www.equalexperts.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/EqualExperts","https:\/\/www.linkedin.com\/company\/equal-experts\/?viewAsMember=true"]},{"@type":"Person","@id":"https:\/\/www.equalexperts.com\/#\/schema\/person\/28ff89d676b184c93ab62bc91b0af11e","name":"Cl\u00e1udio Diniz","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/www.equalexperts.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/d70fbe38b0540d312610b719e2e75bc9f302aafe3264bf1eb8174eb191c4879d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d70fbe38b0540d312610b719e2e75bc9f302aafe3264bf1eb8174eb191c4879d?s=96&d=mm&r=g","caption":"Cl\u00e1udio Diniz"}}]}},"_links":{"self":[{"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/posts\/20923","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/users\/164"}],"replies":[{"embeddable":true,"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/comments?post=20923"}],"version-history":[{"count":0,"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/posts\/20923\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/media\/20934"}],"wp:attachment":[{"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/media?parent=20923"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/categories?post=20923"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/tags?post=20923"},{"taxonomy":"location","embeddable":true,"href":"https:\/\/www.equalexperts.com\/wp-json\/wp\/v2\/location?post=20923"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}