/* global window */
// Content object for the article renderer (blog/blog-article.jsx).
// Produced by the Writer→Editor→Humanize→Proofreader chain; draft in blog/pipeline/drafts/what-counts-evidence.md
window.__ARTICLE__ = {
  slug: "what-counts-evidence",
  laneLabel: "EVIDENCE INDEX",
  kicker: "METHODOLOGY",
  readMins: 8,
  dateLabel: "Apr 24, 2026",
  title: "What counts as evidence? A framework for AI research outputs",
  deck: "Five layers separate a defensible finding from a confident guess — a checklist you can run against any research output, whoever generated it.",
  tags: ["methodology", "ai-research", "evidence-trail"],
  toc: [
    { id: "claim", num: "01 · THE PROBLEM", title: "The claim on the slide" },
    { id: "sounds-right", num: "02 · THE TEST", title: "Why 'it sounds right' fails" },
    { id: "clip", num: "03 · LAYER ONE", title: "The source clip" },
    { id: "transcript", num: "04 · LAYER TWO", title: "The transcript moment" },
    { id: "behavior", num: "05 · LAYER THREE", title: "The behavioral signal" },
    { id: "segment", num: "06 · LAYER FOUR", title: "The segment pattern" },
    { id: "confidence", num: "07 · LAYER FIVE", title: "The confidence indicator" },
    { id: "chain", num: "08 · WORKED EXAMPLE", title: "A claim, all the way down" },
    { id: "checklist", num: "09 · THE CHECKLIST", title: "The checklist you can steal" },
  ],
  body: [
    { t: "h2", id: "claim", num: "01 · THE PROBLEM", text: "The claim on the slide" },
    { t: "p", html: `A line shows up on a slide: "Users find the pricing page confusing." Maybe an AI tool wrote it. Maybe a junior researcher did. Maybe you did, at 11 p.m., from a vague memory of three sessions that blurred together. The sentence reads the same either way. That's the problem.` },
    { t: "p", html: `A claim and the evidence for a claim look identical on a slide. Both are short declarative sentences in the same font. The reader can't tell, from the text alone, whether "users find the pricing page confusing" rests on eleven timestamped clips across three segments or on one model's pattern-match against its training data. The packaging hides the difference. And the more fluent the writing, the more the packaging flatters the claim.` },
    { t: "p", html: `This matters more now than it did three years ago, because the cost of producing a plausible-sounding finding has dropped to roughly zero. A language model will write you a clean research summary in seconds, complete with confident recommendations and a tidy three-bullet rationale. It will sound like a senior researcher wrote it. Sometimes the underlying signal is real. Often it isn't, and you have no way to tell by reading.` },
    { t: "p", html: `So we need a test that doesn't depend on how the sentence sounds. We need to ask what sits underneath it. NeroView's answer is five layers, and a finding is defensible to the degree it exposes all five. Strip them away and what's left is a guess wearing the costume of a conclusion.` },
    { t: "pullquote", text: "A claim and the evidence for a claim look identical on a slide. The only way to tell them apart is to pull on the thread underneath." },

    { t: "h2", id: "sounds-right", num: "02 · THE TEST", text: `Why "it sounds right" is not a test` },
    { t: "p", html: `The instinct to trust fluent text is exactly the instinct that gets teams burned, and the AI era has made the trap sharper.` },
    { t: "p", html: `Start with how wrong confident-sounding outputs can be. When Stanford researchers tested general-purpose language models against more than 200,000 verifiable legal questions, hallucination rates ran from <strong>69% to 88%</strong> on specific legal queries (Stanford HAI, 2024). These weren't models mumbling that they weren't sure. They produced fluent, authoritative, wrong answers, and a follow-up found the models often couldn't tell when they'd erred (Stanford HAI, 2024). Purpose-built tools did better but not clean: a separate test of dedicated legal-research products clocked hallucination rates of <strong>17%</strong> for one and <strong>33%</strong> for another (Magesh et al., 2024). Law is a domain with crisp right answers and enormous stakes. Customer research is fuzzier and softer, which means a confabulated insight is harder to catch, not easier.` },
    { t: "p", html: `The second trap is that the model's own confidence won't save you. Across <strong>48 language models</strong>, self-reported confidence is poorly calibrated, with models clustering their stated certainty near the top of the range no matter how well they actually performed (Xiong et al., 2024). Even the best-calibrated systems stay overconfident. A model telling you it's 90% sure is not evidence that it should be 90% sure.` },
    { t: "p", html: `Put those together and you get the core risk: an output that reads as authoritative, asserts more confidence than it has earned, and offers no way to check. The fix isn't better prose. It's a structural requirement that every claim carry its receipts. Researchers have a name for the underlying principle. They call it <em>triangulation</em>: corroborating a finding across multiple independent sources and methods so it doesn't rest on any single fragile thread, a strategy Norman Denzin formalized for social science in 1978 (Denzin, 1978, via Delve). The five layers are triangulation made operational for an AI research output.` },

    { t: "h2", id: "clip", num: "03 · LAYER ONE", text: "Layer 1 — the source clip" },
    { t: "p", html: `The first question a skeptic asks is the oldest one in any investigation: <em>where did this come from?</em>` },
    { t: "p", html: `A source clip is the raw moment the claim points back to. The video, the audio, the recorded session, timestamped and replayable. Not a paraphrase of the moment. The moment itself. When a finding says "users hesitate at the pricing page," the source clip is the eight seconds where you watch a specific person stall, scroll back up, and frown at the plan grid.` },
    { t: "p", html: `This is chain of custody, borrowed straight from forensics. NIST defines it as the documented process that tracks evidence from collection through analysis, recording who handled it and when, so its origin can be traced and trusted (NIST CSRC). In data terms it's <em>provenance</em>: the trail of where information came from and how it moved (NIST CSRC). The whole reason these disciplines exist is that evidence with a broken custody chain gets thrown out. It might be true, but you can't trust it, because you can't trace it. The same standard now drives industry efforts like the C2PA provenance specification, which cryptographically signs the origin and edit history of media so a viewer can verify what they're looking at (C2PA).` },
    { t: "p", html: `For a research output the test is concrete. Can you get from the claim to the original recording in one click? If the answer is "the clips are in a folder somewhere" or "we'd have to go find it," you don't have a source layer. You have a memory of one.` },

    { t: "h2", id: "transcript", num: "04 · LAYER TWO", text: "Layer 2 — the transcript moment" },
    { t: "p", html: `The second question: <em>did they actually say that, in those words?</em>` },
    { t: "p", html: `The transcript moment is the participant's exact language, not cleaned up, not summarized, not smoothed into corporate prose. "I'd probably look around for a cheaper option here" is a transcript moment. "Participants expressed price sensitivity" is a summary of one, and the summary is where meaning quietly leaks out. The model that wrote "users find the pricing page confusing" chose the word "confusing." The participant might have said "overwhelming," or "I don't trust this," or "wait, which plan am I even on." Those point at three different fixes.` },
    { t: "p", html: `UX research has long leaned on verbatim quotes for exactly this reason. The Nielsen Norman Group notes that participant quotes carry a high level of credibility because they give stakeholders a direct glimpse of the session in the user's own voice, rather than the researcher's interpretation of it (Nielsen Norman Group). The verbatim is harder to argue with than the paraphrase, because the paraphrase already smuggled in a judgment call.` },
    { t: "p", html: `The test: pull any claim and ask to see the underlying quotes, unedited. If what comes back is the model's restatement rather than the human's words, layer two is missing, and you're trusting a translation you can't check against the original.` },

    { t: "h2", id: "behavior", num: "05 · LAYER THREE", text: "Layer 3 — the behavioral signal" },
    { t: "p", html: `Third question: <em>can you show it happened, not just that someone said it would?</em>` },
    { t: "p", html: `The behavioral signal is the measurable thing the body did. The cursor stall, the scroll-back, the rage-click, the forty-second dwell on a field that should take five. It matters because what people say and what people do come apart constantly. The Nielsen Norman Group draws the line between attitudinal research (what users report) and behavioral research (what users actually do), and stresses that the two routinely diverge (Nielsen Norman Group). Market researchers have a blunter name for it: the say-do gap, the well-documented gulf between stated intent and observed action, usually unconscious rather than dishonest (CloudArmy).` },
    { t: "p", html: `This layer is the one synthetic and self-report methods structurally cannot give you. A survey captures the attitude. A language model predicts the plausible attitude. Neither watched anyone abandon a task. When a stated preference ("I'd pay for this") and a behavioral signal (they bounced at the price field) disagree, the behavior usually wins, and a finding that has only the stated side is missing its strongest witness.` },
    { t: "p", html: `The test: behind the claim, is there a measurable behavior, or only a reported opinion? Both can be valid. But a finding dressed up as behavioral when it's really just attitudinal is one of the most common ways a research output overstates itself.` },

    { t: "h2", id: "segment", num: "06 · LAYER FOUR", text: "Layer 4 — the segment pattern" },
    { t: "p", html: `Fourth question: <em>is this a real pattern, or did you build a story out of one vivid moment?</em>` },
    { t: "p", html: `The segment pattern is the breadth. How many participants showed the behavior, and which audience segments it cuts across. One person stalling at the pricing page is an anecdote. Eleven people stalling, eight of them in your mid-market segment, is a pattern with a shape you can act on. The number is load-bearing, and a single compelling clip is the easiest thing in the world to over-weight, because vividness reads as significance even when the n is one.` },
    { t: "p", html: `This is triangulation at the level of cases rather than methods: a finding gets sturdier as independent instances of it converge (Denzin, 1978, via Delve). It's also where AI summaries tend to go quietly wrong. A model will write "several users mentioned" or "many participants felt" without ever counting, because "several" and "many" are linguistic gestures, not measurements. The vagueness is doing work. It implies a pattern the data may not support.` },
    { t: "p", html: `The test: does the claim carry a count and a segment, or a hedge word? "Several users" is a flag, not a finding. Ask how many, and out of how many, in which segment. If the output can't answer, it didn't establish a pattern. It generalized from a moment.` },

    { t: "h2", id: "confidence", num: "07 · LAYER FIVE", text: "Layer 5 — the confidence indicator" },
    { t: "p", html: `Last question: <em>how sure are you, and where is this weak?</em>` },
    { t: "p", html: `The confidence indicator is an honest, human-reviewable read on the strength of the cluster, including where it might be soft. Not the model's self-graded certainty, which we already established runs hot (Xiong et al., 2024). A real confidence layer accounts for sample size, signal consistency, and the cases that don't fit. It's the layer that says "86% confidence, but it's thinner for enterprise users, where we only have two sessions."` },
    { t: "p", html: `The reason this can't be the model's own number is the calibration problem. Language models report high confidence almost regardless of whether they're right, which makes their stated certainty close to useless as a guide (Xiong et al., 2024). Worse, they almost never say "I don't know," even where real respondents do. A confidence indicator earns trust only when a human can inspect how it was derived and a skeptic can find the soft spots it admits to.` },
    { t: "p", html: `The test: does the finding state its own uncertainty and name where it's weakest? A claim that admits no doubt isn't more rigorous. It's less, because the absence of stated uncertainty usually means nobody measured it.` },

    { t: "h2", id: "chain", num: "08 · WORKED EXAMPLE", text: "A claim, all the way down" },
    { t: "p", html: `Here's the framework on one finding, top to bottom. Start with the line a CFO actually reads in the deck.` },
    { t: "figure",
      fig: { key: "insight", props: { accent: true, title: `Surface "what's included" above the pricing CTA`, clips: 14, participants: 8, segments: 3, confidence: 0.86, body: `Across 8 mid-market participants, hesitation spiked when the page failed to surface plan inclusions before the CTA. Trace: 14 clips, 3 segments.` } },
      ref: "FIGURE 01",
      caption: "RECOMMENDATION · the line a CFO reads, with every chip source-linked" },
    { t: "p", html: `Now chain it down through the five layers.` },
    { t: "table",
      headers: ["Layer", "What backs the claim"],
      rows: [
        ["<strong>Source clip</strong>", `The 02:14 moment where a participant scrolls back up, hunting for what the plan includes, then stalls on the CTA. Timestamped, replayable.`],
        ["<strong>Transcript moment</strong>", `Their words: "wait — what do I actually get on this one?" Not "expressed confusion." The actual sentence.`],
        ["<strong>Behavioral signal</strong>", `A measured scroll-back and a long dwell on the CTA before the participant disengaged. Observed, not reported.`],
        ["<strong>Segment pattern</strong>", `The same behavior across 8 participants, concentrated in the mid-market segment, traced through 14 clips. A pattern, not a moment.`],
        ["<strong>Confidence indicator</strong>", `86%, with the soft spot named: thinner evidence for enterprise, where the session count is low.`],
      ] },
    { t: "p", html: `The recommendation at the top is identical in wording to one a model could have invented from nothing. The difference is everything below the line. A skeptical stakeholder can pull any single thread and follow it to a real human at 02:14. When someone challenges the finding, you don't defend the reasoning. You play the tape.` },
    { t: "p", html: `Compare that to the shape an unsourced AI summary usually takes.` },
    { t: "blockquote", html: `"Participants generally found the pricing page confusing, with several mentioning that the team plan was unclear. We recommend revisiting plan descriptions and considering a clearer value proposition above the CTA."` },
    { t: "p", html: `There's nothing wrong with what it says. The recommendation might even be correct. The problem is that every layer underneath it is absent. No clip, no verbatim, no measured behavior, no count, no confidence. A stakeholder can't pull a single thread, so they do one of two things: accept it on faith, or quietly discount it. Both are bad outcomes, and the fluent prose earned neither trust.` },
    { t: "pullquote", text: "When someone challenges the finding, you don't defend the reasoning. You play the tape." },

    { t: "h2", id: "checklist", num: "09 · THE CHECKLIST", text: "The checklist you can steal" },
    { t: "p", html: `You don't need NeroView to run this. The five layers are a portable rubric, and you can hold any research output against it. Take the last recommendation your team shipped and grade it, one layer at a time.` },
    { t: "ol", items: [
      `<strong>Source.</strong> Can you get from the claim to the original recording in one click? Or is the source "in a folder somewhere"?`,
      `<strong>Transcript.</strong> Can you see the participant's exact words, unedited? Or only the model's restatement of them?`,
      `<strong>Behavior.</strong> Is there a measured action behind the claim, or only a reported opinion dressed up as one?`,
      `<strong>Segment.</strong> Does the finding carry a count and a segment, or a hedge word like "several" doing the work of a number?`,
      `<strong>Confidence.</strong> Does it state its own uncertainty and name where it's weakest, or assert clean certainty it never measured?`,
    ] },
    { t: "p", html: `A finding that clears all five is defensible. You can put it in front of a board and survive the questions. A finding that clears two or three is a hypothesis worth validating, not a conclusion worth shipping. A finding that clears none, however fluent, is a confident guess, and the fluency is the most dangerous thing about it.` },
    { t: "p", html: `The grade isn't about which tool produced the output. Run the rubric on a junior researcher's deck and an AI summary the same way. The standard is indifferent to the author. Evidence either chains to a source, or it doesn't.` },
    { t: "p", html: `That's the whole discipline. Not "trust AI" or "distrust AI," but: make every output show its work, and refuse to act on the ones that can't. In a year when anything can write a plausible sentence, the plausible sentence is worth less than it ever was. What it points to is worth more.` },

    { t: "references", items: [
      { n: 1, html: `Stanford HAI / Dahl, M., Magesh, V., Suzgun, M., &amp; Ho, D. E. (2024). "Hallucinating Law: Legal Mistakes with Large Language Models Are Pervasive." Hallucination rates of 69–88% on specific legal queries across 200,000+ queries. <a href="https://hai.stanford.edu/news/hallucinating-law-legal-mistakes-large-language-models-are-pervasive" target="_blank" rel="noopener">hai.stanford.edu</a>` },
      { n: 2, html: `Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., &amp; Ho, D. E. (2024). "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools." Dedicated legal-research tools hallucinated 17% and 33% of the time. <a href="https://arxiv.org/abs/2405.20362" target="_blank" rel="noopener">arxiv.org/abs/2405.20362</a>` },
      { n: 3, html: `Xiong, M., et al. (2024). "Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models." Across 48 models, self-reported confidence is poorly calibrated and clusters near the top of the range. <a href="https://arxiv.org/abs/2405.02917" target="_blank" rel="noopener">arxiv.org/abs/2405.02917</a>` },
      { n: 4, html: `Denzin, N. K. (1978). On triangulation as corroboration across multiple data sources and methods (via Delve, "Triangulation in Qualitative Research"). <a href="https://delvetool.com/blog/triangulation-qualitative-research" target="_blank" rel="noopener">delvetool.com</a>` },
      { n: 5, html: `NIST CSRC. "Data provenance" and "Chain of custody" glossary entries. Provenance traces the origin and movement of information; chain of custody documents handling from collection through analysis. <a href="https://csrc.nist.gov/glossary/term/chain_of_custody" target="_blank" rel="noopener">csrc.nist.gov</a>` },
      { n: 6, html: `C2PA — Coalition for Content Provenance and Authenticity. Open standard attaching cryptographically signed provenance metadata to media so origin and edit history can be verified. <a href="https://c2pa.org/" target="_blank" rel="noopener">c2pa.org</a>` },
      { n: 7, html: `Nielsen Norman Group. "Using Quotes to Share UX Research" and "Attitudinal vs. Behavioral Research in UX." Verbatim quotes carry high credibility; attitudinal (what users say) and behavioral (what users do) data routinely diverge. <a href="https://www.nngroup.com/articles/attitudinal-behavioral/" target="_blank" rel="noopener">nngroup.com</a>` },
      { n: 8, html: `CloudArmy. "Why Stated Preferences Fail: The Say/Do Gap in Market Research." The documented gap between stated intent and observed behavior, usually unconscious. <a href="https://cloud.army/why-stated-preferences-fail-the-saydo-gap-in-market/" target="_blank" rel="noopener">cloud.army</a>` },
    ] },
  ],
  related: [
    { href: "/blog/post.html", title: "Stop trusting AI summaries. Start trusting the evidence trail.", meta: "9 min · Index" },
    { href: "/blog/synthetic-vs-human.html", title: "Synthetic vs. human research: when each one wins", meta: "7 min · Index" },
  ],
};