Guides

How to Write Evaluation Questions That Actually Measure Something

Darron JenningsApril 11, 2026

Most survey questions are designed for humans who pick numbers on a scale. That works fine when someone is sitting in front of a screen making choices — but it falls apart when the goal is to measure what people mean, not just what they click. The difference matters more than you think.

The problem with picking a number

Traditional Likert surveys have a well-known set of flaws. People satisfice — they pick 4 out of 5 because it’s easy, not because it reflects how they feel. Anchoring bias, scale compression, and straight-lining plague every panel.

But there’s a deeper problem that market researchers have grappled with for decades: people overstate their likelihood to buy. When asked directly “Would you purchase this product?”, respondents consistently say yes at rates far higher than their actual behavior. Studies have shown stated purchase intent can run 3 to 5 times higher than real-world conversion. It’s called hypothetical bias, and it’s baked into every traditional survey that asks people to predict their own future behavior. People aren’t lying — they genuinely believe they’d buy. They’re just bad at predicting what they’ll actually do when their wallet is in their hand.

When you move to synthetic research — AI personas evaluating your product or ad — you have an opportunity to do something fundamentally different. Instead of asking a persona to pick a number, you can let it speak naturally and then measure the meaning of what it said.

That’s the core of SSR (Semantic Similarity Rating).

Why measuring meaning beats collecting numbers

SSR works by having each persona write a genuine, conversational response to your question — no scales, no checkboxes. Then it measures how semantically close that response is to a set of calibrated reference statements that define each point on the scale.

This sidesteps the three biggest reliability problems in traditional research:

Hypothetical bias disappears. Because the persona responds conversationally rather than selecting a number, there’s no direct “would you buy this?” prompt to inflate. The response is natural language — “Yeah, I’d probably grab this if I saw it on sale” — and the score is derived from the meaning of what was said, not from a self-reported prediction. The indirectness of the measurement produces more honest signals.

Satisficing can’t happen. Human panelists get fatigued. By question 15, they’re clicking 4s down the line. With SSR, every question gets a unique free-text response, so each score reflects genuine semantic content — not checkbox fatigue.

Scale interpretation is consistent. One person’s “4 out of 5” is another person’s “3 out of 5.” SSR bypasses this entirely because scores are anchored against calibrated reference statements with fixed semantic meaning. A 4.1 means the same thing every time, regardless of who’s responding.

How it works under the hood

For those curious about the mechanics: each evaluation uses a set of reference statements — typically five — that define the full spectrum from strongly negative to strongly positive. Think of them as semantic mile markers.

When a persona responds to your question, that response gets converted into a mathematical representation of its meaning (a vector embedding). The system then measures how close that meaning is to each of the five reference statements using cosine similarity. Those similarity scores get normalized into a probability distribution across the scale.

So a response like “I’d definitely pick this up, the price is right” would land closest to the positive end of a purchase intent scale — not because anyone labeled it a 5, but because the meaning of the words aligns with positive purchase intent.

The key insight: the persona never picks a number. It speaks naturally, and the meaning determines the score.

Your question shapes the response. The response shapes the score.

Because SSR scores the meaning of a natural language response, your question needs to elicit a response that’s rich enough to measure. Vague questions produce vague responses, which produce noisy scores.

Three principles to follow:

Be specific about what you’re evaluating. “What do you think of this?” is too open. “Would you buy this for yourself at this price?” gives the persona something concrete to react to. The more focused the prompt, the more the response clusters around a measurable intent.

Ask in a way that invites a natural reaction. Don’t ask “Rate your satisfaction.” Ask “How do you feel about this after seeing it?” You want the persona to talk, not perform a rating task.

Match your question to what you’re measuring. If you care about purchase intent, ask about buying. If you care about recommendation, ask about sharing. The question and the scoring dimension need to pull in the same direction, or your scores will be noisy.

Turning any research question into something measurable

Not every question you want answered comes naturally as a scored evaluation. Here’s how to take common research questions and reshape them so they produce reliable, comparable scores.

Brand perception — Don’t ask: “What do you think of Brand X?” Ask instead: “If a friend asked about Brand X, what would you tell them?” This invites a natural recommendation-style response that maps cleanly to scored evaluation.

Ad effectiveness — Don’t ask: “Is this a good ad?” Ask instead: “After seeing this ad, how do you feel about trying this product?” This shifts from judging the creative to expressing behavioral intent — which is what you actually care about measuring.

Price sensitivity — Don’t ask: “Is this too expensive?” Ask instead: “Would you buy this at this price, or would you wait for a deal?” This forces a concrete purchase decision rather than an abstract value judgment.

Emotional resonance — Don’t ask: “Does this make you feel something?” Ask instead: “Describe your gut reaction when you see this.” Open enough to capture genuine emotion, specific enough to generate meaningful semantic content.

Competitive preference — Don’t ask: “Do you prefer Product A or B?” Ask instead: “If both were on the shelf, which would you reach for and why?” This creates a scenario the persona can inhabit, producing richer language than a binary choice.

Notice the pattern — every rewrite puts the persona in a situation rather than asking them to evaluate in the abstract. Situational prompts produce more authentic language, and more authentic language produces more reliable scores.

Make it personal with qualification criteria

The question is only half the equation — who’s answering matters just as much. Gustoso lets you add qualification criteria to your personas — things like “Homeowner,” “Has used competitor products,” or “Shops online weekly.” These get woven into each persona’s backstory, so when you ask “Would you buy this at this price?”, a persona who’s been shopping for a home renovation answers differently than one who just graduated college. The more relevant the persona’s context, the more meaningful the response — and the sharper the score.

Where Gustoso delivers the strongest results

Purchase intent — The original use case from the underlying research. Synthetic panels consistently produce intent distributions that track real-world behavior more closely than stated-intent surveys, precisely because the measurement is indirect.

Recommendation likelihood — “Would you tell a friend?” is one of the most natural conversational prompts, which makes it one of the most reliable scoring dimensions.

Concept and creative testing — A/B comparisons where you need to know which version resonates more, not just whether people “like” it.

Audience segmentation — Combining demographic targeting with qualification criteria to find which segments respond strongest — and why.

Ready to see the difference? Start a study and let the responses speak for themselves.

Photo by Rafael Pedroso on Unsplash