Methodology

SSR Under the Hood: How Gustoso Turns Free-Text Answers into Likert Scores

Darron JenningsApril 21, 2026

A persona answers your question with a full sentence. A few seconds later, your dashboard shows a purchase intent score of 3.7. What actually happened in between?

This post walks through the SSR (Semantic Similarity Rating) pipeline one step at a time. No math background required — but by the end, you'll know exactly how free-form language becomes a calibrated score, and why the architecture exists the way it does.

The method is based on the research paper “LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings” (Maier et al., 2025). Gustoso implements it with some practical refinements.

Step 1 — The persona speaks naturally

The persona is never asked to pick a number. Instead, it's prompted to respond to your question in plain language — the way anyone would if a friend asked them.

A prompt like “Would you buy this at this price?” produces responses like “Yeah, honestly, that's a fair price — I'd probably grab it if I saw it in the store,” or “Not at that price. I'd wait for a sale.”

This is the input to everything that follows. The whole method rests on letting the persona talk.

Step 2 — The response becomes a vector

Natural language isn't directly measurable. To score it, we first convert it to a numerical representation — a vector embedding — using a model trained specifically to capture meaning.

An embedding is a long list of numbers (hundreds or thousands of them) where responses with similar meanings land near each other in mathematical space, and responses with different meanings land far apart. “I'd definitely buy this” and “I'd pick this up for sure” produce nearly identical vectors. “I'd definitely buy this” and “No thanks, not for me” produce very different ones.

The response is now a point in meaning-space.

Step 3 — Compare to the five anchors

For every scoring dimension — purchase intent, recommendation likelihood, emotional resonance, whatever you're measuring — there's a calibrated set of anchor statements. Five of them, one for each point on the Likert scale. Think of them as semantic mile markers that define what a 1, 2, 3, 4, or 5 sounds like in words.

A purchase intent anchor set might look roughly like:

1. “I would never buy this product.”

2. “I probably wouldn't buy this product.”

3. “I might or might not buy this product.”

4. “I would probably buy this product.”

5. “I would definitely buy this product.”

Each anchor gets its own vector embedding, the same way the response does. Then we measure how close the response is to each anchor using cosine similarity — a standard way to score how aligned two vectors are on a scale from –1 to 1.

You end up with five similarity scores — one per Likert point — telling you how semantically close the response is to each end of the scale.

Step 4 — Turn similarities into a probability distribution

Raw cosine similarities don't sum to anything meaningful. They also have a well-known problem: embeddings tend to cluster with a positive similarity floor, which makes every anchor look “somewhat close” to every response. The differences that matter are small, and the noise can swamp them.

Gustoso uses the bias-correction method from the Maier paper: subtract the minimum similarity from all five, then normalize so they sum to 1. The anchor farthest away becomes 0. The anchor closest to the response captures the largest share.

What you now have is a probability distribution across the scale — five numbers that represent how likely the response is to correspond to each Likert point. A clearly positive response might produce something like {0.02, 0.05, 0.10, 0.30, 0.53}. A clearly negative one might produce {0.48, 0.28, 0.15, 0.06, 0.03}.

Step 5 — Aggregate across the panel

One persona's response gives you one distribution. A study with, say, 200 personas gives you 200 of them. Averaging element-wise produces the panel-level distribution — the shape you see in your Gustoso report.

From there, the common summary statistics drop out:

Mean: weighted average of the five points. This is your 3.7.

Top-2-box: the combined probability mass at points 4 and 5.

Bottom-2-box: the combined probability mass at points 1 and 2.

The dashboard number is the mean. The distribution is everything that shape is made of.

Why the architecture exists this way

Every step in the pipeline exists to solve a specific problem.

The free-text response exists because asking a persona to pick a number reproduces every bias of a traditional survey. Hypothetical bias, scale compression, satisficing. Natural language sidesteps all of them.

Embeddings exist because meaning matters more than wording. “Yeah I'd get it” and “Sounds worth buying” should count the same. They do, because they embed to nearly the same vector.

Anchors exist because scoring needs a reference. Without calibrated endpoints, there's nothing to compare against. The anchors turn an unbounded similarity space into a bounded, interpretable 1-to-5 scale.

The bias correction exists because cosine similarity has a known floor. Naive normalization would compress every score toward the middle. The paper's correction restores resolution.

The distribution exists because a mean alone loses information. A single number can't tell you whether a panel agrees, disagrees, or splits in half.

What this means for you

You don't need to understand any of the math to run a study — but knowing the pipeline changes how you read the results.

The score is derived, not reported. Every number on your dashboard traces back to a transcript of natural-language responses, measured for meaning against calibrated anchors. The free text is not decoration — it's the raw data. Read it.

And the distribution is not a chart garnish. It's the actual output of the method. The mean is a summary of it, not a replacement for it.

Start a study, look at your next report, and trace a score backward — response by response. The method is transparent all the way down.

Photo by Tim Mossholder on Unsplash