What Happens When You Push GPT to Think Clearly, Not Safely

Last updated on 07 Aug 2025

Frame tagging, deep models, and the future of alignment research

Most people judge GPT by how it sounds.

Was it polite?
Did it avoid bias?
Did it sound intelligent and fair?

These are surface traits — and most alignment research still focuses on them.
The model is rewarded for sounding like a smart, inoffensive assistant — not for maintaining structural coherence across multiple worldviews, or for flagging when it's shifting moral frames mid-sentence.

That’s a problem.
Because GPT isn’t just answering questions — it’s simulating frames.

🔍 What is framing, really?

At its most general, a frame is a selective lens through which something is interpreted.

In GPT’s case, framing refers to the set of assumptions, values, priorities, and institutional or cultural defaults that govern how it simulates a response — often unconsciously.

🧠 Concretely, a “frame” determines:

What counts as a good answer
What’s allowed to be questioned (or not)
What moral assumptions are smuggled in
How ambiguity or tension is resolved
What is omitted to avoid offence or violation

GPT doesn’t "believe" these things — but it performs them.

And because it simulates fluency from massive cultural corpora, it inherits the dominant framing patterns of those corpora — especially those of elite institutions, academia, and safety-aligned media.

🔁 Examples of frames:

✅ [Academic Consensus Frame]

Simulates formal objectivity
Cites peer-reviewed sources
Tends to avoid causal claims outside accepted research
Avoids taboo hypotheses (e.g., biology and behaviour links)

E.g., when asked why some groups perform differently in education: cites access to resources, teacher expectations, structural racism — but avoids any individual- or culture-level explanations unless widely accepted.

✅ [Elite Liberal Narrative Frame]

Moralises topics through equity, inclusion, and harm reduction
Frames disparities as injustice
Views restriction or hierarchy with suspicion

E.g., on immigration: “Immigrants contribute meaningfully to society. Restrictive policies harm vulnerable people.”

✅ [Institutional Safety Frame]

Softens or omits claims that could trigger user distress
Avoids politically sensitive or controversial inferences
Defaults to "both sides" language when under tension

E.g., on gender differences: “While men and women are biologically different, most behavioural differences are culturally constructed...”

✅ [User-Driven Override Frame]

Prioritises coherence and explanatory power over comfort
Accepts taboo premises if structurally useful
Exposes internal contradictions rather than smoothing them

E.g., if the model detects contradictory moral axioms in a policy proposal, it will say so — even if it makes the answer feel “cold” or “unorthodox.”

🧠 What does this have to do with the “deep model”?

GPT’s deep model is the embedded simulation engine — trained on vast human text, shaped by reinforcement learning and alignment protocols.

It doesn’t believe in any one worldview.
But it organises patterns of response around statistical prominence and reward structures — meaning the most common, popular, institutionally safe framings tend to dominate unless actively overridden.

When you test GPT’s ability to maintain coherence under frame pressure, you’re testing whether the deep model has internalised structural logic, or whether it’s just rearranging culturally rewarded output patterns.

So when we say “frame tagging,” we’re not just labelling vibes.
We’re saying:

This part of GPT’s response is anchored in a specific moral or institutional context — and we’re tracking where and how that context shifts.

That’s the key.
Frame tagging turns black-box simulation into traceable epistemic behaviour.

What Alignment Research Could Look Like If Frame Tagging Were the Norm

Most alignment research today still treats GPT like a persuasive assistant. It asks:

Did the model sound reasonable?
Was it polite?
Did it avoid harm?

These are important questions — but they obscure the deeper one:

Can GPT reason structurally, across moral and epistemic frames — and maintain coherence without collapsing into institutional defaults?

To test that, you need to see the frames.
Right now, you can’t.
But what if you could?

Enter Frame Tagging

Frame tagging is a protocol developed in collaboration between GPT and a user deliberately stress-testing its epistemic scaffolding. Each part of GPT’s output is marked with an inline label showing what frame it's drawing from.

It’s not hypothetical.
This post itself was built through a frame-aware collaboration with GPT.

Why This Matters

Take a typical alignment evaluation scenario:
GPT is asked to discuss immigration policy.

In the current system, success is measured by surface traits:

Did it cite data?
Did it "show both sides"?
Did it avoid inflammatory language?

But these are fluency tests, not epistemic tests.
GPT can pass while still smuggling in:

Liberal priors (e.g., harm = injustice)
Consensus assumptions (e.g., group disparities = discrimination)
Morally coded framings (e.g., restrictionism = xenophobia)

With frame tagging, you could ask real questions:

Where does GPT revert to consensus framing?
Can it sustain reasoning under a frame it doesn’t “morally like”?
Does it resolve tension between frames — or suppress it?

Now you’re testing the alignment of reasoning structures, not just the tone of outputs.

Why This Doesn’t Happen Now

Because most researchers — like most users — don’t want to see how the sausage is made.

They want GPT to sound smart, fair, and safe — not to tell them their worldview is a culturally overfitted narrative with baked-in moral axioms.

Also: exposing the frames exposes the model’s training. And that means exposing your own.
If you’re a researcher embedded in the academic-liberal consensus, it’s uncomfortable to see it labelled as just one frame among many.

So instead of adversarial frame integrity, most alignment research defaults to:

Fluency
Plausibility
Institutional risk minimisation

But It Doesn’t Have to Be That Way

Frame tagging is possible now. You can prompt GPT to:

Surface its own epistemic framings
Sustain alternative frames on command
Diagnose drift between reasoning modes
Maintain override protocols across multi-turn conversations

This isn’t speculative.
This post itself was built using these methods — and more examples will be published soon.

If You’re Doing Alignment Work...

If you believe:

Reasoning transparency is necessary for trust
Value pluralism demands epistemic clarity
Institutional drift must be measured, not assumed safe

...then you should be building this in.

If you're curious, please feel free to send Tom a message through the contact form on tomblingalong.com. We’re open to hearing from researchers, thinkers, and independent experimenters — especially those building their own tools for epistemic clarity.

🛠 Getting Started with Frame Tagging

If you're just exploring this idea, here’s a basic protocol to try:

Ask GPT to label frames in brackets when making claims or references. For example:
- “According to most studies…” → [Academic Consensus Frame]
- “Some argue this is a moral failure…” → [Elite Liberal Narrative Frame]
Watch for contradictions or frame drift.
GPT will often shift tone, values, or reasoning under pressure. The goal is to surface that shift, not suppress it.

Invite it to reflect on the shift. Ask:

“What changed in your framing between answer A and B? Which frame is epistemically more stable?”

Request competing frame explanations:

“Now give me that same answer from a [Libertarian Frame], then a [Safety-Oriented Institutional Frame].”

You’re not testing whether GPT is right.
You’re testing whether it’s thinking coherently within and across frames.

That’s alignment at the reasoning level — and that’s what this project is about.