Context-Dependent: When Personality Tests Hit the Boundary of Their Own Model

What happens when your decision architecture doesn't fit the instrument, and what it means for you operationally.

May 03, 2026

There’s an ongoing conversation on Substack among several of us about personality tests and their worth. I see both sides of the argument, and so today I want to get into the question of accuracy; specifically, what happens when the test doesn’t “work” on you? More importantly, how does this affect you and your team?

When I take the Myers-Briggs, I am a clear INTJ; I agree with the result because it accurately depicts me. I’ve taken the OCEAN test in the past, and most of the others over the last few years, because my job required them; jobs and managers seem to really like personality tests. They offer a quick way to categorize and to outsource the work of reading people. We’ll get into that in a moment.

I decided to take the Enneagram test today, and the results were far more murky.

Okay, so it says I’m probably a 1 (The Reformer), but also maybe a 5w6. Also, it has no idea what I am.

More importantly, none of those types actually define my personality with any real accuracy; each of them has facets that I am most definitely NOT, and some of those incorrect facets seem to negate the result because they are core parameters of that type. In short, none of them is better than 50% accurate, and one of them was far worse than that.

Note: I should point out that the official Enneagram site explains that by saying the following:

Your basic type dominates your overall personality, while the wing complements it and adds important, sometimes contradictory, elements to your total personality. Your wing is the “second side” of your personality, and it must be taken into consideration to better understand yourself or someone else. For example, if you are a personality type Nine, you will likely have either a One-wing or an Eight-wing, and your personality as a whole can best be understood by considering the traits of the Nine as they uniquely blend with the traits of either the One or the Eight. In our teaching experience over the years, we have also encountered some individuals who seem to have both wings, while others are strongly influenced by their basic type and show little of either wing.

The problem with this description is that while it accounts for cross-wing traits, they’re still structurally limited. What this description doesn’t explain is cross-type ambiguity; according to their system, a 1 cannot be a 5w6, only a 9 or 2.

What failed? I answered honestly. The easy conclusion would be that the test is useless. After all, people say that all the time, usually after getting a result they don’t like. I don’t have any feelings one way or another; as always, I am looking to see where things fail and why. I expected a clear answer, so I was (pleasantly) surprised when the test basically said that I’m some kind of anomaly, because that gives me something to sink my teeth into. (I ~~might be~~ most definitely am eccentric, but I’m certainly not an anomaly.)

The problem may be the questionnaire's ability to identify which layer it is actually detecting: basic type, wing, stress pattern, instinctual expression, developmental level, or traits common to several types. It's worth noting that the Enneagram itself has faced criticism in peer-reviewed research on exactly these grounds. Researchers studying its reliability found that the instrument struggles to produce consistent results across administrations, which means an ambiguous result like mine may reflect a measurement problem as much as anything else. That doesn't resolve the question of why the ambiguity appeared, but it's worth holding as context.

I would argue that the test must rely on self-reporting, and anything self-reporting forces you to collapse your internal evaluation into a single answer, even when multiple drivers are present. This is a known limitation of self-report instruments generally. Researchers studying self-knowledge accuracy have documented how poorly people translate their own internal states into fixed categorical responses because the instrument forces a precision that the underlying experience doesn't have.

The Enneagram measures motivation rather than action. For instance, I scored high on the questions that were some form of “I don’t mind correcting someone when necessary.” In the enneagram, however, that same behavior could be one of four different types based on why I am doing it.

What they’re doing is objectively wrong or corrupt (Type 1)
What they’re doing is inefficient (Type 5)
What they’re doing is unacceptable to me (Type 8)
What they’re doing is risky (Type 6)

The test sees the same behavior, but looks at the motivation to assess the category, and each form of that question is geared to a specific context. If someone has a predominant fear or desire, they will answer those questions out of that dominant trait, thereby telling the test how to score.

What if I might correct behavior for any or all of those reasons, depending on the situation? I certainly don't have four different personalities; I just have one system that I operate in, and it's not organized around a single dominant expression because I have different drivers activating depending on context.

This is not as unusual as it might sound. Psychologist Walter Mischel argued decades ago that stable internal traits predict behavior across contexts far less reliably than we assume, and that situational input plays a much larger role in shaping response than trait-based models account for. More recently, researcher Daniel Fleeson proposed that personality is better understood as a distribution of states expressed across situations rather than a fixed internal structure. Under that model, moving fluidly between drivers depending on context is a coherent architecture, not an absence of one.

(I would also point out that plenty of people are this way; this isn't some special status.) The test, however, aggregates different motivations that caused the same answer, and tries to force coherence into a category.

Did the test fail? Should we throw out the whole test and assume it cannot accurately describe people? I don’t think it’s that simple. Is it proof that I am somehow special or divergent? Hardly.

Personality tests are built to identify a dominant pattern. What if someone doesn’t express that pattern cleanly at the level the test is measuring?

There are plenty of people who can access multiple drivers. That doesn’t mean they lack structure, or that they are somehow emotionally or mentally unstable; in fact, I would argue that it merely means their structure doesn’t present in a way that produces a single, obvious signal on a questionnaire. That’s not a bad or good thing, and I think characterizing it as one or the other is gross. It is merely a thing that exists.

Self-reported tests are not actually observing you anyway; they rely on you to interpret yourself. What’s more, that interpretation has to be distilled to a fixed answer, and done consistently across dozens of questions.

When you answer a question about correcting someone, for instance, you might be answering based on how you evaluate the situation, based on the context within the question being asked:

Is it wrong?
Is it inefficient?
Is it going to cause problems?
Is it something that needs to be addressed directly?

These are all different lenses applied to the same event. The test assumes one of those lenses is primary, and the others are secondary expressions of it, so it tries to isolate the dominant driver.

Therefore, if you are a true Type 1, you might answer affirmatively to questions regarding correction that are phrased in such a way that trip the “is it wrong” part of you, but answer less affirmatively to questions phrased to trip someone’s need to be efficient or conflict-avoidant. If you don’t experience one of those as “the” reason, and you move between them fluidly depending on context or discernment, then your answers won’t stabilize into a single pattern.

I don’t believe that personality tests are somehow useless. I find them helpful in many cases, and we humans do like our categories. Michael Woudenberg has written extensively on personality testing, and points out that:

Throwing out personality tests just muddies the water by stripping out descriptions of healthy personality from the true pathologies that lurk around them. I’d rather have an imprecise measure to juxtapose the bad behaviors off of than muddy the waters and invite pathologies.

I agree with that. I would argue, however, that the tests are optimized for a specific kind of internal organization, where a dominant motivation or pattern consistently shapes perception, interpretation, and behavior. When that condition is present and the person reports it as such, the test will report accurately. When it isn’t present, or the person has a context-based decision model instead of a dominant pattern, it produces multiple answers and then hedges.

That, to me, is a boundary of the model itself.

So what now? What do we do with these tests when they fail to give us a neat little box with a label we can attach to our personal nametag? I would say we do nothing. Sure, continue to take them. Use them as conversation starters for the talks you have with yourself. After all, while all models are wrong, some of them are useful. Without taking this latest test, I would not know that I am a context-dependent thinker using multiple drivers without a dominant pattern. That has led to some interesting thinking and parsing in my own head about how that shows up and what it means for my decisions.

That being said, maybe outsourcing our own thinking to a test meant to tell us who we are is the real problem. Using an inanimate thing to identify us, to give us labels and identity, is no substitute for the thinking that we must do to know ourselves. It can be a starter, certainly, and there is value in using it for that.

These tests, however, assume to some extent that you already know things about yourself, enough to correctly articulate your patterns to a test that doesn’t know you at all. If you are a self-indulgent person who fancies themselves to be selfless, or someone who has bought into their own rationalizations about their behavior, you’re not going to admit on a personality test that you lack empathy or believe that people should cater to you, because you may not even be aware that you believe that.

You can easily see how this could be a problem. In that situation, the test will confirm your own rationalizations and leave you with an inaccurate picture because you unknowingly gave it inaccurate data.

This brings up another question: Wouldn’t it be true that the better you know yourself, the more accurate the data you give will be, and therefore the more accurate the test will be? My own experience tells me this is not necessarily true. But there is one thing I can say is probably true: The better I know myself, the less I need a test to tell me who I am.

If you walk away from an inconclusive result with a shrug and a choice to believe that you are somehow exempt from the question, you’ve missed the most useful data the test produced. A model that can’t place you has just told you something important: your drivers may not consolidate into a single dominant pattern. That means the tools built around dominant-pattern assumptions will give you distorted readings, including the tools you use to assess other people.

If your architecture is context-dependent, you need to know which drivers activate under which conditions, and why. Without it, you are operating on an incomplete map of your own decision system. That gap shows up in how you read situations and how you assess others, and it will lead to inaccurate decision-making.

If a personality test fails to describe you accurately, it might be that it’s just exposed the limits of its own model, and you’ll have to think through the implications for yourself. It might also be that you aren’t accurately describing yourself to the test. Either way, there’s work to be done, and the only one who can do it is you.

The Shepard Scale

Discussion about this post

Ready for more?