The P-Value Problem in Astrology Research

Why does astrology research keep producing inconclusive results? Part of the answer is structural — p-value thresholds, researcher degrees of freedom, and the file drawer problem affect this literature the same way they affect the rest of social science.

The Line at .05

A p-value is the probability of observing a result at least as extreme as the one you got, assuming there’s actually no effect — assuming the “null hypothesis” is true. By long-standing convention in most of the social and biomedical sciences, a p-value below 0.05 is called “statistically significant,” and a p-value above 0.05 is not. A result with p = .04 gets reported as a finding. A result with p = .054 gets reported as “no significant effect.”

The arbitrariness of this line is well-documented and widely acknowledged even by researchers who use it constantly. There’s no principled reason 0.05 is the right threshold rather than 0.06, or 0.04, or any other value — it’s a convention, originating with Ronald Fisher’s early-twentieth-century work, that became entrenched through decades of institutional practice (journal standards, statistical training, tenure committees) rather than through any argument that it’s the uniquely correct cutoff for distinguishing “real” from “not real.”

This matters enormously for how research gets reported and remembered, because a result on one side of .05 and a nearly identical result on the other side get treated, in publication and in public discussion, as categorically different — one a discovery, the other a non-finding — despite representing almost the same underlying evidence.

Carlson, Revisited

The most consequential application of this dynamic in astrology research involves the study most commonly cited as astrology’s definitive empirical refutation: Shawn Carlson’s 1985 double-blind test, published in Nature.

Carlson’s study asked professional astrologers to match natal charts to personality profiles generated by the California Psychological Inventory, under conditions designed to prevent any cues other than the chart-profile correspondence itself. The paper’s conclusion — that astrologers performed no better than chance — was widely reported as a definitive, “devastating” result, and has been cited as such in skeptical literature for the four decades since.

A reanalysis published in 2023 went back to Carlson’s actual data and reporting, and found something more complicated. Carlson’s study included multiple tests with different formats. One — a three-way forced-choice test — produced a result of p = .054, just above the conventional significance threshold. Under standard reporting conventions, this gets characterized as “not significant” — a non-result. Another test, using a different assessment method (a 10-point rating scale rather than forced choice), produced p = .04 on the same underlying question — just below the threshold, and therefore reportable as a “significant” finding, had the framing of the paper emphasized it.

The reanalysis’s point isn’t that these results prove astrology works — the authors are explicit that they don’t consider the results sufficient to “deem astrology as empirically verified.” The point is narrower and, in a sense, more damaging to how the original study has been used: a paper whose headline conclusion was a clean null result actually contained results straddling the conventional significance line in both directions, and the framing that emphasized the null side of that line became the canonical story, repeated for decades, while the marginal-significance side received essentially no attention.

Why This Isn’t Specific to Astrology

It would be a mistake to read this as evidence of bias specifically against astrology research — though that’s a charge that has been made, including by some of the researchers involved. The underlying dynamic — results near the p = .05 threshold getting sorted into “finding” or “non-finding” bins that overstate how different they actually are — is a structural feature of how significance testing works across all of social science, and it’s a major contributor to what’s become known as the replication crisis.

The “garden of forking paths” problem, described by statisticians including Andrew Gelman, captures part of this: when a dataset can be analyzed in multiple reasonable ways (different outcome measures, different subgroups, different statistical tests), and researchers — without any deliberate misconduct — tend to settle on the analysis that produces the cleanest, most reportable result, the reported p-value understates how much “researcher freedom” went into producing it. A p-value of .04 from an analysis that was one of several reasonable analyses that could have been run isn’t really telling you what a p-value of .04 is supposed to tell you, because the formal calculation assumes a single pre-specified analysis, not a search across multiple candidates that happened to land on this one.

Astrology research is, if anything, especially exposed to this problem, because of how many “reasonable analyses” a birth chart supports. A natal chart contains dozens of placements, aspects, and derived points. A study testing “does astrology work” has to choose which of these to test, against which outcome measures, for which population. The space of possible analyses is enormous, which means the space of analyses that could, by chance, produce p < .05 somewhere is also large — even if the underlying phenomenon being tested has no real effect at all.

The File Drawer

A related structural problem is what’s called the “file drawer” effect: studies that find null results are less likely to be submitted for publication, and less likely to be accepted if submitted, than studies that find significant effects — not because of fraud, but because null results are perceived (often correctly, in terms of career incentives) as less interesting and less publishable.

This creates a systematic bias in the published literature on any topic: it overrepresents positive findings relative to the true underlying rate, because negative findings disproportionately stay in researchers’ file drawers rather than reaching publication. For a topic like astrology, where many informal and small-scale studies have likely been conducted by practitioners and researchers without the resources or institutional backing for large-scale, pre-registered research, the file drawer effect could plausibly be especially severe — an unknown number of small studies finding nothing, never written up, against a smaller number of studies finding something, which get discussed and cited.

The file drawer problem cuts in a direction that should make skeptics, not just believers, cautious: if you only ever hear about the studies that found something — whether that something supports or undermines astrology — your sense of “what the research shows” is being shaped by a selection process that has nothing to do with the actual rate at which effects appear in unselected data.

What a More Honest Research Program Would Look Like

The structural problems described here aren’t unique to astrology, and the fixes proposed for them in the broader replication crisis literature apply here too, in principle: pre-registration (specifying exactly which analysis will be run, and on which outcome measures, before seeing the data, to eliminate the garden-of-forking-paths problem), larger sample sizes (to move borderline p-values away from the threshold where small differences in analysis produce large differences in reported significance), and a publication culture that treats null results as informative rather than uninteresting.

Some of the better astrology research — including aspects of the Gauquelin work discussed in the companion piece on meta-analyses — has approached these standards more closely than the popular “studies show astrology doesn’t work” narrative usually credits. Some of the most-cited “definitive” results, including Carlson’s, turn out on close inspection to be less clean than their reputation suggests — not because they secretly support astrology, but because the reporting conventions that produced their reputation are the same conventions that produce overstated certainty throughout social science.

The p = .05 line doesn’t know anything about astrology. It doesn’t know anything about anything — it’s a historical accident that calcified into a standard, and it sorts results into “yes” and “no” bins with a confidence that the underlying statistics don’t actually support, for any topic it’s applied to. Astrology research has been on the receiving end of this for forty years, in both directions, and the line has been treated, throughout, as if it were sharper than it is.

The P-Value Problem in Astrology Research: Why Studies Keep Failing

The Line at .05

Carlson, Revisited

Why This Isn’t Specific to Astrology

The File Drawer

What a More Honest Research Program Would Look Like

Your reading

Your Compass

Read next

Meta-Analyses of Astrology: What Happens When You Combine All the Studies

The Carlson Experiment: Astrology's Most Famous Failed Test

Astrology Hit Rate: How to Actually Track Whether Your Readings Are Accurate