This site would like to set some non-essential temporary cookies. Some cookies we use are essential to make our site work.
Others such as Google Analytics help us to improve the site or provide additional but non-essential features to you.
No behavioural or tracking cookies are used.
To change your consent settings, read about the cookies we set and your privacy, please see our Privacy Policy

World Sports Advocate

Opinion: Why the IAAF’s latest testosterone study won’t help them at CAS

The deadline for the International Association of Athletics’ Federation to submit evidence in support of its Hyperandrogenism Regulation, which was suspended by the Court of Arbitration for Sport two years ago, is the end of September 2017. Katrina Karkazis, Senior Research Scholar at the Center for Biomedical Ethics at Stanford University and Gideon Meyerowitz-Katz, a chronic disease epidemiologist in Australia, dissect the evidence published by the IAAF in support of its testosterone regulation, and share their opinion on whether the findings stand up to the CAS’s requirement that the IAAF must show that female athletes with higher T have a performance difference that approximates what male athletes typically have over female athletes, which CAS places at between 10-12%.

The 2017 World Championships for track and field started in London on 5 August 2017 and will no doubt be accompanied by a familiar and specious media squall over women with naturally high testosterone levels and whether they have the right to compete. 

To be clear: they do. This legal claim was upheld two years ago, owing to Indian sprinter Dutee Chand’s successful challenge at the Court of Arbitration for Sport (‘CAS’), during which the International Association of Athletics’ Federation’s (‘IAAF’) testosterone regulation was suspended for two years due to a lack of scientific evidence to support it1. The regulation, which was put in place for the 2011 World Championships in Daegu, South Korea, placed an upper limit on natural T levels for women, arguing that women with higher levels have an unfair advantage over their peers. If a woman’s T levels were deemed by the IAAF to give her an ‘unfair’ advantage, she was barred from competing in the female category until she lowered them via surgery or drugs. (The IOC implemented an analogous policy in 2012 and suspended its regulation following the CAS ruling against the IAAF.) 

The CAS ruling, issued on 24 July 2015, could have spelled the end of this regulation - and essentially decades of so-called ‘sex testing’ of women athletes - but it didn’t. In the past year, both on the eve of the 2016 Rio Olympics, and before the 2017 Asian Athletics Championships (held in Chand’s hometown of Bhubaneswar, India), IAAF President Sebastian Coe expressed the IAAF’s intention to seek to reinstate the regulation. 

The deadline for the IAAF to submit evidence in support of the regulation is the end of September 20172. IAAF policymakers recently published two studies in the British Journal of Sports Medicine intended as evidence for CAS. The first study3 was published to little fanfare. The second study4, released three days before the Asian Athletics Championships, on 3 July 2017, was accompanied by an IAAF press release5 and two Guardian articles, all supporting the regulation. One of the newspaper pieces was an op-ed6 by a ‘witness for the IAAF at the Chand trial’ who admitted to being ‘not entirely impartial.’ The other was a news article7. This press generated yet more articles regurgitating the IAAF’s position and spin and magnified it.

The IAAF is heralding this study as major and important evidence. It isn’t. 

From the outset, CAS has been clear about what evidence it requires in order to uphold the regulation. The IAAF must show that female athletes with higher total T have a performance difference that approximates what male athletes typically have over female athletes, not that female athletes with higher T have any competitive advantage over their peers. In other words, it has to be a big performance difference, which CAS put in the 10-12% range.

What the study found is nothing near this.

For the study, the authors looked at the performances of female athletes in 21 track and field events competing in the 2011 and 2013 IAAF World Athletic Championships (17.3% were sampled at both events). They divided the women into three groups (tertiles) according to whether they judged their free T (fT) concentrations to be low, medium, or high.

They then did a number of statistical tests comparing the performances of women in the highest fT group with those in the lowest fT group. They found a statistically significant correlation between fT and performance in five events: 400m, 400m hurdles, 800m, hammer throw, and pole vault. Women athletes with fT in the highest tertile performed ‘significantly better’ in these events than women in the lowest fT tertile, ranging from 1.78% to 4.53%, with 800m being the lowest and hammer throw the highest. 

This is effectively the same 1-3% difference in performance between female athletes with hyperandrogenism and their peers that the IAAF presented as evidence of unfair advantage to the CAS, and which CAS rejected as insufficient evidence. 

But that didn't stop speculation both in the study and in the press release8 that the study cannot support. 

The authors argue that female athletes in the group with the highest fT performed ‘significantly better’ in those five events than those in the group with the lowest fT - this means that the p (or probability) value was less than the critical threshold of 0.05 for statistical significance. This is then recast as the claim that women with higher fT have a ‘significant competitive advantage’ in the paper’s conclusion. 

Something that the study’s authors and promoters have failed to mention, however, is that the IAAF regulation is based on total T levels, not free T concentrations. 

Moreover, the findings don’t all go in one direction. In the 100m, 100m hurdles, 200m, 1,500m, 3,000m SC, 5,000m, 10,000m, javelin, triple jump, and 20km RW, women with the lowest fT actually outperformed those with highest fT. None of these results were statistically significant, but they nevertheless show that ‘significant competitive advantage’ was not evidenced across the board.

The strategic use of the word ‘significant’ makes the findings appear more consequential than they are, when all they really say is that these selective findings are statistically significant. Contrary to what has been reported in The Guardian9, statistical significance is not actually a very high bar. In fact, finding a p-value less than 0.05 is the lowest of all bars in scientific research. The American Statistical Association10 has said that a p-value alone ‘does not provide a good measure of evidence regarding a model or hypothesis,’ because in and of itself a p-value tells you little about the size of the effect. The IAAF were not asked by CAS to find statistical significance; they were charged with finding a performance difference (effect size) in the 10 to 12% range. Nowhere in the paper or press release do they acknowledge that they found no performance difference of this magnitude.

Worse, even the statistical significance in question is on shaky ground for a few reasons.

First, if you just compute the study’s p-values, most of them are between 0.04 and 0.05. That’s technically significant (if you use p < 0.05), but when the p-value is this close to the threshold, the inclusion or exclusion of every single athlete is of consequence. Remove one person and significance disappears; add another, and it’s there. This is why the study’s inclusion and exclusion criteria are important (e.g., including in the sample athletes who have doped). 

Here’s the second reason. Say you want to see if there is a relationship between fT and performance among athletes in this sample. Your first idea might be to test this separately for each event, using a level of significance like p < 0.05. This might not seem like a bad idea at first glance, but running a statistical test is a bit like rolling a 20-sided die: do enough tests and you are bound to come up with a few 20s. 

The authors ran the tests over and over and over - trying 43 times (22 for men and 21 for women) to reveal a relationship between the highest fT levels and performance. Just as when a die is rolled, the more times the analysis was run, the higher the probability of getting a significant result became - simply due to increased chance, and even if each of the tests was not individually significant. This is why this type of analysis - it’s called multiple comparisons - must be corrected for chance, or it runs the risk of showing what’s known as a ‘false positive.’ 

And in the words of the authors: ‘These different athletic events were considered as distinct independent analyses and adjustment for multiple comparisons was not required.’

Well, you can’t really do that. One reason is that there's too much crossover among the group of competitors. 

Some common methods to correct for multiple comparisons are a Bonferroni Correction or the Benjamini-Hochberg Procedure. If you run these tests on their data, none of the results are likely to be statistically significant11, insubstantiating all of the study’s conclusions. 

We noticed that the authors did an analysis using total T in the supplementary file and found a statistically significant relationship for the same three running disciplines for female athletes, but not for the hammer throw and pole vault. After correction, these findings are highly unlikely to be significant.

The authors don’t stop there, however. They suggest their study can speak to things it can’t, all of which are useful for the case. As one example, they extrapolate that based on their findings to suggest higher fT levels will result in even better performance. That’s what they mean when they talk about a ‘dose-response relationship’:

‘In female athletes, a high fT concentration appears to confer a 1.8–2.8% competitive advantage in long sprint and 800 m races. Taking into account a linear dose–response relationship between serum androgen levels and athletic capacity, it is possible that the magnitude of this advantage will be even greater for an androgen-sensitive female athlete with T or fT concentrations within the normal male range.’

The lead author was even looser with his words in the IAAF press release:

‘If, as the study shows, in certain events female athletes with higher testosterone levels can have a competitive advantage of between 1.8-4.5% over female athletes with lower testosterone levels, imagine the magnitude of the advantage for female athletes with testosterone levels in the normal male range.’

But this study does not and cannot speak to any degree of performance advantage; neither the right type of comparison nor the right type of statistical analysis was made to support such a claim. Let’s parse this for a minute.

The study is a group-level analysis, comparing women in the lowest fT tertile with women in the highest T tertile in each discipline. The above quotes, however, are making claims suggesting they have done athlete-level analysis. To make a claim like this - the higher the T, the better the performance - a linear regression would have to be performed. If they had done a linear regression we would also be able to see a regression coefficient for T, giving us insight into if and how T correlates with performance. 

What’s more, you’d need way more than five significant findings to draw the conclusions they do. The most they can say is that there is a relationship between fT and performance in some events; they cannot claim that higher T, even much higher T, results in better performance. In fact, the corollary to their main conclusion - that in five events women with higher fT levels have a significant advantage - is that in the vast majority of sports higher fT levels were not correlated with performance, which doesn't fit their claims that women with high T ‘may have a massive androgenic advantage12’ or that there is a ‘strong scientific consensus13’ that endogenous testosterone levels are a significant determinant of female athletes’ performance.

The authors also made problematic choices about whom to include in the female sample, including women who were known to have doped. Conflation between endogenous and exogenous T (natural versus doping) is not new, indeed the paper opens with quotes implying that the two are ‘identical.’ Yet in an earlier study14 these authors excluded women who doped from their sample, on the basis that doping was a confounder. We do not know which events included the women who doped. We do know that at least nine women from the group in the study, possibly more, were found to have been doping. It could be that the statistically significant findings were in events that had women who had doped. We don't know.

The authors also included women with T in the so-called ‘male-typical’ range, and they were correct to have done so. But doing so renders moot the statement above about imagining the boost women with T in that range might have. We don't have to imagine at all - they were included in the analysis.

The IAAF press release points to other evidence they marshalled for CAS: an opinion piece15 by the lead IAAF policymaker, and an earlier study by IAAF-affiliated researchers (mentioned above). That study was not accompanied by a press release probably because it ‘found no correlation between serum T and physical performance.’

The IAAF has had six years to assemble evidence for this regulation. Currently, they have presented one opinion piece and these two studies. Even some who have supported the regulation agree these studies are not persuasive or pivotal.

But all this scientific parsing and sleight of hand distracts from the very real impact of this regulation on women, women who have undergone anything from invasive exams to serious and irreversible medical interventions, including hormone replacement and clitoral surgery16, all to keep competing. Both the studies and the media attention that has followed ignore the lives that have been harmed and the careers ruined by these regulations. 

So, what do we have here? We have a regulation that is ostensibly intended to level the playing field among women - except that it remains bereft of evidence. Some argue that no regulation will be the end of women’s sport. Meanwhile, there has been no regulation for two years and women’s sport looks like it always has: women competing against women.


Katrina Karkazis, PhD, MPH is a Senior Research Scholar at the Center for Biomedical Ethics at Stanford University. Her research on this regulation has appeared in Science, BMJ, and the American Journal of Bioethics. She also testified as an expert witness in Dutee Chand's case. 


Gideon Meyerowitz-Katz, MPH is an epidemiologist working in diabetes prevention and management in Western Sydney, Australia. He writes a regular public health blog at Gid M-K: Health Nerd, covering public health in all its intricacies and how it is often misconstrued.



















Take a 7 day free trial of World Sports Advocateclick here

For more detailed infromation on our subscription options please contact Liam Smith on +44 (0) 20 7749 0495 or email

Search Publication Archives

Our publication archives contain all of our articles, dating back to 2003.
Can’t find what you are looking for?
Try an Advanced Search

Log in to world sports advocate
Subscribe to world sports advocate
Register for a Free Trial to world sports advocate
world sports advocate Pricing

Social Media

Follow us on TwitterView our LinkedIn Profileworld sports advocate RSS Feed