I saw one of those instagram advertisements that makes wild claims for a new probiotic (“the reason you’re not losing weight is because you don’t have any Akkermansia”) the other day and obviously I immediately slagged it off on my stories.
But then someone sent me a published trial the company had done on glucose, and said “wait - this company actually does legit science, you should check it out”. So I did. I don’t think the paper shows what the company are claiming it does but on the face of it, the paper might seem convincing. So I thought it might be worth reviewing it here. WARNING - THIS IS QUITE A WONKISH REVIEW.
The company in question is called Pendulum, and it seems they manufacture and sell what like look proprietary probiotics or mixtures of probiotics. Probiotics are bacteria that we have in our gut, and the theory is that by replenishing the types of bacteria we might be lacking, we can get a “better” microbiome and improve our health. Of, course, no matter how compelling the theory, or how much in vitro work has been done, the real litmus test for any product or claim is “does it actually have any meaningful impact we can measure in a human?”.
The positive news is that it looks like the company have actually tested their products in some double-blinded randomised controlled trials. The paper the person sent me is this one:
The Study
This was a 16-week placebo-controlled study to evaluate the safety and impact of two probiotic products: WBF-010 and WBF-011 (both are tablets containing mixes of microbes) in people with type 2 diabetes. The primary endpoints were safety, glucose AUC (area under the curve, so it’s a measure of the total exposure to glucose) after a standard meal and C reactive protein (this is a marker of inflammation).
Now let me start by highlighting how many things this company and this study does right:
It’s double-blinded, has a suitable placebo, and is randomised.
They registered a primary outcome* in a clinical trial registry.
They published the findings in a peer-reviewed journal.
They even included a statistical analysis plan along with the main paper.
They seem to be doing a bunch of different trials for their products, some of which are recruiting via their website.
Where they strayed away from good science IMO is how they interpreted their results.
They registered their primary metabolic outcome as the glucose total area under the curve (AUC), and report a significant reduction in this outcome in the paper.
But I am not convinced by their data that they show a significant, clinically relevant reduction in glucose that is due to the supplement alone.
Let’s examine the the paper in depth:
Statistical issues
They had a slightly unusual approach to their analysis. Since they had two primary outcomes and they were making multiple comparisons (WBF-010 vs placebo and WBF-011 vs placebo) the investigators had to take steps to reduce the likelihood that any result they found might be a false positive.
I am not a statistician but let me try to explain. The more outcomes you measure in a study, the greater the likelihood that you get a false positive (we call this a type 1 error). Any single statistical test will yield a false positive 5% of the time - the “ p value” cut off for statistical significance (p<0.05) comes from this. Lots of people think that 5% is too high which is why you often see investigators set statistical significance at p<0.01 or p<0.001. And remember this is for a single statistical test (ie comparing the effect of vitamin C vs placebo on fasting glucose concentration)- what if you are doing multiple statistical tests in one study? As I mentioned, in this study they had 3 groups, and they had two primary outcomes (glucose tAUC and CRP - both of which were compared between WBF-011 and placebo and WBF-012 and placebo).
One way of approaching this that I am more familiar with is a Bonferroni correction. This corrects the p value for the number of comparisons (statistical tests) you are making. So if you were running 3 comparisons, you divide the p value by 3, which would tell you that your statistical significance needs to be set at (0.05/3 = 0.0167) p<0.017. So effectively you’re saying “we acknowledge that by comparing multiple things, we’re more likely to get more “hits”, so to account for this we are setting our accepted significance level at a higher cut-off”.
They don’t do this in this study. Instead they use a sequential testing procedure which I am less familiar with (so I might be totally wrong!). Essentially this approach runs statistical tests at the 0.05 level in a pre-determined order (they state this order in their statistical plan), until they encounter the first test that is not significant (at which point they stop). My concern with this approach is that they consider the study type 1 error rate to be 0.1. In other words, taking all the tests they are running into account they consider 10% of their results might be false positives. I think this should be closer to 20% because they’re running 4 tests. Now some types of sequential testing can adjust the significance level for subsequent tests if the first test is positive, so maybe that’s what they did here. But given the importance of adjusting for multiple comparisons, I don’t think the authors have made a clear enough case that they have controlled for this.
Could the changes in glucose just be natural variation?
Post-prandial glucose is extremely variable. In other words, if you measure someone’s post-prandial glucose one week and then measure it again the next, very often it will change to a clinically significant degree. Because of this, any studies aimed at detecting changes in post-prandial glucose will typically need quite large sample sizes (unless the effect size is very large).
For example, if you look in the table below from the paper, the total AUC for glucose increased in the placebo (+21.2 mg/dL/180 min) by more than it decreased in the WBF-011 group (-15 mg/dL/180 min). While you expect some increase in glucose in people with diabetes who are in a control arm, it’s quite unusual to see it be larger than the change in other other direction from the intervention. So instinctively, I think what they’re seeing might just be natural variation.
Sample size
I’m being quite tough on the investigators so far as their study was a proof-of-principle study, not a fully-powered long-term trial. This is the type of study researchers might do if they’re not really sure if what they’re testing might work. So they’ll often do a shorter, smaller study initially, rather than spending a ton on a fully powered long-term trial. So they didn’t even have a sample size calculation for this trial, and they ended up testing very small numbers (they did a per-protocol analysis on n=16 in the placebo, and n=21 in each of the intervention groups).
The problem with small sample sizes is that it only takes a few people in one of the arms to change their behaviours in a way that can really confound your findings. For example if a few individuals lost 5-6kg in the WBF-011 group whereas there was no changes in weight in the placebo arm, this might well be enough to push the needle towards statistically significant changes in glucose. But remember, if this scenario happens, the reduction in glucose is not caused by the agent being tested, but by a confounding factor (weight). Unfortunately the investigators don’t report weight changes so we can’t see whether change in weight could have played a role here.
Clinical relevance on glucose
And finally, it’s always really important to consider whether the effect size - even if something is statistically significant - is clinically relevant. The effect size is the magnitude of change.
In this study, there was a claimed statistically significant effect with WBF-011 on the total AUC for glucose, but the effect on fasting glucose or HbA1c was not statistically significant. The latter suggests this type of probiotic may not improve glucose to a degree that’s clinically relevant in people with diabetes. In the population studied, post-prandial glucose is a bigger contributor to HbA1c than fasting glucose - so the fact that this probiotic may lower post-prandial glucose but does not affect HbA1c would suggest to me that the effect size is not large.
There’s also a curious issue with what happens to insulin. Typically if glucose comes down, you might expect insulin to as well - particularly in people with early type 2 diabetes. But in this study, there was no reduction in post-prandial insulin. Now, the authors only show the AUC for 180 mins, so we can’t see whether it’s a case of just early insulin increasing (could be a good thing?) or potentially excessive insulin release for the prevailing concentration of glucose. HOMA-IR did not change which suggests WBF-011 does not improve insulin sensitivity.
Bottom line
All in all, I don’t interpret their results as showing the formulation WBF-011 lowers glucose or improves glucose homeostasis in any convincing way.
I think getting into the nitty gritty of this paper really demonstrates how difficult it is to determine whether scientific claims are “legit” or not. On the surface the amount of information and data they provide in their paper is really impressive. But a closer examination IMO reveals there’s not much here to get excited about**.
*They actually have two primary outcomes but for this proof of concept study their choices seems fine.
**Or pay 70 dollars a month for.
Thank you for clarifying this. I heard about this on Peter Attia’s podcast and wondered what the clinical impact was for the price patients have to pay out of pocket for something that may not provide much benefit.
A sequential approach to controlling type 1 error is adequate. In fact, most clinical trials with clinical approvable endpoints use it on their statistical analysis plan. In short, you are saying that you will test your first endpoint, only if that is statistical significant will you move to your ranked secondary endpoint, and so forth. If the tested endpoint is not s.s. you do not test the other endpoints.