We don’t know how to fix science

If you are reading this, you will probably have read about ways to improve the institutions we use to advance science. Perhaps you have come across the occasional call for abolishing pre-publication peer review, increasing transparency and reproducibility, or funding people rather than projects. But this conversation is not backed by strong evidence—we don’t actually know if these will work. Running more experiments could change this.

As an example of the problem, consider the idea of “funding people, not projects.” In this proposal, scientists would spend less time writing proposals detailing what they want to do in order to seek funding. Instead, a funding agency would pick excellent scientists and fund them, regardless of what exactly they want to study. DARPA, the Howard Hughes Medical Institute (HHMI), and before them the Rockefeller Foundation have historically operated in this way.

One of the main papers in the “fund people” literature, by Pierre Azoulay and colleagues, reported some intriguing results when it compared the funding model of the HHMI with that of the National Institutes of Health (NIH). The importance of these results has been stressed by Patrick Collison and Tyler Cowen, in the article that launched the Progress Studies movement:

“Similarly, while science generates much of our prosperity, scientists and researchers themselves do not sufficiently obsess over how it should be organized. In a recent paper, Pierre Azoulay and co-authors concluded that Howard Hughes Medical Institute’s long-term grants to high-potential scientists made those scientists 96 percent more likely to produce breakthrough work. If this finding is borne out, it suggests that present funding mechanisms are likely to be far from optimal, in part because they do not focus enough on research autonomy and risk taking.”

But before we rush to remake NIH in HHMI’s image, we have to ask: is this effect real? Can we really double the odds that a scientist will produce breakthrough work just by changing the way they are funded? As I have written before, there are reasons for skepticism about this particular result.

And even if we do think that the HHMI funding model is better, can we scale it? Azoulay himself thinks not, and that only a handful of elite scientists could take advantage of a program like this. But we have no way to know for sure. We don’t even have a good idea of whether this putative lack of scalability is a problem either. Maybe funding a small number of elite scientists would get most of the science we need done anyway. Either way, this is thin evidence for a complete restructuring of our scientific institutions.

Here’s another example. In many science-funding settings there is usually a committee involved that makes decisions over allocating funding. Typically, they’ll use some measure of agreement to decide what to fund. What if that leads to overly safe, conservative work? What if instead we use disagreement to select potentially groundbreaking work? Perhaps if the experts cannot rule out a grant proposal as obviously good or obviously bad, that should suggest to us there is something in it that is interesting.

Adrian Barnett and colleagues looked at this question in their paper “Do funding applications where peer reviewers disagree have higher citations? A cross-sectional study.” They found the answer to be negative, that disagreement does not predict success. But if you were an advocate of disagreement-driven funding, would you give up based on this? Probably not. It is a single study looking at one metric (citations), and maybe a larger sample will find different results. Perhaps it only works in certain fields or for certain kinds of work. Without more evidence, we can’t settle the question.

So we have a lot of promising ideas to try, but little evidence about what we should do. And it gets worse: it’s often difficult to measure the outcome we’re trying to achieve in the first place.

A carefully controlled randomized control trial for a particular drug has a clearly defined outcome, but with science reform the objectives and thus the metrics used to measure success can be very varied. Some reforms may aim to improve the quality of life of scientists, others to improve the translation of basic research into commercially useful knowledge, and others to make research more accessible or robust by mandating open access and reproducible protocols. Even if these succeed in a narrow sense, it may be difficult to judge whether they have led to an increased stock of knowledge, let alone an improvement in social welfare.

This lack of clarity is widespread in the meta-science literature. There is little clear experimental data that would allow us to cleanly compare different policies, which leads to progressively more sophisticated econometric techniques to squeeze causal claims out of the data, and continued calls for more experimentation.

We may never be able to alleviate the intrinsic difficulties of figuring out the “best” science policy, but doing more actual experiments could at least get us closer to that objective.

Consider funding mechanisms as one possible area for reform where we lack solid evidence so cannot make solid proposals for reform. Now imagine two kinds of experiments that could be introduced by funding agencies.

First, funding agencies could randomly allocate scientists to different funding mechanisms that already exist. Given the availability of scientific databases, tracing the career of a given scientist—funded or not—would be easy. A few years into the experiment, the agency could examine each group. Maybe they would find that scientists are successful (or not) regardless of how they are funded. Or perhaps they would find that any applicant that gains their support—even if that support was randomly given—goes on to become a highly successful scientist. This would show that scientists experience career-long successes thanks to the support that past successes generate, known as the Matthew effect, and this dominates over their actual ability or skills.

Second, they could introduce totally new kinds of funding mechanisms. In the meta-science literature, approaches are proposed ranging from funding lotteries—where chance, not merit, decides what project goes ahead—to highly selective programs that fund for substantially longer, enabling scientists to think of their careers with longer-term horizons. Each is the result of different background beliefs about the extent to which we can predict success in science. At one extreme, we can’t know anything about the future and we should fund at random. At the other, a small group of elite scientists is identified and funded: they are tasked with leading their fields and are given the resources and time to do so.

Advocates of lotteries make two key critiques: a) the current system forces researchers to spend a lot of time preparing grants; and b) peer reviewers cannot reliably identify “good” grant applications. They claim that a lottery system would reduce the time spent on review (because reviewers would mostly skim the proposals to check for minimal scientific robustness) as well as the time spent on preparing proposals (because there would be less of an incentive to meticulously craft proposals, given that no matter how detailed and well written they are, they are going to be chosen at random).

The downside is that good work would be less likely to be funded relative to the status quo if reviewers can actually identify good work. Proponents of lotteries argue that reviewers cannot reliably do this, but reviewers do seem to do a better job than chance, especially if one cares about funding the best (by citation count) work. This does not mean that the status quo is necessarily superior to lotteries, but it means there are legitimate reasons not to replace the current system with lotteries overnight. We need more evidence, and we need to do experiments to get it. At worst, we may find that actually peer reviewers were doing a very valuable job. At best, we save billions of dollars and, more importantly, scientists’ time for decades to come.

There is one more argument in favor of trying more things out through this experimental approach: it will increase the diversity of funding mechanisms available at any given time. By most measures, the US innovation ecosystem is the world’s leading engine of technical and scientific progress. Part of this success may be due to the diversity of funding: rather than coordinating or planning the entire nation’s scientific investments centrally, the US historically has enabled a menagerie of entities to thrive, from philanthropies, privately-run federally funded research centers, to university and industrial labs. This makes it easier for a given researcher to find a research home that suits her and her ideas. Diversity could be further pursued: a large agency like NIH or one of its member institutes like the National Cancer Institute could be split into two or more funding mechanisms internally, and their performance could be assessed every few years.

A possible argument against this experimental approach is that for an experiment to be useful, there has to be a clearly defined metric of success. How would we know if any particular reform is actually making things better?

Ideally, we’d like to measure the benefit provided by a study to society. We might ask: had this piece of research not been funded, would a given invention have been delayed? If so, for how long? And what was the impact of that on societal welfare? We could also try to estimate intermediate metrics of research usefulness, like whether a given basic research paper will end up being used as input to an application. It is an argument for humility, not epistemic nihilism.

But the difficulty is worth grappling with. In fact, it is one of the best arguments in favor of using lotteries as a major mechanism for allocating funding: even if we could see which piece of research is going to be successful (e.g., be highly cited), it is even harder to see if it will end up being useful. But while assessing the success of a specific scientist or proposal in the future is hard, it is easier to assess these mechanisms retrospectively. We can use a varied range of metrics to measure success, from citations (or variations thereof, like field-adjusted citations, or counting only highly cited work), to the number of funded researchers that went on to get highly prestigious awards. We could even simply have peers evaluate portfolios of work without knowing which funding mechanism supported them, and have them decide which portfolios were best. To that end we could survey funded scientists to find out what they thought about the way their work was being funded.

This does not mean we should wait decades before implementing any change. Waiting for strong, crystal-clear evidence to act would be engaging in the same flawed thinking that led to the claim that “there is no evidence that masks work” which we heard last year—the costs of delays or inaction would be high. Demands for open access, “red teaming” science, or running multiple-lab reproducibility studies (be it in the social or life sciences, or elsewhere) shouldn’t get stalled by the lack of RCTs. Where there are strong theoretical considerations, indirect evidence, and broad agreement that a proposal will improve science, without a serious cost if we’re wrong, we should just go ahead in at least some cases, and assess the benefits afterwards.

Lastly, there is the question of why don’t we see experimentation more often. If experimenting with funding policy is so great, how come governments don’t do it? There are multiple reports that coincide in the need for this kind of approach to policy but this is not a problem that’s particular to science funding; in general governments tend to roll out policy in an all-or-nothing fashion, without incremental or randomized rollouts. At best, we tend to get quasi-experimental data from different cities, provinces, or states trying different policies—albeit without experimentation in mind—and then comparing like with like.

Statistician Adrian Barnett reports having talked to Australian funding agencies asking them about using a lottery to allocate their funding. The reply didn’t involve, as one might have expected, lack of belief in the effectiveness of lotteries. Rather, the answer he got was that “It would make it look like we [the agency] don’t know what we’re doing.” The agency’s fear of social or political judgement, to be sure, is not the only reason. Many scientists perceive lotteries as an intrusion from “well-meaning but scientifically inept bureaucrats” and think that “academic research will suffer as fruitful ideas are arbitrarily stalled” if lotteries are introduced.

These arguments do have merit. It’s not hard to imagine what it would feel like being a researcher in that situation: knowing that, regardless of how much of a good job you think that you are doing, your funding depends on chance rather than merit. Lottery advocates would argue that this is the situation right now, that there are already many brilliant scientists with great proposals that don’t get funding after having spent hundreds of hours working on them. Implementing a funding lottery would just make this problem explicit. But it would be the first step. Lotteries should be part of a broader conversation: perhaps if universities paid full salary to professors rather than relying on grants for the bulk of their compensation, or if lottery-awarded funding ran for 15 years instead of the 4–5 more usual now, those concerns would be more effectively addressed.

Scientists seem to be open to a more limited, experimental rollout of funding lotteries, for example by using them only after reaching a particular threshold of quality as well as funding those proposals that are obviously groundbreaking. This might be driven by the “messy middle” model of science funding where some “obviously good” and “obviously bad” proposals are thought to exist and are readily identifiable, leaving a vast number of proposals in the middle that are decent and apt to be funded at random instead of requiring deliberation by a grant giver.

Adopting a more experimentally-minded thinking would have another benefit: it would make other meta-science experiments more likely to occur as well. Substantial changes to the status quo based on unclear evidence can be controversial and are likely to cause division and protracted arguing, resulting in stasis. Running smaller trials, with the aim of verifying what works or doesn’t, will make this kind of approach more likely to be permitted.

A few years ago, there was a debate around whether NIH should cap funding for individual researchers, on the grounds that there are decreasing marginal returns to concentrating funding on a single investigator. Opponents argued that such a policy would unfairly penalize successful investigators leading large labs that are doing highly impactful work. The proposal was ultimately scrapped. It’s not relevant whether such a proposal would have worked: both sides had reasonable arguments. What is important is that at no point did NIH think of randomizing or trialing this policy at a smaller scale—they designed it from the outset as a policy to affect the entirety of NIH’s budget.

That is the kind of thinking that we need to change. Instead, NIH should have considered selecting a subset of investigators and applying a cap to them, and then compared results a decade into the future with those that were left to accumulate more traditional funding.

Those interested in meta-science may disagree about what the best way to reform science is, but all of us can agree that we need more evidence about the proposals being made. We have many interesting, reasonable ideas ready to be tried. It is a glaring irony that the very same institutions that enable practitioners of the scientific method to do their work don’t apply that same method to themselves. It is time to change that.

The conversation around science is full of ideas for reform, but how do we know which ones will be effective? To find out what works, we need to apply the scientific method to science itself.

Why didn’t suicides rise during Covid?

How we fixed the ozone layer

Why Tesla bought bitcoin