How big data created the modern dairy cow

No matter where you are in the world, there’s a good chance the milk or cheese you’re buying is the product of the US dairy industry. Even if it didn’t come from American cattle, the cow that produced the milk could well have been inseminated by an American bull.

The United States has been the world’s largest supplier of cattle genetics since at least 1992. In 2022, the US exported $295 million in bovine semen, 47 percent of the world’s exports. Few countries even come close to this market share: the next biggest exporters are Canada at 14 percent and the Netherlands at seven.

Get the print magazine

Subscribe for $100 to receive six beautiful issues per year.

America’s cows are now extraordinarily productive. In 2024, just 9.3 million cows will produce 226 billion pounds of milk (about 100 million tons) – enough milk to provide ten percent of 333 million insatiable Americans’ diets, and export for good measure.

And that’s despite the fact that none of the cattle breeds the US exports are indigenous to the country. The world’s most popular dairy cow breed, the Holstein-Friesian, hails from the border between the Netherlands and Germany; the Jersey and Guernsey dairy breeds both originate from islands in the English Channel.

In many low-income countries, livestock products, including dairy cows, are critical for providing both nutrition and farming livelihoods. As a result, the US’s role in the global livestock genetics market lends it an outsize role not only in the genetic improvement of cattle but as an arbiter of rural development worldwide.

How did the US achieve this? In a word: data. This is the story of how the power of big data, combined with an ambitious public-private partnership between dairy farmers and the US Department of Agriculture, enabled the US to engineer the modern dairy cow and transform the dairy industry.

When looks aren’t everything

Like many things in the United States, the dairy industry was an import. Dutch settlers were importing their own breeds of cattle to the United States as early as 1621 and the first documented importation of a Holstein-Friesian cow was in 1852. Similarly, English immigrants introduced Jersey and Guernsey cattle to the US between 1840 and 1850.

Over time, crossbreeding between cattle led to a natural breakdown between breeds, making the lines between breeds increasingly unclear. However, many dairy farmers felt a desire to preserve the traits of the breeds they brought over to prevent the characteristics they liked from being changed over time. Only making crosses within the breed was, for farmers at the time, the best way to do this.

Enter the breed association. Breed associations set the breed standard: what a cow should look like to belong to that breed. As an example, a University of Illinois extension publication as late as 1942 reports that any color other than black and white would disqualify a cow from being considered Holstein-Friesian. These associations tracked the lineage of each cow in herd books, and records of each cow’s parents and children, which were made freely available to members of the association. The herd book also functioned as a certification process, preventing unscrupulous breeders from inflating the pedigree of their cattle (as happened, for example, in 1789, when French breeders shipped cattle to the island of Jersey, then sold them on to England as ‘Jersey cattle’).

As the crossbreeding of cattle continued, breed associations began to emphasize the importance of purebred cattle, or crosses of the same breed. Cattle crossed between multiple breeds were considered inferior and referred to as scrub cows. For centuries, this was the standard of genetic improvement: purebreds over scrubs. To maintain the purity of the breed, farmers were encouraged to cross only within the cattle families that the breed association deemed acceptable.

To determine the best animals within a particular breed, the associations also needed a way to appraise their physical characteristics. Thus began the institution that is the cattle show, where experts appraise cattle relative to the breed association’s standard. Dairy cattle would be judged on their body width (important for birthing calves), size (important for a high volume of milk production), and even the shape of their udders (important to ensure the udders would stay off the ground as the cow aged so it could milk longer).

With the benefit of modern genetic science, we now know that this is not a particularly effective strategy for improving dairy cattle. Allowing breeding within only a few families can lead to inbreeding and the proliferation of dangerous recessive traits. Holstein breeders would learn this lesson in the 1990s when a prolific bull named Star, born in the 1960s, spread two diseases, complex vertebral malformation and bovine leukocyte adhesion deficiency, through recessive traits, leading to calf deaths as far as three generations later. Appraising physical traits is also, we now know, only a crude proxy for how profitable a cow might be. Bigger cattle, for example, may produce more milk, but they also eat more, and can therefore be less profitable than a small cow that produces a proportionately greater amount of milk while requiring less feed.

For these reasons, the geneticist Arend Hagedoorn in 1946 called animal breeding ‘remarkably speculative and economically wasteful’ compared to plant breeding, a field that was far more methodical and scientifically motivated. In plant breeding, scientists were much more attuned to the importance of hybrid vigor, or the genetic benefits of outbreeding. Arguably, one of the most important innovations of crop genetics was the development of hybrid corn, a cross of two inbred strains of corn whose adoption caused yields to grow substantially throughout the twentieth century.

But plant breeding had something that animal breeding did not: a way to collect data. With the establishment of the land grant universities in 1862, plant breeders in the US were given ample land to plant new crop varieties at experiment plots to determine their yield – information the state experiment stations would then communicate to farmers in the state. To have their own scientific revolution, animal scientists would need their own way of collecting data. Fortunately, such a method already existed, in another part of the world.

Dairy and the Danes

In 1890, Denmark was a dairy powerhouse with a more technologically advanced dairy sector than the United States. At this time, Denmark was growing its dairy herd rapidly; its milk yields were among the highest in Europe and it was the number one supplier of butter to Britain. Historians typically attribute Danish dairy success to two factors. First, the Danes had already begun adopting steam-powered centrifuges in creameries, which allowed the separation of the butterfat from the milk at a much larger scale. Second, Danish dairy farmers were connected to one another through cooperatives, of which there were 1,000 by 1900. Cooperatives, the argument goes, allowed them to share new techniques and technology with each other.

From 1880 to 1900, the US experienced the biggest influx of Danish immigration in any two-decade period. As Danish immigrants moved to the US, those from dairy farming backgrounds brought both their techniques and their ideas with them. One recent economics study found that US counties with more Danish immigrants developed more dairy farms and had more employees working in the industrialized dairy sector (in creameries or milk processing plants, for example) earlier than other counties, potentially due to the transmission of knowledge from Danish dairy farmers to their American counterparts.

It is no surprise, then, that the data collection system for the dairy sector that was eventually implemented in the US was also a Danish import. In 1906, a Danish immigrant named Helmer Rabild was working for the Michigan Department of Agriculture when he decided to organize the country’s first cow-testing cooperative. The purpose of the cooperative was to hire a technician to go to each dairy farm and test the butterfat percentage of each farmer’s cows to determine which ones produced milk most valued by dairy processors making cheese and butter. Such cooperatives were already widespread in Rabild’s home country: by the time he organized the first cooperative in Newaygo County, Michigan, Denmark already had more than 400.

The innovation of the cow-testing cooperative was itself a response to an earlier innovation: the butterfat test. In 1878, the Danish scientist Niels Johannes Fjord had developed a centrifugal cream tester to determine butterfat from milk samples, which allowed creameries to pay farmers for butterfat instead of the weight of the milk. This innovation did not appear to catch on in US creameries, however, due to its lack of accuracy for some types of milk and lower adoption of centrifugal cream separators.

In 1890, American agricultural chemist Stephen M. Babcock developed a new method to determine how much of a milk sample was butterfat versus water that was lower cost, more accurate, and could be done easily on the farm. The Babcock test works by adding sulfuric acid to milk in a long-necked bottle. The sulfuric acid dissolves everything but the butterfat, the proportion of which can be estimated by spinning the bottle in a centrifuge and measuring the volume of bubbles that rise to the top of the neck. This was a crucial innovation for US dairy processors, who had previously paid dairy farmers based on weight, which led to farmers watering down their milk. The adoption of the Babcock test gave dairy processors an objective measure of the quality of the milk they were buying. As one Wisconsin politician put it, the Babcock test ‘made more dairymen honest than the Bible ever had’.

From then on, dairy farmers would be paid based on their butterfat production instead of the weight of their milk. At the time, processors highly prioritized the butterfat content of the milk over its protein content – even though dairy farmers today are paid for both the fat and protein content of their milk – because the main priority was cheese making, where protein is secondary. By gathering data on cows’ individual butterfat production, the cow-testing cooperatives allowed dairy farmers to cull their lowest-producing cows and instead breed only their highest-producing animals. The information was a revelation for farmers, who had previously judged their cattle on looks alone. In a report to the USDA, Rabild noted that ‘an expression often heard among members of cow-testing associations during the first year is, “the cow I thought was my best cow was the poorest”’.

The USDA quickly caught on to the power of these cooperatives and employed Rabild to help form more of them across the country. From 1906 to 1920, 500 cooperatives sprung up throughout the United States, mostly in the upper Midwest and the Northeast, which already had large numbers of dairy farms. Participation remained modest, however, at just one to two percent of the US dairy herd before 1920.

Moneyballing milk

It was fairly straightforward to determine which cows produced the most milk (and, with the advent of the butterfat test, which produced milk with the highest butterfat content). But since such traits couldn’t be observed directly in a bull, it was more difficult to discern which bulls dairy farmers should breed with to increase butterfat production. Instead, they had to rely on observing the productivity of the bulls’ female offspring. In the absence of solid data, farmers largely had to determine a bull’s quality by asking other farmers whether a particular bull’s daughters appeared to be healthy and productive.

The cattle show mindset of the breed associations was another impediment to selecting the best bulls. Without actual performance, breed associations could only rely on bulls’ physical traits and whether the bull was purebred or a scrub. To use a baseball analogy, dairy cow breeders were very much stuck in the mindset of baseball scouts before the popularization of baseball statistics: assessing physical traits believed to correlate to performance rather than performance itself.

Much as the field of sabermetrics brought a statistical revolution to baseball, the data collected in Rabild’s cooperatives transformed how dairy farmers assessed their bulls. Animal breeding’s answer to Bill James was Jay Lush, a scientist at Iowa State University often considered to be the founder of quantitative genetics. Lush advocated for breeding based on data rather than physical characteristics. Dairy bulls should be ranked not on their appearance, he argued, but on their daughters’ milk production. But in order to help farmers do this, the scientists would need a lot more data.

The cow-testing cooperatives, later renamed dairy herd improvement associations (DHIAs), provided the data dairy breeders needed to put Lush’s ideas into action. By 1935, there were over 800 DHIAs in the US, each collecting data on the milk and butterfat yields of more than 350,000 dairy cows nationwide. Scientists at the USDA saw an opportunity to use that data to realize Lush’s vision at a national scale.

The first metric they came up with was the daughter-dam comparison, which assessed the difference between the milk production of the mother (the dam) and that of the daughter. Scientists could then use this comparison to isolate the father’s contribution to the daughter’s milk production. Since the breed associations kept detailed records of cattle lineage, scientists could calculate the daughter-dam difference for any bull whose offspring had been tested at a dairy herd improvement association. This method of assessing a bull’s milk production ability became known as proving a bull, and a bull’s estimated production potential was known as a bull proof.

The USDA published its first public list of daughter-dam comparisons for dairy bulls in 1937. It included over 1,500 bulls, mostly Holstein and Jersey, with estimates of their potential milk yield, fat yield, and fat percentage. For the first time, farmers could choose bulls based on estimated productivity rather than physical appearance, introducing a new paradigm in the world of dairy cattle breeding: out with the purebred bull, in with the proved bull.

The USDA’s first proved sires list from 1937.

Image

Source: USA Misc Publication No. 277, June 1937, pg. 33

To keep the data flowing from the associations to the USDA, the department entered into an official agreement with the DHIAs and breed associations called the National Cooperative Dairy Herd Improvement Program (NCDHIP). The program stipulated that the USDA would have access to milk production data from the DHIAs and lineage data from the breed associations to run their calculations. In exchange, the USDA would publish their calculations in circulars that would be accessible to all dairy farmers, either through membership in the association or through their university’s extension program.

In many ways, this model was ahead of its time. Because testing every dairy bull in the country themselves was patently infeasible, USDA scientists had, by necessity, stumbled into a novel cooperative arrangement that made dairy farmers partners in the innovation process. It also centralized data from dairy farms across the country: by joining a DHIA, a dairy farmer was not only obtaining information about their own herd but also providing valuable data to dairy farmers all around the country. While understanding the exact role of DHIAs in the evolution of the dairy sector is an ongoing project, the percentage of cows enrolled in DHIA increased at around the same time that dairy farmers saw huge increases in milk yield.

The data explosion

To accurately estimate each bull’s productivity, USDA scientists would need as many data points as possible. They would also need to observe a bull’s offspring in multiple locations, in order to disentangle the impact of a bull’s genetics from other potential contributors like the management style of the dairy farm (what cattle were fed, for example, or how the farm managed health conditions) and the local climate.

But in the 1930s, biological limitations restricted the number of data points available to scientists. For one thing, a single bull could only father so many daughters in a herd. For another, a bull could only spread its genetics as far as it could be physically transported, and transporting bulls over long distances was expensive or otherwise infeasible. As a result, a bull’s offspring would rarely be found much farther than the farm where the bull was kept. As helpful as the USDA’s proved sire list was, dairy farmers could only choose bulls that were close to them. A particularly productive bull in, say, New York would be of no help to a dairy farmer in Wisconsin.

population of dairy cows by county in the US in 1910

Over the next two decades, however, the commercialization of artificial insemination and cryogenic preservation technologies would change that. When artificial insemination – collecting semen from a bull and later using it to inseminate a cow – became commercially viable for the dairy sector around 1937, the number of offspring a given bull could produce increased dramatically, from 12 or 13 female offspring in a bull’s lifetime to over 5,000. Using artificial insemination, a bull could thus be much more prolific within its geographic area.

Still, a bull’s semen could only be transported so far before becoming unusable. But with the commercialization of cryopreservation for genetic material in the 1950s, breeders could use freezing agents like liquid nitrogen to preserve a bull’s semen over long distances. Now, a dairy breeder with an especially productive bull could benefit not just their immediate vicinity but the entire country. The Wisconsin dairy farmer could now harness the superior genetics of a productive bull in New York.

These developments were a boon for scientists, as they made it possible to estimate each bull’s contribution to milk production more precisely. It also meant scientists could apply newer statistical approaches to the data. In 1948, the statistician CR Henderson – a student of the animal scientist Lanoy Hazel, who was himself a student of Jay Lush – developed the Henderson mixed model, a linear regression approach using random and fixed coefficients that became the workhorse model of dairy bull proving. The modern bull-proving model in use today was made possible by the efforts of USDA scientists and the plentiful data on bull offspring made available through the DHIAs.

A productivity revolution in dairy

The advent of cryopreservation coincided with some of the largest structural changes to the dairy industry in the twentieth century. First, the number of farms declined, and the size of their herds grew. In 1940, there were over four million farming operations with dairy cattle around the country; by 1982, there were only 277,000 operations, a 93 percent decrease. While in 1940 a farm would have an average of six dairy cows, the average herd size in 1980 was 32 cows per farm. Second, as dairy cows became more productive, fewer of them were needed to serve the US market. In 1945, there were over 25 million dairy cattle in the United States; by 1980, the number had dropped by more than half to just over ten million. And the dairy industry as a whole became more productive. Between 1940 and 1982, the total supply of milk increased by a third, even as the number of dairy cows in the US halved. Before 1950, the average cow’s milk yield grew by about 0.8 percent per year; after 1950, milk yield began to grow at a rate of about three percent a year.

What role did genetic improvements play in these productivity changes? It’s hard to estimate precisely, because the diffusion of superior genetics also coincided with improvements in management and milking technology on US farms. For example, keeping cattle indoors (where they expend less energy, are not exposed to harsh weather, and can be fed a higher-protein diet instead of primarily eating grass and hay), as well as more efficient milking machines, more nutritious feed, and antibiotic treatments for common infections like mastitis also contributed to rising milk yields during this period.

The data available through the NCDHIP makes it possible to estimate any cow’s milk production based on their lineage. With each generation of dairy cattle, the average genetic potential – that is, each animal’s potential milk yield – goes up, because the best producers are bred and the lowest producers are not. So one way to estimate the contribution of improved cattle genetics is to look at how much the milk yields would have grown based on these estimates alone, absent other improvements. Taking this approach, the US Council on Dairy Cattle Breeding estimates that genetic improvement was responsible for about 50 percent of the growth in milk yield in the past two decades.

It wouldn’t be long before these advancements would benefit not just the US dairy market but also the rest of the world. From 2000 to 2020, the quantity and dollar value of US bovine semen exports more than doubled: adjusted for inflation, total exports increased from about $78 million in 2000 to $230 million in 2020, a roughly threefold increase. According to the National Association of Animal Breeders, more than four fifths of the semen exports from association members are from dairy bulls. Since 1992, the US has been the world’s top exporter of dairy bull semen.

Looming challenges for the dairy cattle genetics market

The revolution in cattle genetics that began with the cow-testing associations of the early 1900s and grew more sophisticated with the formation of the NCDHIP in the 1930s and the advent of artificial insemination and cryogenics in the 1950s continued in the intervening decades.

One especially important innovation was genomic testing. Since 2008, instead of predicting a bull’s production potential based on its daughters’ milk yields, it’s possible to estimate productivity at the genomic level, relying on observed correlations between individual genes and performance. Breeders now only need a sample of the bull’s DNA to estimate its milk production.

Genomic testing significantly increased the rate of improvement as predicted by genetics after genomic bulls became available on the market. Before 2010, the first year genomic bulls were on the market, the potential genetic milk yield grew 2.53 percent a year. After 2010, it grew 4.26 percent a year.

Despite this growth in the genetic potential for milk yield, the rate of increase in the national milk yield has not actually changed between 2010 and 2020. It’s unclear why the benefits of genomic testing have not yet affected the national trends; it may be due to lack of adoption of new genetics, or other factors such as climate pushing yields down as genetic improvement pulls them up.

Yet the dairy genetics industry faces at least two major challenges. The first is a challenge entirely of its own making: inbreeding. As a result of dairy breeders using only the most productive bulls, there has been an upswing in inbreeding from farmers who continue to follow the cattle show mindset of breeding from famous lines. Worryingly, inbreeding appears to have significantly increased since genomic testing became available. In other words, the very technology that has delivered such significant productivity gains may also be one of the most potent drivers of an eventual decrease in productivity.

Inbreeding is detrimental to the industry because it depresses productivity growth – appropriately, this is known as inbreeding depression – and increases the likelihood of dangerous recessive traits infiltrating the genetic pool. Thus, in pursuit of short-term gain, the industry has decreased the resilience of the genetic pool, with potentially dire long-term consequences. Dealing with this problem may require a coordinated industry-wide strategy aimed at incentivizing breeders to embrace hybrid vigor and abandon the purebred mindset. For example, geneticist John Cole has suggested changing how the industry calculates a bull’s genetic traits in order to better inform farmers about how inbreeding will affect their bottom line.

The second challenge is the same one that plagues a host of other industries: climate change. Cattle emit significant amounts of methane, which, while not as long-lasting as carbon, is even more potent in terms of its warming potential.

The good news is that addressing cattle emissions can have a significant impact on climate change, and dairy cattle breeders will have a role to play in this process. Currently, cattle contribute between 11 and 19 percent of greenhouse gas emissions worldwide. Research is underway to uncover genetic traits that drive greenhouse gas emissions, which would then enable dairy farmers to breed bulls that emit less methane. Similarly, breeding for smaller cattle, which require less feed for the amount of milk they produce, can be a meaningful way to reduce the amount of emissions produced per cow. However, both of these strategies only work under some conditions. Scientists may produce new genetic traits for methane emissions, but dairy farmers will need to find it worthwhile to choose them. Cows may become more efficient with feed, but this only affects total emissions if farmers similarly find it profitable to use these cows.

As dairy cattle become more productive, it may also be possible to reduce the overall herd size without harming the global milk supply, as happened from 1950 to 1980. However, since 1980, the number of cows in the US has barely changed due to the ever-growing global demand for dairy products. Between 2010 and 2019, per capita consumption of dairy products grew nine percent and it is projected to grow more over the coming decade. This is one of the greatest trade-offs the dairy sector will have to face. Dairy products are a relatively effective way to deliver calories and nutrients to people who lack reliable access to food; as long as the world continues to demand dairy products, we’re unlikely to see cow populations drop anytime soon.

Democratizing cattle genetics

The evolution and refinement of dairy cattle genetics in the US was a sea change in the history of agriculture. Helmer Rabild’s cow-testing associations formed the bedrock for the data-driven approach to animal breeding pioneered by scientists like Jay Lush and made possible by the USDA’s cooperation with dairy herd improvement associations around the country. This, in turn, enabled dramatic in-
creases in US dairy industry productivity, and allowed the US to become a global powerhouse for dairy cattle genetics – despite having no native dairy cattle populations.

In a sense, the data revolution in US dairy cattle breeding has enabled a genetic meritocracy. Dairy breeders can now use productivity data to select cattle rather than the nebulous and unscientific designation of purebred or scrub. And recent improvements in genomic testing have made it even easier to identify new and productive genes, including, potentially, genes that could help combat the industry’s impact on the climate.

Nevertheless, a variation of the purebred mindset still permeates the dairy industry. As we’ve seen, genomics not only failed to decrease inbreeding but in fact appears to have increased it. As it stands now, the US dairy cattle population is the most inbred it’s ever been. Given the US’s position as the dominant exporter of cattle genetics, this is a global issue.

But while genomics may have exacerbated the problem, it could also be the solution. By harnessing the genomic associations between specific genes and productive traits, the industry could begin to identify bulls from other breeds that have historically been less used in modern dairy production. Making crosses between breeds like Jersey and Holstein-Friesian and other, less used breeds could provide the genetic diversity the industry needs without sacrificing production. This has the potential to further democratize the breeding process and should lead to less inbreeding over time.

As the past 20 years have shown, however, decreasing inbreeding will not be solved by technology alone. Instead, US dairy breeders may have to learn from the dairy herd improvement associations and the USDA and embrace cooperation. More than a century ago, dairy was a sector where farmers regularly adulterated their product to get paid more. After Rabild brought DHIAs to the US, dairy farmers were collaborating to improve milk quality and provide valuable data for the industry. In this new era, left to act alone, the industry may pursue short-sighted gains without considering looming challenges like inbreeding and climate change. Working together, the industry can learn new ways to promote genetic diversity and reduce the climate impact of their cattle.

What do cryogenics, butterfat tests, and genetic data have in common? They’re some of the reasons behind the world’s most productive dairy cows. Here’s how it all started.