Predicting Heart Disease Mortality – Towards Data Science

Building a machine learning model that can identify high-risk states in 2019.

According to the Center for Disease Control, “About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.” It is unlikely anyone reading this hasn’t been affected by this disease in some way. I, myself, lost a family member to the disease at just 57 years old earlier this year. The causes are well documented and understood, yet it remains the leading cause of death in the United States. Is it possible that changes in public policy can help save lives in this regard?

By building a machine learning model to predict heart disease mortality rates across states, we should be able to identify states that have effectively reduced those rates. If so, the hope is that we can extrapolate and export those policy principles to other states.

Historical Data

The CDC publishes data on heart mortality and other leading causes of death every year going back to 1999. By tracking the number of deaths relative to the population of each state, we can see that mortality rates have been trending downward in the last 20 years.

As we can see, the trend seems to level out in 2011, likely due to an aging population.

Leading Causes of Heart Disease

After conducting some general research, four primary factors were identified that have a substantial impact on heart disease:

By running a simple linear regression with each predictor variable with the target, we can start to get a sense of which factors have the most significant impact.

From here, we can see that population demographics seem to have the largest impact on heart disease, with the 75–79 age bracket being the leading predictor. Perhaps more surprising here is the fact that as the percentage of males in the population increases, heart disease mortality rates decrease. Does this mean that males are less susceptible to get heart disease? Not necessarily, but it does seem to suggest that women are more likely to die from it. The CDC sites awareness as being a primary factor in this regard,

Despite increases in awareness over the past decades, only about half (56%) of women recognize that heart disease is their number 1 killer.

What’s more, women tend to experience a broader array of symptoms, which may cause them not to recognize they are having heart attacks.

Another surprising result was that alcohol consumption did not seem to have much correlation with heart disease mortality as expected. In fact, an initial review of the data appears to suggest that increased wine consumption is correlated with lower mortality rates. However, there is a caveat here in that wine consumption was also found to be negatively correlated with smoking rates. Since smoking does have a strong, positive correlation with heart disease, it stands to reason that it’s not the wine, but the lower smoking rates that result in lower mortality rates.

Defining the Variables

Naturally, predictions about heart disease need to be based on prior data with substantial lead time. In other words, if we want to predict results for 2019, it needs to be based on data from 2018 and prior. For purposes of this analysis, 3- to 5-year time lags were used. In other words, if predicting for 2016, we use data from 2011–2013.

Risk levels are based on historical heart disease rates in aggregate. If a predicted rate is in the lower 33% of all observed rates, we mark it as low-risk. Conversely, if it is in the upper 33%, it is high-risk.

Building a Machine Learning Model

With the information in place, we can now start building a model. In total, 1,413 variants of six model architectures were applied to an 80% subset of the data using 5-fold cross-validation. Applying each model to the testing data, we can measure the accuracy and compare the effectiveness of the different architectures.

Here, we can see that the support vector machine model was able to achieve an f1-score of 91.7%. Below is a map that shows 2019 predictions, with nine states identified as high-risk.

Most striking is the clustering of high-risk states in the middle of the country, though it could not be determined whether or not this is coincidental.

It is worth noting that the model predicted six states would have a change in risk level from 2016 to 2019, as illustrated by the infographic below.

As we can see, Oklahoma, Michigan, and Pennsylvania are no longer predicted to be high-risk states. However, Tennessee moves in the opposite direction from medium-risk to high-risk. But there is one state of particular interest.


In 2016, Oklahoma observed heart disease mortality rates that put it into the high-risk category, yet our model predicts it to be low-risk in 2019. When I first saw this, I thought it was a mistake. How could a state so drastically reduce its risk level in such a short time-frame? As it turns out, there is strong evidence to support this prediction.

After closer inspection, it became apparent that the reason for this change is that the smoking rate dropped from 25.5% in 2010 to an all-time low around 19% in 2015.

The state’s secretary of Health and Human Services, Terry Cline, attributes this decline to smoking bans on state property. Combined with the Certified Healthy Oklahoma program, which incentivizes commercial properties to do the same, the results come into focus.


The findings of this research point to two simple strategies for states that want to reduce heart disease mortality.

  • Follow Oklahoma’s lead and implement smoking bans in public spaces.
  • Run heart disease awareness campaigns that target women and retirees.

Awareness campaigns in cities are likely to be most effective because women tend to make up higher percentages of the population. Though there is much more that can be done, my hope is that this analysis provides some ideas for some initial steps forward.


This piece was adapted from a more technical post I wrote on this topic several months ago. Please refer to it if you’d like more specifics on the code and strategy.

This analysis was performed using Jupyter Notebooks and relevant libraries. The repository containing all code and data can be found on GitHub.

Source link
Show More

Leave a Reply

Back to top button
Skip to toolbar