# Chapter 5 The heart attack data set (I)

The heart attack data set (http://statland.org/R/RC/tables4R.htm), included in the ActivStats1 CD, contains all 12,844 cases of hospital discharges of the patients admitted for heart attack but did not have surgery in New York State in 1993. This information is essential for the interpretation of our results, as this is a purely observational study. It is not a random sample or controlled experiment.

The data set is formatted as a table (Table 5.1) with rows representing cases and columns representing characteristics, which is a typical format for many datasets. If you download and open the file in NotePad++ or Excel, you can see that the columns are separated by tabs, and there are missing values noted by “NA.” Four columns (DIAGNOSIS, SEX, DRG, DIED) contain nominal values, representing labels of categories. See an excellent explanation of data types here. DIAGNOSIS column contains codes defined in the International Classification of Diseases (IDC), 9th Edition. This is the code that your doctor sends to your insurance company for billing. The numbers, such as 41041, actually code for which part of the heart was affected. Although these are numbers, it does not make any sense to add or subtract or compute the mean. If I have a room of 30 students each with student ids, such as 732324, it does not make any sense to compute the average of these numbers. Such categorical data needs to be recognized as factors in R. Similarly, DRG column has three possible numbers, 121 for survivors with cardiovascular complications, 122 for survivors without complications, and 123 for patients who died. Moreover, DIED also codes for prognosis, it is 1 if the patient passed away, and 0 if survived.

sep = "\t",
colClasses = c("character", "factor", "factor", "factor",
"factor", "numeric", "numeric", "numeric"))
Table 5.1: First 15 rows of the heart attack dataset
Patient DIAGNOSIS SEX DRG DIED CHARGES LOS AGE
1 41041 F 122 0 4752.00 10 79
2 41041 F 122 0 3941.00 6 34
3 41091 F 122 0 3657.00 5 76
4 41081 F 122 0 1481.00 2 80
5 41091 M 122 0 1681.00 1 55
6 41091 M 121 0 6378.64 9 84
7 41091 F 121 0 10958.52 15 84
8 41091 F 121 0 16583.93 15 70
9 41041 M 121 0 4015.33 2 76
10 41041 F 123 1 1989.44 1 65
11 41041 F 121 0 7471.63 6 52
12 41091 M 121 0 3930.63 5 72
13 41091 F 122 0 NA 9 83
14 41091 F 122 0 4433.93 4 61
15 41041 M 122 0 3318.21 2 53

Take a look at this dataset in Excel, and consider these questions. What type of people are more likely to suffer from heart attacks? Which patient is more likely to survive a heart attack? Suppose you have a friend who was just admitted to hospital for heart attack. She is a 65 years old with DIAGNOSIS code of 41081. What is the odds that she survives without complication? Also, consider yourself as a CEO of an insurance company, and you want to know what types of patients incur more charges and whether a particular subgroup of people, such as men, or people over 70, should pay a higher premium.

To answer these questions, we need to do is:

• Import data files into R

• Exploratory data analysis (EDA)

• Statistical modeling (regression)

x <- heartatk4R  # Make a copy of the data for manipulation, call it x.
str(x)  # structure of data object, data types for each column
## 'data.frame':    12844 obs. of  8 variables:
##  $Patient : chr "1" "2" "3" "4" ... ##$ DIAGNOSIS: Factor w/ 9 levels "41001","41011",..: 5 5 9 8 9 9 9 9 5 5 ...
##  $SEX : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 1 1 2 1 ... ##$ DRG      : Factor w/ 3 levels "121","122","123": 2 2 2 2 2 1 1 1 1 3 ...
##  $DIED : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ... ##$ CHARGES  : num  4752 3941 3657 1481 1681 ...
##  $LOS : num 10 6 5 2 1 9 15 15 2 1 ... ##$ AGE      : num  79 34 76 80 55 84 84 70 76 65 ...

## 5.1 Begin your analysis by examining each column separately

If you are single and meet someone in a bar, you typically start with small talks and learn some basic information about him/her. We should do the same thing with data. But too often, we go right to the business of building models or testing hypothesis without exploring our data.

As a first step, we are going to examine each column separately. This can be very basic things such as mean, median, ranges, distributions, and normality. This is important because sometimes the data is so skewed or far from normal distribution that we need to use non-parametric tests, or transformate the raw data using log transformation, or more generally box-cox transformation, before conducting other analyses.

Exercise 5.1

Perform the following analysis for the heartatk4R dataset. If you forgot the R commands, refer to our previous learning materials. You may also find these commands faster by asking Dr. Google.

1. Graphical EDA: Plot distribution of charges using box plot, histogram, qqplot, lag plot, sequence plot. And interpret your results in PLAIN English. Note that there are missing values in this column that may cause some problems for some plots. You can remove missing values by defining a new variable by running temp = CHARGES [ ! is.na (CHARGES) ] and then run your plot on temp.
2. Quantitative EDA: test of normality, and confidence interval. Note that if the Shapiro-Wilk normality test cannot handle the 12,000 data points, you can either try to find other tests in the nortest library or sample randomly by running temp = sample( CHARGES, 4000)

SEX <- heartatk4R$SEX counts <- _____________(DIAGNOSIS, SEX) # Take SEX as columns barplot(_____________________________________________________) legend(“topleft,” legend = rownames(counts), fill = rainbow(9), ncol = 3, cex = 0.75) mosaicplot(_____________________________________________) ## 5.4 Associations between a categorical and a numeric variables? Do women stay longer in the hospital? Does the charges differ for people with different diagnosis (part of the heart affected)? We should know by now how to answer these questions with T-test, and more generally ANOVA, following our examples commands used in our analysis of the Iris flower data set. For data visualization, boxplot is the most straight forward way. But beyond boxplot, we can use the ggplot2 package for more detailed examination of distribution of variables in two or more groups. library(ggplot2) ggplot(heartatk4R, aes(x = AGE, group = SEX, y = c(..count..[..group.. == 1]/sum(..count..[..group.. == 1]), ..count..[..group.. == 2]/sum(..count..[..group.. == 2])) * 100)) + geom_histogram(binwidth = 6, colour = "black", fill = "lightblue") + facet_grid(SEX ~ .) + #split a graph according to panels defined by SEX labs(y = "Percent of Total") Now the two histograms are arranged, and it is very easy to see that women’s age are more skewed to the right, meaning women are considerably older than men. I am surprised at first by this huge difference, as the average age of women if bigger by 11. Further research shows that for women the symptoms of heart attacks are milder and often go unnoticed. We can further divide the population according to survival status by adding another factor: heartatk4R$GROUP <- paste(DIED, SEX, sep = "-")
ggplot(heartatk4R, aes(x = AGE, group = GROUP,
y = c(..count..[..group.. == 1]/sum(..count..[..group.. == 1]),
..count..[..group.. == 2]/sum(..count..[..group.. == 2]),
..count..[..group.. == 3]/sum(..count..[..group.. == 3]),
..count..[..group.. == 4]/sum(..count..[..group.. == 4])) * 100))+
geom_histogram(binwidth = 5, colour = "black", fill = "lightblue") +
facet_grid(DIED ~ SEX) +   #split a graph according to matrix of panels defined by DIED*SEX
labs(y = "Percent of Total")

We can see that patients who did not survive heart attack tend to be older, for both men and women. This is perhaps better illustrated with density plot:

ggplot(heartatk4R, aes(x = AGE, fill = SEX)) + geom_density(alpha = .3)

The result is similar to Figure 5.8, but as a density plot.

Now for each gender, we further divide the patients by their survival status. Instead of splitting into multiple panels, the curves are overlaid.

ggplot(heartatk4R, aes(x = AGE, fill = DIED)) + geom_density(alpha = .3) + facet_grid(SEX ~ .)

Exercise 5.5

Use the ggplot2 package to compare the distribution of lengths of stay among patients who survived and those who did not. Use both histogram and density plot. Interpret your results.

Exercise 5.6

Use the ggplot2 package to compare the distribution of lengths of stay among patients who survived and those who did not, but compare men and women separately (similar to Figure 5.11).

Exercise 5.7

Use student’s t-test, boxplot, histogram and density plots to compare the age distribution between survived and those who didn’t.

Exercise 5.8

Use ANOVA, boxplot, and histogram and density plots to compare the charges among people who have different DRG codes.

## 5.5 Associations between multiple columns?

We can use the ggplot2 package to investigate correlations among multiple columns by figures with multiple panels.

ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_boxplot(color = "blue") + facet_grid(SEX ~ .)

Recall that 121 indicate survivors with complication, 122 survivors with no complication, and 123, died. As you could see this clearly indicate our previous observation people who died in hospital are older than survivors and that patients who developed complications seems to be older than those that did not. Did people with complications stayed longer in the hospital?

Exercise 5.9

Are the surviving women younger than the women who died? Similar question can be asked for men. Produce a figure that compares, in a gender-specific way, age distribution between patients who died in the hospital and those who survived.

Exercise 5.10

Use the ggplot2 package to produce boxplots to compare the length of stage of men vs. women for each of the DRG categories indicating complication status. You should produce a plot similar to Figure 5.13. Offer interpretation.
# scatterplot of LOS vs. AGE
ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_point()
# scatterplot of LOS vs. AGE, divided by SEX
ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_point() + facet_grid(SEX ~ .)
# scatterplot colored by DIED
ggplot(heartatk4R, aes(x = AGE, y = LOS, color = DIED)) + geom_point() + facet_grid(SEX ~ .)

Here we only show the figure for third code. Note that ggplot(heartatk4R, aes(x = DRG, y = AGE)) + geom_point() + facet_grid(SEX ~ .) generates multiple scatterplots of LOS ~ AGE according to different values of SEX, while color = DIED will add these two color-coded scatter plots into the same figure.

Figure 5.14 seems to suggest that the positive association between AGE and LOS is noticeable in patients who did not die in hospital, regardless of sex. This is a statistician’s language. Try this instead that could be understood by both the statistician and his/her grandmother. Older patients tend to stay longer in the hospital after surviving a heart attack. This is true for both men and women.

Another way to visualize complex correlation is bubble plot. Bubble plot is an extension of scatter plot. It uses an additional dimension of data to determine the size of the symbols. Interesting video using bubble plot: http://youtu.be/jbkSRLYSojo

y <- x[sample(1:12844, 200), ]   # randomly sample 200 patients
plot(y$AGE, y$LOS, cex = y$CHARGES / 6000, col = rainbow(2)[y$SEX], xlab = "AGE", ylab = "LOS")
legend("topleft", levels(y\$SEX), col = rainbow(2), pch = 1)

Figure 5.15 is a busy plot. Female patients are shown in red while males in blue. Size of the plot is proportional to charges. So on this plot we are visualizing 4 columns of data!

Other common methods we can use to detect complex correlations and structures include principal component analysis (PCA), Multidimensional scaling (MDS), hierarchical clustering etc.

All of these techniques we introduced so far enable us to LEARN about your dataset without any of priori hypothesis, ideas, and judgments. Many companies claim that they want to know their customers first as individuals and then do business. Same thing applies to data mining. You need to know you dataset as it is before making predictions, classifications etc.

You should also INTERACT with your data by asking questions based on domain knowledge and common sense. Generates lots and lots of plots to support or reject hypothesis you may have. I demonstrated this by using the heart attack dataset in the last few pages. You should do the same thing when you have a new dataset. Sometimes, the thing that you discovered is more important than the initial objectives.

1. Velleman, P. F.; Data Description Inc. ActivStats, 2000-2001 release.; A.W. Longman,: Glenview, IL, 2001.