# Chapter 2 Visualizing the iris flower data set

Learning objectives:

- Basic concepts of R graphics
- Install and load R packages (ggplot2 and pheatmap)
- Basic plots with ggplot2
- Different ways to visualize the iris flower dataset

## 2.1 Basic concepts of R graphics

We start with base R graphics. The first import distinction should be made about high- and low-level graphics functions in base R. See this table below. High-level graphics functions initiate new plots, to which new elements could be added using the low-level functions.

Function | Memo |
---|---|

plot | Generic plotting function |

boxplot | boxplot. Can be applied to multiple columns of a matrix, or use equations boxplot( y ~ x) |

hist | histogram |

qqnorm | Quantile-quantile (Q-Q) plot to check for normality |

curve | Graph an arithmetic function |

barplot | Barplot |

mosaicplot | Using mosaics to represent the frequencies of tabulated counts. |

heatmap | Using colors to visualize a matrix of numeric values. |

In contrast, low-level graphics functions do not wipe out the existing plot; they add elements to it.

Function | Memo |
---|---|

points | Add points to a plot |

lines | Add lines to a plot |

abline | Add a straight line |

segments | Add line segments |

text | Add text to a plot |

legend | Add a legend to a plot |

arrows | Add arrows to a plot |

Sometimes we generate many graphics quickly for exploratory data analysis (EDA)
to get some sense of what the data looks like. We can achieve this by using
plotting functions with **default settings** to quickly generate a lot of
“plain” plots. R is a very powerful EDA tool. However, you have to know what
types of graphs are suitable for the data.

Occasionally we want to generate really “cool”-looking graphics for papers and presentations. Making such plots typically requires a bit more coding, as you have to customize different parameters. For me, it usually involves some Google searches of example codes, and then I revise it via trial-and-error. If I cannot make it work, I read the help document.

One of the open secrets of R programming is that you can start from a plain
figure and keep trying to **refine it step by step** while Googling. Let’s take
an example using the base R graphics. This produces a basic scatter plot with
the petal length on the x-axis and petal width on the y-axis. Since iris is a
data frame, we will use the `iris$Petal.Length`

to refer to the Petal.Length
column. This avoids using the *attach* function.

```
<- iris$Petal.Length
PL <- iris$Petal.Width
PW plot(PL, PW)
```

Let’s change the marker type.

`plot(PL, PW, pch = 2) # change to triangle`

The **pch** parameter can take values from 0 to 25, each corresponding to different types of marker. See a list here.

There are many other parameters to the **plot** function in R. You can get these
from the documentation

` ? plot`

We can also change the color of the data points easily with the **col** parameter.

`plot(PL, PW, pch = 2, col = "green") # change to green`

Next, we want to use different markers according to species. First, lets get the information.

```
<- as.numeric(iris$Species)
speciesID speciesID
```

```
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
```

This is possible because Species is a **factor** with 3 predefined levels and the individual elements take 1, 2 or 3. Now we can change the symbols based on species with this:

`plot(PL, PW, pch = speciesID, col = "green") `

What happens here is that the 150 integers stored in the types vector are used to alter symbols. The first 50 data points (setosa) are represented by open circles. The next 50 (versicolor) are represented by triangles, while the last 50 in crosses.

Follow the same thinking, we also change the color according to species. We generate 3 colors with rainbow() function and then alternate using the species.

`plot(PL, PW, pch = speciesID, col = rainbow(3)[speciesID])`

This figure starts to looks nice, as the three species are easily separated by color and shape. But we are not done. Let us change the x- and y-labels, and also add a main title.

```
plot(PL, PW, #x and y
pch = speciesID, #marker type
col = rainbow(3)[speciesID], # color
xlab = "Petal length (cm)", # x label
ylab = "Petal width (cm)", # y label
main = "Petal width vs. length") # main title
```

Note that this command spans many lines. Very long lines make it hard to read. The benefit of multiple lines is that we can clearly see each line contain a parameter. When you are typing in the Console window, R knows that you are not done and will be waiting for the second parenthesis.

It is essential to write your code so that it could be easily understood by others (or your future self). See breif and detailed style guides. This is like checking the dressing code before going to an event.

A true perfectionist never settles. We notice a strong linear correlation between petal length and width. Let’s add a trend line using **abline**, a low level graphics function.

```
plot(PL, PW,
pch = speciesID,
col = rainbow(3)[speciesID],
xlab = "Petal length (cm)",
ylab = "Petal width (cm)",
main = "Petal width vs. length")
abline( lm(PW ~ PL) ) # the order is reversed as we need y ~ x.
```

The `lm(PW ~ PL)`

generates a linear model (lm) of petal width as a function petal
length. y ~ x is **formula notation** that used in many different situations.
This linear regression model is used to plot the trend line.

We calculate the Pearson’s correlation coefficient and mark it to the plot.

```
<- cor(PW, PL) # Pearson's correlation coefficient
PCC <- round(PCC, 2) # round to the 2nd place after decimal point.
PCC PCC
```

`## [1] 0.96`

The **text** function is used to place strings created with the *paste* function at the location of coordinate of (x=5, y=0.5). To finish up this plot, we add a legend. Here is how we have everything together.

```
<- iris$Petal.Length
PL <- iris$Petal.Width
PW plot(PL, PW,
pch = speciesID, # marker type
col = rainbow(3)[speciesID], # color
xlab = "Petal length (cm)",
ylab = "Petal width (cm)",
main = "Petal width vs. length")
abline( lm(PW ~ PL) ) # regression line
text(5, 0.5, paste("R=", PCC) ) # add text annotation
legend("topleft",
levels(iris$Species),
pch = 1:3,
col = rainbow(3))
```

While **plot** is a high-level graphics function that starts a new plot, **abline**, **text**, and **legend** are all low-level functions that add to existing plot. This chunk of code is impressive! You should be proud of yourself if you are able to generate this plot. We know that we build it up step-by-step with some trial-and-error.

Another open secret of coding is that we frequently steal each other’s ideas and
code. The textbook *R Graphics Cookbook* includes all kinds of R plots and
example code. Some are online: http://www.cookbook-r.com/Graphs/. There are
also websites lists all sorts of R graphics and example codes that you can use.
http://www.r-graph-gallery.com/ contains more than 200 such examples. Another
one is here: http://bxhorn.com/r-graphics-gallery/. If you know what types of
graphs you want, it is very easy to start with the template code and swap out
the dataset.

## 2.2 Scatter plots matrix

While data frames can have a mixture of numbers and characters in different
columns, a **matrix often only contains numbers**. Let’s extract the first 4
columns from the data frame iris and convert to a matrix:

```
<- as.matrix(iris[, 1:4]) # convert to matrix
ma colMeans(ma) # column means for matrix
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
```

`colSums(ma)`

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 876.5 458.6 563.7 179.9
```

The same thing can be done with rows via **rowMeans(x)** and **rowSums(x)**.

We can generate a matrix of scatter plots simply by:

`pairs(ma) `

`pairs(ma, col = rainbow(3)[iris$Species]) # Figure 2.4`

Look at this large plot for a moment. What do you see? Provide interpretation of these scatter plots.

Exercise 2.1

## 2.3 Star and segment diagrams

**Star plot uses stars to visualize multidimensional data.** Radar chart is a useful way to display multivariate observations with an arbitrary number of variables. Each observation is represented as a star-shaped figure with one ray for each variable. For a given observation, the length of each ray is made proportional to the size of that variable. The star plot was first used by Georg von Mayr in 1877!

```
<- iris[, 1:4]
df stars(df) # do I see any diamonds?
stars(df, key.loc = c(17,0)) # What does this tell you?
```

Based on the star plot, what is your overall impression regarding the differences among these 3 species of flowers?

Exercise 2.2

The stars() function can also be used to generate segment diagrams, where each variable is used to generate colorful segments. The sizes of the segments are proportional to the measurements.

`stars(df, key.loc = c(20,0.5), draw.segments = TRUE ) `

Produce the segments diagram of the state data (state.x77) and offer some interpretation regarding South Dakota compared with other states. Hints: Convert the matrix to data frame using df.state.x77 <- as.data.frame(state.x77), then attach df.state.x77.

Exercise 2.3

## 2.4 The ggplot2 package is intuitive and powerful

In addition to the graphics functions in base R, there are many other packages we can use to create graphics. The most widely used are lattice and ggplot2, both implemented in additional packages. Together with base R graphics, sometimes these are referred to as the three independent paradigms of R graphics. The lattice package extends base R graphics and enables the creating of graphs in multiple facets. The ggplot2 is developed based on a Grammar of Graphics (hence the “gg”), a modular approach that builds complex graphics by adding layers. Intuitive yet powerful, ggplot2 is becoming increasingly popular.

The ggplot2 functions is not included in the base distribution of R. These are available as an additional **package**, on the CRAN website.
They need to be downloaded and installed. One of the main advantages of R is that it
is open, and users can contribute their code as packages. If you are using
**Rstudio**, you can choose **Tools->Install packages** from the main menu, and
then enter the name of the package. If you are using R software, you can install
additional packages, by clicking **Packages** in the main menu, and select a
mirror site. These mirror sites all work the same, but some may be faster. After
choosing a mirror and clicking “OK,” you can scroll down the long list to find
your package. Alternatively, you can type this command to install packages.

ggplot2 is a modular, intuitive system for plotting, as we use different functions to refine different aspects of a chart step-by-step:

**Aesthetic mapping**from variables to visual characteristics**Geometric shapes**(points, lines, boxes, bars, histograms, maps, etc)**Scales**and**statistical transformations**(log, reverse, count, etc)**coordinate systems**

**facets**( multiple plots)**Labels**and**annotations**

Detailed tutorials on ggplot2 can be find here and by its author. Here we present a quick demo.

```
# Install the package. You can also do it through the Packages Tab
install.packages ("ggplot2")
```

Packages only need to be installed once. But every time you need to use the **functions or data** in a package, you have to **load** it from your hard drive into memory.

`library(ggplot2) # load the ggplot2 package`

```
# define a canvas
ggplot(data = iris)
```

```
# map data to x and y coordinates
ggplot(data = iris ) +
aes(x = Petal.Length, y = Petal.Width)
```

```
# add data points
ggplot(data = iris) +
aes(x = Petal.Length, y = Petal.Width) +
geom_point()
```

```
# change color & marker type
ggplot(data = iris) +
aes(x = Petal.Length, y = Petal.Width) +
geom_point( aes(color = Species, shape = Species) )
```

```
# adds trend line
ggplot(data = iris) +
aes(x = Petal.Length, y = Petal.Width) +
geom_point( aes(color = Species, shape = Species) ) +
geom_smooth(method =lm)
```

```
# adds annotation text
ggplot(data = iris) +
aes(x = Petal.Length, y = Petal.Width) +
geom_point( aes(color = Species, shape = Species) ) +
geom_smooth(method =lm) +
annotate("text", x = 5, y = 0.5, label = "R=0.96")
```

Some ggplot2 commands span multiple lines. The first line defines the plotting space. The ending “+” signifies that another layer ( data points) of plotting is added. Also, the ggplot2 package handles a lot of the details for us. We can easily generate many different types of plots. It is also much easier to generate a plot like Figure 2.1.

```
ggplot(iris) +
aes(x = Petal.Length, y = Petal.Width) + # define space
geom_point( aes(color = Species, shape = Species) ) + # add points
geom_smooth(method =lm) + # add trend line
annotate("text", x = 5, y = 0.5, label = "R=0.96") + # annotate with text
xlab("Petal length (cm)") + # x-axis labels
ylab("Petal width (cm)") + # y-axis labels
ggtitle( "Correlation between petal length and width") # title
```

`## `geom_smooth()` using formula 'y ~ x'`

As you can see, data visualization using ggplot2 is similar to painting: we first find a blank canvas, paint background, sketch outlines, and then add details.

Create a scatter plot of sepal length vs sepal width, change colors and shapes with species, and add trend line.

Exercise 2.4

Use boxplot, multiple panel histograms, and density plots to investigate whether petal width is the same among three subspecies.

Exercise 2.5

## 2.5 Other types of plots with ggplot2

Box plots can be generated by ggplot.

```
library(ggplot2)
ggplot(data = iris) +
aes(x = Species, y = Sepal.Length, color = Species) +
geom_boxplot()
```

Here we use Species, a categorical variable, as x-coordinate. Essentially, we use it to define three groups of data. The y-axis is the sepal length, whose distribution we are interested in. The benefit of using ggplot2 is evident as we can easily refine it.

```
ggplot(data = iris) +
aes(x = Species, y = Sepal.Length, color = Species) +
geom_boxplot() +
geom_jitter(position = position_jitter(0.2))
```

On top of the boxplot, we add another layer representing the raw data points for each of the species. Since lining up data points on a straight line is hard to see, we jittered the relative x-position within each subspecies randomly. We also color-coded three species simply by adding “color = Species.” Many of the low-level graphics details are handled for us by ggplot2 as the legend is generated automatically.

```
ggplot(data = iris) +
aes(x = Petal.Length, fill = Species) +
geom_density( alpha = 0.3)
```

```
ggplot(data = iris) +
aes(x = Petal.Length, fill = Species) +
geom_density( alpha = 0.3) +
facet_wrap(~ Species, nrow = 3)
```

The bar plot with error bar in (fig:22-1) we generated above is called dynamite plots for its similarity. **“The dynamite plots must die!”**, argued renowned statistician Rafael Irizarry in his blog. Dynamite plots give very little information; the mean and standard errors just could be printed out. The outliers and overall distribution is hidden. Many scientists have chosen to use this boxplot with jittered points.

## 2.6 Hierarchical clustering and heat map

Hierarchical clustering summarizes observations into trees representing the overall similarities.

```
<- as.matrix(iris[, 1:4]) # convert to matrix
ma <- dist(ma)
disMatarix plot( hclust(disMatarix) )
```

We first calculate a distance matrix using the **dist** function with the default Euclidean distance method. The distance matrix is then used by the **hclust** function to generate a hierarchical clustering tree with the default complete linkage method, which is then plotted in a nested command.

The 150 samples of flowers are organized in this cluster dendrogram based on their Euclidean distance, which is labeled vertically by the bar to the left side. Highly similar flowers are grouped together in smaller branches, and their distances can be found according to the vertical position of the branching point. We are often more interested in looking at the overall structure of the dendrogram. For example, we see two big clusters.

First, each of the flower samples is treated as a cluster. The algorithm joins
the two most similar clusters based on a distance function. This is performed
iteratively until there is just a single cluster containing all 150 flowers. At
each iteration, the distances between clusters are recalculated according to one
of the methods—*Single linkage, complete linkage, average linkage, and so on*.
In the single-linkage method, the distance between two clusters is defined by
the smallest distance among the all possible object pairs. This approach puts
‘friends of friends’ into a cluster. On the contrary, the complete linkage
method defines the distance as the largest distance between object pairs. It
finds similar clusters. Between these two extremes, there are many options in
between. The linkage method I found the most robust is the average linkage
method, which uses the average of all distances. However, the default seems to
be the complete linkage. Thus we need to change that in our final version.

**Heat maps** with hierarchical clustering are my favorite way of visualizing data matrices. The rows and columns are reorganized based on hierarchical clustering, and the values in the matrix are coded by colors. Heat maps can directly visualize millions of numbers in one plot. The hierarchical trees also show the similarity among rows and columns.

```
heatmap(ma,
scale = "column",
RowSideColors = rainbow(3)[iris$Species] )
```

Scaling is handled by the **scale** function, which subtracts the mean from each
column and then divides by the standard division. Afterward, all the columns
have the same mean of approximately 0 and standard deviation of 1. This is also
called standardization. The default color scheme codes bigger numbers in yellow
and smaller numbers in red. The color bar on the left codes for different
species. But we still miss a legend and many other things can be polished.
Instead of going down the rabbit hole of adjusting dozens of parameters to
heatmap function (and it’s improved version heatmap.2 in the ggplots package), We
will refine this plot using another R package called pheatmap.

`install.packages("pheatmap")`

```
library(pheatmap)
<- as.matrix(iris[, 1:4]) # convert to matrix
ma row.names(ma) <- row.names(iris) # assign row names in the matrix
pheatmap(ma,
scale = "column",
clustering_method = "average", # average linkage
annotation_row = iris[, 5, drop=FALSE], # the 5th column as color bar
show_rownames = FALSE )
```

First, we convert the first 4 columns of the iris data frame into a matrix. Then
the row names are assigned to be the same, namely, “1” to “150.” This is
required because row names are used to match with the column annotation
information, specified by the annotation_row parameter. Even though we only
need the 5th column, i.e., Species, this has to be a data frame. To prevent R
from automatically converting a one-column data frame into a vector, we used
*drop = FALSE* option. Multiple columns can be contained in the column
annotation data frame to display multiple color bars. The rows could be
annotated the same way.

More information about the pheatmap function can be obtained by reading the help document. But most of the times, I rely on the online tutorials. Beyond the official documents prepared by the author, there are many documents created by R users across the world. The R user community is uniquely open and supportive. I was researching heatmap.2, a more refined version of heatmap part of the gplots package and landed on Dave Tang’s blog, which mentioned that there is a more user-friendly package called pheatmap described in his other blog. To figure out the code chuck above, I tried several times and also used Kamil Slowikowski’s blog.

We can gain many insights from Figure 2.13. The 150 flowers in the rows are organized into different clusters. *I. Setosa* samples obviously formed a unique cluster, characterized by smaller (blue) petal length, petal width, and sepal length. The other two subspecies are not clearly separated but we can notice that some *I. Virginica* samples form a small subcluster showing bigger petals. The columns are also organized into dendrograms, which clearly suggest that petal length and petal width are highly correlated.

## 2.7 Principal component analysis (PCA)

This section can be skipped, as it contains more statistics than R programming. To visualize high-dimensional data, we use PCA to map data to l

PCA is a linear dimension-reduction method. As illustrated in Figure 2.14, it tries to define a new set of orthogonal coordinates to represent the data such that the new coordinates can be ranked by the amount of variation or information it captures in the dataset. After running PCA, you get many pieces of information:

- How the new coordinates are defined,
- The percentage of variances captured by each of the new coordinates,
- A representation of all the data points onto the new coordinates.

Here’s an example of running PCA on the first 4 columns of the iris data. Note that “scale = TRUE” in the following command means that the data is normalized before conduction PCA so that each variable has unit variance.

```
= prcomp(iris[, 1:4], scale = TRUE)
pca # Have a look at the results. pca
```

```
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
```

The first principal component is positively correlated with Sepal length, petal length, and petal width. Recall that these three variables are highly correlated. Sepal width is the variable that is almost the same across three species with small standard deviation. PC2 is mostly determined by sepal width, less so by sepal length.

`plot(pca) # plot the amount of variance each principal components captures.`

`str(pca) # this shows the structure of the object, listing all parts. `

```
## List of 5
## $ sdev : num [1:4] 1.708 0.956 0.383 0.144
## $ rotation: num [1:4, 1:4] 0.521 -0.269 0.58 0.565 -0.377 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## $ center : Named num [1:4] 5.84 3.06 3.76 1.2
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ scale : Named num [1:4] 0.828 0.436 1.765 0.762
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ x : num [1:150, 1:4] -2.26 -2.07 -2.36 -2.29 -2.38 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## - attr(*, "class")= chr "prcomp"
```

`head(pca$x) # the new coordinate values for each of the 150 samples`

```
## PC1 PC2 PC3 PC4
## [1,] -2.257141 -0.4784238 0.12727962 0.024087508
## [2,] -2.074013 0.6718827 0.23382552 0.102662845
## [3,] -2.356335 0.3407664 -0.04405390 0.028282305
## [4,] -2.291707 0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825 0.006586116
```

To plot the PCA results, we first construct a data frame with all information, as required by ggplot2.

```
<- as.data.frame(pca$x[, 1:2]) # extract first two columns and convert to data frame
pcaData <- cbind(pcaData, iris$Species) # bind by columns
pcaData colnames(pcaData) <- c("PC1", "PC2", "Species") # change column names
library(ggplot2)
ggplot(pcaData) +
aes(PC1, PC2, color = Species, shape = Species) + # define plot area
geom_point(size = 2) # adding data points
```

Now we have a basic plot. We are adding elements one by one using the “+” sign at the end of the first line. We will add details to this plot.

```
<- round(100 * summary(pca)$importance[2, 1:2], 0) # compute % variances
percentVar ggplot(pcaData, aes(PC1, PC2, color = Species, shape = Species)) + # starting ggplot2
geom_point(size = 2) + # add data points
xlab(paste0("PC1: ", percentVar[1], "% variance")) + # x label
ylab(paste0("PC2: ", percentVar[2], "% variance")) + # y label
ggtitle("Principal component analysis (PCA)") + # title
theme(aspect.ratio = 1) # width and height ratio
```

The result (Figure 2.15) is a projection of the 4-dimensional iris flowering data on 2-dimensional space using the first two principal components. We can see that the first principal component alone is useful in distinguishing the three species. We could use simple rules like this: If PC1 < -1, then Iris setosa. If PC1 > 1.5 then Iris virginica. If -1 < PC1 < 1, then Iris versicolor.

Create PCA plot of the state.x77 data set (convert matrix to data frame). Use the state.region information to color code the states. Interpret your results. Hint: do not forget normalization using the scale option.

Exercise 2.6

## 2.8 Classification by predicting the odds of binary outcomes

This section can be skipped, as it contains more statistics than R programming.

It is easy to distinguish *I. setosa* from the other two species, just based on
petal length alone. Here we focus on building a predictive model that can
predict between *I. versicolor* and *I. virginica*. For this we use the logistic
regression to model the odds ratio of being *I. virginica* as a function of all
of the 4 measurements:

\[ln(odds)=ln(\frac{p}{1-p}) =a×Sepal.Length + b×Sepal.Width + c×Petal.Length + d×Petal.Width+c+e.\]

```
<- iris[51:150, ] # removes the first 50 samples, which represent I. setosa
df <- droplevels(df) # removes setosa, an empty levels of species.
df <- glm(Species ~ . , # Model: Species as a function of other variables
model family = binomial(link = 'logit'),
data = df)
summary(model)
```

```
##
## Call:
## glm(formula = Species ~ ., family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.01105 -0.00541 -0.00001 0.00677 1.78065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.638 25.707 -1.659 0.0972 .
## Sepal.Length -2.465 2.394 -1.030 0.3032
## Sepal.Width -6.681 4.480 -1.491 0.1359
## Petal.Length 9.429 4.737 1.991 0.0465 *
## Petal.Width 18.286 9.743 1.877 0.0605 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 11.899 on 95 degrees of freedom
## AIC: 21.899
##
## Number of Fisher Scoring iterations: 10
```

Sepal length and width are not useful in distinguishing versicolor from
*virginica*. The most significant (P=0.0465) factor is Petal.Length. One unit
increase in petal length will increase the log-odds of being *virginica* by
9.429. A marginally significant effect is found for Petal.Width.

If you do not fully understand the mathematics behind linear regression or
logistic regression, do not worry about it too much. In this class, I
just want to show you how to do these analyses in R and interpret the results. I
do not understand how computers work. Yet I **use** it every day.

Exercise 2.7So far, we used a variety of techniques to investigate the iris flower dataset. Recall that in the very beginning, I asked you to eyeball the data and answer two questions:

Review all the analysis we did, examine the raw data, and answer the above questions. Write a paragraph and provide evidence of your thinking. Do more analysis if needed.

- What distinguishes these three species?
- If we have a flower with sepals of 6.5cm long and 3.0cm wide, petals of 6.2cm long, and 2.2cm wide, which species does it most likely belong to?

References: 1 Beckerman, A. (2017). Getting started with r second edition. New York, NY, Oxford University Press.