Preface

Aimed for total beginners, this book is written based on the philosophy that people learn faster when they are shown examples and case studies. Instead of explaining the rules, the book largely centers on the analysis of several datasets from the very beginning. So this is an alternative to traditional, more rigorous textbooks on R programming. We start with small and clean datasets and gradually transition into big, messy ones. With each dataset, we hope to tell a story through the analysis. We invite you, our courageous reader, to take on this journey with us. Motivated readers, such as biologists, could easily work their way through this book and learn by themselves. I would encourage you to type in the example code and see the outputs. And then work on the challenges and exercises.

It originally started as materials for 2-hour hands-on workshops intended to give a quick introduction/demonstration for students and researchers who are totally new to R. The workshop has been given many times to different audiences ranging from high-school students to mathematicians. For a 2-hour session, I have to keep it gentle, interactive, and fun, sometimes at the expense of rigor. Instead of explaining all the rules, grammar, and syntax, I found it is easier to focus on one dataset and walk them through some of the analyses possible with R. This material later evolved into as a one-credit online class and then a three credit class. We stick with the unconventional approach of focusing on datasets and examples.

Another feature of this book is that we review the statistical concepts involved. R is a language for statistical computing, thus can not be detached from the context.

Coding and cooking.

If this is your first time coding, consider it as a process of writing a recipe. Your goal is to provide clear, step-by-step instructions to help a 10-year-old turn raw ingredients (data) into delicious pasta (results). A good recipe should be used again to make pasta from the same materials, just like computer programs could process data of the exact specifications. That is why millions of people share their code on repositories like GitHub.

Free, powerful, and welcoming, the R programming environment is a wonderful kitchen. It is interactive and easy to learn for beginners. In this kitchen, you can find many tools (a knife) and complex appliances (stove); in R, we call them functions, created by others (sometimes painstakingly over many years) and ready to be used to process data. We need to learn the commonly used R functions, just like we need to know how to use a knife. Each of the kitchen tools has its instruction manuals, but people rarely read them. The same thing goes with R functions, as most people learn from example code provide by others on sites like StackOverflow.

If you want to make a veggie smoothie, but the recipe requires a fancy blender not available in your home kitchen. You can go to a marketplace such as Amazon.com to buy one. Similarly, with the R programming language, we can download additional R packages from The Comprehensive R Archive Network (CRAN), a FREE marketplace where tens of thousands of people contribute. The open and collaborative user community is uniquely productive. People build on top of each other, providing increasingly complex functionalities with simple interfaces. There are R packages that can help you create complex charts, write a book (like this one), host a website, or even find a girlfriend (just kidding). Imagine a free kitchen appliance that can turn uncooked chicken, vegetables, oil, and spices, into delicious Kung Pao Chicken! That is precisely how I feel every time I use other people’s R packages to analyze genomic data.

In your kitchen, you also find jars, dishes, salt dispensers, pots, and so on; we use suitable containers for different ingredients or foods. Even though some containers are only needed to store intermediate products, it is essential to know what kind of containers there are before starting to cook. In programming, we have different types of pre-defined data types. A scalar variable can hold one number, some text(strings), or just a true/false indicator (logical values). A vector holds a sequence of scalars of the same kind. With rows and columns like an Excel spreadsheet, a data frame can be thought of as multiple vectors of the same length. In computer programming, we need to learn these data structures. Common R data structures include scalars, vectors, matrices, data frames, and lists.

When I just started cooking, I always hated it when people or recipes say something like “a little olive oil,” like this recipe in the picture above. Without any experience, I have no idea whether that is one drip, a teaspoon, or 1 cup of olive oil! The 10-year-old we want to write a recipe for might not even know how small “small pieces” are or even what “boiling water” looks like. Computers are stupid machines that can run calculations faithfully and fastly. They have no common sense whatsoever, unlike the “computers” in history, who are people that can calculate, either mentally or with mechanical calculators. When programming, we need to (1) provide clear, specific commands at each step and (2) define the correct sequence of operations, considering exceptional scenarios such as data is zero or missing.

Just like writing a recipe, the process of programming can be frustrating. Patience and trial-and-error is the only solution. You asked your 6-year-old baby sister to help peel the carrots. Before moving on to the chopping step, you need to look at these carrots to see if they are peeled properly. One of the main things you can do in debugging is to stop and look at the intermediate products. The previous steps might not be carried out correctly, even though you think your instructions are clear and correct. Sometimes we have typos in the code or forgot to pass on the right inputs. We can print out the data and take a look. If the data is large, we examine the first few rows or even just the number of rows and columns. As you execute your code, the intermediate data objects are stored in the computer memory. The process of coding is the creating and modifying data objects in memory, step-by-step.

Many students have contributed to this material. Notably, Quazi Irfan who worked as teaching assistant, fixed many errors and gave constructive feedback. In the fall of 2018, a group of highly motivated students in the STAT 442 Exploratory Data Analysis worked on some of the datasets presented here. They are Samuel Ivanecky, Kory Heier, Audrey Bunge, Jacie McDonald, Shae Olson, Nathan Thirsten, and Alex Wieseler. Some of the plots in this book are inspired by them.

Any comments and suggestions to make this book better would be welcome. This includes typos, errors, and organizational issues. The best place to reach out is through the GitHub issues page. If you do not like to create yet another account, you can email us .