Preface

Aimed at total beginners, this book is written based on the philosophy that people learn faster from examples. Instead of explaining the rules, the book primarily centers on analyzing several datasets from the very beginning. So this is an alternative to traditional, more rigorous textbooks on R programming. We start with small and clean datasets and gradually transition into big, messy ones. With each dataset, we hope to tell a story through the analysis. We invite you, our courageous reader, to take on this journey with us. Motivated readers, such as biologists, could quickly work through this book and learn by themselves. I would encourage you to type in the example code and see the outputs. Then work on the challenges and exercises.

It originally started as materials for 2-hour hands-on workshops intended to give a quick introduction/demonstration for students and researchers new to R. The workshop has been given many times to different audiences ranging from high-school students to mathematicians. For a 2-hour session, I have to keep it gentle, interactive, and fun, sometimes at the expense of rigor. Instead of explaining all the rules, grammar, and syntax, I found it is easier to focus on one dataset and walk them through some of the analyses possible with R. This material later evolved into a one-credit online class and then a three-credit course. We stick with the unconventional approach of focusing on datasets and examples.

Another feature of this book is that we review the statistical concepts involved. R is a language for statistical computing, thus can not be detached from the context.

Coding and cooking.

If this is your first time coding, consider it a process of writing a recipe. Your goal is to provide clear, step-by-step instructions to help a 10-year-old turn raw ingredients (data) into delicious pasta (results). A good recipe should be used again to make pasta from the same materials, just like computer programs could process data of the exact specifications. Millions of people share their code on repositories like GitHub.

Free, powerful, and welcoming, the R programming environment is a wonderful kitchen. It is interactive and easy to learn for beginners. In this kitchen, you can find many tools (a knife) and complex appliances (stove); in R, we call them functions, created by others (sometimes painstakingly over many years) and ready to be used to process data. We need to learn the commonly used R functions, just like we need to know how to use a knife. Each kitchen tool has its instruction manuals, but people rarely read them. The same thing goes with R functions, as most people learn from example code provide by others on sites like StackOverflow.

If you want to make a veggie smoothie, but the recipe requires a fancy blender. You can go to a marketplace such as Amazon to buy one. Similarly, with the R programming language, we can download additional R packages from The Comprehensive R Archive Network (CRAN), a FREE marketplace where tens of thousands of people contribute. The open and collaborative user community is uniquely productive. People build on top of each other, providing increasingly complex functionalities with simple interfaces. There are R packages that can help you create complex charts, write a book (like this one), host a website, or even find a girlfriend (just kidding). Imagine a free appliance that can turn uncooked chicken, vegetables, oil, and spices, into delicious Kung Pao Chicken! That is precisely how I feel every time I use other people’s R packages to analyze genomic data.

In your kitchen, you also find jars, dishes, salt dispensers, pots, and so on; we use suitable containers for different ingredients or foods. Even though some containers are only needed to store intermediate products, it is essential to know what kind of containers there are before starting to cook. In programming, we have different types of pre-defined data types. A scalar variable can hold one number, some text(strings), or just a true/false indicator (logical values). A vector contains a sequence of scalars of the same kind. With rows and columns like an Excel spreadsheet, a data frame can be considered multiple vectors of the same length. In computer programming, we need to learn these data structures. Common R data structures include scalars, vectors, matrices, data frames, and lists.

When I just started cooking, I always hated it when people or recipes say something like “a little” olive oil, like this recipe in the picture above. Without any experience, I have no idea whether that is one drip, a teaspoon, or 1 cup of olive oil! The 10-year-old we want to write a recipe for might not even know how small “small pieces” are or even what “boiling water” looks like. Computers are stupid machines that can run calculations faithfully and fastly. They have no common sense whatsoever, unlike the “computers” in history, who are people that can calculate, either mentally or with mechanical calculators. When programming, we need to (1) provide clear, specific commands at each step and (2) define the correct sequence of operations, considering exceptional scenarios such as data being zero or missing.

Just like writing a recipe, the process of programming can be frustrating. Patience and trial-and-error is the only solution. You asked your 6-year-old baby sister to help peel the carrots. Before moving on to the chopping step, you need to look at these carrots to see if they are appropriately peeled. One of the main things you can do in debugging is to stop and look at the intermediate products. The previous steps might not be carried out correctly, even though you think your instructions are clear and correct. Sometimes we have typos in the code or forget to pass on the right inputs. We can print out the data and take a look. If the data is large, we examine the first few rows or even just the number of rows and columns. The intermediate data objects are created in the computer memory as you execute your code. The coding process is the step-by-step creation and modification of data objects in memory.

Many students have contributed to this material. Notably, Quazi Irfan, who worked as a teaching assistant, fixed many errors and gave constructive feedback. In the fall of 2018, a group of highly motivated students in the STAT 442 Exploratory Data Analysis worked on some of the datasets presented here. They are Samuel Ivanecky, Kory Heier, Audrey Bunge, Jacie McDonald, Shae Olson, Nathan Thirsten, and Alex Wieseler. Some of the plots in this book are inspired by them.

The best way to learn R is to use it. This is similar to learning a forgein language. After the first two chapters, it is entirely possible to start working on a your own dataset. You do not need to learn everything first. That is nearly impossible, as the R community produces extremely useful and cool packages every day. You do not need to finish all the chapters of this book. Feel free to search for and steal some example code. Even programmers with 20 years of experience acknowledged that they still google basic functions daily. If you do not have a dataset, you can find one from sources such as TidyTuesday.

Chapter 3 is a more traditional R programming content that explains the data objects and common functions. If you are learning by yourself, you do not need to feel guilty about skipping it. You can go through the later chapters of the book quickly. There is no need to memorize the functions. Many students,
even the authors, keep coming back to the later chapters to recall how a particular plots are generated. Once you go through this book, you can use it as a reference.

This is still a work in progress. The later chapters, in particular, need to be extensively edited. Any comments and suggestions to make this draft better would be welcome. This includes typos, errors, and organizational issues. The best place to reach out is through the GitHub issues page. If you do not like to create yet another account, you can email us .