1 Introduction

1.1 Welcome to ggplot2

ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar. This grammar, based on the Grammar of Graphics (Wilkinson 2005), is made up of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful because you are not limited to a set of pre-specified graphics, but you can create new graphics that are precisely tailored for your problem. This may sound overwhelming, but because there is a simple set of core principles and very few special cases, ggplot2 is also easy to learn (although it may take a little time to forget your preconceptions from other graphics tools).

ggplot2 provides beautiful, hassle-free plots that take care of fiddly details like drawing legends. In fact, its carefully chosen defaults mean that you can produce publication-quality graphics in seconds. However, if you do have special formatting requirements, ggplot2’s comprehensive theming system makes it easy to do what you want. Ultimately, this means that rather than spending your time making your graph look pretty, you can instead focus on creating the graph that best reveal the message in your data.

ggplot2 is designed to work iteratively. You can start with a layer showing the raw data then add layers of annotations and statistical summaries. It allows you to produce graphics using the same structured thinking that you use to design an analysis, reducing the distance between a plot in your head and one on the page. It is especially helpful for students who have not yet developed the structured approach to analysis used by experts.

Learning the grammar not only will help you create graphics that you know about now, but will also help you to think about new graphics that would be even better. Without the grammar, there is no underlying theory, so most graphics packages are just a big collection of special cases. For example, in base R, if you design a new graphic, it’s composed of raw plot elements like points and lines, and it’s hard to design new components that combine with existing plots. In ggplot2, the expressions used to create a new graphic are composed of higher-level elements like representations of the raw data and statistical transformations, and can easily be combined with new datasets and other plots.

This book provides a hands-on introduction to ggplot2 with lots of example code and graphics. It also explains the grammar on which ggplot2 is based. Like other formal systems, ggplot2 is useful even when you don’t understand the underlying model. However, the more you learn about it, the more effectively you’ll be able to use ggplot2. This book assumes some basic familiarity with R, to the level described in the first chapter of Dalgaard’s Introductory Statistics with R.

This book will introduce you to ggplot2 as a novice, unfamiliar with the grammar; teach you the basics so that you can re-create plots you are already familiar with; show you how to use the grammar to create new types of graphics; and eventually turn you into an expert who can build new components to extend the grammar.

1.2 What is the grammar of graphics?

Wilkinson (2005) created the grammar of graphics to describe the deep features that underlie all statistical graphics. The grammar of graphics is an answer to a question: what is a statistical graphic? The layered grammar of graphics (Wickham 2009) builds on Wilkinson’s grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.

As the book progresses, the formal grammar will be explained in increasing detail. The first description of the components follows below. It introduces some of the terminology that will be used throughout the book and outlines the basic responsibilities of each component. Don’t worry if it doesn’t all make sense right away: you will have many more opportunities to learn about the pieces and how they fit together.

All plots are composed of:

  • Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes that you can perceive.

  • Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, represent what you actually see on the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model.

  • The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape. Scales draw a legend or axes, which provide an inverse mapping to make it possible to read the original data values from the plot.

  • A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph. We normally use a Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.

  • A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.

  • A theme which controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot. A good starting place is Tufte’s early works (Tufte 1990, 1997, 2001).

It is also important to talk about what the grammar doesn’t do:

  • It doesn’t suggest what graphics you should use to answer the questions you are interested in. While this book endeavours to promote a sensible process for producing plots of data, the focus of the book is on how to produce the plots you want, not knowing what plots to produce. For more advice on this topic, you may want to consult Robbins (2013), Cleveland (1993b), Chambers et al. (1983), and Tukey (1977).

  • It does not describe interactivity: the grammar of graphics describes only static graphics and there is essentially no benefit to displaying them on a computer screen as opposed to a piece of paper. ggplot2 can only create static graphics, so for dynamic and interactive graphics you will have to look elsewhere (perhaps at ggvis, described below). Cook and Swayne (2007) provides an excellent introduction to the interactive graphics package GGobi. GGobi can be connected to R with the rggobi package (Wickham et al. 2008).

1.3 How does ggplot2 fit in with other R graphics?

There are a number of other graphics systems available in R: base graphics, grid graphics and trellis/lattice graphics. How does ggplot2 differ from them?

  • Base graphics were written by Ross Ihaka based on experience implementing the S graphics driver and partly looking at Chambers et al. (1983). Base graphics has a pen on paper model: you can only draw on top of the plot, you cannot modify or delete existing content. There is no (user accessible) representation of the graphics, apart from their appearance on the screen. Base graphics includes both tools for drawing primitives and entire plots. Base graphics functions are generally fast, but have limited scope. If you’ve created a single scatterplot, or histogram, or a set of boxplots in the past, you’ve probably used base graphics.

  • The development of “grid” graphics, a much richer system of graphical primitives, started in 2000. Grid is developed by Paul Murrell, growing out of his PhD work (Murrell 1998). Grid grobs (graphical objects) can be represented independently of the plot and modified later. A system of viewports (each containing its own coordinate system) makes it easier to lay out complex graphics. Grid provides drawing primitives, but no tools for producing statistical graphics.

  • The lattice package, developed by Deepayan Sarkar, uses grid graphics to implement the trellis graphics system of Cleveland (1993b) and is a considerable improvement over base graphics. You can easily produce conditioned plots and some plotting details (e.g., legends) are taken care of automatically. However, lattice graphics lacks a formal model, which can make it hard to extend. Lattice graphics are explained in depth in Sarkar (2008).

  • ggplot2, started in 2005, is an attempt to take the good things about base and lattice graphics and improve on them with a strong underlying model which supports the production of any kind of statistical graphic, based on the principles outlined above. The solid underlying model of ggplot2 makes it easy to describe a wide range of graphics with a compact syntax, and independent components make extension easy. Like lattice, ggplot2 uses grid to draw the graphics, which means you can exercise much low-level control over the appearance of the plot.

  • Work on ggvis, the successor to ggplot2, started in 2014. It takes the foundational ideas of ggplot2 but extends them to the web and interactive graphics. The syntax is similar, but it’s been re-designed from scratch to take advantage of what I’ve learned in the 10 years since creating ggplot2. The most exciting thing about ggvis is that it’s interactive and dynamic, so plots automatically re-draw themselves when the underlying data or plot specification changes. However, ggvis is work in progress and currently can create only a fraction of the plots in ggplot2 can. Stay tuned for updates!

  • htmlwidgets, http://www.htmlwidgets.org, provides a common framework for accessing web visualisation tools from R. Packages built on top of htmlwidgets include leaflet (https://rstudio.github.io/leaflet/, maps), dygraph (http://rstudio.github.io/dygraphs/, time series) and networkD3 (http://christophergandrud.github.io/networkD3/, networks). htmlwidgets is to ggvis what the many specialised graphic packages are to ggplot2: it provides graphics honed for specific purposes.

Many other R packages, such as vcd (Meyer, Zeileis, and Hornik 2006), plotrix (Lemon et al. 2008) and gplots (Warnes 2007), implement specialist graphics, but no others provide a framework for producing statistical graphics. A comprehensive list of all graphical tools available in other packages can be found in the graphics task view at http://cran.r-project.org/web/views/Graphics.html.

1.4 About this book

The first chapter, Chapter 2, describes how to quickly get started using ggplot2 to make useful graphics. This chapter introduces several important ggplot2 concepts: geoms, aesthetic mappings and facetting. Chapters 4 to 8 dive into more details, giving you a toolbox designed to solve a wide range of problems.

Chapter 10 describes the layered grammar of graphics which underlies ggplot2. The theory is illustrated in Chapter 11 which demonstrates how to add additional layers to your plot, exercising full control over the geoms and stats used within them.

Understanding how scales work is crucial for fine-tuning the perceptual properties of your plot. Customising scales gives fine control over the exact appearance of the plot and helps to support the story that you are telling. Chapter 12 will show you what scales are available, how to adjust their parameters, and how to control the appearance of axes and legends.

Coordinate systems and facetting control the position of elements of the plot. These are described in Chapter 11.7. Facetting is a very powerful graphical tool as it allows you to rapidly compare different subsets of your data. Different coordinate systems are less commonly needed, but are very important for certain types of data.

To polish your plots for publication, you will need to learn about the tools described in Chapter 15. There you will learn about how to control the theming system of ggplot2 and how to save plots to disk.

1.5 Installation

To use ggplot2, you must first install it. Make sure you have a recent version of R (at least version 3.2.0) from http://r-project.org and then run the following code to download and install ggplot2:

1.6 Other resources

This book teaches you the elements of ggplot2’s grammar and how they fit together, but it does not document every function in complete detail. You will need additional documentation as your use of ggplot2 becomes more complex and varied.

The best resource for specific details of ggplot2 functions and their arguments will always be the built-in documentation. This is accessible online, http://docs.ggplot2.org/, and from within R using the usual help syntax. The advantage of the online documentation is that you can see all the example plots and navigate between topics more easily.

If you use ggplot2 regularly, it’s a good idea to sign up for the ggplot2 mailing list, http://groups.google.com/group/ggplot2. The list has relatively low traffic and is very friendly to new users. Another useful resource is stackoverflow, http://stackoverflow.com. There is an active ggplot2 community on stackoverflow, and many common questions have already been asked and answered. In either place, you’re much more likely to get help if you create a minimal reproducible example. The reprex package by Jenny Bryan provides a convenient way to do this, and also include advice on creating a good example. The more information you provide, the easier it is for the community to help you.

The number of functions in ggplot2 can be overwhelming, but RStudio provides some great cheatsheets to jog your memory at http://www.rstudio.com/resources/cheatsheets/.

Finally, the complete source code for the book is available online at https://github.com/hadley/ggplot2-book. This contains the complete text for the book, as well as all the code and data needed to recreate all the plots.

1.7 Colophon

This book was written in R Markdown inside RStudio. knitr and pandoc converted the raw Rmarkdown to html and pdf. The complete source is available from github. This version of the book was built with:

devtools::session_info(c("ggplot2", "dplyr", "broom"))
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.1 (2017-01-27)
#>  os       Ubuntu 16.04.6 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_US.UTF-8                 
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       UTC                         
#>  date     2019-08-22                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package      * version  date       lib source        
#>  assertthat     0.2.1    2019-03-21 [1] CRAN (R 3.6.1)
#>  backports      1.1.4    2019-04-10 [1] CRAN (R 3.6.1)
#>  BH             1.69.0-1 2019-01-07 [1] CRAN (R 3.6.1)
#>  broom          0.5.2    2019-04-07 [1] CRAN (R 3.6.1)
#>  cli            1.1.0    2019-03-19 [1] CRAN (R 3.6.1)
#>  colorspace     1.4-1    2019-03-18 [1] CRAN (R 3.6.1)
#>  crayon         1.3.4    2017-09-16 [1] CRAN (R 3.6.1)
#>  digest         0.6.20   2019-07-04 [1] CRAN (R 3.6.1)
#>  dplyr        * 0.8.3    2019-07-04 [1] CRAN (R 3.6.1)
#>  ellipsis       0.2.0.1  2019-07-02 [1] CRAN (R 3.6.1)
#>  fansi          0.4.0    2018-10-05 [1] CRAN (R 3.6.1)
#>  generics       0.0.2    2018-11-29 [1] CRAN (R 3.6.1)
#>  ggplot2      * 3.2.1    2019-08-10 [1] CRAN (R 3.6.1)
#>  glue           1.3.1    2019-03-12 [1] CRAN (R 3.6.1)
#>  gtable         0.3.0    2019-03-25 [1] CRAN (R 3.6.1)
#>  labeling       0.3      2014-08-23 [1] CRAN (R 3.6.1)
#>  lattice        0.20-38  2018-11-04 [3] CRAN (R 3.6.1)
#>  lazyeval       0.2.2    2019-03-15 [1] CRAN (R 3.6.1)
#>  magrittr       1.5      2014-11-22 [1] CRAN (R 3.6.1)
#>  MASS           7.3-51.4 2019-03-31 [3] CRAN (R 3.6.1)
#>  Matrix         1.2-17   2019-03-22 [3] CRAN (R 3.6.1)
#>  mgcv           1.8-28   2019-03-21 [3] CRAN (R 3.6.1)
#>  munsell        0.5.0    2018-06-12 [1] CRAN (R 3.6.1)
#>  nlme           3.1-140  2019-05-12 [3] CRAN (R 3.6.1)
#>  pillar         1.4.2    2019-06-29 [1] CRAN (R 3.6.1)
#>  pkgconfig      2.0.2    2018-08-16 [1] CRAN (R 3.6.1)
#>  plogr          0.2.0    2018-03-25 [1] CRAN (R 3.6.1)
#>  plyr           1.8.4    2016-06-08 [1] CRAN (R 3.6.1)
#>  purrr          0.3.2    2019-03-15 [1] CRAN (R 3.6.1)
#>  R6             2.4.0    2019-02-14 [1] CRAN (R 3.6.1)
#>  RColorBrewer   1.1-2    2014-12-07 [1] CRAN (R 3.6.1)
#>  Rcpp           1.0.2    2019-07-25 [1] CRAN (R 3.6.1)
#>  reshape2       1.4.3    2017-12-11 [1] CRAN (R 3.6.1)
#>  rlang          0.4.0    2019-06-25 [1] CRAN (R 3.6.1)
#>  scales         1.0.0    2018-08-09 [1] CRAN (R 3.6.1)
#>  stringi        1.4.3    2019-03-12 [1] CRAN (R 3.6.1)
#>  stringr        1.4.0    2019-02-10 [1] CRAN (R 3.6.1)
#>  tibble         2.1.3    2019-06-06 [1] CRAN (R 3.6.1)
#>  tidyr        * 0.8.3    2019-03-01 [1] CRAN (R 3.6.1)
#>  tidyselect     0.2.5    2018-10-11 [1] CRAN (R 3.6.1)
#>  utf8           1.1.4    2018-05-24 [1] CRAN (R 3.6.1)
#>  vctrs          0.2.0    2019-07-05 [1] CRAN (R 3.6.1)
#>  viridisLite    0.3.0    2018-02-01 [1] CRAN (R 3.6.1)
#>  withr          2.1.2    2018-03-15 [1] CRAN (R 3.6.1)
#>  zeallot        0.1.0    2018-01-28 [1] CRAN (R 3.6.1)
#> 
#> [1] /home/travis/R/Library
#> [2] /usr/local/lib/R/site-library
#> [3] /home/travis/R-bin/lib/R/library
getOption("width")
#> [1] 75

References

Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey. 1983. Graphical Methods for Data Analysis. Wadsworth.

Cleveland, William. 1993b. Visualizing Data. Hobart Press.

Cook, Dianne, and Deborah F. Swayne. 2007. Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and Ggobi. Springer.

Lemon, Jim, Ben Bolker, Sander Oom, Eduardo Klein, Barry Rowlingson, Hadley Wickham, Anupam Tyagi, et al. 2008. Plotrix: Various Plotting Functions.

Meyer, David, Achim Zeileis, and Kurt Hornik. 2006. “The Strucplot Framework: Visualizing Multi-Way Contingency Tables with Vcd.” Journal of Statistical Software 17 (3): 1–48. http://www.jstatsoft.org/v17/i03/.

Murrell, Paul. 1998. “Investigations in Graphical Statistics.” PhD thesis, The University of Auckland.

Robbins, Naomi. 2013. Creating More Effective Graphs. Chart House.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with R. Springer.

Tufte, Edward R. 1990. Envisioning Information. Graphics Press.

Tufte, Edward R. 1997. Visual Explanations. Graphics Press.

Tufte, Edward R. 2001. The Visual Display of Quantitative Information. Second. Graphics Press.

Tukey, John W. 1977. Exploratory Data Analysis. Addison–Wesley.

Warnes, Gregory. 2007. Gplots: Various R Programming Tools for Plotting Data.

Wickham, Hadley. 2009. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics.

Wickham, Hadley, Michael Lawrence, Duncan Temple Lang, and Deborah F Swayne. 2008. “An Introduction to Rggobi.” R-News 8 (2): 3–7. http://CRAN.R-project.org/doc/Rnews/Rnews_2008-2.pdf.

Wilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Statistics and Computing. Springer.