17 ggplot2 internals

Throughout this book the focus has been on how to use ggplot2 as a user. The nature of the API is so that how it’s used is very different from how it works, and for the most part this is not a problem. After all, as a user there is no reason to spend time figuring out how ggplot2 translates your declarations into a plot.

The decoupling of the API and the machinery underneath can hit hard when you as a user begins to transition into an extension developer, where understanding of the machinery becomes paramount. As extending ggplot2 becomes more and more common, so does the frustration related to understanding how it all fits together.

This chapter is dedicated to providing a description of how ggplot2 works behind the curtains. The focus will not be on the technical aspects of the implementation, but rather on the design, and its implications for how it all fits together. I’ll start by describing what happens as you plot your ggplot object from a high-level perspective, and then proceeding to describe how the data you are plotting flows through this whole process and ends up as visual elements in your plot.

17.1 The plot() method

Almost everything related to converting your ggplot2 code into a plot happens once you print it, not while you construct the plot. This makes sense as it means nothing has to be re-calculated every time you add new elements to the plot. It also means that to properly understand the mechanics of ggplot2, you have to follow the plot function down the rabbit hole. So, how does it look?

Many of the function calls above may seem foreign, and most are not really relevant to understanding ggplot2. The calls of interest are set_last_plot(x) (to some extent), data <- ggplot_build(x), and gtable <- ggplot_gtable(data). The set_last_plot(x) stores the plot internally so that it is retrivable with last_plot(). The two remaining calls are what makes up the rendering stack of ggplot2. ggplot_build is where the data for each layer is prepared for plotting (with everything that entails) and ggplot_gtable takes the prepared data and turns it into graphic elements stored in a gtable (we’ll come back to what that is later). What may come as a surprise is that ggplot2 itself does not do any actual drawing. It’s responsibility stops after the gtable object has been created. The gtable package which implements the gtable class does not do any drawing either. Drawing is performed by the grid package in unison with the active graphic device. This is an important point, as it means that ggplot2 and, by extension, any extensions to ggplot2, do not need to concern themselves with the nitty gritty of creating the visual output. The responsibility is solemnly on converting the user data to one or more graphic primitives such as polygons, lines, points, etc. While it is thus stricktly not true, we will continue to call this conversion into graphic primitives the rendering process.

17.2 Follow the data

As may be apparent from the section above, the main actor in the rendering process is the layer data, and the rendering process is really a long progression of steps to convert the data from the format supplied by the user, to a format that fits with the graphic primitives needed to create the desired visual elements. This also means that to gain an understanding of the mechanics of ggplot2 we must understand how data flows through the mechanics and how it transforms along the way.

17.2.1 The build step

ggplot_build(), as discussed above, takes the declarative representation constructed with the public API and augments it by preparing the data for conversion to graphic primitives.

17.2.1.1 Data preparation

The first part of the processing is to get the data associated with each layer and get it into a predictable format. A layer can either provide it’s own data, inherit from the global data, or provide a function that is applied to the global data and returns a new data.frame. Once this is done the data is passed to the plot layout which orchestrates coordinate systems and facets. Within the layout the data is passed in turn to the plot coordinate system which may change it (it usually don’t) and then to the facet which inspects the data to figure out how many panels the plot should have and how they should be organised. During this process each leayer data will be augmented with a PANEL column. This column will (must) be kept throughout the rendering process and will link each data row to a specific facet panel in the final plot.

The last part of the data preparation is to convert the layer data into calculated aesthetic values. This involves evaluating all the aesthetic expression from aes() on the layer data. Further, if not given explicitly, the group aesthetic is calculated from the interaction of all non-continuous aesthetics. The group aesthetic is, like PANEL a special column that must be kept throughtout the processing.

17.2.1.2 Data transformation

Once the layer data has been extracted and converted to a predictable format it undergoes a range of transformations until it gets to the format that the layer geometry expects.

The first step is to apply any scale transformations to the columns in the data. This is where any argument to trans in a scale has an effect. The remainder of the rendering will work in this transformed space. This is the underlying reason for the difference in setting a position transform in the scale vs in the coordinate system. Setting it in the scale will force all calculations to happen in transformed space, while setting it in the coordinate system will have all calculations happen in untransformed space and then apply the transformation to the outcome.

After this the position aesthetics are mapped based on the position scales. For continuous positions this simply means applying the oob() function (defaults to censor()) and removing NA rows. For discrete positions the change is more radical as the values are matched to the limits (breaks) of the scale and converted to integer positions For binned position scales the continuous data is first cut into bins based on the breaks specification and then set the position to the midpoint of their respective bin. This means that no matter what type of position scale is used, it will look continuous to the stat and geom computations. This is important because otherwise computations such as dodging and jitter would fail for discrete positions.

Now the data is ready to be handed to the layer stat where any statistical transformation takes place. The setup is that the stat first gets to inspect the data and modify its parameters, then do a one off preparation of the data. After that the data is split by PANEL, then group and statistics are calculated before the data is reassembled. It is possible for a stat to circumvent this splitting by overwritting specific compute_*() methods and thus do some optimisation. After the data has been reassembled in its new form it goes through a new aesthetic mapping. This is where aesthetics that has been delayed using stat() (or the old ..var.. notation) gets added to the data. This is why stat() expressions cannot target the original data as it simply doesn’t exist at this point anymore.

At this point the geom takes over from the stat (almost). The first action it takes is to inspect the data, update its parameters and possibly make a first pass modification of the data (same setup as for stat). This is possibly where some of the columns gets reparameterised e.g. x+width gets changed to xmin+xmax. After this the position adjustment gets applied, so that e.g. overlapping bars are stacked, etc.

Now, perhaps surprisingly, the position scales are reset, retrained, and applied to the data. Thinking about it, this is absolutely necessary as e.g. stacking can change the range of one of the axes dramatically. Even more, sometimes one of the position aesthetics is not available until after the stat computations and if the scales were not retrained it would never get trained.

The last part of the data transformation is to train and map all non-positional aesthetics, i.e. convert whatever discrete or continuous input that is mapped to graphical parameters such as colours, linetypes, sizes etc. Further, any default aesthetics from the geom is added so that the data is now in a predicatable state for the geom. In the end, both the stat and the facet gets a last chance to modify the data in its final mapped form with their finish_data() methods before the build step is done.

17.2.1.3 Output

The return value of ggplot_build() is a list structure with the ggplot_built class. It contains the computed data, as well as a Layout object holding information about the trained coordinate system and faceting. Further it holds a copy of the original plot object, but now with trained scales.

17.2.2 The gtable step

The purpose of ggplot_gtable() is to take the output of the build step and turn it into a single gtable object that can be plotted using grid. At this point the main elements responsible for further computations are the geoms, the coordinate system, the facet, and the theme. The stats and position adjustments have all played their part already.

17.2.2.1 Rendering the panels

The first thing that happens is that the data is converted into its graphical representation. This happens in two steps. First, each layer is converted into a list of graphical objects (grobs). As with stats the conversion happens by splitting the data, first by PANEL, and then by group, with the possibility of the geom intercepting this splitting for performance reasons. While a lot of the data preparation has been performed already it is not uncommon that the geom does some additional transformation of the data during this step. A crucial part is to transform and normalise the position data. This is done by the coordinate system and while it often simply means that the data is normalised based on the limits of the coordinate system, it can also include radical transformations such as converting the positions into polar coordinates. The output of this is for each layer a list of gList objects corresponding to each panel in the facet layout. After this the facet takes over and assembles the panels. It does this by first collectiong the grobs for each panel from the layers, along with rendering strips, backgrounds, gridlines,and axes based on the theme and combines all of this into a single gList for each panel. It then proceeds to arranging all these panels into a gtable based on the calculated panel layout. For most plots this is simple as there is only a single panel, but for e.g. plots using facet_wrap() it can be quite complicated. The output is the basis of the final gtable object.

17.2.2.2 Adding guides

There are two types of guides in ggplot2: axes and legends. at this point the axes has already been rendered and assembled together with the panels, while the legends are still missing. Rendering the legends is a complicated process that first trains a guide for each scale. Then, potentially multiple guides are merged if their mapping allows it before the layers that contribute to the legend is asked for key grobs for each key in the legend. These key grobs are then assembled across layers and combined to the final legend in a process that is quite reminiscent of how layers gets combined into the gtable of panels. In the end the output is a gtable that holds each legend box arranged and styled according to the theme and guide specifications. Once created the guide gtable is then added to the main gtable according to the legend.position theme setting.

17.2.2.3 Adding adornment

The only thing remaining is to add title, subtitle, caption, and tag as well as add background and margins, at which point the final gtable is done.

17.2.2.4 Output

The end result of ggplot_gtable() is, as described above, a gtable. What is less obvious is that the dimensions of the object is unpredictable and will depend on both the faceting, legend placement, and which titles are drawn. It is thus not advised to depend on row and column placement in your code, should you want to further modify the gtable. All elements of the gtable are named though, so it is still possible to reliably retrieve, e.g. the grob holding the top-left y-axis with a bit of work.

17.3 ggproto

ggplot2 has undergone a couple of rewrites during its long life. A few of these have introduced new class systems to the underlying code. While there is still a small amount of leftover from older class systems, the code has more or less coalesced around the ggproto class system introduced in ggplot2 v2.0.0. ggproto is a custom build class system made specifically for ggplot2 to facilitate portable extension classes. Like the more well-known R6 system it is a system using reference semantics, allowing inheritance and access to methods from parent classes. On top of the ggproto is a set of design principles that, while not enforced by ggproto, is essential to how the system is used in ggplot2.

17.3.1 ggproto syntax

A ggproto object is created using the ggproto() function, which takes a class name, a parent class and a range of fields and methods:

As can be seen, fields and methods are not differentiated in the construction, and they are not treated differently from a user perspective. Methods can take a first argment self which gives the method access to its own fields and methods, but it won’t be part of the final method signature. One surprising quirk if you come from other reference based object systems in R is that ggproto() does not return a class contructor; it returns an object. New instances of the class is constructed by subclassing the object without giving a new class name:

When subclassing and overwriting methods, the parent class and its methods are available through the ggproto_parent() function:

For reasons that we’ll discuss below, the use of ggproto_parent() is not that prevalent in the ggplot2 source code.

All in all ggproto is a minimal class system that is designed to accomodate ggplot2 and nothing else. It’s structure is heavily guided by the proto class system used in early versions of ggplot2 in order to reduce the required changes to the ggplot2 source code during the switch, and its features are those required by ggplot2 and nothing more.

17.3.2 ggproto style guide

While ggproto is flexible enough to be used in many ways, it is used in ggplot2 in a very delibarete way. As you are most likely to use ggproto in the context of extending ggplot2 you will need to understand these ways.

17.3.2.1 ggproto classes are used selectively

The use of ggproto in ggplot2 is not all-encompassing. Only select functionality is based on ggproto and it is not expected, nor advised to create new ggproto classes to encapsulate logic in your extensions. This means that you, as an extension developer, will never create ggproto objects from scratch but rather subclass one of the main ggproto classes provided by ggplot2. Later chapters will go into detail on how exactly to do that.

17.3.2.2 ggproto classes are stateless

Except for a few select internal classes used to orchestrate the rendering, ggproto classes in ggplot2 are stateless. This means that after they are constructed they will not change. This breaks a common expectation for reference based classes where methods will alter the state of the object, but it is paramount that you adhere to this principle. If e.g. some of your Stat or Geom extensions changed state during rendering, plotting a saved ggplot object would affect all instances of that object as all copies would point to the same ggproto objects. State is imposed in two ways in ggplot2. At creation, which is ok because this state should be shared between all instances anyway, and through a params object managed elsewhere. As you’ll see later, most ggproto classes have a setup_params() method where data can be inspected and specific properties calculated and stored.

17.3.2.3 ggproto classes have simple inheritance

Because ggproto class instances are stateless it is relatively safe to call methods from other classes inside a method, instead of inheriting directly from the class. Because of this it is relatively common to borrow functionality from other classes without creating an explicit inheritance. As an example, the setup_params() method in GeomErrorbar is defined as:

While we have seen that parent methods can be called using ggproto_parent() this pattern is quite rare to find in the ggplot2 source code, as the pattern shown above is often clearer and just as safe.