Plotting geospatial data is a common visualisation task, and one that requires specialised tools. Typically the problem can be decomposed into two problems: using one data source to draw a map, and adding metadata from another information source to the map. This chapter will help you tackle both problems. I’ve structured the chapter in the following way: Section 7.1 outlines a simple way to draw maps using
geom_polygon(), which is followed in Section 7.2 by a modern “simple features” (sf) approach using
geom_sf(). Next, Sections 7.3 and 7.4 discuss how to work with map projections and the underlying sf data structure. Finally, Section 7.5 discusses how to draw maps based on raster data.
7.1 Polygon maps
Perhaps the simplest approach to drawing maps is to use
geom_polygon() to draw boundaries for different regions. For this example we take data from the maps package using
ggplot2::map_data(). The maps package isn’t particularly accurate or up-to-date, but it’s built into R so it’s an easy place to start. Here’s a data set specifying the county boundaries for Michigan:
In this data set we have four variables:
long specify the latitude and longitude of a vertex (i.e. a corner of the polygon),
id specifies the name of a region, and
group provides a unique identifier for contiguous areas within a region (e.g. if a region consisted of multiple islands). To get a better sense of what the data contains, we can plot
geom_point(), as shown in the left panel below. In this plot, each row in the data frame is plotted as a single point, producing a scatterplot that shows the corners of every county. To turn this scatterplot into a map, we use
geom_polygon() instead, which draws each county as a distinct polygon. This is illustrated in the right panel below.
In both plots I use
coord_quickmap() to adjust the axes to ensure that longitude and latitude are rendered on the same scale. Chapter 13 discusses coordinate systems in ggplot2 in more general terms, but as we’ll see below, geospatial data often require a more exacting approach. For this reason, ggplot2 provides
coord_sf() to handle spatial data specified in simple features format.
7.2 Simple features maps
There are a few limitations to the approach outlined above, not least of which is the fact that the simple “longitude-latitude” data format is not typically used in real world mapping. Vector data for maps are typically encoded using the “simple features” standard produced by the Open Geospatial Consortium. The sf package (Pebesma 2018) developed by Edzer Pebesma https://github.com/r-spatial/sf provides an excellent toolset for working with such data, and the
coord_sf() functions in ggplot2 are designed to work together with the sf package.
To introduce these functions, we rely on the ozmaps package by Michael Sumner https://github.com/mdsumner/ozmaps/ which provides maps for Australian state boundaries, local government areas, electoral boundaries, and so on (Sumner 2019). To illustrate what an sf data set looks like, we import a data set depicting the borders of Australian states and territories:
library(ozmaps) library(sf) #> Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2 oz_states <- ozmaps::ozmap_states oz_states #> Simple feature collection with 9 features and 1 field #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: 106 ymin: -43.6 xmax: 168 ymax: -9.23 #> epsg (SRID): 4283 #> proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs #> # A tibble: 9 x 2 #> NAME geometry #> <chr> <MULTIPOLYGON [°]> #> 1 New South Wal… (((151 -35.1, 151 -35.1, 151 -35.1, 151 -35.1, 151 -35.2, 151 … #> 2 Victoria (((147 -38.7, 147 -38.7, 147 -38.7, 147 -38.7, 147 -38.7)), ((… #> 3 Queensland (((149 -20.3, 149 -20.4, 149 -20.4, 149 -20.3)), ((149 -20.9, … #> 4 South Austral… (((137 -34.5, 137 -34.5, 137 -34.5, 137 -34.5, 137 -34.5, 137 … #> 5 Western Austr… (((126 -14, 126 -14, 126 -14, 126 -14, 126 -14)), ((124 -16.1,… #> 6 Tasmania (((148 -40.3, 148 -40.3, 148 -40.3, 148 -40.3)), ((147 -39.5, … #> # … with 3 more rows
This output shows some of the metadata associated with the data (discussed momentarily), and tells us that the data is essentially a tibble with 11 rows and 4 columns. One advantage to sf data is immediately apparent, we can easily see the overall structure of the data: Australia is comprised of six states, four territories and Macquarie Island, which is politically part of Tasmania. There are 11 distinct geographical units, so there are 11 rows in this tibble (cf.
mi_counties data where there is one row per polygon vertex).
The most important column is
geometry, which specifies the spatial geometry for each of the states and territories. Each element in the
geometry column is a multipolygon object which, as the name suggests, contains data specifying the vertices of one or more polygons that demark the border of a region. Given data in this format, we can use
coord_sf() to draw a serviceable map without specifying any parameters or even explicitly declaring any aesthetics:
To understand why this works, note that
geom_sf() relies on a
geometry aesthetic that is not used elsewhere in ggplot2. This aesthetic can be specified in one of three ways:
In the simplest case (illustrated above) when the user does nothing,
geom_sf()will attempt to map it to a column named
dataargument is an sf object then
geom_sf()can automatically detect a geometry column, even if it’s not called
You can specify the mapping manually in the usual way with
aes(geometry = my_column). This is useful if you have multiple geometry columns.
coord_sf() function governs the map projection, discussed in Section 7.3.
7.2.1 Layered maps
In some instances you may want to overlay one map on top of another. The ggplot2 package supports this by allowing you to add multiple
geom_sf() layers to a plot. As an example, I’ll use the
oz_states data to draw the Australian states in different colours, and the overlay this plot with the boundaries of Australian electoral regions. To do this, there are two preprocessing steps to perform. First, I’ll use
dplyr::filter() to remove the “Other Territories” from the state boundaries.
The code below draws a plot with two map layers: the first uses
oz_states to fill the states in different colours, and the second uses
oz_votes to draw the electoral boundaries. Second, I’ll extract the electoral boundaries in a simplified form using the
ms_simplify() function from the rmapshaper package (Teucher and Russell 2018). This is generally a good idea if the original data set (in this case
ozmaps::abs_ced) is stored at a higher resolution than your plot requires, in order to reduce the time taken to render the plot.
Now that I have data sets
oz_votes to represent the state borders and electoral borders respectively, the desired plot can be constructed by adding two
geom_sf() layers to the plot:
It is worth noting that the first layer to this plot maps the
fill aesthetic in onto a variable in the data. In this instance the
NAME variable is a categorical variable and does not convey any additional information, but the same approach can be used to visualise other kinds of area metadata. For example, if
oz_states had an additional column specifying the unemployment level in each state, we could map the
fill aesthetic to that variable.
7.2.2 Labelled maps
Adding labels to maps is an example of annotating plots (Chapter 8) and is supported by
geom_sf_text(). For example, while an Australian audience might be reasonably expected to know the names of the Australian states (and are left unlabelled in the plot above) few Australians would know the names of different electorates in the Sydney metropolitan region. In order to draw an electoral map of Sydney, then, we would first need to extract the
map data for the relevant elextorates, and then add the label. The plot below zooms in on the Sydney region by specifying
coord_sf() and then uses
geom_sf_label() to overlay each electorate with a label:
# filter electorates in the Sydney metropolitan region sydney_map <- ozmaps::abs_ced %>% filter(NAME %in% c( "Sydney", "Wentworth", "Warringah", "Kingsford Smith", "Grayndler", "Lowe", "North Sydney", "Barton", "Bradfield", "Banks", "Blaxland", "Reid", "Watson", "Fowler", "Werriwa", "Prospect", "Parramatta", "Bennelong", "Mackellar", "Greenway", "Mitchell", "Chifley", "McMahon" )) # draw the electoral map of Sydney ggplot(sydney_map) + geom_sf(aes(fill = NAME), show.legend = FALSE) + coord_sf(xlim = c(150.97, 151.3), ylim = c(-33.98, -33.79)) + geom_sf_label(aes(label = NAME), label.padding = unit(1, "mm")) #> Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not #> give correct results for longitude/latitude data
The warning message is worth noting. Internally
geom_sf_label() uses the function
st_point_on_surface() from the sf package to place labels, and the warning message
occurs because most algorithms used by sf to compute geometric quantities (e.g., centroids,
interior points) are based on an assumption that the points lie in on a flat two dimensional
surface and parameterised with Cartesian co-ordinates. This assumption is not strictly
warranted, and in some cases (e.g., regions near the poles) calculations that treat
longitude and latitude in this way will give erroneous answers. For this reason, the sf
package produces warning messages when it relies on this approximation.
7.2.3 Adding other geoms
geom_sf() is special in some ways, it nevertheless behaves in much the same fashion as any other geom, allowing additional data to be plotted on a map with standard geoms. For example, we may wish to plot the locations of the Australian capital cities on the map using
geom_point(). The code below illustrates how this is done:
oz_capitals <- tibble::tribble( ~city, ~lat, ~lon, "Sydney", -33.8688, 151.2093, "Melbourne", -37.8136, 144.9631, "Brisbane", -27.4698, 153.0251, "Adelaide", -34.9285, 138.6007, "Perth", -31.9505, 115.8605, "Hobart", -42.8821, 147.3272, "Canberra", -35.2809, 149.1300, "Darwin", -12.4634, 130.8456, ) ggplot() + geom_sf(data = oz_votes) + geom_sf(data = oz_states, colour = "black", fill = NA) + geom_point(data = oz_capitals, mapping = aes(x = lon, y = lat), colour = "red") + coord_sf()
In this example
geom_point is used only to specify the locations of the capital cities, but the basic idea can be extended to handle point metadata more generally. For example if the oz_capitals data were to include an additional variable specifying the number of electorates within each metropolitan area, we could encode that data using the
7.3 Map projections
At the start of the chapter I drew maps by plotting longitude and latitude on a Cartesian plane, as if geospatial data were no different to other kinds of data one might want to plot. To a first approximation this is okay, but it’s not good enough if you care about accuracy. There are two fundamental problems with the approach.
The first issue is the shape of the planet. The Earth is neither a flat plane, nor indeed is it a perfect sphere. As a consequence, to map a co-ordinate value (longitude and latitude) to a location we need to make assumptions about all kinds of things. How ellipsoidal is the Earth? Where is the centre of the planet? Where is the origin point for longitude and latitude? Where is the sea level? How do the tectonic plates move? All these things are relevant, and depending on what assumptions one makes the same co-ordinate can be mapped to locations that are many meters apart. The set of assumptions about the shape of the Earth is referred to as the geodetic datum and while it might not matter for some data visualisations, for others it is critical. There are several different choices one might consider: if your focus is North America the “North American Datum” (NAD83) is a good choice, whereas if your perspective is global the “World Geodetic System” (WGS84) is probably better.
The second issue is the shape of your map. The Earth is approximately ellipsoidal, but in most instances your spatial data need to be drawn on a two dimensional plane. It is not possible to map the surface of an ellipsoid to a plane without some distortion or cutting, and you will have to make choices about what distortions you are prepared to accept when drawing a map. This is the job of the map projection.
Map projections are often classified in terms of the geometric properties that they preserve, e.g.
Area-preserving projections ensure that regions of equal area on the globe are drawn with equal area on the map.
Shape-preserving (or conformal) projections ensure that the local shape of regions is preserved.
And unfortunately, it’s not possible for any projection to be shape-preserving and area-preserving. This makes it a little beyond the scope of this book to discuss map projections in detail, other than to note that the simple features specification allows you to indicate which map projection you want to use. For more information on map projections, see Geocomputation with R https://geocompr.robinlovelace.net/ (Lovelace, Nowosad, and Muenchow 2019).
Taken together, the geodetic datum (e.g, WGS84), the type of map projection (e.g., Mercator) and the parameters of the projection (e.g., location of the origin) specify a coordinate reference system, or CRS, a complete set of assumptions used to translate the latitude and longitude information into a two dimensional map. An sf object often includes a default CRS, as illustrated below:
As this output illutrates, the CRS can be described in two different ways. You can either use a numeric “EPSG code” (see http://www.epsg-registry.org/) or you can use the “proj4string”, a more verbose and (slightly) more human-readable format.
In ggplot2, the CRS is controlled by
coord_sf(), which ensures that every layer in the plot uses the same projection. By default,
coord_sf() uses the CRS associated with the geometry column of the data1. Because sf data typically supply a sensible choice of CRS, this process usually unfolds invisibly, requiring no intervention from the user. However, should you need to set the CRS yourself, you can specify the
crs parameter. Some careful thought is required to do so, however, as the plot below illustrates:
In this map I have used the WGS84 datum (quite reasonably) and Lambert conformal conic (LCC) projection, which is often used in aeronautical applications because straight lines on the map are approximate “great circles” on the globe, and it is generally considered a good projection for regional maps in the middle latitudes. However, the map looks terrible because I haven’t set the parameters very wisely. To fix this, I centre the map at longitude 140 and latitude -25, and fix the two standard parallels (lines at which there are no distortions) at latitudes -18 and -36.
7.4 Working with sf data
As noted earlier, maps created using
coord_sf() rely heavily on tools
provided by the sf package, and indeed the sf package contains many more useful tools for
manipulating simple features data. In this section I provide an introduction to a few
such tools; more detailed coverage can be found on the sf package website
To get started, recall that one advantage to simple features over other representations of spatial data is that geographical units can have complicated structure. A good example of this in the Australian maps data is the electoral district of Eden-Monaro, plotted below:
As this illustrates, Eden-Monaro is defined in terms of two distinct polygons, a large one on the Australian mainland and a small island. However, the large region has a hole in the middle (the hole exists because the Australian Capital Territory is a distinct political unit that is wholly contained within Eden-Monaro, and as illustrated above, electoral boundaries in Australia do not cross state lines). In sf terminology this is an example of a
MULTIPOLYGON geometry. In this section I’ll talk about the structure of these objects and how to work with them.
First, let’s use dplyr to grab only the geometry object:
The metadata for the edenmonaro object can accessed using helper functions. For example,
st_geometry_type() extracts the geometry type (e.g.,
st_dimension() extracts the number of dimensions (2 for XY data, 3 for XYZ),
st_bbox() extracts the bounding box as a numeric vector, and
st_crs() extracts the CRS as a list with two components, one for the EPSG code and the other for the proj4string. For example:
Normally when we print the
edenmonaro object the output would display all the additional information (dimension, bounding box, geodetic datum etc) but for the remainder of this section I will show only the relevant lines of the output. In this case edenmonaro is defined by a MULTIPOLYGON geometry containing one feature:
However, we can “cast” the MULTIPOLYGON into the two distinct POLYGON geometries from which it is constructed using
To illustrate when this might be useful, consider the Dawson electorate, which consists of 69 islands in addition to a coastal region on the Australian mainland.
Suppose, however, our interest is only in mapping the islands. If so, we can first use the
st_cast() function to break the Dawson electorate into the constituent polygons, and then use
st_area() to calculate the area of each polygon:
dawson <- st_cast(dawson, "POLYGON") st_area(dawson) #> Units: [m^2] #>  1.91e+05 5.88e+05 2.64e+07 1.76e+05 1.13e+06 3.00e+05 2.09e+05 4.58e+05 #>  1.47e+05 9.13e+05 1.17e+06 1.21e+06 2.50e+05 2.81e+05 7.97e+05 1.75e+05 #>  3.23e+05 2.68e+06 3.49e+06 2.34e+05 1.89e+05 4.84e+05 2.02e+06 4.76e+05 #>  1.90e+05 6.62e+06 5.35e+07 4.55e+06 1.54e+07 6.38e+06 3.97e+06 8.19e+05 #>  6.88e+05 1.36e+05 2.93e+06 6.17e+05 3.81e+05 4.20e+05 2.70e+06 2.39e+06 #>  7.66e+05 1.06e+08 1.03e+07 7.60e+06 4.43e+06 4.44e+05 1.96e+05 2.15e+06 #>  6.84e+05 1.54e+06 8.11e+05 2.11e+06 1.90e+05 1.41e+05 1.09e+07 1.47e+06 #>  1.69e+06 5.32e+06 4.41e+06 1.03e+07 6.49e+06 3.92e+05 1.09e+06 4.00e+05 #>  3.43e+05 4.23e+05 4.17e+06 1.08e+05 1.11e+10 3.42e+07
The large mainland region corresponds to the 69th polygon within Dawson. Armed with this knowledge, we can draw a map showing only the islands:
7.5 Raster maps
Instead of displaying context with vector boundaries, you might want to draw a traditional map underneath. This is called a raster image. The easiest way to get a raster map of a given area is to use the ggmap package, which allows you to get data from a variety of online mapping sources including OpenStreetMap and Google Maps. Downloading the raster data is often time consuming so it’s a good idea to cache it in a rds file, as illustrated here:
Note that finding the appropriate
scale in this example required a lot of manual tweaking, so it may take some effort to obtain the data you want. Once you have the raster data, the
ggmap() function from the ggmap package allows you to create a plot drawing the map, as illustrated in the left panel below:
The nice thing about the
ggmap() function is that the output is a ggplot object, so adding metadata to the plot is accomplished in the same way one would add layers to any other plot. This is illustrated in the right side plot above.
7.6 Data sources
The USAboundaries package, https://github.com/ropensci/USAboundaries contains state, county and zip code data for the US (Mullen and Bratt 2018). As well as current boundaries, it also has state and county boundaries going back to the 1600s.
The tigris package, https://github.com/walkerke/tigris, makes it easy to access the US Census TIGRIS shapefiles (Walker 2019). It contains state, county, zipcode, and census tract boundaries, as well as many other useful datasets.
The rnaturalearth package (South 2017) bundles up the free, high-quality data from http://naturalearthdata.com/. It contains country borders, and borders for the top-level region within each country (e.g. states in the USA, regions in France, counties in the UK).
The osmar package, https://cran.r-project.org/package=osmar wraps up the OpenStreetMap API so you can access a wide range of vector data including individual streets and buildings (Eugster and Schlesinger 2010)
If you have your own shape files (
.shp) you can load them into R with
Eugster, Manuel J. A., and Thomas Schlesinger. 2010. “Osmar: OpenStreetMap and R.” R Journal. http://osmar.r-forge.r-project.org/RJpreprint.pdf.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. CRC Press.
Mullen, Lincoln A., and Jordan Bratt. 2018. “USAboundaries: Historical and Contemporary Boundaries of the United States of America.” Journal of Open Source Software 3 (23): 314. https://doi.org/10.21105/joss.00314.
Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
South, Andy. 2017. Rnaturalearth: World Map Data from Natural Earth. https://CRAN.R-project.org/package=rnaturalearth.
Sumner, Michael. 2019. Ozmaps: Australia Maps. https://CRAN.R-project.org/package=ozmaps.
Teucher, Andy, and Kenton Russell. 2018. Rmapshaper: Client for ’Mapshaper’ for ’Geospatial’ Operations. https://CRAN.R-project.org/package=rmapshaper.
Walker, Kyle. 2019. Tigris: Load Census Tiger/Line Shapefiles. https://CRAN.R-project.org/package=tigris.
If there are multiple data sets with a different associated CRS, it uses the CRS from the first layer.↩