-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintroduction.qmd
350 lines (226 loc) · 17.7 KB
/
introduction.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
# Introduction {#sec-introduction .unnumbered}
```{r}
#| echo: false
#| message: false
#| results: asis
source("warning.R")
```
Data visualization is one of those things that can look easy when you see someone doing it well. Then you try it yourself and it’s easy to feel like you know nothing.
I wanted to share the tools that make it easier to create those beautiful visualizations. And to show you how to use those tools together, not only with **ggplot2** [@wickham2024a] but with each other.
One bit about terminology before we go any further. *Chart*, *diagram*, *graph*, *plot*, and *visualization* are all terms that people have tried to give strict definitions to. But in reality, they get used more or less interchangeably. Is it a bar chart or a bar graph? Yes.
In the context of this book, I try to use the terms consistently in this way:
- **Chart / Diagram / Graph**: Type of visualization (e.g. bar chart, Venn diagram, or line graph)
- **Plot**: Output of some code to create a visualization. It can contain many charts, diagrams, and/or graphs. But sometimes it's also used to mean a type of visualization (e.g. scatter plot)
- **Visualization**: The most general of the terms. It can be used to talk about visualization as a concept. But sometimes it's also used for a more specific case.
## Purpose and scope {#sec-purpose-and-scope}
A very basic reason why I started writing this book was the fact that one didn’t already exist. “Someone had to do it”, as they say.
I know there is information out there about the different ggplot2 extension packages. There are the package documentation, but there are also some blogs around (see @sec-further-resources).
But that information is quite scattered. There were many great packages I hadn’t even heard about before I started doing research for the book.
I’ve tried my best to test and write about the most useful among them. After all, there are hundreds of packages to choose from. And I don’t think everyone should have to go through them all. Don’t try this at home!
### What you will learn {#sec-what-you-will-learn}
This book continues from where **ggplot2: Elegant Graphics for Data Analysis** [@wickham2016] left off. In his book, Hadley does mention some extension packages by name. But until now there hasn’t been a comprehensive guide to those extension packages. And how to use them together.
**Basics** works as kind of a "Previously, on **ggplot2**...". A reminder of the basic function(alitie)s of the *core* package.
**Layers** goes through the layers that make up *the grammar of graphics*: **Data**, geometric objects (**Geoms**), statistical transformations (**Stats**), titles and labels (**Annotations**), color, position and size of visual elements (**Scales and guides**), **Coordinate systems**, small multiples (**Facets**), and **Themes**. They are in a logical order. As are the rest of the chapters. The same order that you would build your visualization in.
**Special cases** hosts not-so-typical visualizations: **Maps**, **networks**, **interactive plots**, **animated plots**, and graphs integrated in **tables**.
**Shortcuts** contains different kinds of **helpers** and **interactive tools**. Designed to make it easier to create your visualizations.
At some point, we will be **getting ready to publish** our visualization. You'll want to give them **finishing touches**, but it's also good to know about **arranging plots**.
And while ggplot2 is an R package, its impact and influence go **beyond R**. So we'll talk about **Python** a bit.
To finish off, we go through some **case studies**. We build visualizations, step-by-step, from data to publication.
### What you won’t learn {#sec-what-you-wont-learn}
These are some of the interesting topics that I had to leave out of the book. It's long as it is and I decided to concentrate on the topics and extensions that everyone can use. Regardless of the field you're in.
**Combining visualizations and BI tools**
Using R in general and ggplot2 visualizations in specific within BI tools like Power BI is cool. Interested? You can start by reading Luca Zavarella's book [Extending Power BI with Python and R](https://www.packtpub.com/en-us/product/extending-power-bi-with-python-and-r-9781837639533).
**Generative art**
Creating art with R - color me impressed! It's called Rtistry or aRt, because why not? There isn’t enough room in the book for a proper chapter on this, but here are some links to get you started on the topic:
- [artpack by Meghan Harris](https://meghansaha.github.io/artpack/)
- [aRtsy by Koen Derks](https://koenderks.github.io/aRtsy/)
- [Generative Art by Nicola Rennie](https://nrennie.rbind.io/projects/generative-art/)
**Large language models (LLM)**
I know, it would be trendy to write about AI. And I do think they can play a part in the data visualization process. But there isn’t anything that meaningful to write about. Whatever you do, whether you use AI or not, do trust your senses!
A machine doesn't need to visualize data to use it. So, in a way, data visualization is one of the last bastions of human-to-human communication. By humans, for humans, am I right?
**Specialized packages in specialized fields like bioinformatics**
This one should be self-evident. "Write what you know". I know nothing about bioinformatics. I'm not even sure I understand what the term means.
Are you interested in bioinformatics? I mean, I don't want to leave you hanging. There is a website and a suite of packages called [Bioconductor](https://bioconductor.org/) that should get you started!
**Sports**
It's the same as with generative art. Interesting topic, but not enough space. Still, here's another short list of packages related to sports:
**Baseball**
- [baseballr by Saiem Gilani](https://billpetti.github.io/baseballr/)
- [mlbplotR by Camden Kay](https://camdenk.github.io/mlbplotR/)
**Football**
- [nflplotR by Sebastian Carl](https://nflplotr.nflverse.com/)
**Soccer**
- [ggshakeR by Abhishek Amol Mishra](https://abhiamishra.github.io/ggshakeR/)
- [ggsoccer by Ben Torvaney](https://torvaney.github.io/ggsoccer/)
- [soccerAnimate by Ismael Gómez Schmidt](https://github.com/Dato-Futbol/soccerAnimate)
**Various**
- [sportyR by Ross Drucker](https://sportyr.sportsdataverse.org/)
## Who is the book for? {#sec-who-is-the-book-for}
I assume that you’re reading this because you’re interested in data visualization, R, or both.
I already wrote about this in the Welcome section. This book is for people who already have a working knowledge of ggplot2 and *the grammar of graphics*. I’m quite adamant on that. You should learn the foundations first. And there are already excellent books and websites for that.
Other than that, the tools and techniques in the book aren’t all that advanced (at least most of them). You don’t have to be a visualization wizard to find the book useful. Each of the chapters will contain code that you can use almost verbatim.
It doesn’t even matter if R isn’t your *native* language. You might find an inspiring R package and then go on and find a similar package in your preferred language. Or write that package, if one doesn’t already exist. It’s all good!
## Who am I to write this book? {#sec-who-am-i-to-write-this-book}
Hi! I’m Antti Rask, a data analyst if you had to choose one label. I’ve lived many lives and had many careers before finding my truest calling in data.
I’ve liked data visualizations longer than I’ve been using R (which I started in late 2019/early 2020). I made my first visualizations with Excel. At some point, I switched to Power BI. Which is still my main tool for visualization at work.
You could say that I am an artistic type. Music has been the main avenue for my artistic self-expression for most of my life. But I’ve also liked to take photos of street art. I come across them on my adventures in Helsinki, or when I’m abroad.
In visualizing data I find myself leaning more towards the **Stephen Few** school of thought. Less is usually more and you should avoid [chartjunk](https://en.wikipedia.org/wiki/Chartjunk) at all costs. Still, I’m not a purist, by any means.
I’ve co-authored the **RandomWalker** [@sanderson2024] R package with **Steven Paul Sanderson II** ([\@spsanderson](https://github.com/spsanderson)). Specifically the function `RandomWalker::visualize_walks()`. It uses ggplot2 and some of the extensions found in this book.
I also like to tinker with R Shiny and recently made an app called [TuneTeller](https://youcanbeapirate.shinyapps.io/TuneTeller/). It received an *honorable mention* in the 2024 [Shiny Contest](https://posit.co/blog/winners-of-the-2024-shiny-contest/).
## R {#sec-r}
All the code in this book is in **R**. Except for @sec-python. If you have never used R, [R for Data Science](https://r4ds.hadley.nz/) [@wickham2023a] is a better place to start.
Don't have R yet? You can get the latest version from **CRAN** (the comprehensive R archive network): <https://www.r-project.org/>.
## IDE {#sec-ide}
**RStudio** is my integrated development environment (IDE) of choice. It's free and open-source. It's the environment where you can try out the code you find in this book. And it's also where I wrote this book.
You can download the latest version from the **Posit** website: <https://posit.co/download/rstudio-desktop/>. Posit is the company that develops RStudio and many R packages. Like the **tidyverse** [@wickham2023b] (see [@sec-ggplot2-tidyverse]).
There are other alternatives like [VSCode](https://code.visualstudio.com/), [Positron](https://positron.posit.co/), and [JupyterLab](https://jupyter.org/). This book does operate under the assumption that you are using RStudio.
## Packages {#sec-packages}
### ggplot2 & tidyverse {#sec-ggplot2-tidyverse}
The tidyverse is a collection of R packages. ggplot2 is one of them. The others are **dplyr**, **forcats**, **lubridate**, **purrr**, **readr**, **stringr**, **tibble**, and **tidyr**.
You could install the packages you need one by one, but you might as well install the whole tidyverse:
```{r}
#| eval: false
install.packages("tidyverse")
```
To highlight each of the packages used, I've decided to load the individual packages in the code examples. If you want to, you can use the following:
```{r}
#| results: asis
library(tidyverse)
```
The previous message mentions **conflicted** [@wickham2023d]. It's not part of the core tidyverse, but it’s one of my favorite packages. And I do recommend loading it whenever you’re writing new code.
It checks for conflicting function names and forces you to choose the one you want to use. Either use case by use case (for example `dplyr::filter()`) or session-wide (for example `conflicts_prefer(dplyr::filter())`).
To install conflicted:
```{r}
#| eval: false
install.packages("conflicted")
```
All you need to do is load the package at the beginning of your script:
```{r}
#| eval: false
library(conflicted)
```
It will then let you know if a conflict emerges. One fewer reason to have your code act weird when running it.
### ggplot2 extensions {#sec-ggplot2-extensions}
Here are the main characters of this book.
Some of them you can install from CRAN:
```{r}
#| eval: false
install.packages(
c(
"DataExplorer",
"ExPanDaR",
"GGally",
"ggcorrplot",
"ggplot2movies",
"ggquickeda",
"gt",
"gtExtras",
"naniar",
"radiant"
)
)
```
Others you will have to install from GitHub:
```{r}
#| eval: false
# For fully updating radiant
options(repos = c(RSM = "https://radiant-rstats.github.io/minicran", CRAN = "https://cloud.r-project.org"))
install.packages("remotes")
remotes::install_github("radiant-rstats/radiant.update", upgrade = "never")
```
### Others {#sec-others}
Here are the packages that are not part of the tidyverse. They aren't considered to be ggplot2 extensions either. They are still useful for what we want to do and worth highlighting instead of hiding.
```{r}
#| eval: false
install.packages(
c(
"janitor",
"scales"
)
)
```
## Next steps? {#sec-next-steps}
If you are reading these chapters in chronological order, the next step will be to read the rest of the book.
Of course, I recommend you don't only read the book. I've learned the most from reading these technical books when I've also gone through the code examples. Debugged them (of course I hope the code I provide isn't broken...) when needed. Even translated them to another language (Python to R) or *dialect* (base R to tidyverse).
Next, it's practice, practice, practice. Recreate your favorite data visualizations using only ggplot2 and its extensions. Create something completely new. Either using your data (fitness, music, social media, etc.) or even something from [@sec-data]. There are also concepts like **TidyTuesday** (see [@sec-further-resources]).
## Further resources {#sec-further-resources}
### Blogs {#sec-blogs}
- [Albert Rapp](https://albert-rapp.de/ggplot-series/)
- [Bruno Mioto](https://www.brunomioto.com/)
- [Cédric Scherer](https://www.cedricscherer.com/)
- [Dominic Royé](https://dominicroye.github.io/en/)
- [Georgios Karamanis](https://karaman.is/)
- [Matt Dancho](https://www.business-science.io/blog/index.html)
- [Nicola Rennie](https://nrennie.rbind.io/blog/)
### Books {#sec-books}
- [BBC Visual and Data Journalism cookbook for R graphics](https://bbc.github.io/rcookbook/)
- [Data visualisation using R, for researchers who don’t use R by Emily Nordmann et al.](https://psyteachr.github.io/introdataviz/index.html)
- [ggplot2: Elegant Graphics for Data Analysis (3e) by Hadley Wickham et al.](https://ggplot2-book.org/)
- [An Introduction to ggplot2 by Ozancan Ozdemir](https://bookdown.org/ozancanozdemir/introduction-to-ggplot2/)
- [Modern Data Visualization with R by Robert Kabacoff](https://rkabacoff.github.io/datavis/)
- [R Gallery Book by Kyle W. Brown](https://bookdown.org/content/b298e479-b1ab-49fa-b83d-a57c2b034d49/)
- [R Graphics Cookbook, 2nd edition by Winston Chang](https://r-graphics.org/)
- [Solutions to ggplot2: Elegant Graphics for Data Analysis by Howard Baek](https://ggplot2-book-solutions-3ed.netlify.app/)
- [The Missing Book by Nicholas Tierney & Allison Horst](https://tmb.njtierney.com/)
### GitHub Repos {#sec-github-repos}
- [Awesome ggplot2 by Erik Gahner](https://github.com/erikgahner/awesome-ggplot2/)
- [Awesome R Dataviz by Krzysztof Joachimiak](https://github.com/krzjoa/awesome-r-dataviz/)
- [ggplot2 extenders by Gina Reynolds & Teun van den Brand](https://ggplot2-extenders.github.io/ggplot-extension-club/)
- [ggplot Tricks by Teun van den Brand](https://github.com/teunbrand/ggplot_tricks/)
- [TidyTuesday by Tom Mock](https://github.com/rfordatascience/tidytuesday)
### Websites {#sec-websites}
- [Packages (ggplot2) by Antti Rask](https://tinyurl.com/ggplot2packages/)
- [R for Artists and Designers by Arvind Venkatadri](https://r-for-artists.netlify.app/)
- [R for the Rest of Us by David Keys](https://rfortherestofus.com/blog/)
- [R Graph Gallery by Yan Holtz](https://r-graph-gallery.com/)
- [Statistics Globe by Joachim Schork](https://statisticsglobe.com/graphics-in-r)
### YouTube - Channels {#sec-youtube-channels}
- [Albert Rapp](https://www.youtube.com/@rappa753/)
- [Statistics Globe by Joachim Schork](https://www.youtube.com/@StatisticsGlobe/)
### YouTube - Videos {#sec-youtube-videos}
- [ggplot2 extenders - Hadley Wickham: Thoughts on the ggplot2 extension ecosystem](https://www.youtube.com/watch?v=kjjcgdkowXs)
## Acknowledgments {#sec-introduction-acknowledgments}
[Eevamaija Virtanen](https://www.linkedin.com/in/eevamaijavirtanen/), [Juha Korpela](https://www.linkedin.com/in/jkorpela/), and [Säde Haveri](https://www.linkedin.com/in/sade-haveri/) for founding [Helsinki Data Week](https://www.helsinkidataweek.com/) with me. For helping me dream big! And for making me realize a big project like this is possible when you keep making constant progress.
**Marc Eixarch** ([\@Marceix](https://github.com/Marceix/)) and **Vicent Boned Riera** ([\@eivicent](https://github.com/eivicent/)) from the [R User Group Finland](https://www.meetup.com/r-user-group-finland/). Learning R has been more fun thanks to your efforts in keeping the local R scene active in Helsinki, Finland.
[Joe Reis](https://www.linkedin.com/in/josephreis/), [Ole Olesen-Bagneux](https://www.linkedin.com/in/ole-olesen-bagneux/), and [Vin Vashishta](https://www.linkedin.com/in/vineetvashishta/) for your generosity and time. And for setting a positive example for someone like me to follow.
**Hadley Wickham** ([\@hadley](https://github.com/hadley/)) for all the books and packages. **R for Data Science** [@wickham2023a] got me started with R. **ggplot2: Elegant Graphics for Data Analysis** [@wickham2016] helped me understand ggplot2 and *the grammar of graphics* on a deeper level.
To be continued...
## Colophon {#sec-colophon}
This book was written in [RStudio](https://posit.co/products/open-source/rstudio/) using [Quarto](https://quarto.org/). The website is hosted with [Netlify](https://www.netlify.com/). And it's automatically updated after every commit by [Github Actions](https://github.com/features/actions). The complete source is available from [GitHub](https://github.com/AnttiRask/ggplot2-extended-book/).
This version of the book was built with `r R.version.string` and the following packages:
```{r}
#| echo: false
pkgs <- c(
"conflicted",
"DataExplorer",
"dplyr",
"ExPanDaR",
"GGally",
"ggcorrplot",
"ggplot2",
"ggplot2movies",
"ggquickeda",
"gt",
"gtExtras",
"janitor",
"naniar",
"purrr",
"Radiant",
"readr",
"rmarkdown",
"scales",
"tibble",
"tidyr"
)
```
```{r}
#| echo: false
#| message: false
#| results: asis
library(dplyr)
library(rmarkdown)
library(sessioninfo)
package_info(pkgs, dependencies = FALSE) %>%
as_tibble() %>%
select(package, version = ondiskversion, source) %>%
paged_table()
```