presentation

# Reusable data visualization

## Radovan Bast ([fosstodon.org/@radovan](https://fosstodon.org/@radovan))

### UiT The Arctic University of Norway

Text: CC-BY 4.0

---

## About me

.left-column30[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/avatar.jpg" style="width: 80%;"/>
]

- I write research software and teach programming to researchers and lead the
  [CodeRefinery project](https://coderefinery.org).

- I lead the [high-performance computing group](https://hpc.uit.no) and the
  [research software engineering group](https://research-software.uit.no) at UiT.
]

---

## CodeRefinery

We teach all the **essential tools** which are usually skipped in academic
education so everyone can make full use of software, computing, and data.

.left-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/map.jpg" style="width: 100%;"/>

- https://coderefinery.org
- https://coderefinery.org/workshops/past/
]
.right-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/coderefinery.png" style="padding-left: 30px; width: 60%;"/>
]

---

## Goals for this course/lesson

### Our focus

- Data visualization for .emph[publications and presentations] within and outside academia

- .emph[Practical] recommendations

- .emph[Reproducibility] **for you** and others

- Know which tools exist -> .emph[good starting points]

### What I will not focus on

- Programming languages and technical details of tools

- Data visualization for the general public (newspapers, television)

---

.quote["One thing I have learned over the years is that automation is your
friend. I think figures should be autogenerated as part of the data analysis
pipeline (which should also be automated), and they should come out of the
pipeline ready to be sent to the printer, no manual post-processing needed."]

---

## 2 take-home messages

### Prefer tools that can be automated/scripted

- If data or requirements change, somebody will have to update figures.

- Automation makes it a bit easier.

### Optimize for comprehension and accessibility

- So that we don't have to study the plot for 20 minutes with eyes hurting to
  get the message.

- Font size, colors, suitable representation, good title, and caption.

---

# Why visualizing data?

---

## Anscombe's quartet

.left-column60[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/jupyter/quartet/quartet.png" alt="Anscombe's quartet" width="80%">
]

.right-column40[
All four plots have the .emph[same] mean of x and y, sample variance of *x* and
*y*, correlation between *x* and *y*, linear regression line, and *R^2* coefficient.

---

## Same Stats, Different Graphs

.cite[[A. Cairo, "Datasaurus: Never trust summary statistics alone; always visualize your data"](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html)]

.cite[[J. Matejka, G. Fitzmaurice, "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"](https://www.autodeskresearch.com/publications/samestats)]

---

# How many 5s?

464418163541729611394089491019

103214981928889407852268902875

389879353920237244649469321810

290602004777144868218046078720

522890797338149835404330684291

.cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko]

---

# How many 5s?

464418163.red[5]41729611394089491019

1032149819288894078.red[5]226890287.red[5]

3898793.red[5]3920237244649469321810

290602004777144868218046078720

.cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko]

---

Data visualization is a

## "Visual representation and presentation of data to facilitate understanding"

### Data visualizations map .emph[data values] onto .emph[aesthetics/channels]

- position
- length
- shape
- size
- color
- line width
- line type
- (there exist [many more](https://altair-viz.github.io/user_guide/api.html#encoding-channels))

---

## Why visualizing data?

### More insight into data: easier to see patterns and problems

- Both calculations and graphs will contribute to understanding

- Presentations/papers: facilitate understanding
- Communication with the public

.quote[reflect on how important and powerful data visualization is: COVID-19, politics, climate change, ...]
]

- And we often copy the style and culture
]

---

# How do you read a paper?

# How do you read posters during a poster session?

(reflect about the value of a good visualization)

---

# How is your design process?

---

## How I design plots

- Sometimes: Sketch with pen and paper

- Browse directories/galleries for inspiration:
  [Vega-Altair](https://altair-viz.github.io/gallery/index.html),
  [Matplotlib](https://matplotlib.org/gallery.html),
  [Seaborn](https://seaborn.pydata.org/examples/index.html),
  [Plotly](https://plotly.com/python/),
  [Bokeh](https://demo.bokeh.org/),
  [ggplot](https://yhat.github.io/ggpy/),
  [PyNGL](https://www.pyngl.ucar.edu/Examples/gallery.shtml),
  [K3D](https://k3d-jupyter.org/showcase/),
  [ggplot2](https://ggplot2.tidyverse.org/),
  [Shiny](https://shiny.rstudio.com/),
  [Data-Driven Documents](https://d3js.org/), ...
- Take an example that is close to what I want

- Try to rerun it with original example data

- Try to replace example data with my own data

- Tweak and refine

---

---

---

## Checklist for good visual communication

[This list is adapted from a similar list in a presentation by **L. Garrison,
"Share Your Science: Visualization for Communication"**]

- Define your goals

- Show the data (go beyond summary statistics)

- Be honest with your visuals

- Consider accessibility

- Avoid taxing working memory

- Tell a story

- Reflect on uncertainty and unknowns

---

## Define your goals

- "Before you start, define your goals in 1-3 sentences"
  .cite[L. Garrison, "Share Your Science: Visualization for Communication"]

- Audience?

- Time constraints

---

## Show the data: strip-plot vs box-plot vs violin-plot

---

## Be honest with your visuals

## The principle of "proportional ink"

Examples with disproportional data/ink ratio:

---

## Be honest with your visuals

## Another bad example

---

## Accessibility: Avoid 3D plots (unless it's 3D object)

... unless you are plotting something inherently 3D (molecular structures,
structure of an enzyme, a 3D relief of a terrain)

---

# Accessibility: Colors

"We need five colors for the plot: black ... red ... green ... blue ... ... ... orange?"

---

## Colors

### Consider color vision deficiencies (CVD)

.left-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/ishihara.png" alt="ishihara color test plate" width="80%">
]

- View your color figures under [CVD simulations](https://www.color-blindness.com/coblis-color-blindness-simulator/)

- Use color scales designed to be CVD-friendly
]

---

## Color scales: 3 types

- .emph[Discrete/qualitative] color scales: designed to distinguish

.cite[[Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People."](https://jfly.uni-koeln.de/color/)]

- .emph[Sequential/continuous] color scales: represent data values

- .emph[Diverging] color scales: visualize deviation of data values relative to a neutral midpoint
.cite[ColorBrewer pink to yellow-green]

---

## Discrete/qualitative color scales: designed to distinguish

.left-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/colors-okabe-ito.png" alt="okabe ito color scale" height="70px">

- Great for scatter-plots.

- What if you need more than 8 colors? Use direct labeling instead.

.right-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/jupyter/colors/colors-scatter.png" alt="scatter plot" width="100%">

---

## Sequential/continuous color scales: represent data values

.left-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/colors-blues.png" alt="blues color scale" height="70px">
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/colors-rocket.png" alt="rocket color scale" height="70px">

- Great for choropleth plots (here plotting unemployment rate).

- Color vision deficiencies less of a concern for this type.

- Avoid rainbow scales.
]

.right-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/jupyter/colors/colors-choropleth.png" alt="choropleth plot" width="100%">

---

## Diverging color scales: visualize deviation of data values relative to a neutral midpoint

.left-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/colors-divergent.png" alt="divergent color scale" height="70px">

- Great for heatmaps.

.right-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/jupyter/colors/colors-divergent.png" alt="heatmap plot" width="100%">

---

## Colors

### Great resources

- https://clauswilke.com/dataviz/color-pitfalls.html
- https://blog.datawrapper.de/beautifulcolors/
- [Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People."](https://jfly.uni-koeln.de/color/)
- https://seaborn.pydata.org/tutorial/color_palettes.html
- https://colorbrewer2.org/
- https://www.fabiocrameri.ch/colourmaps/
- https://venngage.com/tools/accessible-color-palette-generator

---

## Categories

- So that we know what to search for
- Source of inspiration

### Good overviews

- https://clauswilke.com/dataviz/directory-of-visualizations.html

- https://datavizcatalogue.com/search.html

- https://depictdatastudio.com/charts/

- https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary

- http://chartmaker.visualisingdata.com/

---

# Problematic plots

See also: https://viz.wtf

---

## Example 1

---

## Example 2

---

## Example 3

---

## Example 4

---

## Example 5

---

## Example 6

.left-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/graph-crimes/banana.jpg" alt="problematic plot" width="100%">
]

---

## Example 7

---

## Example 8

---

## Example 9

---

## Example 10

.cite[Example taken from ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)]

---

## Example 11

.cite[Example taken from ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)]

---

## Example 12

.cite[Example taken from ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)]

---

# Tell a story

---

## Minard's Visualization Of Napoleon's 1812 March

- Another great example: [1854 Broad Street cholera outbreak](https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak#Investigation_by_John_Snow)

---

## There is a story in here: can you improve the text?

---

# Reproducibility and FAIR principles

---

## Reproducibility and FAIR principles

.cite[(c) [Scriberia](http://www.scriberia.co.uk) for [The Turing Way](https://the-turing-way.netlify.com), CC-BY]

---

---

## FAIR: Which problems can you anticipate?

### Findable

### Accessible

### Interoperable

### Reusable

---

# Data formats

---

## What problems can arive when storing data like this?

---

## What problems can arive when storing data like this?

- .emph[Format]: Limited interoperability with other programs
- .emph[Error prone] (see e.g. [this famous example](https://www.washingtonpost.com/news/wonk/wp/2013/04/16/is-the-best-evidence-for-austerity-based-on-an-excel-spreadsheet-error/))
- Difficult to parse ("understand") by scripts: .emph[difficult to automate]
- Not in *tidy format* (more about this later): .emph[difficult to extend/modify]

---

## How should we arrange the data?

.left-column50[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/svalbard-compact.png" alt="compact table" height="150px">

<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/svalbard-transposed.png" alt="table wide format transposed" height="120px">
]

How can these 3 examples be problematic for .emph[automated data visualization]?

- In the compact structure we need to divide at the comma
- If we add more species or more observation sites, we need to adapt the visualization pipeline
]

---

## "Tidy data"

.left-column40[
<img src="https://cdn.jsdelivr.net/gh/bast/data-visualization@4aa0586a42b54fd74ca4a16e8784f97826fd8ca7/img/svalbard-tidy.png" alt="table tidy format" width="100%">
]

- Rows are observations/measurements

- "Long form"

- Order does not matter

- .emph[Easy to extend] with more species and more sites

- .emph[Structure for storing data] - this does not mean that this is ideal
  for tables in presentations or publications

---

## Standard data formats

```csv
Species,Observation site,Number of sightings
arctic fox,A,3
arctic fox,B,1
walrus,B,1
walrus,C,1
reindeer,B,10
reindeer,C,1
polar bear,A,1
polar bear,C,1
seal,A,2
seal,B,1
seal,C,2
```

- CSV is often a good choice
- Most visualization tools can read CSV data
]

- [JSON](https://en.wikipedia.org/wiki/JSON)
- [XML](https://en.wikipedia.org/wiki/XML)
- [GeoJSON](https://geojson.org/)
- [NPY (NumPy arrays)](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html)
- [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)
- [SQL](https://en.wikipedia.org/wiki/SQL)
- Many domain-specific formats (such as [NetCDF](https://www.unidata.ucar.edu/software/netcdf/))
- .emph[Use standard formats, don't invent your own]
]

---

## Data cleaning

- Often we want to visualize data sets with inconsistent or missing entries:

```csv
Date,Organization,Number of participants
2020-09-27,UiT,20
Oct 10 2020,UiT Norges arktiske universitet,15
"Nov. 11, 2020",UiT The Arctic University of Norway,40
2020-12-12,UiT The Arctic University of Norway,-
```

Data cleaning is a bit outside the scope of this course but still good to know:
- There are tools to clean and merge inconsistent data sets (e.g. [OpenRefine](https://openrefine.org/), see also
  [this Data Carpentry lesson](https://datacarpentry.org/OpenRefine-ecology-lesson/))
- This does not have to be done manually

---

# Choosing the right tools

---

## Choosing the right tools: scriptable

### There is not the one perfect language and not the one perfect library for everything

- You will have to choose what fits best you and your group

- We will show examples using .emph[Python, R, and JavaScript]

### No manual post-processing

- This will bite you when you need to regenerate
  50 figures one day before submission deadline or regenerate a set of figures
  after the person who created them left the group.

- Use software that can be scripted: batch processing and reproducibility (more about that in next section).

---

## Choosing the right tools: free

### Use free software and free tools

- Even if the university pays for a license, what happens after you leave university
  or after they stop paying? How can other groups build on your work?

- .emph[Python and R are free], and popular for "notebook"-based pipelines, but also a number
  .emph[JavaScript frameworks] exist, especially for maps.

- Plain text files for small datasets.

- Standard formats instead of proprietary formats.

### For any academic discipline it will be a good investment to learn a bit of Python or R if you want to do data visualization

---

## Visualization libraries (incomplete list)

Two main families: procedural (e.g. Matplotlib) and declarative.

- [Vega-Altair](https://altair-viz.github.io/gallery/index.html): declarative visualization
- [Matplotlib](https://matplotlib.org/gallery.html): MATLAB users will be at home
- [Seaborn](https://seaborn.pydata.org/examples/index.html): statistical functions built in
- [Plotly](https://plotly.com/python/): interactive graphs
- [Bokeh](https://demo.bokeh.org/): also here good for interactivity
- [ggplot](https://yhat.github.io/ggpy/): R users will be more at home
- [PyNGL](https://www.pyngl.ucar.edu/Examples/gallery.shtml): used in the weather forecast community
- [K3D](https://k3d-jupyter.org/showcase/): Jupyter notebook extension for 3D visualization
]

- [ggplot2](https://ggplot2.tidyverse.org/): system for declaratively creating graphics, based on the grammar of graphics
- [Shiny](https://shiny.rstudio.com/): interactive graphs and notebooks

### JavaScript

- [Data-Driven Documents](https://d3js.org/)
]

---

# Data visualization using Python

https://coderefinery.github.io/data-visualization-python/

(co-created by the author of these slides)

---

# Reproducible and reusable plots

---

.cite[Juliette Taka, Logilab and the OpenDreamKit project (2017), https://opendreamkit.org/2017/11/02/use-case-publishing-reproducible-notebooks/]

---

## .emph[Demo]: visualization pipeline on [Binder](https://mybinder.org/)

- Python/[Altair](https://altair-viz.github.io/) on [Jupyter](https://jupyter.org/) served via [Binder](https://mybinder.org/):
  https://github.com/bast/jupyter-binder-example

- R/[ggplot2](https://ggplot2.tidyverse.org/) on [RStudio](https://rstudio.com/)/[R Markdown](https://rmarkdown.rstudio.com/) served via [Binder](https://mybinder.org/):
  https://github.com/bast/rstudio-binder-example

### Other fantastic tools which I will not demonstrate

- [Data-Driven Documents](https://d3js.org/) with [gallery of examples](https://observablehq.com/@d3/gallery)

- Interactive plots with [Shiny](https://shiny.rstudio.com/gallery/)

---

## [Zenodo](https://zenodo.org/) can give you a persistent identifier (DOI) and make your pipeline citable

Rather than specifying a GitHub repository when launching
[Binder](https://mybinder.org/), you can instead use a Zenodo DOI.

---

## Progression

- Start with a working example and try adapting it
- Learn the very basics
  - Learn a bit of Python
  - Or R
- It can be a good idea to start learning right away in a notebook
  - Python: Jupyter
  - R: R Markdown in R Studio
  - [Quarto](https://quarto.org/)
- Later try Binder
- Later learn how to get a DOI for your Binder
- Now your plotting recipe can be cited and is reproducible

### This takes time and it is OK to take time

.quote[If I had six hours to chop down a tree, I’d spend the first four hours sharpening the axe.]
.cite[Abraham Lincoln]

---

## Summary

- Don't forget to **tell a story**

- FAIR principles and reproducibility will be good for you (and for others)

- Document all tools and dependencies used .emph[with versions]

- Prefer .emph[free tools]

- "Data visualization clinic" next week

---

### Books

- ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)
- ["Data Visualization: A practical introduction", K. Healy](https://socviz.co/)
- ["Data Visualisation: A Handbook for Data Driven Design", A. Kirk](https://www.visualisingdata.com/book/)

### Papers

- [N. P. Rougier, M. Droettboom, P. E. Bourne, "Ten Simple Rules for Better Figures", PLoS Comput Biol 10(9): e1003833 (2014)](https://doi.org/10.1371/journal.pcbi.1003833)

### Courses/talks

- https://coderefinery.github.io/data-visualization-python/
- https://courses.cs.washington.edu/courses/cse512/23sp/
- https://swcarpentry.github.io/visualization-novice/
- https://www.ub.uio.no/english/courses-events/events/all-libraries/2020/research-bazaar/visualisation.html
- https://ajstewartlang.github.io/SIPS_2019/SIPS_presentation.html