class: center, middle, gray-background # Reusable data visualization ## Radovan Bast ([fosstodon.org/@radovan](https://fosstodon.org/@radovan)) ### UiT The Arctic University of Norway
Text: CC-BY 4.0 --- ## About me .left-column30[
] .right-column70[ - Theoretical chemist turned research software engineer. - I write research software and teach programming to researchers and lead the [CodeRefinery project](https://coderefinery.org). - I lead the [high-performance computing group](https://hpc.uit.no) and the [research software engineering group](https://research-software.uit.no) at UiT. ] --- ## CodeRefinery We teach all the **essential tools** which are usually skipped in academic education so everyone can make full use of software, computing, and data. .left-column50[
- https://coderefinery.org - https://coderefinery.org/workshops/past/ ] .right-column50[
] --- ## Goals for this course/lesson ### Our focus - Data visualization for .emph[publications and presentations] within and outside academia - .emph[Practical] recommendations - .emph[Reproducibility] **for you** and others - Know which tools exist -> .emph[good starting points] ### What I will not focus on - Programming languages and technical details of tools - Data visualization for the general public (newspapers, television) --- .quote["One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed."] .cite[["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)]
.cite[https://twitter.com/kara_woo/status/1134878080567091200] --- ## 2 take-home messages ### Prefer tools that can be automated/scripted - If data or requirements change, somebody will have to update figures. - Automation makes it a bit easier. ### Optimize for comprehension and accessibility - So that we don't have to study the plot for 20 minutes with eyes hurting to get the message. - Font size, colors, suitable representation, good title, and caption. --- class: center, middle, inverse # Why visualizing data? --- ## Anscombe's quartet .left-column60[
] .right-column40[ All four plots have the .emph[same] mean of x and y, sample variance of *x* and *y*, correlation between *x* and *y*, linear regression line, and *R^2* coefficient. .cite[https://en.wikipedia.org/wiki/Anscombe%27s_quartet] .cite[https://seaborn.pydata.org/examples/anscombes_quartet.html] ] --- ## Same Stats, Different Graphs
.cite[[A. Cairo, "Datasaurus: Never trust summary statistics alone; always visualize your data"](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html)] .cite[[J. Matejka, G. Fitzmaurice, "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"](https://www.autodeskresearch.com/publications/samestats)] --- # How many 5s? 464418163541729611394089491019 103214981928889407852268902875 389879353920237244649469321810 290602004777144868218046078720 522890797338149835404330684291 .cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko] --- # How many 5s? 464418163.red[5]41729611394089491019 1032149819288894078.red[5]226890287.red[5] 3898793.red[5]3920237244649469321810 290602004777144868218046078720 .red[5]2289079733814983.red[5]404330684291 .cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko] --- Data visualization is a ## "Visual representation and presentation of data to facilitate understanding" .cite[["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)] ### Data visualizations map .emph[data values] onto .emph[aesthetics/channels] - position - length - shape - size - color - line width - line type - (there exist [many more](https://altair-viz.github.io/user_guide/api.html#encoding-channels)) --- ## Why visualizing data? ### More insight into data: easier to see patterns and problems - Both calculations and graphs will contribute to understanding .left-column50[ ### Communicating insight - Presentations/papers: facilitate understanding - Communication with the public .quote[reflect on how important and powerful data visualization is: COVID-19, politics, climate change, ...] ] .right-column50[ ### Because others do it or tell us to - And we often copy the style and culture ] --- class: center, middle, inverse # How do you read a paper? # How do you read posters during a poster session? (reflect about the value of a good visualization) --- class: center, middle, inverse # How is your design process? --- ## How I design plots - Sometimes: Sketch with pen and paper - Browse directories/galleries for inspiration: [Vega-Altair](https://altair-viz.github.io/gallery/index.html), [Matplotlib](https://matplotlib.org/gallery.html), [Seaborn](https://seaborn.pydata.org/examples/index.html), [Plotly](https://plotly.com/python/), [Bokeh](https://demo.bokeh.org/), [ggplot](https://yhat.github.io/ggpy/), [PyNGL](https://www.pyngl.ucar.edu/Examples/gallery.shtml), [K3D](https://k3d-jupyter.org/showcase/), [ggplot2](https://ggplot2.tidyverse.org/), [Shiny](https://shiny.rstudio.com/), [Data-Driven Documents](https://d3js.org/), ... - Take an example that is close to what I want - Try to rerun it with original example data - Try to replace example data with my own data - Tweak and refine ---
---
--- ## Checklist for good visual communication [This list is adapted from a similar list in a presentation by **L. Garrison, "Share Your Science: Visualization for Communication"**] - Define your goals - Show the data (go beyond summary statistics) - Be honest with your visuals - Consider accessibility - Avoid taxing working memory - Tell a story - Reflect on uncertainty and unknowns --- ## Define your goals - "Before you start, define your goals in 1-3 sentences" .cite[L. Garrison, "Share Your Science: Visualization for Communication"] - Audience? - Time constraints --- ## Show the data: strip-plot vs box-plot vs violin-plot
.cite[[J. Matejka, G. Fitzmaurice, "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"](https://www.autodeskresearch.com/publications/samestats)] --- ## Be honest with your visuals ## The principle of "proportional ink" Examples with disproportional data/ink ratio:
.cite[Both figures from https://www.callingbullshit.org/tools/tools_proportional_ink.html] --- ## Be honest with your visuals ## Another bad example
.cite[Citation needed] --- ## Accessibility: Avoid 3D plots (unless it's 3D object) ... unless you are plotting something inherently 3D (molecular structures, structure of an enzyme, a 3D relief of a terrain)
.cite[https://matplotlib.org/3.1.1/gallery/mplot3d/scatter3d.html] --- class: center, middle, inverse # Accessibility: Colors "We need five colors for the plot: black ... red ... green ... blue ... ... ... orange?" --- ## Colors ### Consider color vision deficiencies (CVD) .left-column50[
] .right-column50[ - 4% of the population is affected - View your color figures under [CVD simulations](https://www.color-blindness.com/coblis-color-blindness-simulator/) - Use color scales designed to be CVD-friendly ] --- ## Color scales: 3 types - .emph[Discrete/qualitative] color scales: designed to distinguish
.cite[[Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People."](https://jfly.uni-koeln.de/color/)] - .emph[Sequential/continuous] color scales: represent data values
- .emph[Diverging] color scales: visualize deviation of data values relative to a neutral midpoint .cite[ColorBrewer pink to yellow-green]
--- ## Discrete/qualitative color scales: designed to distinguish .left-column50[
- Great for scatter-plots. - What if you need more than 8 colors? Use direct labeling instead. .cite[[Okabe, M., and K. Ito. 2008](https://jfly.uni-koeln.de/color/)] ] .right-column50[
.cite[
] ] --- ## Sequential/continuous color scales: represent data values .left-column50[
- Great for choropleth plots (here plotting unemployment rate). - Color vision deficiencies less of a concern for this type. - Avoid rainbow scales. ] .right-column50[
.cite[
] ] --- ## Diverging color scales: visualize deviation of data values relative to a neutral midpoint .left-column50[
- Great for heatmaps. .cite[ColorBrewer pink to yellow-green] ] .right-column50[
.cite[
] ] --- ## Colors ### Great resources - https://clauswilke.com/dataviz/color-pitfalls.html - https://blog.datawrapper.de/beautifulcolors/ - [Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People."](https://jfly.uni-koeln.de/color/) - https://seaborn.pydata.org/tutorial/color_palettes.html - https://colorbrewer2.org/ - https://www.fabiocrameri.ch/colourmaps/ - https://venngage.com/tools/accessible-color-palette-generator --- ## Categories - So that we know what to search for - Source of inspiration ### Good overviews - https://clauswilke.com/dataviz/directory-of-visualizations.html - https://datavizcatalogue.com/search.html - https://depictdatastudio.com/charts/ - https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary - http://chartmaker.visualisingdata.com/ --- class: center, middle, inverse # Problematic plots See also: https://viz.wtf --- ## Example 1
.cite[Figure from https://twitter.com/GraphCrimes] --- ## Example 2
.cite[Figure from https://www.callingbullshit.org/tools/tools_proportional_ink.html] --- ## Example 3
.cite[Figure from https://twitter.com/GraphCrimes] --- ## Example 4
.cite[Figure from https://www.callingbullshit.org/tools/tools_proportional_ink.html] --- ## Example 5
.cite[Figure from https://twitter.com/GraphCrimes] --- ## Example 6 .left-column50[
] .right-column50[ .cite[Figure from https://twitter.com/GraphCrimes] ] --- ## Example 7
.cite[Figure from https://twitter.com/GraphCrimes] --- ## Example 8
.cite[Figure from https://twitter.com/GraphCrimes] --- ## Example 9
.cite[Figure from https://twitter.com/GraphCrimes] --- ## Example 10
.cite[Example taken from ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)] --- ## Example 11
.cite[Example taken from ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)] --- ## Example 12
.cite[Example taken from ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/)] --- class: center, middle, inverse # Tell a story --- ## Minard's Visualization Of Napoleon's 1812 March
.cite[https://www.edwardtufte.com/tufte/minard] - Another great example: [1854 Broad Street cholera outbreak](https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak#Investigation_by_John_Snow) --- ## There is a story in here: can you improve the text?
--- class: center, middle, inverse # Reproducibility and FAIR principles --- ## Reproducibility and FAIR principles
.cite[(c) [Scriberia](http://www.scriberia.co.uk) for [The Turing Way](https://the-turing-way.netlify.com), CC-BY] ---
.cite[Heidi Seibold, CC-BY 4.0, https://twitter.com/HeidiBaya/status/1579385587865649153] --- ## FAIR: Which problems can you anticipate? ### Findable .quote["On which of my external hard-drives is my script?"] ### Accessible .quote["Can you please give me access to your plotting scripts?"] ### Interoperable .quote["How can I convert this file format?"] ### Reusable .quote["I wish I could reuse this for my new data!"] --- class: center, middle, inverse # Data formats --- ## What problems can arive when storing data like this?
--- ## What problems can arive when storing data like this?
- .emph[Format]: Limited interoperability with other programs - .emph[Error prone] (see e.g. [this famous example](https://www.washingtonpost.com/news/wonk/wp/2013/04/16/is-the-best-evidence-for-austerity-based-on-an-excel-spreadsheet-error/)) - Difficult to parse ("understand") by scripts: .emph[difficult to automate] - Not in *tidy format* (more about this later): .emph[difficult to extend/modify] --- ## How should we arrange the data? .left-column50[
] -- .right-column40[ For the moment let us not focus on the tool, but the .emph[data structure] How can these 3 examples be problematic for .emph[automated data visualization]? - In the compact structure we need to divide at the comma - If we add more species or more observation sites, we need to adapt the visualization pipeline ] --- ## "Tidy data" .left-column40[
] .right-column60[ - Columns are variables - Rows are observations/measurements - "Long form" - Order does not matter - .emph[Easy to extend] with more species and more sites - .emph[Structure for storing data] - this does not mean that this is ideal for tables in presentations or publications .cite[[H. Wickham, "Tidy Data"](http://vita.had.co.nz/papers/tidy-data.pdf)] ] --- ## Standard data formats .left-column50[ ### Comma-saparated values (CSV) ```csv Species,Observation site,Number of sightings arctic fox,A,3 arctic fox,B,1 walrus,B,1 walrus,C,1 reindeer,B,10 reindeer,C,1 polar bear,A,1 polar bear,C,1 seal,A,2 seal,B,1 seal,C,2 ``` - CSV is often a good choice - Most visualization tools can read CSV data ] .right-column50[ ### There are many more formats - [JSON](https://en.wikipedia.org/wiki/JSON) - [XML](https://en.wikipedia.org/wiki/XML) - [GeoJSON](https://geojson.org/) - [NPY (NumPy arrays)](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html) - [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) - [SQL](https://en.wikipedia.org/wiki/SQL) - Many domain-specific formats (such as [NetCDF](https://www.unidata.ucar.edu/software/netcdf/)) - .emph[Use standard formats, don't invent your own] ] --- ## Data cleaning - Often we want to visualize data sets with inconsistent or missing entries: ```csv Date,Organization,Number of participants 2020-09-27,UiT,20 Oct 10 2020,UiT Norges arktiske universitet,15 "Nov. 11, 2020",UiT The Arctic University of Norway,40 2020-12-12,UiT The Arctic University of Norway,- ``` Data cleaning is a bit outside the scope of this course but still good to know: - There are tools to clean and merge inconsistent data sets (e.g. [OpenRefine](https://openrefine.org/), see also [this Data Carpentry lesson](https://datacarpentry.org/OpenRefine-ecology-lesson/)) - This does not have to be done manually --- class: center, middle, inverse # Choosing the right tools --- ## Choosing the right tools: scriptable ### There is not the one perfect language and not the one perfect library for everything - You will have to choose what fits best you and your group - We will show examples using .emph[Python, R, and JavaScript] ### No manual post-processing - This will bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group. - Use software that can be scripted: batch processing and reproducibility (more about that in next section). --- ## Choosing the right tools: free ### Use free software and free tools - Even if the university pays for a license, what happens after you leave university or after they stop paying? How can other groups build on your work? - .emph[Python and R are free], and popular for "notebook"-based pipelines, but also a number .emph[JavaScript frameworks] exist, especially for maps. - Plain text files for small datasets. - Standard formats instead of proprietary formats. ### For any academic discipline it will be a good investment to learn a bit of Python or R if you want to do data visualization --- ## Visualization libraries (incomplete list) Two main families: procedural (e.g. Matplotlib) and declarative. .left-column50[ ### Python - [Vega-Altair](https://altair-viz.github.io/gallery/index.html): declarative visualization - [Matplotlib](https://matplotlib.org/gallery.html): MATLAB users will be at home - [Seaborn](https://seaborn.pydata.org/examples/index.html): statistical functions built in - [Plotly](https://plotly.com/python/): interactive graphs - [Bokeh](https://demo.bokeh.org/): also here good for interactivity - [ggplot](https://yhat.github.io/ggpy/): R users will be more at home - [PyNGL](https://www.pyngl.ucar.edu/Examples/gallery.shtml): used in the weather forecast community - [K3D](https://k3d-jupyter.org/showcase/): Jupyter notebook extension for 3D visualization ] .right-column40[ ### R - [ggplot2](https://ggplot2.tidyverse.org/): system for declaratively creating graphics, based on the grammar of graphics - [Shiny](https://shiny.rstudio.com/): interactive graphs and notebooks ### JavaScript - [Data-Driven Documents](https://d3js.org/) ] --- class: center, middle, inverse # Data visualization using Python https://coderefinery.github.io/data-visualization-python/ (co-created by the author of these slides) --- class: center, middle, inverse # Reproducible and reusable plots --- class: center, middle
.cite[Juliette Taka, Logilab and the OpenDreamKit project (2017), https://opendreamkit.org/2017/11/02/use-case-publishing-reproducible-notebooks/] --- ## .emph[Demo]: visualization pipeline on [Binder](https://mybinder.org/) - Python/[Altair](https://altair-viz.github.io/) on [Jupyter](https://jupyter.org/) served via [Binder](https://mybinder.org/): https://github.com/bast/jupyter-binder-example - R/[ggplot2](https://ggplot2.tidyverse.org/) on [RStudio](https://rstudio.com/)/[R Markdown](https://rmarkdown.rstudio.com/) served via [Binder](https://mybinder.org/): https://github.com/bast/rstudio-binder-example ### Other fantastic tools which I will not demonstrate - [Data-Driven Documents](https://d3js.org/) with [gallery of examples](https://observablehq.com/@d3/gallery) - Interactive plots with [Shiny](https://shiny.rstudio.com/gallery/) --- ## [Zenodo](https://zenodo.org/) can give you a persistent identifier (DOI) and make your pipeline citable Rather than specifying a GitHub repository when launching [Binder](https://mybinder.org/), you can instead use a Zenodo DOI.
--- ## Progression - Start with a working example and try adapting it - Learn the very basics - Learn a bit of Python - Or R - It can be a good idea to start learning right away in a notebook - Python: Jupyter - R: R Markdown in R Studio - [Quarto](https://quarto.org/) - Later try Binder - Later learn how to get a DOI for your Binder - Now your plotting recipe can be cited and is reproducible ### This takes time and it is OK to take time .quote[If I had six hours to chop down a tree, I’d spend the first four hours sharpening the axe.] .cite[Abraham Lincoln] --- ## Summary - Don't forget to **tell a story** - FAIR principles and reproducibility will be good for you (and for others) - Document all tools and dependencies used .emph[with versions] - Prefer .emph[free tools] - "Data visualization clinic" next week --- ### Books - ["Fundamentals of Data Visualization", C. O. Wilke](https://clauswilke.com/dataviz/) - ["Data Visualization: A practical introduction", K. Healy](https://socviz.co/) - ["Data Visualisation: A Handbook for Data Driven Design", A. Kirk](https://www.visualisingdata.com/book/) ### Papers - [N. P. Rougier, M. Droettboom, P. E. Bourne, "Ten Simple Rules for Better Figures", PLoS Comput Biol 10(9): e1003833 (2014)](https://doi.org/10.1371/journal.pcbi.1003833) ### Courses/talks - https://coderefinery.github.io/data-visualization-python/ - https://courses.cs.washington.edu/courses/cse512/23sp/ - https://swcarpentry.github.io/visualization-novice/ - https://www.ub.uio.no/english/courses-events/events/all-libraries/2020/research-bazaar/visualisation.html - https://ajstewartlang.github.io/SIPS_2019/SIPS_presentation.html