Skip to content

friendly/HistData

Repository files navigation

Lifecycle: stable CRAN_Status_Badge HistData status badge Last Commit cranlog DOI docs

HistData

Data Sets from the History of Statistics and Data Visualization

Version 1.0.1 (2025-12-01)

The HistData package provides a collection of small data sets that are interesting and important in the history of statistics and data visualization. The goal of the package is to make these available, both for instructional use (as examples, problem sets or projects) and for historical research (extending or criticizing a previous analysis). Some of these present interesting challenges, or opportunities to “show off”, with graphics or analysis in R.

Many of the data sets contained here have examples which reproduce an historical graph or analysis. These are meant mainly as simple starters for more extensive re-analysis or graphical elaboration. If you are interested in any of these problems or data sets, there is lots of room to do better!

Some of these been featured in social media calls for participation, such as the 30 Day Chart Challenge

This is part of a program of research called statistical historiography (Friendly, 2007; Friendly & Denis, 2001; Friendly et-al, 2016) meaning the use of statistical methods to study problems and questions in the history of statistics and graphics. A main aspect of this is the increased understanding of historical problems in science and data analysis trough the process of trying to reproduce a graph or analysis using modern methods. I call this “Re-visioning”, meaning to see again, hopefully in a new light.

They are also used in our book, (Friendly & Wainer, 2021), A History of Data Visualization & Graphic Communication, Harvard University Press, ISBN 9780674975231. See also the companion website for this book.

If you are looking more widely for datasets to use for examples, teaching or research, check out Vincent Arel-Bundock’s Rdatasets package, with over 2500 datasets from various R packages, with this list of Available datasets.

Data science

There is another R aspect of the HistData project that should be noted here:

A great deal of “data sciency” work was involved in constructing this package. This may not be evident in what you see in the resulting HistData package you see, but for those interested in this form of History Data Mining, here are some of the aspects involved:

  • In some cases, data had to be extracted from dusty historical documents, using a variety of techniques (web scraping, OCR of PDF files followed by conversion to a data set). Each problem had its own toolbox, inside R or outside. In many cases, transcription errors had to be corrected with code or manually. This is much simpler today with many tools, and the possibility to enlist an AI companion. But, whatever you use, you’d better check the work. Some of the more interesting stories of statistical historiography involve uncovering transcription errors.

  • Digitization of data from an image. When the data is only available in an image, you can extract it with more modern tools than were available when I started this in 2007. Among the newer ones: WebPlotDigitizer, PlotDigitizer. But, where accuracy is important, you should think about the reliability and valitity of what you got from an image (Drevon etal. (2016)).

  • Conversion of text-based data sets to a CSV file and then to an .RData file with proper column names. Ever you seen a Unix .shar (shell archive) file? Well, I have. What about an old codebook prepared for a survey? Modern R packages can read data in a much wider variety of formats today, but older material still presents challenges.

  • Cleaning variable names: Some of this can now be done using e.g.,janitor::clean_names(), or, in some cases, manually editing an excel file you got. What to do about very_long_variable_names or those abbrvd 2 shrt?

  • Applying Type-conversion, e.g., chr to factor or ordered required thinking about what a variable represented. In some cases, appropriate contrasts for factors were constructed to facilitate re-analysis. Pay attention to StringsAsFactors.

  • Tidying data.frames: Initially, most of this was done in base R, but as they came along tidyverse tools made these tasks much easier, e.g., conversion of long <–> wide, separating implicit columns, like “1800-1850”, abbreviations of character string labels, …

  • Documentation: The thankless task? No – considerable effort was made to give detailed descriptions, notes on methods, executable examples, references to original sources and analyses, …

  • Dataset documentation: Originally, I wrote most of the documentation for datasets using utils::prompt(dataset). There isn’t anything like this to generate roxygen documentation for a dataset, so I wrote something use_data_doc.R to do this.

  • Documentation style: In this release, all the documentation was converted first from manually written .Rd files, to roxygen format using Rd2roxygen. This package does a reasonable job, but gets a few things wrong. E.g., one that required a complicated regex to convert \item{list("VarName")} to the correct format \item{\code{VarName}}. But documentation is even easier to edit and maintain using markdown format, as converted again by roxygen2md.

  • Ask AI: A few tasks were aided by Claude Sonnet 4.5. I wanted to add @concept descriptors to datasets reflecting the statistical and graphical concepts that each dataset can be used to illustrate. It made a list, which I heavily edited, and also an R script to add these to the documentation.

Installation

Get the released version from CRAN

install.packages("HistData")

The development version can be installed to your R library directly from github or my R-universe via:

install.packages('HistData', repos = 'https://friendly.r-universe.dev')
remotes::install_github("friendly/HistData")

Data sets

Here are the data sets in the package, with links to their documentation. Some topics are represented by two or more data sets.

# link dataset to pkgdown doc
refurl <- "https://friendly.github.io/HistData/reference/"

dsets <- vcdExtra::datasets("HistData") |> 
  dplyr::select(Item, Title) |> 
  dplyr::mutate(Item = glue::glue("[{Item}]({refurl}{Item}.html)")) 

library(tinytable)
# tt(dsets) |>
#   format_tt(j = 1, markdown = TRUE) |>
#   style_tt(j = 1, bootstrap_css = "width: 30%;") |>
#   style_tt(j = 2, bootstrap_css = "width: 70%;")
tt(dsets, width = c(.2, .8)) |> 
    format_tt(j = 1, markdown = TRUE) 
Item Title
Arbuthnot Arbuthnot’s Data on Male and Female Birth Ratios
Armada La Felicisima Armada
Bowley Bowley’s data on values of British and Irish trade, 1855-1899
Breslau Halley’s Breslau Life Table
Cavendish Cavendish’s Determinations of the Density of the Earth
ChestSizes Chest measurements of Scottish Militiamen
ChestStigler Chest measurements of Scottish Militiamen
Cholera William Farr’s Data on Cholera in London, 1849
CholeraDeaths1849 Daily Deaths from Cholera and Diarrhaea in England, 1849
CushnyPeebles Cushny-Peebles Data: Soporific Effects of Scopolamine Derivatives
CushnyPeeblesN Cushny-Peebles Data: Soporific Effects of Scopolamine Derivatives
Dactyl Edgeworth’s counts of dactyls in Virgil’s Aeneid
DrinksWages Elderton and Pearson’s (1910) data on drinking and wages
EdgeworthDeaths Edgeworth’s Data on Death Rates in British Counties
Fingerprints Waite’s data on Patterns in Fingerprints
Galton Galton’s data on the heights of parents and their children
GaltonFamilies Galton’s data on the heights of parents and their children, by child
Guerry Data from A.-M. Guerry, “Essay on the Moral Statistics of France”
HalleyLifeTable Halley’s Life Table
Jevons W. Stanley Jevons’ data on Numerical Discrimination
Langren.all van Langren’s Data on Longitude Distance between Toledo and Rome
Langren1644 van Langren’s Data on Longitude Distance between Toledo and Rome
Macdonell Macdonell’s Data on Height and Finger Length of Criminals, used by Gosset (1908)
MacdonellDF Macdonell’s Data on Height and Finger Length of Criminals, used by Gosset (1908)
Mayer Mayer’s Data on the Libration of the Moon.
Michelson Michelson’s Determinations of the Velocity of Light
MichelsonSets Michelson’s Determinations of the Velocity of Light
Minard.cities Data from Minard’s famous graphic map of Napoleon’s march on Moscow
Minard.temp Data from Minard’s famous graphic map of Napoleon’s march on Moscow
Minard.troops Data from Minard’s famous graphic map of Napoleon’s march on Moscow
Nightingale Florence Nightingale’s data on deaths in the Crimean War
OldMaps Latitudes and Longitudes of 39 Points in 11 Old Maps
PearsonLee Pearson and Lee’s data on the Heights of Parents and Children by Gender
Playfair1824 Playfair’s Linear Chronology
PolioTrials Polio Field Trials Data
Pollen Pollen Data Challenge
Prostitutes Parent-Duchatelet’s time-series data on the number of prostitutes in Paris
Pyx Trial of the Pyx
Quarrels Statistics of Deadly Quarrels
Saturn Laplace’s Saturn data.
Snow.dates John Snow’s Map and Data on the 1854 London Cholera Outbreak
Snow.deaths John Snow’s Map and Data on the 1854 London Cholera Outbreak
Snow.deaths2 John Snow’s Map and Data on the 1854 London Cholera Outbreak
Snow.polygons John Snow’s Map and Data on the 1854 London Cholera Outbreak
Snow.pumps John Snow’s Map and Data on the 1854 London Cholera Outbreak
Snow.streets John Snow’s Map and Data on the 1854 London Cholera Outbreak
Virginis John F. W. Herschel’s Data on the Orbit of the Twin Stars gamma Virginis
Virginis.interp John F. W. Herschel’s Data on the Orbit of the Twin Stars gamma Virginis
Wheat Playfair’s Data on Wages and the Price of Wheat
Wheat.monarchs Playfair’s Data on Wages and the Price of Wheat
Yeast Student’s (1906) Yeast Cell Counts
YeastD.mat Student’s (1906) Yeast Cell Counts
ZeaMays Darwin’s Heights of Cross- and Self-fertilized Zea May Pairs

See also

  • The Horsekicks package contains the classic data from von Bortkeiwicz, “Death by the kick of a horse in the Prussian Army”, with additional data on deaths by falling from a horse and by drowning.

  • Rdatasets is a collection of over 2500 datasets culled from CRAN packages to make them broadly accessible for teaching and statistical software development.

  • The lattice package contains the Minnesota barley data used by Cleveland (1993) in developing Trellis graphics and ideas behind conditioning (faceted) plots. Wright (2013) reanalysed this and extended the datasets in the agridat package, which also contains a wide variety other datasets from agricultural experiments, some of historical interest.

Contributors

Please note that the HistData project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Over the years, many people have contributed new data sets, offered corrections, suggestions, or documentation examples. They are appreciatively listed below:

David Bellhouse, Brian Clair, Stephane Dray, Luiz Droubi, Antoine de Falguerolles, Monique Graf, James Hanley, Ivan Lokhovm Peter Li, Dennis Murphy, Jim Oeppen, James Riley, John Russell, Neville Verlander, Hadley Wickham.

References

Cleveland, William S. (1993) Visualizing Data. Hobart Press, Summit, New Jersey.

Drevon, D., Fursa, S. R., & Malcolm, A. L. (2016). Intercoder Reliability and Validity of WebPlotDigitizer in Extracting Graphed Data. Behavior Modification, 41(2), 323–339. https://doi.org/10.1177/0145445516673998

Friendly, M. (2007). A Brief History of Data Visualization. In Chen, C., Hardle, W. & Unwin, A. (eds.)
Handbook of Computational Statistics: Data Visualization, Springer-Verlag, III, Ch. 1, 1-34. Preprint

Friendly, M. & Denis, D. (2001). Milestones in the history of thematic cartography, statistical graphics, and data visualization. Web stite: http://datavis.ca/milestones/

Friendly, M. & Sigal, M. & Harnanansingh, D. (2016). “The Milestones Project: A Database for the History of Data Visualization,” In Kostelnick, C. & Kimball, M. (ed.), Visible Numbers: The History of Data Visualization, Ashgate Press, Chapter 10. Preprint

Friendly, M. & Wainer, H. (2021). A History of Data Visualization and Graphic Communication, Harvard University Press, ISBN 9780674975231. Companion web site

Wright, K. (2013). Revisiting Immer’s Barley Data. The American Statistician, 67(3), 129–133

About

Data Sets from the History of Statistics and Data Visualization

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •