Skip to main content

Precision Public Health: Tidy & Messy Datasets

This guide documents the initative to develop a infrastructure of precision public health @UF

Why is Tidy Data Important?

A 2015 research report conducted by Experian Data Quality revealed that on average, U.S. organizations believe 32 percent of their data is inaccurate. This inaccurate information causes 91 percent of respondents to believe revenue is affected by inaccurate data in terms of wasted resources, lost productivity, or wasted marketing and communications . According to an article from the Harvard Business Review,  Bad Data Costs the U.S. $3 Trillion Per Year, poor data quality has a high cost because decision makers, managers, knowledge workers, data scientists, and others must accommodate or develop workarounds for it in their everyday work, which is both time-consuming and expensive.

Data tidying uses a standardized method to structure datasets to facilitate analysis. This standard improves the initial and exploration and analysis of data, and the time to clean datasets prior to initial use.


What is Tidy Data?

The journal article Tidy Data, explains that Tidy Data is a standardized method of mapping the meaning of a dataset to its structure.  A dataset can be defined as messy or tidy.  Whether or not a dataset is messy or tidy, is dependent on how rows, columns and tables are matched with observations, variables and types. 

Five Elements of Messy Data

  1. Column headers are values, not variable names.
  2.  Multiple variables are stored in one column.
  3.  Variables are stored in both rows and columns.
  4.  Multiple types of observational units are stored in the same table.
  5.  A single observational unit is stored in multiple tables.

See the Tidy Data article for methods to fix messy data.

Three Elements of Tidy Data

Tidy data has three element

1.  Each variable forms a column

2. Each observation forms a row

3. Each type of observational unit forms a table. 

This follows the same principles as Codd's third normal, with a few variations; constraints framed in statistical language and a focus on a single dataset versus connected datasets such as those in relational databases. By contrast, messy data is any other arrangement of the data.   This article provides a more in-depth perspective about Codd's normalization process for database structures, Normalized data base structure : a brief tutorial

University of Florida Home Page

This page uses Google Analytics - (Google Privacy Policy)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.