A 2015 research report conducted by Experian Data Quality revealed that on average, U.S. organizations believe 32 percent of their data is inaccurate. This inaccurate information causes 91 percent of respondents to believe revenue is affected by inaccurate data in terms of wasted resources, lost productivity, or wasted marketing and communications . According to an article from the Harvard Business Review, Bad Data Costs the U.S. $3 Trillion Per Year, poor data quality has a high cost because decision makers, managers, knowledge workers, data scientists, and others must accommodate or develop workarounds for it in their everyday work, which is both time-consuming and expensive.
Data tidying uses a standardized method to structure datasets to facilitate analysis. This standard improves the initial and exploration and analysis of data, and the time to clean datasets prior to initial use.
The journal article Tidy Data, explains that Tidy Data is a standardized method of mapping the meaning of a dataset to its structure. A dataset can be defined as messy or tidy. Whether or not a dataset is messy or tidy, is dependent on how rows, columns and tables are matched with observations, variables and types.
See the Tidy Data article for methods to fix messy data.
Tidy data has three element
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table.
This follows the same principles as Codd's third normal, with a few variations; constraints framed in statistical language and a focus on a single dataset versus connected datasets such as those in relational databases. By contrast, messy data is any other arrangement of the data. This article provides a more in-depth perspective about Codd's normalization process for database structures, Normalized data base structure : a brief tutorial