- Understand the concept of “raw” vs “cleaned” data
- Understand the different forms of data
- Understand basic tenets for organizing data
27 January 2025
Data are a set of values of qualitative or quantitative variables collected through observations
Raw data have not been “cleaned” to remove outliers, instrument/observation errors, or data entry errors
Data may be raw to you, but they may have been pre-processed by someone prior to you receiving them
Data chain of custody refers to a process whereby every analyst should
This process insures reproducibility
There are many different data formats
.csv
, .txt
)Be consistent
female
and not Female
or F
)NA
and not N/A
or -999
)2021-01-22
and not January 22, 2021
)Choose good names
Good name | Alternative | Avoid |
---|---|---|
fish_mass_g | FishMassG | Fish mass (g) |
sex | Sex | M/F |
obs_01 | first_obs | 1st Obs. |
No empty cells – use NA
or some other placeholder
One thing per cell, for example
lat-lon
to lat
and lon
0 (below detection limit)
”Use a tidy format where
These are not tidy
2 tables with linked IDs
Create a data dictionary, which includes info like
Example data dictionary
Be very cautious of Excel workbooks and Google sheets
C1=A1+B1
)How many issues can you identify?
For data entry in Excel, consider data validation options
(see how to do this here)
Save the data in plain text
.csv
files are generally preferred.txt
with tabs or space delimiters are more popular in languages where a “,
” indicates a decimalA reproducible analysis should include step-by-step instructions for moving from raw to clean data
Items to include:
We’ll learn about reading data from different sources