8 January 2025

Goals for today

  • Outline the steps to a data science analysis
  • Understand the software requirements for the course

Credits

This content draws upon material developed by Jeff Leek & Roger Peng for their Advanced Data Science Course at the Johns Hopkins Bloomberg School of Public Health.

Steps to an analysis

  1. Define the question
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results

Steps to an analysis

  1. Define the question
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results


This should ALL be reproducible

Steps to an analysis

1. Define the question of interest

What makes a good Q?

  • It should be specific
  • It should be answerable with available data
  • It should come with a definition of success
  • It should not be designed to discriminate or cause harm

Steps to an analysis

2. Get the data

Data sources (in order of confidence):

  • You collected your own
  • Collaborators
  • Public repositories

Steps to an analysis

3. Clean the data

This is usually the most time intensive step

It’s also the most important & the hardest to teach

Steps to an analysis

4. Explore the data

Do the data meet your expectations?

  • Make judicious use of plots
  • Also consider correlation matrics (eg, {corrplot})
  • Beware of data mining!

Steps to an analysis

5. Fit statistical models

Start with first principles

  • Type of data (discrete, continuous)
  • Distributions of data (normal, Poisson)
  • Start simple and build from it!

Steps to an analysis

6. Communicate the results

Who’s your audience? What’s your medium?

  • Lab group
  • Social media, blog posts
  • Presentation
  • Publication

Steps to an analysis

Make your analysis reproducible

Consider these real scenarios

  • Your collaborator asks, “Can you re-run the analysis with these new data?”
  • “Your closest collaborator is you six months ago, but you don’t reply to emails.” (Karl Broman)

Types of analyses

Types of analysis

Raw

The goal is to explore the data

This is your “sandbox”

Types of analysis

Informal

Soften the edges of the raw results

  • Focus the analysis or “refine the arc”
  • Improve the presentation
  • Narrow your conclusions

 

 

Types of analysis

Formal

A refined analysis and presentation

  • Code in supplementary materials (GitHub, Zenodo, figshare)
  • High quality figures

A rubric for analyses

Here are some useful questions to ask yourself

A rubric for analyses

Answering the question posed

  • Could the question be answered with the available data?
  • Did you understand the context for the question and the scientific application?
  • Did you record the experimental design?

A rubric for analyses

Cleaning the data

  • Are the data “tidy”?
  • Did you record the recipe for moving from raw to tidy data?
  • Did you record all parameters, units, and functions applied to the data?

A rubric for analyses

Exploratory analysis

  • Did you make univariate plots (histograms, density plots, boxplots)?
  • Did you consider correlations between variables (scatterplots)?
  • Did you check the units of all data points to make sure they are in the right range?
  • Did you try to identify any errors or miscoding of variables?

A rubric for analyses

Statistical analysis

  • Did you identify the quantities of interest in your model?
  • How did you select your model(s) (eg, information theory, cross validation)?
  • Is this a correlative or causative study?

A rubric for analyses

Communication

  • Did you describe the data set, experimental design, and question you are answering?
  • Did you specify the model you are fitting?
  • Does each table/figure communicate an important piece of information?

A rubric for analyses

Reproducibility

  • Did you create a script that reproduces all your analyses?
  • Did you save the raw and processed versions of your data?
  • Did you record all versions of the software you used to process the data?
  • Did you try to have someone else run your analysis code?

What’s next?

We’ll learn about research compendia and style conventions for coding and naming