6 January 2025

What is data science?

Data science is…

  • statistics on a Mac
  • 80% preparing data, 20% complaining about preparing data

Data science motto


If at first you don’t succeed,
call it version 1.0!

Data science is interdisciplanary

  • Math and statistics

  • Computer science

  • Knowledge about the system

  • Communication

Data science process

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results

Data science process

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results


This should ALL be reproducible

Data science time

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results

 

 

“Good data science is distinguished from bad data science primarily by a repeatable, thoughtful, skeptical application of an analytic process to data in order to arrive at supportable conclusions.”


Jeff Leek, Vice President and Chief Data Officer, Fred Hutch

Questions in data science

Leek & Peng (2015) Science

Leek & Peng (2015) Science

Descriptive

Summarizes the information in a single data set without further interpretation

ex: the US Census

Leek & Peng (2015) Science

Exploratory

Searches for trends, correlations, etc among multiple variables

ex: “data mining”

Leek & Peng (2015) Science

Inferential

Generalizes to the population from a sample

ex: wearing a mask reduces the spread of COVID

Predictive

Uses a subset of the data to parameterize a model and predicts out-of-sample data

ex: using weather to predict recreational fishing effort

Leek & Peng (2015) Science

Causal

Seeks to understand the average magnitude and direction of a response

ex: experimental analysis of temperature effects on oyster growth

Mechanistic

Seeks to understand the specific magnitude and direction of relationships

ex: engineering studies

Storytelling in data science

Communicating an analysis

Although you may have spent weeks/months on an analysis, most people only want the bottom line

The key is weaving everything into a nice story to achieve both trust and belief

Trust

Do you accept the analysis per se?

Were the data analyzed properly and thoroughly?


Trust is specific to the analysis/analyst (internal)

Belief

Do you believe the analysis?


Belief is more broadly related to previous work and other factors (external)

3 parts to an analysis

  1. The things you did and presented

  2. The things you did but did not present

  3. The things you did not do


The trick is to strike a balance to achieve trust and belief

Things we’ll emphasize

  • Thinking about data separately from research
  • Increasing efficiency & reproducibility
  • Facilitating collaboration with others
  • Finding answers from the community
  • Communicating our workflows & results

Things we won’t emphasize

  • Specific options for cleaning & tidying data (eg, tidyverse)
  • Specific analytical techniques (eg, machine learning)
  • Specific plotting techniques (eg, ggplot)

How are we going to do this?

  • Use lots of online resources
  • Use open/live coding
  • Rely on our community for help