Motivation

Today we are going to talk about reading data, examples of common data formats, and useful tools to access data.


Working directories

One of the keys to creating a reproducible workflow is maintaining the proper folder/directory structures across computers, platforms, etc.

Absolute paths

In the early days of R programming, many people worked with absolute paths. When you open up someone’s R code or analysis, you would often see the setwd() function being used. This explicitly tells R to change the absolute path to the folder/directory where work is being done. Although this may work fine for an individual who works by themselves on a single computer, it will quickly lead to headaches and long trails of work when projects are shared with others.

For example, if Mark had a series of scripts scattered throughout a repo, and each of them began with something like

setwd("~/Users/Mark/Documents/folder/that/only/Mark/has")

you would need to go into each script and edit every single one of those absolute paths, such that they worked on your own computer.

Relative paths

An alternative to using absolute paths is to use relative paths. For example, the following command would set the working directory to the /data/ folder in someone’s root directory.

setwd("~/data/")

RStudio .Rproj files

We have already seen how to set up projects in RStudio and make use of the relative file paths associated with a project’s .Rproj file. When you start a project by opening a .Rproj file, RStudio automatically changes the path to the folder/directory where the .Rproj file is located.

The {here} package

Within R, the {here} package will automatically identify the top-level directory of a Git repo, and then direct all other paths relative to that. For more on project-oriented workflows, check out this blog post by Jenny Bryan.

Tip: The here::here() function creates a path relative to the folder/directory where the .Rproj file exists.

Instead of using setwd() at the top your .R or .Rmd file, Jenny suggests:

  • Organize each logical project into a folder on your computer (i.e., use a research compendium).

  • Make sure the top-level folder advertises itself as such. This can be as simple as having an empty file named .here. If you use RStudio and/or Git, those both leave characteristic files behind that will get the job done.

  • Use the here() function from the {here} package to build the path when you read or write a file. Create paths relative to the top-level directory.

  • Whenever you work on this project, launch the R process from the project’s top-level directory. If you launch R from the shell, use cd to move to the correct folder before beginning.

Tip: You can use getwd() to see your current working directory and you can inspect the files available using list.file() or dir().

## what is the working directory?
getwd()
## [1] "/Users/mark/Documents/GitHub/FISH549/website/lectures/week_04"
## what files exist in the working directory?
list.files()
##  [1] "data"                      "fontawesome.css"          
##  [3] "images"                    "lec_10_getting_data_cache"
##  [5] "lec_10_getting_data.html"  "lec_10_getting_data.Rmd"  
##  [7] "lec_11_getting_help.html"  "lec_11_getting_help.Rmd"  
##  [9] "lec_12_kara_woo.html"      "lec_12_kara_woo.Rmd"      
## [11] "lecture_inst.css"          "readr_cheatsheet.pdf"     
## [13] "solid.css"
dir()
##  [1] "data"                      "fontawesome.css"          
##  [3] "images"                    "lec_10_getting_data_cache"
##  [5] "lec_10_getting_data.html"  "lec_10_getting_data.Rmd"  
##  [7] "lec_11_getting_help.html"  "lec_11_getting_help.Rmd"  
##  [9] "lec_12_kara_woo.html"      "lec_12_kara_woo.Rmd"      
## [11] "lecture_inst.css"          "readr_cheatsheet.pdf"     
## [13] "solid.css"

Our current location is in the week_04 sub-folder buried relatively deeply within the course website repository.

Using here() returns the following information:

## load here package
library(here)
## here() starts at /Users/mark/Documents/GitHub/FISH549/website
## show files in base directory
list.files(here())
##  [1] "_list_of_hw.Rmd"         "_list_of_lectures.Rmd"  
##  [3] "_site.yml"               "_syllabus.Rmd"          
##  [5] "_timeline.Rmd"           "callout_code.html"      
##  [7] "docs"                    "FISH549_hex_sticker.png"
##  [9] "homework"                "homework.Rmd"           
## [11] "index.Rmd"               "lectures"               
## [13] "lectures.Rmd"            "LICENSE"                
## [15] "pkg_reviews.Rmd"         "README.md"              
## [17] "references"              "references.Rmd"         
## [19] "syllabus.Rmd"            "webfonts"               
## [21] "website.Rproj"
## show files in references directory
list.files(here("references"))
##  [1] "BES_guide_to_reproducible_code.pdf"      
##  [2] "Broman_Woo_2018_data_in_spreadsheets.pdf"
##  [3] "Bryan_2017_version_control.pdf"          
##  [4] "Ellis_Leek_2017_how_to_share_data.pdf"   
##  [5] "Fegraus_2005_intro_EML.pdf"              
##  [6] "fontawesome.css"                         
##  [7] "git_install"                             
##  [8] "git_install_windows.html"                
##  [9] "git_install_windows.Rmd"                 
## [10] "github_pat_creation.html"                
## [11] "github_pat_creation.Rmd"                 
## [12] "github_signup.html"                      
## [13] "github_signup.Rmd"                       
## [14] "images"                                  
## [15] "Jones_2006_new_bioinformatics.pdf"       
## [16] "lecture_inst.css"                        
## [17] "Leek_Peng_2015_what_is_the_Q.pdf"        
## [18] "Marwick_etal_2018_research_compendia.pdf"
## [19] "Michener_2005_meta_info_concepts.pdf"    
## [20] "Parker_2017_analysis_development.pdf"    
## [21] "readr_cheatsheet.pdf"                    
## [22] "Ross_etal_2017_tidy_tools.pdf"           
## [23] "rstudio_cheatsheet.pdf"                  
## [24] "set_global_gitignore.html"               
## [25] "set_global_gitignore.Rmd"                
## [26] "Silberzahn_2018_crowdsourcing_paper.pdf" 
## [27] "solid.css"

Tip: We can also use here() to check the path for a specific file.

here("data", "palmer_penguins.csv")
## [1] "/Users/mark/Documents/GitHub/FISH549/website/data/palmer_penguins.csv"

Checking for local files

You may run into situations where you’d like to

  1. check if a file already exists, and

  2. if not, perhaps create a new folder/directory before downloading or saving the file.

Tip: The function file.exists() will test for the presence of a file and return TRUE or FALSE.

For example,

file.exists(here("my", "relative", "path", "filename.csv"))

will look for the filename.csv within the folder/directory /my/relative/path.

Why or when would someone use this? Fitting statistical models in JAGS or Stan can take a lot of time (hours), and I don’t want to have to refit the same models when doing an analysis. Therefore, I often use file.exists() within an if statement to first check if the folder/file exists, and if not, create the folder and save the file.

## set analysis directory
analysis_dir <- here("analysis")

## check if the `/analysis/cache` folder exists
## if not, create it
if(!file.exists(here("analysis_dir", "cache"))) {
  dir.create(here("analysis_dir", "cache"))
}

## check for saved file named `fit_ipm_JAGS.rds`
## if it doesn't exists, fit the model and save it
if(!file.exists(here("analysis_dir", "cache", "fit_ipm_JAGS.rds"))) {
  ## fit the model
  fit_ipm_JAGS <- rjags(...)
  ## save the model fit
  saveRDS(fit_ipm_JAGS, here("analysis_dir", "cache", "fit_ipm_JAGS.rds"))
}

Getting data

Downloading files

One option for reading data is to scrape files directly from a website. Here is an example of downloading some river flow data from the U.S. Geological Survey National Water Information System.

We will use the direct link to the gauge data from the Skagit River near Mount Vernon, WA (gauge #12178000). Before beginning, we need to specify the first and last years, and gauge’s ID.

## first & last years of flow data
yr_frst <- 2001
yr_last <- 2020

## flow gauge ID
flow_site <- 12178000

Here I break the URL into component pieces so its easier to see how and where the user input data are used.

## get URL for flow data from USGS
flow_url <- paste0("https://waterdata.usgs.gov/nwis/dv",
                   "?cb_00060=on",
                   "&format=rdb",
                   "&site_no=", flow_site,
                   "&begin_date=", yr_frst, "-01-01",
                   "&end_date=", yr_last, "-12-31")

Next we retrieve the raw data file and print its metadata.

## load `readr` package
library("readr")

## raw flow data from USGS
flow_raw <- read_lines(flow_url)

## lines with metadata
hdr_flow <- which(lapply(flow_raw, grep, pattern = "\\#") == 1,
                  arr.ind = TRUE)

## print flow metadata
print(flow_raw[hdr_flow], quote = FALSE)
##  [1] # ---------------------------------- WARNING ----------------------------------------     
##  [2] # Some of the data that you have obtained from this U.S. Geological Survey database       
##  [3] # may not have received Director's approval. Any such data values are qualified           
##  [4] # as provisional and are subject to revision. Provisional data are released on the        
##  [5] # condition that neither the USGS nor the United States Government may be held liable     
##  [6] # for any damages resulting from its use.                                                 
##  [7] #                                                                                         
##  [8] # Additional info: https://help.waterdata.usgs.gov/policies/provisional-data-statement    
##  [9] #                                                                                         
## [10] # File-format description:  https://help.waterdata.usgs.gov/faq/about-tab-delimited-output
## [11] # Automated-retrieval info: https://help.waterdata.usgs.gov/faq/automated-retrievals      
## [12] #                                                                                         
## [13] # Contact:   gs-w_waterdata_support@usgs.gov                                              
## [14] # retrieved: 2023-01-25 09:14:56 EST       (vaww02)                                       
## [15] #                                                                                         
## [16] # Data for the following 1 site(s) are contained in this file                             
## [17] #    USGS 12178000 SKAGIT RIVER AT NEWHALEM, WA                                           
## [18] # -----------------------------------------------------------------------------------     
## [19] #                                                                                         
## [20] # Data provided for site 12178000                                                         
## [21] #            TS   parameter     statistic     Description                                 
## [22] #        149338       00060     00003     Discharge, cubic feet per second (Mean)         
## [23] #                                                                                         
## [24] # Data-value qualification codes included in this output:                                 
## [25] #                                                                                         
## [26] #     A  Approved for publication -- Processing and review completed.                     
## [27] #     e  Value has been estimated.                                                        
## [28] #

Lastly, we extract the actual flow data for the years of interest and inspect the file contents.

## flow data for years of interest
dat_flow <-  read_tsv(flow_url,
                      col_names = FALSE,
                      col_types = "ciDdc",
                      skip = max(hdr_flow) + 2)

## revise column names
colnames(dat_flow) <- unlist(strsplit(tolower(flow_raw[max(hdr_flow) + 1]),
                                      split = "\\s+"))
## inspect tibble
head(dat_flow)
## # A tibble: 6 × 5
##   agency_cd  site_no datetime   `149338_00060_00003` `149338_00060_00003_cd`
##   <chr>        <int> <date>                    <dbl> <chr>                  
## 1 USGS      12178000 2001-01-01                 2520 A                      
## 2 USGS      12178000 2001-01-02                 3270 A                      
## 3 USGS      12178000 2001-01-03                 2850 A                      
## 4 USGS      12178000 2001-01-04                 2640 A                      
## 5 USGS      12178000 2001-01-05                 2680 A                      
## 6 USGS      12178000 2001-01-06                 2630 A

We only need the 3rd and 4th columns, which contain the date (datetime) and daily flow measurements (149338_00060_00003). We will rename them to date and flow, respectively, and convert the flow units from “cubic feet per second” to “cubic meters per second”.

## keep only relevant columns
dat_flow <- dat_flow[c("datetime",
                       grep("[0-9]$", colnames(dat_flow), value = TRUE))]

## nicer column names
colnames(dat_flow) <- c("date", "flow")

## convert cubic feet to cubic meters
dat_flow$flow <- dat_flow$flow / 35.3147

## flow by year & month
dat_flow$year <- as.integer(format(dat_flow$date, "%Y"))
dat_flow$month <- as.integer(format(dat_flow$date, "%m"))
dat_flow <- dat_flow[, c("year", "month", "flow")]

## inspect the file
head(dat_flow)
## # A tibble: 6 × 3
##    year month  flow
##   <int> <int> <dbl>
## 1  2001     1  71.4
## 2  2001     1  92.6
## 3  2001     1  80.7
## 4  2001     1  74.8
## 5  2001     1  75.9
## 6  2001     1  74.5

Reading .csv files

Base functions

The read.table() function, and its specific variants read.csv() and read.delim(), is one of the most commonly used functions for reading data into R. The help file for read.table() provides a lot of useful information (see ?read.table).

The read.table() function has several important arguments:

  • file: the name of a file, or a connection
  • header: a logical value indicating if the file has a header line
  • sep: a string indicating how the columns are separated
  • colClasses: a character vector indicating the class of each column in the dataset
  • nrows: the number of rows in the dataset (by default read.table() reads an entire file)
  • comment.char: a character string indicating the comment character, which defaults to #. If there are no commented lines in your file, it’s worth setting this to be the empty string "".
  • skip: the number of lines to skip from the beginning
  • stringsAsFactors: should character variables be coded as factors? Note that as of R version 4.0, this defaults to FALSE

Tip: For small to moderately sized files, you can usually call read.table() without specifying any other arguments.

For example, the following line

raw_data <- read.table("flat_file.txt")

would

  • skip any lines that begin with a #
  • figure out how many rows there are (and how much memory needs to be allocated)
  • figure what type of variable is in each column of the table

On the other hand, explicitly specifying all of these arguments directly (i.e., “defensive programming”) makes R run faster and more efficiently.

Tip: The read.csv() function is identical to read.table() except that some of the defaults are set differently (e.g., sep argument).

There is a palmer_penguins.csv file located deep in our working directory structure.

## read penguin data
data_raw <- read.csv(here("lectures", "week_04", "data", "palmer_penguins.csv"))

## inspect its content
head(data_raw)
##   studyName Sample.Number                             Species Region    Island
## 1   PAL0708             1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 2   PAL0708             2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 3   PAL0708             3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 4   PAL0708             4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 5   PAL0708             5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 6   PAL0708             6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
##                Stage Individual.ID Clutch.Completion Date.Egg
## 1 Adult, 1 Egg Stage          N1A1               Yes 11/11/07
## 2 Adult, 1 Egg Stage          N1A2               Yes 11/11/07
## 3 Adult, 1 Egg Stage          N2A1               Yes 11/16/07
## 4 Adult, 1 Egg Stage          N2A2               Yes 11/16/07
## 5 Adult, 1 Egg Stage          N3A1               Yes 11/16/07
## 6 Adult, 1 Egg Stage          N3A2               Yes 11/16/07
##   Culmen.Length..mm. Culmen.Depth..mm. Flipper.Length..mm. Body.Mass..g.    Sex
## 1               39.1              18.7                 181          3750   MALE
## 2               39.5              17.4                 186          3800 FEMALE
## 3               40.3              18.0                 195          3250 FEMALE
## 4                 NA                NA                  NA            NA   <NA>
## 5               36.7              19.3                 193          3450 FEMALE
## 6               39.3              20.6                 190          3650   MALE
##   Delta.15.N..o.oo. Delta.13.C..o.oo.                       Comments
## 1                NA                NA Not enough blood for isotopes.
## 2           8.94956         -24.69454                           <NA>
## 3           8.36821         -25.33302                           <NA>
## 4                NA                NA             Adult not sampled.
## 5           8.76651         -25.32426                           <NA>
## 6           8.66496         -25.29805                           <NA>

Tidyverse

The newer Tidyverse collection of packages includes the {readr} package for reading in data files and assigning them the tibble object class. Click here to see the cheat sheet.

Task: Read the Palmer penguins data into R with the readr package and then inspect its contents.

## read penguin data
data_raw <- read_csv(here("lectures", "week_04", "data", "palmer_penguins.csv"))
## Rows: 344 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
## dbl  (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## inspect its content
head(data_raw)
## # A tibble: 6 × 17
##   studyName `Sample Number` Species          Region Island Stage `Individual ID`
##   <chr>               <dbl> <chr>            <chr>  <chr>  <chr> <chr>          
## 1 PAL0708                 1 Adelie Penguin … Anvers Torge… Adul… N1A1           
## 2 PAL0708                 2 Adelie Penguin … Anvers Torge… Adul… N1A2           
## 3 PAL0708                 3 Adelie Penguin … Anvers Torge… Adul… N2A1           
## 4 PAL0708                 4 Adelie Penguin … Anvers Torge… Adul… N2A2           
## 5 PAL0708                 5 Adelie Penguin … Anvers Torge… Adul… N3A1           
## 6 PAL0708                 6 Adelie Penguin … Anvers Torge… Adul… N3A2           
## # … with 10 more variables: `Clutch Completion` <chr>, `Date Egg` <chr>,
## #   `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## #   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## #   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>

Reading Excel files

Excel is very popular for entering and storing data, although we’ve already seen some of the pitfalls that you might encounter. There is a great package called {readxl} that allows you to read Excel files directly into R. Here’s an example that uses an Excel file called world_bank_climate_change.xls from the World Bank’s website. It looks like this:


Tip: We can get a list of the individual worksheets within the workbook with excel_sheets().

## load readxl
library(readxl)

## inpect the sheet names
excel_sheets("data/world_bank_climate_change.xls")
## [1] "Data"    "Country" "Series"

We can read in a specific worksheet with read_excel().

## read in the "Data" worksheet
cc_data <- read_excel("data/world_bank_climate_change.xls", sheet = "Data")

## inspect it
head(cc_data)
## # A tibble: 6 × 28
##   `Country code` `Country name`       `Series code` `Series name` SCALE Decimals
##   <chr>          <chr>                <chr>         <chr>         <dbl>    <dbl>
## 1 ABW            Aruba                AG.LND.EL5M.… Land area be…     0        1
## 2 ADO            Andorra              AG.LND.EL5M.… Land area be…     0        1
## 3 AFG            Afghanistan          AG.LND.EL5M.… Land area be…     0        1
## 4 AGO            Angola               AG.LND.EL5M.… Land area be…     0        1
## 5 ALB            Albania              AG.LND.EL5M.… Land area be…     0        1
## 6 ARE            United Arab Emirates AG.LND.EL5M.… Land area be…     0        1
## # … with 22 more variables: `1990` <chr>, `1991` <chr>, `1992` <chr>,
## #   `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>, `1997` <chr>,
## #   `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>,
## #   `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>,
## #   `2008` <chr>, `2009` <chr>, `2010` <chr>, `2011` <chr>

Tip: If you are only interested in a certain subset of the data, the range argument can be used to select which cells to read.

For example, to read in the columns related to cereal yield for the years 1990 through 1995 for country names that start with “B” in the Data tab, we would only want to read in the rectangle of cells from G481 to L498.

Note: The authors of this file decided to use .. to represent missing values, rather than something more conventional like NA or N/A, so hence the argument na = "..".

cc_data <- read_excel("data/world_bank_climate_change.xls",
                      sheet     = "Data",
                      range     = "G481:L498",
                      col_names = as.character(c(1990:1995)),
                      na        = "..")
head(cc_data)
## # A tibble: 6 × 6
##   `1990` `1991` `1992` `1993` `1994` `1995`
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1  1348.  1369.  1391.  1387.  1289.  1332.
## 2    NA     NA     NA     NA     NA     NA 
## 3   848.   878.   913.   919.   957.  1086.
## 4   600.   876.   872.   936.   788.   851 
## 5  2490.  2586.  2626.  2648.  2476.  2593.
## 6  3954.  3990   2954.  2512.  2799.  3069.

Workflow example

Tip: If you’d like to see an example of how one might use the {readxl} package in a workflow, check out Mark’s project on estimating trends in the density of bull trout across the Pacific Northwest here.

Reading a Google Sheet

The {googlesheets4} package allows you to inspect and read Google Sheets.

Tip: The gs_ls() function returns a data frame of the sheets listed in your Google Sheets home screen located at https://docs.google.com/spreadsheets/. This will include sheets that you own and sheets owned by others that you have permission to access.

Task: To begin, you need to use gs4_deauth() to authorize googlesheets4 to access the contents of your Google Drive where the Sheets are stored (here I’m allowing only access to public sheets).

## load the packge
library(googlesheets4)

## authorize for reading public-only sheets
gs4_deauth()
  Tip: We can read in a sheet by one of two methods:
  1. specify the entire URL
  2. specify only the sheet ID

Option 1: Specify the entire URL

## specify URL
gs_url <- "https://docs.google.com/spreadsheets/d/1ct3MMMzEX82BeqJcUw3uYD5b4CJrKUmRYCK-wVA3yig/edit#gid=0"

## read from URL
read_sheet(gs_url)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.
## # A tibble: 34 × 5
##     year flow_wtr flow_sum  npgo h_releases
##    <dbl>    <dbl>    <dbl> <dbl>      <dbl>
##  1  1978     234.     29.4 -0.75     340416
##  2  1979     391.     28.3 -0.22     274090
##  3  1980     490.     41.9 -0.69     216537
##  4  1981     253.     40.5 -0.09     300942
##  5  1982     248.     39.6  0.62     241931
##  6  1983     306.     40.8 -0.39     256238
##  7  1984     199.     45.9 -0.73     319220
##  8  1985     205.     42.8  0.43     125280
##  9  1986     207      45.0  1.32     229013
## 10  1987     190.     36.5  0.61     217948
## # … with 24 more rows
## assign it to an object
gs_raw <- read_sheet(gs_url)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.

Option 2: Specify the sheet ID

## specify ID
gs_id <- as_sheets_id(gs_url)

## read from ID
read_sheet(gs_id)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.
## # A tibble: 34 × 5
##     year flow_wtr flow_sum  npgo h_releases
##    <dbl>    <dbl>    <dbl> <dbl>      <dbl>
##  1  1978     234.     29.4 -0.75     340416
##  2  1979     391.     28.3 -0.22     274090
##  3  1980     490.     41.9 -0.69     216537
##  4  1981     253.     40.5 -0.09     300942
##  5  1982     248.     39.6  0.62     241931
##  6  1983     306.     40.8 -0.39     256238
##  7  1984     199.     45.9 -0.73     319220
##  8  1985     205.     42.8  0.43     125280
##  9  1986     207      45.0  1.32     229013
## 10  1987     190.     36.5  0.61     217948
## # … with 24 more rows
## assign it to an object
gs_raw <- read_sheet(gs_id)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.

Tip: We can browse the sheet’s metadata with gs4_browse().

## browse sheet contents
gs4_get(gs_id)
## Spreadsheet name: FISH549_example
##               ID: 1ct3MMMzEX82BeqJcUw3uYD5b4CJrKUmRYCK-wVA3yig
##           Locale: en_US
##        Time zone: America/Los_Angeles
##      # of sheets: 1
## 
## (Sheet name): (Nominal extent in rows x columns)
##       Sheet1: 1001 x 26

Tip: We can also read in just a subset of cells by specifying the range argument.

read_sheet(gs_id, range = "A1:B10")
## ✔ Reading from "FISH549_example".
## ✔ Range 'A1:B10'.
## # A tibble: 9 × 2
##    year flow_wtr
##   <dbl>    <dbl>
## 1  1978     234.
## 2  1979     391.
## 3  1980     490.
## 4  1981     253.
## 5  1982     248.
## 6  1983     306.
## 7  1984     199.
## 8  1985     205.
## 9  1986     207

Reading a JSON file

What is JSON?

JSON (or JavaScript Object Notation) is a file format that stores information in an organized, logical, easy-to-access manner (i.e., “human readable”). For example, here is what a JSON file looks like:

var mark = {
  "city" : "Seattle",
  "state" : "WA", 
  "hobbies" : {
    "hobby1" : "cycling",
    "hobby2" : "skiing"
  }
  "bikes" : {
    "bike1" : "Ridley Helium",
    "bike2" : "Niner RLT",
    "bike3" : "Santa Cruz Tallboy"
  }
  "skis" : {
    "skis1" : "K2 Hardside",
    "skis2" : "Black Crows Corvus Freebird"
  }
}

Some features about JSON object:

  • JSON objects are surrounded by curly braces {}
  • JSON objects are written in key : value pairs (separated by a colon)
  • Keys must be strings
  • Values must be a valid JSON data type (string, number, object, array, boolean)
  • Each key : value pair is separated by a comma

Using the GitHub API

Many organizations provide an application programming interface (API) for accessing their data. GitHub has an API that allows you to explore information about users. Let’s use this API to explore some of Mark’s GitHub activities.

To do so, we’ll use the fromJSON() function from the {jsonlite} package to convert the data from a JSON object to a data frame.

Task: Begin by reading in the JSON file located at https://api.github.com/users/mdscheuerell/repos.

## load the pckage
library(jsonlite)

## set the URL
github_url = "https://api.github.com/users/mdscheuerell/repos"

## get the JSON data
data_json <- fromJSON(github_url)

Note: The function fromJSON() will convert the JSON file into a data frame.

Tip: You can inspect the field names in the data frame with names().

## all of the JSON names
names(data_json)
##  [1] "id"                          "node_id"                    
##  [3] "name"                        "full_name"                  
##  [5] "private"                     "owner"                      
##  [7] "html_url"                    "description"                
##  [9] "fork"                        "url"                        
## [11] "forks_url"                   "keys_url"                   
## [13] "collaborators_url"           "teams_url"                  
## [15] "hooks_url"                   "issue_events_url"           
## [17] "events_url"                  "assignees_url"              
## [19] "branches_url"                "tags_url"                   
## [21] "blobs_url"                   "git_tags_url"               
## [23] "git_refs_url"                "trees_url"                  
## [25] "statuses_url"                "languages_url"              
## [27] "stargazers_url"              "contributors_url"           
## [29] "subscribers_url"             "subscription_url"           
## [31] "commits_url"                 "git_commits_url"            
## [33] "comments_url"                "issue_comment_url"          
## [35] "contents_url"                "compare_url"                
## [37] "merges_url"                  "archive_url"                
## [39] "downloads_url"               "issues_url"                 
## [41] "pulls_url"                   "milestones_url"             
## [43] "notifications_url"           "labels_url"                 
## [45] "releases_url"                "deployments_url"            
## [47] "created_at"                  "updated_at"                 
## [49] "pushed_at"                   "git_url"                    
## [51] "ssh_url"                     "clone_url"                  
## [53] "svn_url"                     "homepage"                   
## [55] "size"                        "stargazers_count"           
## [57] "watchers_count"              "language"                   
## [59] "has_issues"                  "has_projects"               
## [61] "has_downloads"               "has_wiki"                   
## [63] "has_pages"                   "has_discussions"            
## [65] "forks_count"                 "mirror_url"                 
## [67] "archived"                    "disabled"                   
## [69] "open_issues_count"           "license"                    
## [71] "allow_forking"               "is_template"                
## [73] "web_commit_signoff_required" "topics"                     
## [75] "visibility"                  "forks"                      
## [77] "open_issues"                 "watchers"                   
## [79] "default_branch"

Now we can query these data to answer some questions about Mark’s GitHub profile and activities. (Note that the following results are only based upon the first 30 repos listed in Mark’s account.)

Task: How many of Mark’s repos have been forked?

## table of the number of forks
table(data_json$forks)
## 
##  0  1  2 
## 23  6  1

Only a total of 7 of the firt 30 repos have been forked.

Task: What are the languages used in these repos?

## table number of the different languages
table(data_json$language)
## 
##  CSS HTML  Lua    R  TeX 
##    1    8    1    8    3

Task: How many repos have open issues that Mark should address?

## how many repos have open issues? 
table(data_json$open_issues_count)
## 
##  0  2  3  5 13 
## 26  1  1  1  1

Not too bad, but clearly he’s behind on some things. One of his repos has 13 open issues!


Other good R packages