Today we are going to talk about reading data, examples of common data formats, and useful tools to access data.
One of the keys to creating a reproducible workflow is maintaining the proper folder/directory structures across computers, platforms, etc.
In the early days of R programming, many people
worked with absolute paths. When you open up someone’s
R code or analysis, you would often see the
setwd()
function being used. This explicitly tells
R to change the absolute path to the folder/directory
where work is being done. Although this may work fine for an individual
who works by themselves on a single computer, it will quickly lead to
headaches and long trails of work when projects are shared with
others.
For example, if Mark had a series of scripts scattered throughout a repo, and each of them began with something like
setwd("~/Users/Mark/Documents/folder/that/only/Mark/has")
you would need to go into each script and edit every single one of those absolute paths, such that they worked on your own computer.
An alternative to using absolute paths is to use relative paths. For
example, the following command would set the working directory to the
/data/
folder in someone’s root directory.
setwd("~/data/")
.Rproj
filesWe have already seen how to set up projects in
RStudio and make use of the relative file paths
associated with a project’s .Rproj
file. When you start a
project by opening a .Rproj
file, RStudio
automatically changes the path to the folder/directory where the
.Rproj
file is located.
{here}
packageWithin R, the {here}
package
will automatically identify the top-level directory of a
Git repo, and then direct all other paths relative to
that. For more on project-oriented workflows, check out this
blog post by Jenny Bryan.
Tip: The here::here()
function creates
a path relative to the folder/directory where the
.Rproj
file exists.
Instead of using setwd()
at the top your .R
or .Rmd
file, Jenny suggests:
Organize each logical project into a folder on your computer (i.e., use a research compendium).
Make sure the top-level folder advertises itself as such. This
can be as simple as having an empty file named .here
. If
you use RStudio and/or Git, those both
leave characteristic files behind that will get the job done.
Use the here()
function from the {here}
package to build the path when you read or write a file. Create paths
relative to the top-level directory.
Whenever you work on this project, launch the R
process from the project’s top-level directory. If you launch
R from the shell, use cd
to move to the
correct folder before beginning.
Tip: You can use getwd()
to see your
current working directory and you can inspect the files available using
list.file()
or dir()
.
## what is the working directory?
getwd()
## [1] "/Users/mark/Documents/GitHub/FISH549/website/lectures/week_04"
## what files exist in the working directory?
list.files()
## [1] "data" "fontawesome.css"
## [3] "images" "lec_10_getting_data_cache"
## [5] "lec_10_getting_data.html" "lec_10_getting_data.Rmd"
## [7] "lec_11_getting_help.html" "lec_11_getting_help.Rmd"
## [9] "lec_12_kara_woo.html" "lec_12_kara_woo.Rmd"
## [11] "lecture_inst.css" "readr_cheatsheet.pdf"
## [13] "solid.css"
dir()
## [1] "data" "fontawesome.css"
## [3] "images" "lec_10_getting_data_cache"
## [5] "lec_10_getting_data.html" "lec_10_getting_data.Rmd"
## [7] "lec_11_getting_help.html" "lec_11_getting_help.Rmd"
## [9] "lec_12_kara_woo.html" "lec_12_kara_woo.Rmd"
## [11] "lecture_inst.css" "readr_cheatsheet.pdf"
## [13] "solid.css"
Our current location is in the week_04
sub-folder buried
relatively deeply within the course website
repository.
Using here()
returns the following information:
## load here package
library(here)
## here() starts at /Users/mark/Documents/GitHub/FISH549/website
## show files in base directory
list.files(here())
## [1] "_list_of_hw.Rmd" "_list_of_lectures.Rmd"
## [3] "_site.yml" "_syllabus.Rmd"
## [5] "_timeline.Rmd" "callout_code.html"
## [7] "docs" "FISH549_hex_sticker.png"
## [9] "homework" "homework.Rmd"
## [11] "index.Rmd" "lectures"
## [13] "lectures.Rmd" "LICENSE"
## [15] "pkg_reviews.Rmd" "README.md"
## [17] "references" "references.Rmd"
## [19] "syllabus.Rmd" "webfonts"
## [21] "website.Rproj"
## show files in references directory
list.files(here("references"))
## [1] "BES_guide_to_reproducible_code.pdf"
## [2] "Broman_Woo_2018_data_in_spreadsheets.pdf"
## [3] "Bryan_2017_version_control.pdf"
## [4] "Ellis_Leek_2017_how_to_share_data.pdf"
## [5] "Fegraus_2005_intro_EML.pdf"
## [6] "fontawesome.css"
## [7] "git_install"
## [8] "git_install_windows.html"
## [9] "git_install_windows.Rmd"
## [10] "github_pat_creation.html"
## [11] "github_pat_creation.Rmd"
## [12] "github_signup.html"
## [13] "github_signup.Rmd"
## [14] "images"
## [15] "Jones_2006_new_bioinformatics.pdf"
## [16] "lecture_inst.css"
## [17] "Leek_Peng_2015_what_is_the_Q.pdf"
## [18] "Marwick_etal_2018_research_compendia.pdf"
## [19] "Michener_2005_meta_info_concepts.pdf"
## [20] "Parker_2017_analysis_development.pdf"
## [21] "readr_cheatsheet.pdf"
## [22] "Ross_etal_2017_tidy_tools.pdf"
## [23] "rstudio_cheatsheet.pdf"
## [24] "set_global_gitignore.html"
## [25] "set_global_gitignore.Rmd"
## [26] "Silberzahn_2018_crowdsourcing_paper.pdf"
## [27] "solid.css"
Tip: We can also use here()
to check
the path for a specific file.
here("data", "palmer_penguins.csv")
## [1] "/Users/mark/Documents/GitHub/FISH549/website/data/palmer_penguins.csv"
You may run into situations where you’d like to
check if a file already exists, and
if not, perhaps create a new folder/directory before downloading or saving the file.
Tip: The function file.exists()
will
test for the presence of a file and return TRUE
or
FALSE
.
For example,
file.exists(here("my", "relative", "path", "filename.csv"))
will look for the filename.csv
within the
folder/directory /my/relative/path
.
Why or when would someone use this? Fitting statistical models in
JAGS or Stan can take a lot of time
(hours), and I don’t want to have to refit the same models when doing an
analysis. Therefore, I often use file.exists()
within an
if
statement to first check if the folder/file exists, and
if not, create the folder and save the file.
## set analysis directory
analysis_dir <- here("analysis")
## check if the `/analysis/cache` folder exists
## if not, create it
if(!file.exists(here("analysis_dir", "cache"))) {
dir.create(here("analysis_dir", "cache"))
}
## check for saved file named `fit_ipm_JAGS.rds`
## if it doesn't exists, fit the model and save it
if(!file.exists(here("analysis_dir", "cache", "fit_ipm_JAGS.rds"))) {
## fit the model
fit_ipm_JAGS <- rjags(...)
## save the model fit
saveRDS(fit_ipm_JAGS, here("analysis_dir", "cache", "fit_ipm_JAGS.rds"))
}
One option for reading data is to scrape files directly from a website. Here is an example of downloading some river flow data from the U.S. Geological Survey National Water Information System.
We will use the direct link to the gauge data from the Skagit River near Mount Vernon, WA (gauge #12178000). Before beginning, we need to specify the first and last years, and gauge’s ID.
## first & last years of flow data
yr_frst <- 2001
yr_last <- 2020
## flow gauge ID
flow_site <- 12178000
Here I break the URL into component pieces so its easier to see how and where the user input data are used.
## get URL for flow data from USGS
flow_url <- paste0("https://waterdata.usgs.gov/nwis/dv",
"?cb_00060=on",
"&format=rdb",
"&site_no=", flow_site,
"&begin_date=", yr_frst, "-01-01",
"&end_date=", yr_last, "-12-31")
Next we retrieve the raw data file and print its metadata.
## load `readr` package
library("readr")
## raw flow data from USGS
flow_raw <- read_lines(flow_url)
## lines with metadata
hdr_flow <- which(lapply(flow_raw, grep, pattern = "\\#") == 1,
arr.ind = TRUE)
## print flow metadata
print(flow_raw[hdr_flow], quote = FALSE)
## [1] # ---------------------------------- WARNING ----------------------------------------
## [2] # Some of the data that you have obtained from this U.S. Geological Survey database
## [3] # may not have received Director's approval. Any such data values are qualified
## [4] # as provisional and are subject to revision. Provisional data are released on the
## [5] # condition that neither the USGS nor the United States Government may be held liable
## [6] # for any damages resulting from its use.
## [7] #
## [8] # Additional info: https://help.waterdata.usgs.gov/policies/provisional-data-statement
## [9] #
## [10] # File-format description: https://help.waterdata.usgs.gov/faq/about-tab-delimited-output
## [11] # Automated-retrieval info: https://help.waterdata.usgs.gov/faq/automated-retrievals
## [12] #
## [13] # Contact: gs-w_waterdata_support@usgs.gov
## [14] # retrieved: 2023-01-25 09:14:56 EST (vaww02)
## [15] #
## [16] # Data for the following 1 site(s) are contained in this file
## [17] # USGS 12178000 SKAGIT RIVER AT NEWHALEM, WA
## [18] # -----------------------------------------------------------------------------------
## [19] #
## [20] # Data provided for site 12178000
## [21] # TS parameter statistic Description
## [22] # 149338 00060 00003 Discharge, cubic feet per second (Mean)
## [23] #
## [24] # Data-value qualification codes included in this output:
## [25] #
## [26] # A Approved for publication -- Processing and review completed.
## [27] # e Value has been estimated.
## [28] #
Lastly, we extract the actual flow data for the years of interest and inspect the file contents.
## flow data for years of interest
dat_flow <- read_tsv(flow_url,
col_names = FALSE,
col_types = "ciDdc",
skip = max(hdr_flow) + 2)
## revise column names
colnames(dat_flow) <- unlist(strsplit(tolower(flow_raw[max(hdr_flow) + 1]),
split = "\\s+"))
## inspect tibble
head(dat_flow)
## # A tibble: 6 × 5
## agency_cd site_no datetime `149338_00060_00003` `149338_00060_00003_cd`
## <chr> <int> <date> <dbl> <chr>
## 1 USGS 12178000 2001-01-01 2520 A
## 2 USGS 12178000 2001-01-02 3270 A
## 3 USGS 12178000 2001-01-03 2850 A
## 4 USGS 12178000 2001-01-04 2640 A
## 5 USGS 12178000 2001-01-05 2680 A
## 6 USGS 12178000 2001-01-06 2630 A
We only need the 3rd and 4th columns, which contain the date
(datetime
) and daily flow measurements
(149338_00060_00003
). We will rename them to
date
and flow
, respectively, and convert the
flow units from “cubic feet per second” to “cubic meters per
second”.
## keep only relevant columns
dat_flow <- dat_flow[c("datetime",
grep("[0-9]$", colnames(dat_flow), value = TRUE))]
## nicer column names
colnames(dat_flow) <- c("date", "flow")
## convert cubic feet to cubic meters
dat_flow$flow <- dat_flow$flow / 35.3147
## flow by year & month
dat_flow$year <- as.integer(format(dat_flow$date, "%Y"))
dat_flow$month <- as.integer(format(dat_flow$date, "%m"))
dat_flow <- dat_flow[, c("year", "month", "flow")]
## inspect the file
head(dat_flow)
## # A tibble: 6 × 3
## year month flow
## <int> <int> <dbl>
## 1 2001 1 71.4
## 2 2001 1 92.6
## 3 2001 1 80.7
## 4 2001 1 74.8
## 5 2001 1 75.9
## 6 2001 1 74.5
.csv
filesThe read.table()
function, and its specific variants
read.csv()
and read.delim()
, is one of the
most commonly used functions for reading data into R.
The help file for read.table()
provides a lot of useful
information (see ?read.table
).
The read.table() function has several important arguments:
file
: the name of a file, or a connectionheader
: a logical value indicating if the file has a
header linesep
: a string indicating how the columns are
separatedcolClasses
: a character vector indicating the class of
each column in the datasetnrows
: the number of rows in the dataset (by default
read.table()
reads an entire file)comment.char
: a character string indicating the comment
character, which defaults to #
. If there are no commented
lines in your file, it’s worth setting this to be the empty string
""
.skip
: the number of lines to skip from the
beginningstringsAsFactors
: should character variables be coded
as factors? Note that as of R version 4.0, this
defaults to FALSE
Tip: For small to moderately sized files, you can
usually call read.table()
without specifying any other
arguments.
For example, the following line
raw_data <- read.table("flat_file.txt")
would
#
On the other hand, explicitly specifying all of these arguments directly (i.e., “defensive programming”) makes R run faster and more efficiently.
Tip: The read.csv()
function is
identical to read.table()
except that some of the defaults
are set differently (e.g., sep
argument).
There is a palmer_penguins.csv
file located deep in our
working directory structure.
## read penguin data
data_raw <- read.csv(here("lectures", "week_04", "data", "palmer_penguins.csv"))
## inspect its content
head(data_raw)
## studyName Sample.Number Species Region Island
## 1 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 2 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 3 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 4 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 5 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## 6 PAL0708 6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen
## Stage Individual.ID Clutch.Completion Date.Egg
## 1 Adult, 1 Egg Stage N1A1 Yes 11/11/07
## 2 Adult, 1 Egg Stage N1A2 Yes 11/11/07
## 3 Adult, 1 Egg Stage N2A1 Yes 11/16/07
## 4 Adult, 1 Egg Stage N2A2 Yes 11/16/07
## 5 Adult, 1 Egg Stage N3A1 Yes 11/16/07
## 6 Adult, 1 Egg Stage N3A2 Yes 11/16/07
## Culmen.Length..mm. Culmen.Depth..mm. Flipper.Length..mm. Body.Mass..g. Sex
## 1 39.1 18.7 181 3750 MALE
## 2 39.5 17.4 186 3800 FEMALE
## 3 40.3 18.0 195 3250 FEMALE
## 4 NA NA NA NA <NA>
## 5 36.7 19.3 193 3450 FEMALE
## 6 39.3 20.6 190 3650 MALE
## Delta.15.N..o.oo. Delta.13.C..o.oo. Comments
## 1 NA NA Not enough blood for isotopes.
## 2 8.94956 -24.69454 <NA>
## 3 8.36821 -25.33302 <NA>
## 4 NA NA Adult not sampled.
## 5 8.76651 -25.32426 <NA>
## 6 8.66496 -25.29805 <NA>
The newer Tidyverse collection of packages includes
the {readr}
package for reading in data files and assigning them the
tibble
object class. Click here to see the cheat sheet.
Task: Read the Palmer penguins data into
R with the readr
package and then inspect
its contents.
## read penguin data
data_raw <- read_csv(here("lectures", "week_04", "data", "palmer_penguins.csv"))
## Rows: 344 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
## dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## inspect its content
head(data_raw)
## # A tibble: 6 × 17
## studyName `Sample Number` Species Region Island Stage `Individual ID`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin … Anvers Torge… Adul… N1A1
## 2 PAL0708 2 Adelie Penguin … Anvers Torge… Adul… N1A2
## 3 PAL0708 3 Adelie Penguin … Anvers Torge… Adul… N2A1
## 4 PAL0708 4 Adelie Penguin … Anvers Torge… Adul… N2A2
## 5 PAL0708 5 Adelie Penguin … Anvers Torge… Adul… N3A1
## 6 PAL0708 6 Adelie Penguin … Anvers Torge… Adul… N3A2
## # … with 10 more variables: `Clutch Completion` <chr>, `Date Egg` <chr>,
## # `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
Excel is very popular for entering and storing data, although we’ve
already seen some of the pitfalls that you might encounter. There is a
great package called {readxl}
that
allows you to read Excel files directly into R. Here’s
an example that uses an Excel file called
world_bank_climate_change.xls
from the World Bank’s website.
It looks like this:
Tip: We can get a list of the individual worksheets
within the workbook with excel_sheets()
.
## load readxl
library(readxl)
## inpect the sheet names
excel_sheets("data/world_bank_climate_change.xls")
## [1] "Data" "Country" "Series"
We can read in a specific worksheet with
read_excel()
.
## read in the "Data" worksheet
cc_data <- read_excel("data/world_bank_climate_change.xls", sheet = "Data")
## inspect it
head(cc_data)
## # A tibble: 6 × 28
## `Country code` `Country name` `Series code` `Series name` SCALE Decimals
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 ABW Aruba AG.LND.EL5M.… Land area be… 0 1
## 2 ADO Andorra AG.LND.EL5M.… Land area be… 0 1
## 3 AFG Afghanistan AG.LND.EL5M.… Land area be… 0 1
## 4 AGO Angola AG.LND.EL5M.… Land area be… 0 1
## 5 ALB Albania AG.LND.EL5M.… Land area be… 0 1
## 6 ARE United Arab Emirates AG.LND.EL5M.… Land area be… 0 1
## # … with 22 more variables: `1990` <chr>, `1991` <chr>, `1992` <chr>,
## # `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>, `1997` <chr>,
## # `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>,
## # `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>,
## # `2008` <chr>, `2009` <chr>, `2010` <chr>, `2011` <chr>
Tip: If you are only interested in a certain subset
of the data, the range
argument can be used to select which
cells to read.
For example, to read in the columns related to cereal yield for the years 1990 through 1995 for country names that start with “B” in the Data tab, we would only want to read in the rectangle of cells from G481 to L498.
Note: The authors of this file decided to use
..
to represent missing values, rather than something more
conventional like NA
or N/A
, so hence the
argument na = ".."
.
cc_data <- read_excel("data/world_bank_climate_change.xls",
sheet = "Data",
range = "G481:L498",
col_names = as.character(c(1990:1995)),
na = "..")
head(cc_data)
## # A tibble: 6 × 6
## `1990` `1991` `1992` `1993` `1994` `1995`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1348. 1369. 1391. 1387. 1289. 1332.
## 2 NA NA NA NA NA NA
## 3 848. 878. 913. 919. 957. 1086.
## 4 600. 876. 872. 936. 788. 851
## 5 2490. 2586. 2626. 2648. 2476. 2593.
## 6 3954. 3990 2954. 2512. 2799. 3069.
Tip: If you’d like to see an example of how one
might use the {readxl}
package in a workflow, check out
Mark’s project on estimating trends in the density of bull trout across
the Pacific Northwest here.
{googlesheets4}
package allows you to inspect and read
Google Sheets.
Tip: The gs_ls()
function returns a
data frame of the sheets listed in your Google Sheets home screen
located at https://docs.google.com/spreadsheets/. This will include
sheets that you own and sheets owned by others that you have permission
to access.
Task: To begin, you need to use
gs4_deauth()
to authorize googlesheets4
to
access the contents of your Google Drive where the Sheets are stored
(here I’m allowing only access to public sheets).
## load the packge
library(googlesheets4)
## authorize for reading public-only sheets
gs4_deauth()
## specify URL
gs_url <- "https://docs.google.com/spreadsheets/d/1ct3MMMzEX82BeqJcUw3uYD5b4CJrKUmRYCK-wVA3yig/edit#gid=0"
## read from URL
read_sheet(gs_url)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.
## # A tibble: 34 × 5
## year flow_wtr flow_sum npgo h_releases
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1978 234. 29.4 -0.75 340416
## 2 1979 391. 28.3 -0.22 274090
## 3 1980 490. 41.9 -0.69 216537
## 4 1981 253. 40.5 -0.09 300942
## 5 1982 248. 39.6 0.62 241931
## 6 1983 306. 40.8 -0.39 256238
## 7 1984 199. 45.9 -0.73 319220
## 8 1985 205. 42.8 0.43 125280
## 9 1986 207 45.0 1.32 229013
## 10 1987 190. 36.5 0.61 217948
## # … with 24 more rows
## assign it to an object
gs_raw <- read_sheet(gs_url)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.
## specify ID
gs_id <- as_sheets_id(gs_url)
## read from ID
read_sheet(gs_id)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.
## # A tibble: 34 × 5
## year flow_wtr flow_sum npgo h_releases
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1978 234. 29.4 -0.75 340416
## 2 1979 391. 28.3 -0.22 274090
## 3 1980 490. 41.9 -0.69 216537
## 4 1981 253. 40.5 -0.09 300942
## 5 1982 248. 39.6 0.62 241931
## 6 1983 306. 40.8 -0.39 256238
## 7 1984 199. 45.9 -0.73 319220
## 8 1985 205. 42.8 0.43 125280
## 9 1986 207 45.0 1.32 229013
## 10 1987 190. 36.5 0.61 217948
## # … with 24 more rows
## assign it to an object
gs_raw <- read_sheet(gs_id)
## ✔ Reading from "FISH549_example".
## ✔ Range 'Sheet1'.
Tip: We can browse the sheet’s metadata with
gs4_browse()
.
## browse sheet contents
gs4_get(gs_id)
## Spreadsheet name: FISH549_example
## ID: 1ct3MMMzEX82BeqJcUw3uYD5b4CJrKUmRYCK-wVA3yig
## Locale: en_US
## Time zone: America/Los_Angeles
## # of sheets: 1
##
## (Sheet name): (Nominal extent in rows x columns)
## Sheet1: 1001 x 26
Tip: We can also read in just a subset of cells by
specifying the range
argument.
read_sheet(gs_id, range = "A1:B10")
## ✔ Reading from "FISH549_example".
## ✔ Range 'A1:B10'.
## # A tibble: 9 × 2
## year flow_wtr
## <dbl> <dbl>
## 1 1978 234.
## 2 1979 391.
## 3 1980 490.
## 4 1981 253.
## 5 1982 248.
## 6 1983 306.
## 7 1984 199.
## 8 1985 205.
## 9 1986 207
JSON (or JavaScript Object Notation) is a file format that stores information in an organized, logical, easy-to-access manner (i.e., “human readable”). For example, here is what a JSON file looks like:
var mark = {
"city" : "Seattle",
"state" : "WA",
"hobbies" : {
"hobby1" : "cycling",
"hobby2" : "skiing"
}
"bikes" : {
"bike1" : "Ridley Helium",
"bike2" : "Niner RLT",
"bike3" : "Santa Cruz Tallboy"
}
"skis" : {
"skis1" : "K2 Hardside",
"skis2" : "Black Crows Corvus Freebird"
}
}
Some features about JSON
object:
{}
key : value
pairs
(separated by a colon)key : value
pair is separated by a commaMany organizations provide an application programming interface (API) for accessing their data. GitHub has an API that allows you to explore information about users. Let’s use this API to explore some of Mark’s GitHub activities.
To do so, we’ll use the fromJSON()
function from the {jsonlite}
package to convert the data from a JSON object to a data frame.
Task: Begin by reading in the JSON file located at https://api.github.com/users/mdscheuerell/repos.
## load the pckage
library(jsonlite)
## set the URL
github_url = "https://api.github.com/users/mdscheuerell/repos"
## get the JSON data
data_json <- fromJSON(github_url)
Note: The function fromJSON()
will
convert the JSON file into a data frame.
Tip: You can inspect the field names in the data
frame with names()
.
## all of the JSON names
names(data_json)
## [1] "id" "node_id"
## [3] "name" "full_name"
## [5] "private" "owner"
## [7] "html_url" "description"
## [9] "fork" "url"
## [11] "forks_url" "keys_url"
## [13] "collaborators_url" "teams_url"
## [15] "hooks_url" "issue_events_url"
## [17] "events_url" "assignees_url"
## [19] "branches_url" "tags_url"
## [21] "blobs_url" "git_tags_url"
## [23] "git_refs_url" "trees_url"
## [25] "statuses_url" "languages_url"
## [27] "stargazers_url" "contributors_url"
## [29] "subscribers_url" "subscription_url"
## [31] "commits_url" "git_commits_url"
## [33] "comments_url" "issue_comment_url"
## [35] "contents_url" "compare_url"
## [37] "merges_url" "archive_url"
## [39] "downloads_url" "issues_url"
## [41] "pulls_url" "milestones_url"
## [43] "notifications_url" "labels_url"
## [45] "releases_url" "deployments_url"
## [47] "created_at" "updated_at"
## [49] "pushed_at" "git_url"
## [51] "ssh_url" "clone_url"
## [53] "svn_url" "homepage"
## [55] "size" "stargazers_count"
## [57] "watchers_count" "language"
## [59] "has_issues" "has_projects"
## [61] "has_downloads" "has_wiki"
## [63] "has_pages" "has_discussions"
## [65] "forks_count" "mirror_url"
## [67] "archived" "disabled"
## [69] "open_issues_count" "license"
## [71] "allow_forking" "is_template"
## [73] "web_commit_signoff_required" "topics"
## [75] "visibility" "forks"
## [77] "open_issues" "watchers"
## [79] "default_branch"
Now we can query these data to answer some questions about Mark’s GitHub profile and activities. (Note that the following results are only based upon the first 30 repos listed in Mark’s account.)
Task: How many of Mark’s repos have been forked?
## table of the number of forks
table(data_json$forks)
##
## 0 1 2
## 23 6 1
Only a total of 7 of the firt 30 repos have been forked.
Task: What are the languages used in these repos?
## table number of the different languages
table(data_json$language)
##
## CSS HTML Lua R TeX
## 1 8 1 8 3
Task: How many repos have open issues that Mark should address?
## how many repos have open issues?
table(data_json$open_issues_count)
##
## 0 2 3 5 13
## 26 1 1 1 1
Not too bad, but clearly he’s behind on some things. One of his repos has 13 open issues!
{httr}
for tools to work with URLs and HTTP
{googledrive}
to interact with your Google Drive