Overview

These notes will walk you through the process of connecting to a remote database and running some simple queries on the database using the verb-like functions in some common R packages. If you’d like to review the process of creating a new PostgreSQL database, adding some files, and querying the database using tools other than R, please look here.

The following figure courtesy of Posit shows a high-level overview of our process. Recall that a database is a collection of tables and views created and manipulated using Structured Query Language (SQL). Specifically, we use a relational database management system (eg, PostgreSQL) to create relational databases using SQL. One could then use SQL to access and wrangle data directly with the database or, as we’ll see here, one could use R packages to connect to a database and interact with it completely within RStudio.



Remote databases

As we discussed previously, most databases are hosted on a server to which users connect remotely. To do so, one could use a dedicated IDE, such as pgAdmin shown here. Alternatively, as we’ll see here, you can use various R packages within RStudio to connect to a remote database.

Connecting to a database

To connect to our remote database, we’ll use two packages:

  1. {DBI}
  2. {RPostgres}
## load packages for db connection
library(DBI)
library(RPostgres)

When connecting remotely to a database, the user typically needs to provide several pieces of information. These include:

  • Hostname
  • Port
  • Database name
  • Username
  • Password

Here is the information you’ll need to connect to the remote Lake Washington database:

Hostname: p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com
Port:     5432
Database: lake_wa
Username: postgres
Password: FGE9jBmVIuizl0ImuR9N9rPer5hXAfuIifvkAazJC59EG5zhf8bIZUU3LTrN13Eo
## connect to the remote db
db_con <- dbConnect(drv = Postgres(), 
                 dbname = "lake_wa", 
                 host = "p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com", 
                 port = "5432", 
                 user = "postgres", 
                 password = "FGE9jBmVIuizl0ImuR9N9rPer5hXAfuIifvkAazJC59EG5zhf8bIZUU3LTrN13Eo")

Interacting with a db

We saw previously how to interact with a database using regular SQL commands. Alternatively, we can use the {dplyr} package to interact with databases directly by translating the {dplyr} functions in the form of verbs into SQL queries. This allows you to interact directly with the database from within R. Other advantages of this approach are:

  • Use SQL to run the data transformations within the database itself.

  • All of your code is written in R. Because {dplyr} is used to communicate with the database, there is no need to alternate between languages or tools to perform the data exploration.

The following figure courtesy of Posit shows conceptually our process of using {dplyr} to interact with a remote SQL database.



In addition, we can use the related {dbplyr} to convert R commands into actual SQL commands, or simply pass raw SQL commands if you want a more sophisticated query. So, to interact with our remote database, we’ll use two related packages:

  1. {dplyr}
  2. {dbplyr}
## load packages for db queries
library(dplyr)
library(dbplyr)

List of tables

Let’s take a look at the tables contained in our Lake Washington database so we can begin thinking about how to approach an exploratory analysis. To do so, we can use DBI::dbListTables() to list the names of remote tables accessible through our connection.

## list the available tables
dbListTables(db_con)
## [1] "pdo"                     "pg_stat_statements_info"
## [3] "pg_stat_statements"      "limno"

Here we can see that there are 4 tables in our database, which is a bit odd because we only added 2 of them: limno and pdo.

What’s up with the others beginning with pg_stat? Those tables track statistics with repsect to all SQL statements executed on the server.

Retrieve a table

Conveniently, we can use dplyr::tbl() to retrieve a table from a remote database and assign it to an object, if we’d like. Let’s inspect the 2 tables containing the SQL statistics.

## inspect metadata tables
tbl(db_con, from = "pg_stat_statements_info")
## # Source:   table<pg_stat_statements_info> [1 x 2]
## # Database: postgres  [postgres@p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com:5432/lake_wa]
##   dealloc stats_reset        
##   <int64> <dttm>             
## 1       0 2025-02-02 01:13:10
tbl(db_con, from = "pg_stat_statements")
## # Source:   table<pg_stat_statements> [?? x 49]
## # Database: postgres  [postgres@p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com:5432/lake_wa]
##    userid  dbid toplevel queryid query       plans total_plan_time min_plan_time
##     <int> <int> <lgl>    <int64> <chr>       <int>           <dbl>         <dbl>
##  1  16385     1 TRUE       2 e18 "select $1"     0               0             0
##  2     10     5 TRUE       3.e18 "/*pga4das…     0               0             0
##  3  16385     1 TRUE       3 e18 "SELECT ro…     0               0             0
##  4     10     5 TRUE      -6.e18 "SELECT\n …     0               0             0
##  5     10 16489 TRUE       8 e18 "SET clien…     0               0             0
##  6  16385     1 TRUE      -1 e18 "ALTER DEF…     0               0             0
##  7     10     5 TRUE      -7 e17 "SELECT\n …     0               0             0
##  8     10     5 TRUE       3 e16 "select (s…     0               0             0
##  9     10 16489 TRUE       4 e18 "SELECT at…     0               0             0
## 10     10 16489 TRUE       3.e18 "SELECT na…     0               0             0
## # ℹ more rows
## # ℹ 41 more variables: max_plan_time <dbl>, mean_plan_time <dbl>,
## #   stddev_plan_time <dbl>, calls <int64>, total_exec_time <dbl>,
## #   min_exec_time <dbl>, max_exec_time <dbl>, mean_exec_time <dbl>,
## #   stddev_exec_time <dbl>, rows <int64>, shared_blks_hit <int64>,
## #   shared_blks_read <int64>, shared_blks_dirtied <int64>,
## #   shared_blks_written <int64>, local_blks_hit <int64>, …

Exploratory analysis

Lake Washington

Let’s take an exploratory look at some of the limnological1 data from Lake Washington, an urban lake forming the eastern boundary of Seattle (see map here). We can also check out the monthly time series of the Pacific Decadal Oscillation.

Task: Get the limnology table from our database and assign it to limno.

## get the `limno` table from the db
limno <- tbl(db_con, from = "limno")

Task: Take a look at the data contained in limno.

## inspect the `limno` table
limno
## # Source:   table<limno> [?? x 20]
## # Database: postgres  [postgres@p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com:5432/lake_wa]
##     year month temp  tp    ph    cryptomonas diatoms greens bluegreens unicells
##    <int> <int> <chr> <chr> <chr> <chr>       <chr>   <chr>  <chr>      <chr>   
##  1  1962     1 7.6   60.2  7.4   0           8295    0      1301440    7336    
##  2  1962     2 7.5   NA    7.5   0           60414   1632   302189     6657    
##  3  1962     3 7.7   57.7  7.6   0           12848   0      290080     30137   
##  4  1962     4 10.1  52.4  7.6   0           27250   2250   448650     30600   
##  5  1962     5 12.8  57    7.9   0           48155   0      2105040    97313   
##  6  1962     6 16    82.5  8.4   NA          NA      NA     NA         NA      
##  7  1962     7 18.8  69.5  8.1   0           13511   0      4370310    40628   
##  8  1962     8 16.1  44.6  8.5   0           1292    13619  3450907    117579  
##  9  1962     9 14.7  67.1  8     0           1531    0      1940400    277424  
## 10  1962    10 13.5  60.2  7.2   0           1219    210    1344875    33431   
## # ℹ more rows
## # ℹ 10 more variables: other_algae <chr>, conochilus <chr>, cyclops <chr>,
## #   daphnia <chr>, diaptomus <chr>, epischura <chr>, leptodora <chr>,
## #   neomysis <chr>, non_daphnid_cladocerans <chr>, non_colonial_rotifers <chr>

The columns (“fields” in database parlance) are as follows:

  • year: calendar year
  • month: month of year
  • temp: water temperature (C)
  • tp: concentration of total phosphorus (micrograms/L)
  • ph: pH of the water
  • cryptomonas: counts of the phytoplankton Cryptomonas
  • diatoms: counts of diatoms
  • greens: counts of green algae
  • bluegreens: counts of bluegreen algae (cyanobacteria)
  • unicells: counts of the unicellular phytoplankton
  • other_algae: counts of all other phytoplankton combined
  • conochilus: counts of the colonial rotifer Conochilus
  • cyclops: counts of the cyclopoid copepod Cyclops
  • daphnia: counts of Daphnia
  • diaptomus: counts of the calanoid copepod Calanus
  • epischura: counts of the calanoid copepod Epischura
  • leptodora: counts of the predatory crustacean Calanus
  • neomysis: counts of the predatory crustacean Neomysis
  • non_daphnid_cladocerans: counts of non-Daphnid cladocerans
  • non_colonial_rotifers: counts of non-colonial rotifers

Task: Find out the range of years in the database.

## unique years of data
unique(limno$year)
## NULL

What happened here? We can clearly see values (of class integer) listed in our view, but the above call returns NULL.

Because we’re working with a remote database, all operations are “lazy”, meaning they don’t do anything until we formally request the data.

Task: Assign limno to a local R object with collect().

## create a local copy of the table
limno <- collect(limno)

Task: Check again for the range of years in the database.

## unique years of data
unique(limno$year)
##  [1] 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976
## [16] 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [31] 1992 1993 1994
## how many years of data?
unique(limno$year) |> length()
## [1] 33

Convert field classes

If you look closely at the classes for all of the non-date related fields, you’ll see that they’re currently all character’s. This can be traced back to the NA’s in the database itself. Before we can do any real operations on them, we’ll need to convert them to class numeric. This is easily done using the {tidyverse}.

Task: Use mutate_if() to convert character to numeric.

Note: R will respond with a warning about NAs introduced by coercion, which you can ignore.

## convert data types to `numeric`
limno <- limno |>
  mutate_if(is.character, as.numeric)
## Warning: There were 18 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `temp = .Primitive("as.double")(temp)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 17 remaining warnings.

Time series of temperature

Let’s extract and plot the monthly time series of water temperature for the 20 years from 1970-1989. To begin, we’ll create a new date column based on year and month so we can better use ggplot(). To do so, we’ll make use of the {lubridate} package.

Task: Use lubridate::make_date() to create a new date field.

## if necessary, install {lubridate}
# install.package(lubridate)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## create new `date` field; assume first day of year
limno <- limno |>
  mutate(date = make_date(year, month))

Now we can filter the table to choose the correct timespan and plot the temperature.

Task: Filter the table and plot the time series of temperature.

## load ggplot2
library(ggplot2)

limno |>
  # choose years 1970-1989
  filter(year >= 1970 & year <= 1989) |>
  # plot the time series of temperature
  ggplot(aes(x = date, y = temp)) +
  geom_line() + 
  xlab("\nTime") +
  ylab("Temperature (C)\n") +
  theme_gray(base_size = 16)

Pacific Decadal Oscillation

Let’s make a similar plot of the Pacific Decadal Oscillation. To begin, we’ll get the remote table and collect it into a local R object.

Task: Get the PDO table from our database and assign it to pdo.

## get the `pdo` table from the db & collect it
pdo <- tbl(db_con, from = "pdo") |>
  collect()

Task: Inspect the data in PDO table.

## inspect the top of the `pdo` table
pdo
## # A tibble: 2,004 × 2
##      date   pdo
##     <int> <dbl>
##  1 185401  0.11
##  2 185402 -0.24
##  3 185403 -0.4 
##  4 185404 -0.44
##  5 185405 -0.54
##  6 185406 -0.3 
##  7 185407 -0.1 
##  8 185408 -1.24
##  9 185409 -1   
## 10 185410 -2.23
## # ℹ 1,994 more rows

Note: The date field here appears a bit odd; it’s of the form YYYYMM.

Once again we can use the {lubridate} package to construct a proper date object. Specifically, we’ll use the ym() function, which converts a date in a YYYYMM format to the ISO-standard YYYY-MM-DD

Task: Use lubridate::ym() to create a new date field.

## if necessary, install {lubridate}
# install.package(lubridate)

## create new `date` field; assume first day of year
pdo <- pdo |>
  mutate(date = ym(date)) |>
  select(date, pdo)

Time series of PDO

Let’s look at the time series of PDO values from 1970-1989.

Task: Filter the PDO table and plot the time series.

pdo |>
  # choose years 1970-1989
  filter(year(date) >= 1970 & year(date) <= 1989) |>
  # plot the time series of temperature
  ggplot(aes(x = date, y = pdo)) +
  geom_line() + 
  xlab("\nTime") +
  ylab("PDO\n") +
  theme_gray(base_size = 16)


  1. If you didn’t already know, “limnology” refers to the study of inland waters. Some people mistakenly define it as the study of fresh water, but limnologists certainly study saline lakes as well (e.g., Great Salt Lake).↩︎