These notes will walk you through the process of connecting to a remote database and running some simple queries on the database using the verb-like functions in some common R packages. If you’d like to review the process of creating a new PostgreSQL database, adding some files, and querying the database using tools other than R, please look here.
The following figure courtesy of Posit shows a high-level overview of our process. Recall that a database is a collection of tables and views created and manipulated using Structured Query Language (SQL). Specifically, we use a relational database management system (eg, PostgreSQL) to create relational databases using SQL. One could then use SQL to access and wrangle data directly with the database or, as we’ll see here, one could use R packages to connect to a database and interact with it completely within RStudio.
As we discussed previously, most databases are hosted on a server to which users connect remotely. To do so, one could use a dedicated IDE, such as pgAdmin shown here. Alternatively, as we’ll see here, you can use various R packages within RStudio to connect to a remote database.
To connect to our remote database, we’ll use two packages:
## load packages for db connection
library(DBI)
library(RPostgres)
When connecting remotely to a database, the user typically needs to provide several pieces of information. These include:
Here is the information you’ll need to connect to the remote Lake Washington database:
Hostname: p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com
Port: 5432
Database: lake_wa
Username: postgres
Password: FGE9jBmVIuizl0ImuR9N9rPer5hXAfuIifvkAazJC59EG5zhf8bIZUU3LTrN13Eo
## connect to the remote db
db_con <- dbConnect(drv = Postgres(),
dbname = "lake_wa",
host = "p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com",
port = "5432",
user = "postgres",
password = "FGE9jBmVIuizl0ImuR9N9rPer5hXAfuIifvkAazJC59EG5zhf8bIZUU3LTrN13Eo")
We saw previously how to interact with a database using regular SQL commands. Alternatively, we can use the {dplyr} package to interact with databases directly by translating the {dplyr} functions in the form of verbs into SQL queries. This allows you to interact directly with the database from within R. Other advantages of this approach are:
Use SQL to run the data transformations within the database itself.
All of your code is written in R. Because {dplyr} is used to communicate with the database, there is no need to alternate between languages or tools to perform the data exploration.
The following figure courtesy of Posit shows conceptually our process of using {dplyr} to interact with a remote SQL database.
In addition, we can use the related {dbplyr} to convert R commands into actual SQL commands, or simply pass raw SQL commands if you want a more sophisticated query. So, to interact with our remote database, we’ll use two related packages:
## load packages for db queries
library(dplyr)
library(dbplyr)
Let’s take a look at the tables contained in our Lake Washington
database so we can begin thinking about how to approach an exploratory
analysis. To do so, we can use DBI::dbListTables()
to list
the names of remote tables accessible through our connection.
## list the available tables
dbListTables(db_con)
## [1] "pdo" "pg_stat_statements_info"
## [3] "pg_stat_statements" "limno"
Here we can see that there are 4 tables in our database, which is a
bit odd because we only added 2 of them: limno
and
pdo
.
What’s up with the others beginning with pg_stat
? Those
tables track statistics with repsect to all SQL statements executed on
the server.
Conveniently, we can use dplyr::tbl()
to retrieve a
table from a remote database and assign it to an object, if we’d like.
Let’s inspect the 2 tables containing the SQL statistics.
## inspect metadata tables
tbl(db_con, from = "pg_stat_statements_info")
## # Source: table<pg_stat_statements_info> [1 x 2]
## # Database: postgres [postgres@p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com:5432/lake_wa]
## dealloc stats_reset
## <int64> <dttm>
## 1 0 2025-02-02 01:13:10
tbl(db_con, from = "pg_stat_statements")
## # Source: table<pg_stat_statements> [?? x 49]
## # Database: postgres [postgres@p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com:5432/lake_wa]
## userid dbid toplevel queryid query plans total_plan_time min_plan_time
## <int> <int> <lgl> <int64> <chr> <int> <dbl> <dbl>
## 1 16385 1 TRUE 2 e18 "select $1" 0 0 0
## 2 10 5 TRUE 3.e18 "/*pga4das… 0 0 0
## 3 16385 1 TRUE 3 e18 "SELECT ro… 0 0 0
## 4 10 5 TRUE -6.e18 "SELECT\n … 0 0 0
## 5 10 16489 TRUE 8 e18 "SET clien… 0 0 0
## 6 16385 1 TRUE -1 e18 "ALTER DEF… 0 0 0
## 7 10 5 TRUE -7 e17 "SELECT\n … 0 0 0
## 8 10 5 TRUE 3 e16 "select (s… 0 0 0
## 9 10 16489 TRUE 4 e18 "SELECT at… 0 0 0
## 10 10 16489 TRUE 3.e18 "SELECT na… 0 0 0
## # ℹ more rows
## # ℹ 41 more variables: max_plan_time <dbl>, mean_plan_time <dbl>,
## # stddev_plan_time <dbl>, calls <int64>, total_exec_time <dbl>,
## # min_exec_time <dbl>, max_exec_time <dbl>, mean_exec_time <dbl>,
## # stddev_exec_time <dbl>, rows <int64>, shared_blks_hit <int64>,
## # shared_blks_read <int64>, shared_blks_dirtied <int64>,
## # shared_blks_written <int64>, local_blks_hit <int64>, …
Let’s take an exploratory look at some of the limnological1 data from Lake Washington, an urban lake forming the eastern boundary of Seattle (see map here). We can also check out the monthly time series of the Pacific Decadal Oscillation.
Task: Get the limnology table from our database and
assign it to limno
.
## get the `limno` table from the db
limno <- tbl(db_con, from = "limno")
Task: Take a look at the data contained in
limno
.
## inspect the `limno` table
limno
## # Source: table<limno> [?? x 20]
## # Database: postgres [postgres@p.nxh57cg2izb6dbv27dykuuxztq.db.postgresbridge.com:5432/lake_wa]
## year month temp tp ph cryptomonas diatoms greens bluegreens unicells
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1962 1 7.6 60.2 7.4 0 8295 0 1301440 7336
## 2 1962 2 7.5 NA 7.5 0 60414 1632 302189 6657
## 3 1962 3 7.7 57.7 7.6 0 12848 0 290080 30137
## 4 1962 4 10.1 52.4 7.6 0 27250 2250 448650 30600
## 5 1962 5 12.8 57 7.9 0 48155 0 2105040 97313
## 6 1962 6 16 82.5 8.4 NA NA NA NA NA
## 7 1962 7 18.8 69.5 8.1 0 13511 0 4370310 40628
## 8 1962 8 16.1 44.6 8.5 0 1292 13619 3450907 117579
## 9 1962 9 14.7 67.1 8 0 1531 0 1940400 277424
## 10 1962 10 13.5 60.2 7.2 0 1219 210 1344875 33431
## # ℹ more rows
## # ℹ 10 more variables: other_algae <chr>, conochilus <chr>, cyclops <chr>,
## # daphnia <chr>, diaptomus <chr>, epischura <chr>, leptodora <chr>,
## # neomysis <chr>, non_daphnid_cladocerans <chr>, non_colonial_rotifers <chr>
The columns (“fields” in database parlance) are as follows:
year
: calendar yearmonth
: month of yeartemp
: water temperature (C)tp
: concentration of total phosphorus
(micrograms/L)ph
: pH of the watercryptomonas
: counts of the phytoplankton
Cryptomonasdiatoms
: counts of diatomsgreens
: counts of green algaebluegreens
: counts of bluegreen algae
(cyanobacteria)unicells
: counts of the unicellular phytoplanktonother_algae
: counts of all other phytoplankton
combinedconochilus
: counts of the colonial rotifer
Conochiluscyclops
: counts of the cyclopoid copepod
Cyclopsdaphnia
: counts of Daphniadiaptomus
: counts of the calanoid copepod
Calanusepischura
: counts of the calanoid copepod
Epischuraleptodora
: counts of the predatory crustacean
Calanusneomysis
: counts of the predatory crustacean
Neomysisnon_daphnid_cladocerans
: counts of non-Daphnid
cladoceransnon_colonial_rotifers
: counts of non-colonial
rotifersTask: Find out the range of years in the database.
## unique years of data
unique(limno$year)
## NULL
What happened here? We can clearly see values (of class
integer
) listed in our view, but the above call returns
NULL
.
Because we’re working with a remote database, all operations are “lazy”, meaning they don’t do anything until we formally request the data.
Task: Assign limno
to a local
R object with collect()
.
## create a local copy of the table
limno <- collect(limno)
Task: Check again for the range of years in the database.
## unique years of data
unique(limno$year)
## [1] 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976
## [16] 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [31] 1992 1993 1994
## how many years of data?
unique(limno$year) |> length()
## [1] 33
If you look closely at the classes for all of the non-date related
fields, you’ll see that they’re currently all character
’s.
This can be traced back to the NA
’s in the database itself.
Before we can do any real operations on them, we’ll need to convert them
to class numeric
. This is easily done using the
{tidyverse}.
Task: Use mutate_if()
to convert
character
to numeric
.
Note: R will respond with a warning about
NAs introduced by coercion
, which you can ignore.
## convert data types to `numeric`
limno <- limno |>
mutate_if(is.character, as.numeric)
## Warning: There were 18 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `temp = .Primitive("as.double")(temp)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 17 remaining warnings.
Let’s extract and plot the monthly time series of water temperature
for the 20 years from 1970-1989. To begin, we’ll create a new
date
column based on year
and
month
so we can better use ggplot()
. To do so,
we’ll make use of the {lubridate} package.
Task: Use lubridate::make_date()
to
create a new date
field.
## if necessary, install {lubridate}
# install.package(lubridate)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## create new `date` field; assume first day of year
limno <- limno |>
mutate(date = make_date(year, month))
Now we can filter the table to choose the correct timespan and plot the temperature.
Task: Filter the table and plot the time series of temperature.
## load ggplot2
library(ggplot2)
limno |>
# choose years 1970-1989
filter(year >= 1970 & year <= 1989) |>
# plot the time series of temperature
ggplot(aes(x = date, y = temp)) +
geom_line() +
xlab("\nTime") +
ylab("Temperature (C)\n") +
theme_gray(base_size = 16)
Let’s make a similar plot of the Pacific Decadal Oscillation. To begin, we’ll get the remote table and collect it into a local R object.
Task: Get the PDO table from our database and assign
it to pdo
.
## get the `pdo` table from the db & collect it
pdo <- tbl(db_con, from = "pdo") |>
collect()
Task: Inspect the data in PDO table.
## inspect the top of the `pdo` table
pdo
## # A tibble: 2,004 × 2
## date pdo
## <int> <dbl>
## 1 185401 0.11
## 2 185402 -0.24
## 3 185403 -0.4
## 4 185404 -0.44
## 5 185405 -0.54
## 6 185406 -0.3
## 7 185407 -0.1
## 8 185408 -1.24
## 9 185409 -1
## 10 185410 -2.23
## # ℹ 1,994 more rows
Note: The date
field here appears a bit
odd; it’s of the form YYYYMM
.
Once again we can use the {lubridate} package to
construct a proper date
object. Specifically, we’ll use the
ym()
function, which converts a date in a
YYYYMM
format to the ISO-standard
YYYY-MM-DD
Task: Use lubridate::ym()
to create a
new date
field.
## if necessary, install {lubridate}
# install.package(lubridate)
## create new `date` field; assume first day of year
pdo <- pdo |>
mutate(date = ym(date)) |>
select(date, pdo)
Let’s look at the time series of PDO values from 1970-1989.
Task: Filter the PDO table and plot the time series.
pdo |>
# choose years 1970-1989
filter(year(date) >= 1970 & year(date) <= 1989) |>
# plot the time series of temperature
ggplot(aes(x = date, y = pdo)) +
geom_line() +
xlab("\nTime") +
ylab("PDO\n") +
theme_gray(base_size = 16)
If you didn’t already know, “limnology” refers to the study of inland waters. Some people mistakenly define it as the study of fresh water, but limnologists certainly study saline lakes as well (e.g., Great Salt Lake).↩︎