We saw previously how to
create a PostgreSQL database for staging our data. We also created a
staging
schema in our database where we brought in the
limnology data from the lwa_limno.csv
file. The
staging
schema provides a clear place in our workflow where
we can now manipulate and clean up the limnology data before writing it
into the final table form with numeric data types in the
public
schema. Recall that we used the Convert CSV to SQL
website to generate the CREATE TABLE
and
INSERT INTO
SQL statements we then ran in
pgAdmin to write our csv data into the database.
Here we will learn how to:
Cast data into new data types
Clean up the data
Modify the data
Write data between tables
Join our tables together in a query
Create a view of our joined query
SQL scripts for each step are available here. As you’ve seen elsewhere in class, all of the SQL scripts have number prefixes to show the order in which they should be run.
One of the essential elements necessary to join two data tables
together is for both tables to have a common field upon which to join
them. Both of our data tables contain monthly data, so we will join them
using a year-month ocombination. When we look at our example data, the
lwa_limno.csv
has one field for Year
and a
second field for Month
, but in the
pdo_data.csv
we have only one field called
Date
, which has the year and month concatenated
together.
Year/month format in limno
year | month
------------
1962 | 1
1962 | 2
Year/month format in pdo
Date
------
196201
196202
We could convert either data set to match the other, but here we’ll
convert the limno
data to a concatenated year-month value
to match pdo
.
To standardize between the two tables, we are going to create a new
field in the limno
table where we can write a new
year_month
concatenation. To add a field to an existing
table, we use the SQL ALTER TABLE
command. Note that fields
added to tables via the ALTER TABLE
command are appended to
the end of the existing table fields.
Tools > Query Tool
to open up the query editing
tool
ALTER TABLE staging.limno
ADD COLUMN year_month text;
1_alter_table_staging_limno.sql
and then run
itNow that we have a field we can write our new year-month
concatenation into, we need to write some SQL to concatenate two fields
into one field. The PostgreSQL concatenation operator is double pipes
||
. Because the year
and month
fields in the table are integers, we need to cast their data types to
text so that we can concatenate them together. To cast a field to a
different data type in PostgreSQL, you follow the field name with double
colons ::
and then the data type (eg,
Year::text
).
If we query limno
to test our concatenation, we can run
the following to check if that gives us what we need for the
concatenated field.
SELECT year::text || month::text FROM staging.limno
That looks pretty good, but all of the Date
values in
pdo_data.csv
are six digits–for single digit months the
month is written 01
instead of 1
–and our
limno
concatenation doesn’t match that. Therefore, we need
to pad the month with a 0
in the concatenation, but only
for January through September (ie, single digit months).
To write conditional logic in SQL we use the CASE
statement. The CASE
statement context is
CASE WHEN XX THEN YY ELSE ZZ END
Note that we can make CASE
statements more complicated
by nesting additional WHEN
clauses like
CASE WHEN XX THEN YY WHEN AA THEN BB ELSE ZZ END
We can check our conditional logic statement with the following SQL code:
SELECT
year::text || CASE WHEN month < 10 THEN '0' ELSE '' END || month::text
FROM staging.limno;
That all looks good, so now we can add the new
year_month
column to our table with the UPDATE
command
0
for single digit months,
copy/paste the following SQl code into the editor and press playUPDATE staging.limno
SET year_month = year::text || CASE WHEN month < 10 THEN '0'
ELSE '' END || month::text;
2_update_staging_limno.sql
and then run
itNA
to NULL
Recall that the notion of a missing value in a SQL database is
different than in R. In R we use
NA
to designate a missing values,but in SQL we use
NULL
. Therefore, we need to clean up the limno
table by replacing the text NA
with NULL
.
Because we are updating data in a table, we again will use the SQL
UPDATE
command. We can string all of our
UPDATE
commands together in one script and change all of
the NA
values for each field.
/*
to
delineate the start/end a comment in SQL) and then run it/*
column headings in limno table are:
year, month, temp, tp, ph, cryptomonas, diatoms, greens, bluegreens,
unicells, other_algae, conochilus, cyclops, daphnia, diaptomus, epischura,
leptodora, neomysis, non_daphnid_cladocerans, non_colonial_rotifers
*/
UPDATE staging.limno SET temp = NULL WHERE temp = 'NA';
UPDATE staging.limno SET tp = NULL WHERE tp = 'NA';
UPDATE staging.limno SET ph = NULL WHERE ph = 'NA';
UPDATE staging.limno SET cryptomonas = NULL WHERE cryptomonas = 'NA';
UPDATE staging.limno SET diatoms = NULL WHERE diatoms = 'NA';
UPDATE staging.limno SET greens = NULL WHERE greens = 'NA';
UPDATE staging.limno SET bluegreens = NULL WHERE bluegreens = 'NA';
UPDATE staging.limno SET unicells = NULL WHERE unicells = 'NA';
UPDATE staging.limno SET other_algae = NULL WHERE other_algae = 'NA';
UPDATE staging.limno SET conochilus = NULL WHERE conochilus = 'NA';
UPDATE staging.limno SET cyclops = NULL WHERE cyclops = 'NA';
UPDATE staging.limno SET daphnia = NULL WHERE daphnia = 'NA';
UPDATE staging.limno SET diaptomus = NULL WHERE diaptomus = 'NA';
UPDATE staging.limno SET epischura = NULL WHERE epischura = 'NA';
UPDATE staging.limno SET leptodora = NULL WHERE leptodora = 'NA';
UPDATE staging.limno SET neomysis = NULL WHERE neomysis = 'NA';
UPDATE staging.limno SET non_daphnid_cladocerans = NULL WHERE non_daphnid_cladocerans = 'NA';
UPDATE staging.limno SET non_colonial_rotifers = NULL WHERE non_colonial_rotifers = 'NA';
3_clean_up_NA_in_limno.sql
and then run
it.Now that the staging.limno
data are cleaned up and ready
to go, we need to make our final data table where the cleaned up data
will reside. This will be an otherwise identical table to
staging.limno
except we will
Drop the year
and month
fields and
instead include our newly created year_month
field
Set year_month
as the primary key
Change the data fields from text
to
numeric
We’ll use CREATE TABLE
in SQL to do this.
CREATE TABLE public.limno
(
year_month integer primary key,
temp numeric,
tp numeric,
ph numeric,
cryptomonas numeric,
diatoms numeric,
greens numeric,
bluegreens numeric,
unicells numeric,
other_algae numeric,
conochilus numeric,
cyclops numeric,
daphnia numeric,
diaptomus numeric,
epischura numeric,
leptodora numeric,
neomysis numeric,
non_daphnid_cladocerans numeric,
non_colonial_rotifers numeric
)
TABLESPACE pg_default;
ALTER TABLE public.limno
OWNER to lake_wa;
4_create_table_public_limno.sql
and then run
itOur last step for the limno
data is to write the data
from staging.limno
to public.limno
. For this
step, we’ll use the INSERT INTO
command in SQL. Copy the
following code into the query editor and then run it.
INSERT INTO public.limno(
year_month, temp, tp, ph, cryptomonas, diatoms, greens, bluegreens,
unicells, other_algae, conochilus, cyclops, daphnia, diaptomus, epischura,
leptodora, neomysis, non_daphnid_cladocerans, non_colonial_rotifers)
SELECT year_month::integer,
temp::numeric,
tp::numeric,
ph::numeric,
cryptomonas::numeric,
diatoms::numeric,
greens::numeric,
bluegreens::numeric,
unicells::numeric,
other_algae::numeric,
conochilus::numeric,
cyclops::numeric,
daphnia::numeric,
diaptomus::numeric,
epischura::numeric,
leptodora::numeric,
neomysis::numeric,
non_daphnid_cladocerans::numeric,
non_colonial_rotifers::numeric
FROM staging.limno;
5_write_clean_data_into_public_limno.sql
and
then run itlimno
and pdo
We now have both the pdo
and limno
tables
cleaned up, so we can join them together into a single result set for
future analysis. To begin, let’s run the following script to view the
result set and verify that everything looks right.
SELECT
LEFT(year_month::text, 4) AS year,
RIGHT(year_month::text, 2) AS month,
pdo,
temp,
tp,
ph,
cryptomonas,
diatoms,
greens,
bluegreens,
unicells,
other_algae,
conochilus,
cyclops,
daphnia,
diaptomus,
epischura,
leptodora,
neomysis,
non_daphnid_cladocerans,
non_colonial_rotifers
FROM public.limno JOIN public.pdo ON limno.year_month = pdo.date;
This looks good, so now we can use CREATE VIEW
to create
a View that we can use to access the joined data.
CREATE VIEW pdo_and_limno as
SELECT
LEFT(year_month::text, 4) AS year,
RIGHT(year_month::text, 2) AS month,
pdo,
temp,
tp,
ph,
cryptomonas,
diatoms,
greens,
bluegreens,
unicells,
other_algae,
conochilus,
cyclops,
daphnia,
diaptomus,
epischura,
leptodora,
neomysis,
non_daphnid_cladocerans,
non_colonial_rotifers
FROM public.limno JOIN public.pdo ON limno.year_month = pdo.date;
6_join_pdo_limno.sql
and then run itWe can now check in the pgAdmin browser to see that
our newly created view called pdo_and_limno
is there.