2 Downloading the data

In preparation for downloading and working with these data, you will need to load some additional R packages. The ‘install.packages’ commands below are commented out, so if you need to install, you will need to un-comment those before running this code.

# install.packages("dplyr")
library(dplyr)
# install.packages("rerddap")
library(rerddap)
# install.packages("leaflet")
library(leaflet)
# install.packages("RColorBrewer")
library(RColorBrewer)
# install.packages("ggplot2")
library(ggplot2)
# install.packages("rmarkdown")
library(rmarkdown)
# install.packages("knitr")
library(knitr)

2.1 Using ERDDAP to download data

Getting data directly from ERDDAP (Environmental Resource Division Data Access Protocol) using ‘rerddap’ package in R.

To see the meanings and valid values for all of the fields in National Database, download an Excel version of the full data dictionary following location: LINK

Use the ‘info’ function from the ‘rerddap’ package to get information about the variables (fields) in the database. Note that the ‘Latitude’ and ‘Longitude’ fields are in ERDDAP as ‘latitude’ and ‘longitude’, with no capital letters. These are the only field names that differ from what is in the data dictionary.

info <- info(datasetid='deep_sea_corals', 
              url = "https://www.ncei.noaa.gov/erddap/")

Now take a look at the first 5 field names using the ‘head’ function.

head(info$variables)

##          variable_name data_type  actual_range
## 1              AphiaID       int -999, 1419323
## 2  AssociatedSequences    String              
## 3       AssociatedTaxa    String              
## 4        CatalogNumber       int    1, 1045199
## 5 CategoricalAbundance    String              
## 6             Citation    String

To list all variables in the NOAA, National Database for Deep Sea Corals and Sponges , then use the following:

info$variables

When you have the variables (fields) that you want to download in mind after reviewing the data dictionary, you can download them using the following ‘tabledap’ function from the ‘rerddap’ package. In the code below, you can see an example of setting specific geographic bounding box constraints and selecting a particular Vessel. You can set any numeric or character-based constraint on your downloads using this basic syntax. If you set no contraints, you will download all data. Warning: If you add no constraints, the data download will take more time, so be patient.

You can also filter your data after downloading if you would prefer, rather than using constraints in the ‘tabledap’ function. This might be preferable if you want to intially download all of the data and all of the fields for exploration. The example below contains constraints. Note: Character constraints must be enclosed in double quotes. So the syntax for a constraint is: ‘Vessel=“Okeanos Explorer R/V”’. Variables can be listed in any order desired. Getting the full database may be preferable if you are using it regularly in your work.

library(rerddap)
d <- tabledap("deep_sea_corals", 'longitude<50', 'latitude>20', 'latitude<30', 
              'Vessel="Okeanos Explorer R/V"',
              fields=c('DatabaseVersion', 'CatalogNumber', 'latitude', 'longitude', 'ScientificName', 'ImageURL', 
                       'Vessel', 'RecordType', 'DatasetID', 'SurveyID', 'SampleID', 'TrackingID',
                       'Station', 'Locality', 'ObservationYear', 'Genus', 'Phylum', 'TaxonRank',
                       'DepthInMeters'),
              url = "https://www.ncei.noaa.gov/erddap/")

Check the size of the data frame that you just downloaded using the ‘dim’ function.

x <- dim(d)
x

## [1] 30562    19

The dataframe that you downloaded contains 30562 rows (occurrences) by 19 fields.

Now list the names of the fields to check that you downloaded the fields that you wanted.

names(d)

##  [1] "DatabaseVersion" "CatalogNumber"   "latitude"       
##  [4] "longitude"       "ScientificName"  "ImageURL"       
##  [7] "Vessel"          "RecordType"      "DatasetID"      
## [10] "SurveyID"        "SampleID"        "TrackingID"     
## [13] "Station"         "Locality"        "ObservationYear"
## [16] "Genus"           "Phylum"          "TaxonRank"      
## [19] "DepthInMeters"

Now filter the downloaded data for only those records which have images and take a look at the ‘head’ of the dataframe.

d <- d %>% filter(is.na(d$ImageURL) == F)
head(d)

## <ERDDAP tabledap> deep_sea_corals
##    Path: [C:\Users\ROBERT~1.MCG\AppData\Local/Cache/R/rerddap/2f86982e23e0cc5e2ae70a1af066cd14.csv]
##    Last updated: [2021-03-11 07:28:50]
##    File size:    [8.72 mb]
## # A tibble: 6 x 19
##   DatabaseVersion CatalogNumber latitude longitude
##   <chr>                   <int> <chr>    <chr>    
## 1 20201021-0             419497 26.46665 -84.77803
## 2 20201021-0             419498 26.46664 -84.77803
## 3 20201021-0             419499 26.46664 -84.77804
## 4 20201021-0             419500 26.46662 -84.77801
## 5 20201021-0             419501 26.46662 -84.778  
## 6 20201021-0             419502 26.46664 -84.778  
## # ... with 15 more variables: ScientificName <chr>,
## #   ImageURL <chr>, Vessel <chr>, RecordType <chr>,
## #   DatasetID <chr>, SurveyID <chr>, SampleID <chr>,
## #   TrackingID <chr>, Station <chr>, Locality <chr>,
## #   ObservationYear <int>, Genus <chr>, Phylum <chr>,
## #   TaxonRank <chr>, DepthInMeters <chr>

Use the following loop function to browse the first 5 images in the set. The ‘ImageURL’ variable contains the URL of the hosted images. The ‘browseURL’ function will open the images in your default image editing program. Warning: Careful when using this function in a loop as it will open as many URLs as you feed it, so it could crash your computer.

for (i in head(d$ImageURL)){
  browseURL(i)
}

2.2 Get data using a CSV

Alternatively, you can upload data directly from CSV file into R, if you already have a subset of the National Database on your local machine that you would like to use instead.

setwd("C:/your/working/directory")
d <- read.csv("data_records.csv", header = T)