-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathREADME.Rmd
More file actions
142 lines (103 loc) · 6.75 KB
/
README.Rmd
File metadata and controls
142 lines (103 loc) · 6.75 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
output:
html_document:
keep_md: yes
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
suppressPackageStartupMessages(library(tidyverse))
```
# BioticExplorer-Server
**R Package for Downloading Server-Side Data for BioticExplorer**
This package can be used to download IMR Biotic database and to place it into a [duckdb](https://duckdb.org/docs/api/r.html) database, providing access to all of the institute's Biotic data from R.
## Installation
```{r, eval = FALSE}
remotes::install_github("DeepWaterIMR/BioticExplorerServer")
```
## Usage
### Download the IMR biotic database
The BioticExplorerServer package (BES) downloads and compiles the IMR database into a [duckdb](https://cran.r-project.org/package=duckdb) database. The download requires stable intranet access (VPN, cable within the institute web or HI-adm WiFi). The **database requires more than 2 Gb of disk space**. Make sure to modify the `dbPath` argument to choose an appropriate location for the database. **Do not store it in a folder that is synced to the cloud** due to its size, and be mindful of the [institute's data policy](https://www.hi.no/resources/Data-policy-HI.pdf). While most of the data is licensed under NLOD (the governmental version of CCBY), **some external data within the database must not be shared outside the institute**. **By using this package, you accept the responsibility of handling the IMR Biotic data according to the licenses and regulations**. It is advisable to run the download command in a separate R session or in a screen session in the terminal on Unix machines, as downloading the database takes several hours and requires a stable internet connection. If the connection is unstable, the function may return an error. In such cases, ensure that the connection is stable and rerun the command. The function should continue downloading from where it left off.
```{r, eval = FALSE}
library(BioticExplorerServer)
compileDatabase(dbPath = "~/IMR_biotic_BES_database") # default dbPath, written out to show it
```
### Update the database
Currently, the entire database must be re-downloaded to update it because IMR Biotic database does not have last modified tags. To update the database, you can do the following:
```{r, eval = FALSE}
library(BioticExplorerServer)
compileDatabase(dbPath = "~/IMR_biotic_BES_database",
dbName = "bioticexplorer-next")
unlink(normalizePath("~/IMR_biotic_BES_database/bioticexplorer.duckdb"))
file.rename(
normalizePath("~/IMR_biotic_BES_database/bioticexplorer-next.duckdb"),
normalizePath("~/IMR_biotic_BES_database/bioticexplorer.duckdb",
mustWork = FALSE))
```
This process first downloads the database to a file named `bioticexplorer-next.duckdb`, then deletes the old database and renames `bioticexplorer-next.duckdb` to `bioticexplorer.duckdb`. Since downloading takes time, you can continue using the database in another R session while the download is in progress. If it is not a priority, you can also overwrite the existing database:
```{r, eval = FALSE}
compileDatabase(dbPath = "~/IMR_biotic_BES_database", overwrite = TRUE)
```
### Uninstall the database
If you want to remove the database from your computer, simply delete the folder at the specified `dbPath`. Remember to empty your trash bin as well.
### Control the database through R
Once the database has been downloaded and saved to a [duckdb](https://cran.r-project.org/package=duckdb) database, you can use standard [DBI](https://cran.r-project.org/package=DBI) or [dplyr](https://cran.r-project.org/package=dplyr) functions to access it:
```{r, message = FALSE}
# Packages required to replicate the example:
packages <- c("tidyverse", "data.table", "DBI", "duckdb")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load the packages
invisible(lapply(packages, library, character.only = TRUE, quietly = TRUE))
# Connect to the database (assuming you used standard dbPath and name)
con_db <- "~/IMR_biotic_BES_database/bioticexplorer.duckdb" %>%
normalizePath() %>%
duckdb::duckdb(read_only = TRUE) %>%
DBI::dbConnect()
## Create the data objects
stnall <- dplyr::tbl(con_db, "stnall") # station-based data
indall <- dplyr::tbl(con_db, "indall") # individual-based data
ageall <- dplyr::tbl(con_db, "ageall") # age data
mission <- dplyr::tbl(con_db, "mission") # information on
meta <- dplyr::tbl(con_db, "metadata") %>% # time of download
collect() %>% mutate_all(as.POSIXct)
csindex <- dplyr::tbl(con_db, "csindex") # cruise series index
gearlist <- dplyr::tbl(con_db, "gearindex") %>% collect() # gear index
```
These data objects can now be used in R:
```{r}
head(mission)
```
The [dplyr package can also be used with databases](https://solutions.posit.co/connections/db/r-packages/dplyr/). The only difference from normal use is that you'll need to [`collect()`](https://dbplyr.tidyverse.org/reference/collapse.tbl_sql.html) the data from the database after filtering. Note that you are handling large amounts of data, and using the collect function incorrectly may cause your computer to crash due to insufficient RAM. Therefore, always filter before collecting and consider using [`compute()`](https://dbplyr.tidyverse.org/reference/collapse.tbl_sql.html) or use the [data.table](https://cran.r-project.org/web/packages/data.table/index.html) package, if you'll need to handle very large proportions of the IMR Biotic database.
```{r}
stnall %>%
filter(!is.na(cruise)) %>%
collect() %>%
group_by(startyear) %>%
reframe(n = length(unique(paste(cruise, platformname, serialnumber)))) %>%
ggplot(aes(x = startyear, y = n)) +
geom_col() +
labs(x = "Year", y = "Number of stations",
title = "Number of sampling stations in IMR survey data over years") +
theme_classic()
```
The duckdb database contains following data tables:
```{r}
DBI::dbListTables(con_db)
```
### Explore the database using Biotic Explorer shiny app
Once downloaded, you can also use the database through the [Biotic Explorer](https://github.com/DeepWaterIMR/BioticExplorer) shiny app.
## Troubleshooting
If you get an error something like:
> Error:
> ! error in evaluating the argument 'drv' in selecting a method for function 'dbConnect':
> rapi_startup: Failed to open database: {"exception_type":"IO","exception_message":"Could not set lock on file
> ...
You likely have database connection with write privileges open elsewhere within your R session. DuckDB supports only one write connection, but multiple simultaneus read only connections can be established. Close the connections to the database:
```{r}
DBI::dbDisconnect(con_db)
```
and connect to the database in read only mode: `duckdb::duckdb(read_only = TRUE)`.