We will be using the following packages:
library(ggplot2) # For beautiful plots
library(maps) # For map of fennoscandia countries
library(dplyr) # For data wrangling
library(ggmap) # For getting a higher resolution map of Trondheim
library(cowplot) # For making a grid of ggplots
For downloading all data sets used here, follow these links:
NORS: https://ntnu.box.com/shared/static/bn44f3aulesciijjb9dq4o72add25vyt.zip
Kautokeino: https://gbif.vm.ntnu.no/ipt/resource?r=kautokeino_fish_inventory (we don’t really have this data right now)
Trondheim: https://gbif.vm.ntnu.no/ipt/resource?r=freshwater_survey_occurrences_trondheim_municipality
Nordic presence absence: https://gbif.vm.ntnu.no/ipt/resource?r=fish_status_survey_of_nordic_lakes
Transcriptions of Norwegian gillnet test-fishing: https://gbif.vm.ntnu.no/ipt/resource?r=transcribed_gillnet_test_fishing_data_norway
Citizen science observations: (Norwegian and Swedish species observation services) To download from GBIF, use dataset key b124e1e0-4755-430f-9eab-894f25a9b59c for Norway, or 38b4c89f-584c-41bb-bd8f-cd1def33e92f for Sweden. Note that this gives all species, not only fish, so filters for specific species of interest should also be provided. See code downloading_CS_observations.R
.
Lake polygons Fennoscandia: https://bird.unit.no/resources/9b27e8f0-55dd-442c-be73-26781dad94c8/content (click on “Innhold”-tab at the bottom of the page to download only selected sets of lakes)
For some of the data sets, we need to merge occurrence- and event-files. This is done easily by merging by eventID
once both files are loaded into R:
merged_data <- merge(occurrence, event, by = "eventID")
See details in the file merging_occ_and_event.R
. I store the merged datasets in data
-folders withing each dataset’s individual folder, this is what my folder structure looks like:
NORS <- readRDS("../NORS/data/merged.rds")
PIKE <- readRDS("../PIKE/data/merged.rds")
#kautokeino_data <-
trondheim_data <- read.table("../Trondheim_freshwater_survey/data/occurrence.txt", sep = "\t", header = TRUE)
nordic_survey <- readRDS("../Fish_status_survey_of_nordic_lakes/data/merged.rds")
transcribed_gillnet <- readRDS("../Transcribed_gillnet_test_fishing_data_norway/data/merged.rds")
artsobsNO <- readRDS("../Nordic_Species_Observation_Services/data/GBIF_download_Norge.rds")
artobsSE <- readRDS("../Nordic_Species_Observation_Services/data/GBIF_download_Sverige.rds")
Finally! Now we can go on to look at these datasets.
First, let us look at when observations are made in the different datasets.
Note that since the number of observations varies so wildly between the years and the datasets, the y-axis is different in each of the mini-plots above. I have also excluded observations before year 1900 in the above plot, since there are so few they wouldn’t be visible. But some of the datasets do have observations from before that, as is clear in the table below.
dataset | earliest_year | latest_year | most_common_year |
---|---|---|---|
NORS | 1952 | 2015 | 1996 |
PIKE | 1600 | 2010 | 1896 |
trondheim_data | 2014 | 2014 | 2014 |
nordic_survey | 19 | 1996 | 1996 |
transcribed_gillnet | 1970 | 1998 | 1977 |
artsobsNO | 1946 | 2020 | 2017 |
artobsSE | 1871 | 2020 | NA |
For some reason the observations from the Swedish observation service has really a lot of NA entries for the year
-variable, the fraction of entries with year = NA is:
nrow(filter(artobsSE, is.na(year))) / nrow(artobsSE)
## [1] 0.3648923
That is slightly odd, I will look into this further in the map-section.
The datasets may differ in the type of data available. Some are presence/absence, some give only presences, while others count number of individuals observed.
To get a quick overview, the function check_obs_type()
shows the levels of the variable occurrenceStatus
and the number off different levels of the variable organismQuantity
:
check_obs_type <- function(dataset){
cat("Dataset:", deparse(substitute(dataset)), "\n")
cat("occurrenceStatus has levels: ",
unique(dataset$occurrenceStatus))
cat("\nNumber of different organism quantities reported:",
length(unique(dataset$organismQuantity)))
}
## Dataset: NORS
## occurrenceStatus has levels: 1
## Number of different organism quantities reported: 1
## Dataset: PIKE
## occurrenceStatus has levels: 4 2 1 3 5
## Number of different organism quantities reported: 1493
## Dataset: trondheim_data
## occurrenceStatus has levels: 1
## Number of different organism quantities reported: 193
## Dataset: nordic_survey
## occurrenceStatus has levels: 1 2
## Number of different organism quantities reported: 5
## Dataset: transcribed_gillnet
## occurrenceStatus has levels: 1
## Number of different organism quantities reported: 792
## Dataset: artsobsNO
## occurrenceStatus has levels: present absent
## Number of different organism quantities reported: 1
For some reason some of the datasets seem to have levels “1” and “2”, while others have “present” and “absent”, while in reality they all have only “present”/“absent” (or only “present”). Let’s check the classes of the variables for the different datasets:
class(nordic_survey$occurrenceStatus)
## [1] "factor"
class(artsobsNO$occurrenceStatus)
## [1] "character"
So the reason is just that most of the datasets have the occurrenceStatus
as a factor variable, while artsobsNO
has it as a character.
We also see that some of the datasets are clearly reporting abundance, while others are not, but some, like the Nordic survey, has mostly just reported presences, but seems to have an organismQuantity on a very few of the observations. Let’s take a closer look at that:
unique(nordic_survey$organismQuantity)
## [1] <NA> unknown sparse ordinary abundant
## Levels: abundant ordinary sparse unknown
Another strange thing is the fact that there seems to be presence/absence observations in the citizen science data from the species observation services, where we would have expected presence only data. However, there is actually just one observation with “absent”.
nrow(filter(artsobsNO, occurrenceStatus == "absent"))
## [1] 1
Dataset | Type of observations |
---|---|
NORS | Presence only. |
PIKE | Presence/absence/rare/doubtful/irregular, with abundance as number. |
Kautokeino | something |
Trondheim | Presence only, with abundance as number. |
Nordic survey | Presence/absence, with abundance as one of “unknown”, “sparse”, “ordinary” or “abundant” |
Transcribed gillnet | Presence only, with abundance as number. |
Species observation services | Presence only. (note: there may be a few absences in the datasets) |
Next, let’s look at the observation points for each of the datasets.
plot_obs <- function(dataset){
nosefi <- map_data("world", region = c("Norway(?!:Svalbard)",
"Sweden", "Finland"))
p <- ggplot(dataset) +
geom_polygon(data = nosefi, aes(long, lat, group = group),
color="#2b2b2b", fill = "white") +
geom_point(aes(x = decimalLongitude, y = decimalLatitude),
color = 'hotpink4', alpha = 0.6, size = 0.5) +
theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
guides(colour = guide_legend(override.aes = list(size=2))) +
ggtitle(deparse(substitute(dataset)))
return(p)
}
NORS_plot <- plot_obs(NORS)
PIKE_plot <- plot_obs(PIKE)
#kautokeino_data <- plot_obs(kautokeino_data)
trond_loc = c(10.1, 63.34, 10.5, 63.46)
trond_map <- get_map(location=trond_loc,
source="stamen", maptype="watercolor", crop=FALSE)
trondheim_plot <- ggmap(trond_map) +
geom_point(data = trondheim_data, aes(x = decimalLongitude,
y = decimalLatitude),
alpha = 0.6, size = 1, color = 'hotpink4') +
ggtitle("trondheim_data")
nordic_survey_plot <- plot_obs(nordic_survey)
transcribed_gillnet_plot <- plot_obs(transcribed_gillnet)
artsobsNO_plot <- plot_obs(artsobsNO)
artobsSE_plot <- plot_obs(artobsSE)
plot_grid(NORS_plot, PIKE_plot, trondheim_plot, nordic_survey_plot, transcribed_gillnet_plot, artsobsNO_plot, artobsSE_plot)
Now, note that we discovered that 36% of the observations from the Swedish species observation service are without year
-variable. Let us look at where we find these observations compared to the observations with year
-variable: So there is a blob right by the Norwegian border that doesn’t have year information. Let’s explore this a little more:
not_all_na <- function(x) any(!is.na(x))
artobsSE.NA <- artobsSE %>% filter(is.na(year)) %>% select_if(not_all_na)
rightsHolder_counts <- count(artobsSE.NA, rightsHolder, sort = TRUE)
varmland <- artobsSE.NA %>% filter(rightsHolder == as.character(rightsHolder_counts[1,1]))
plot_obs(varmland)
A quick look at who made the Norwegian citizen science observations:
recordedBy_count <- count(artsobsNO, recordedBy, sort = TRUE)
institutionCode_count <- count(artsobsNO, institutionCode, sort = TRUE)
collectionCode_count <- count(artsobsNO, collectionCode, sort = TRUE)
datasetName_count <- count(artsobsNO, datasetName, sort = TRUE)
head(recordedBy_count)
## # A tibble: 6 x 2
## recordedBy n
## <chr> <int>
## 1 Ole-HÃ¥kon Heier 1244
## 2 Øyvind Nyvold Larsen 197
## 3 Line Selvaag 190
## 4 Roar Pettersen 189
## 5 Knut Eie 105
## 6 Øystein Engen 90
institutionCode_count
## # A tibble: 1 x 2
## institutionCode n
## <chr> <int>
## 1 nzf 4557
collectionCode_count
## # A tibble: 1 x 2
## collectionCode n
## <chr> <int>
## 1 so2-fishes 4557
datasetName_count
## # A tibble: 28 x 2
## datasetName n
## <chr> <int>
## 1 "" 4267
## 2 "Forsvarsbygg" 191
## 3 "Kartleggingsmidler Sabima" 21
## 4 "NØF-øvrige arter" 12
## 5 "Artsjakten!" 10
## 6 "Gjeddtjørna" 8
## 7 "Statens naturoppsyn" 8
## 8 "Besøkssenter våtmark Nordre Øyeren - generelle observasjoner" 7
## 9 "Viltkartlegging i våtmarksbiotopar langs Surna" 5
## 10 "Abbor i Stavsjøen" 3
## # ... with 18 more rows