1 Data preparation

We will be using the following packages:

library(ggplot2) # For beautiful plots
library(maps) # For map of fennoscandia countries
library(dplyr) # For data wrangling
library(ggmap) # For getting a higher resolution map of Trondheim
library(cowplot) # For making a grid of ggplots

1.1 Downloading

For downloading all data sets used here, follow these links:

1.2 Merging events and occurrences

For some of the data sets, we need to merge occurrence- and event-files. This is done easily by merging by eventID once both files are loaded into R:

merged_data <- merge(occurrence, event, by = "eventID")

See details in the file merging_occ_and_event.R. I store the merged datasets in data-folders withing each dataset’s individual folder, this is what my folder structure looks like:

  • Fish_status_survey_of_nordic_lakes
    • data
  • Nordic_Species_Observation_Services
    • data
  • NORS
    • data
  • PIKE
    • data
  • R
    • download_lakefish_dataset.R
    • match_to_lake.R
    • merging_occ_and_event.R
  • Transcribed_gillnet_test_fishing_data_norway
    • data
  • Trondheim_freshwater_survey
    • data
  • reports
    • Exploration_of_freshwater_fish_datasets.rmd (this report!!)
NORS <- readRDS("../NORS/data/merged.rds")
PIKE <- readRDS("../PIKE/data/merged.rds")
#kautokeino_data <- 
trondheim_data <- read.table("../Trondheim_freshwater_survey/data/occurrence.txt", sep = "\t", header = TRUE)
nordic_survey <- readRDS("../Fish_status_survey_of_nordic_lakes/data/merged.rds")
transcribed_gillnet <-  readRDS("../Transcribed_gillnet_test_fishing_data_norway/data/merged.rds")
artsobsNO <- readRDS("../Nordic_Species_Observation_Services/data/GBIF_download_Norge.rds")
artobsSE <- readRDS("../Nordic_Species_Observation_Services/data/GBIF_download_Sverige.rds")

Finally! Now we can go on to look at these datasets.

2 Exploring the data

2.1 Distribution across years

First, let us look at when observations are made in the different datasets.

Note that since the number of observations varies so wildly between the years and the datasets, the y-axis is different in each of the mini-plots above. I have also excluded observations before year 1900 in the above plot, since there are so few they wouldn’t be visible. But some of the datasets do have observations from before that, as is clear in the table below.

dataset earliest_year latest_year most_common_year
NORS 1952 2015 1996
PIKE 1600 2010 1896
trondheim_data 2014 2014 2014
nordic_survey 19 1996 1996
transcribed_gillnet 1970 1998 1977
artsobsNO 1946 2020 2017
artobsSE 1871 2020 NA

For some reason the observations from the Swedish observation service has really a lot of NA entries for the year-variable, the fraction of entries with year = NA is:

nrow(filter(artobsSE, is.na(year))) / nrow(artobsSE)
## [1] 0.3648923

That is slightly odd, I will look into this further in the map-section.

2.2 Observation types

The datasets may differ in the type of data available. Some are presence/absence, some give only presences, while others count number of individuals observed.

To get a quick overview, the function check_obs_type() shows the levels of the variable occurrenceStatus and the number off different levels of the variable organismQuantity:

check_obs_type <- function(dataset){
  cat("Dataset:", deparse(substitute(dataset)), "\n")
  cat("occurrenceStatus has levels: ", 
             unique(dataset$occurrenceStatus))
  cat("\nNumber of different organism quantities reported:",
      length(unique(dataset$organismQuantity)))
}
## Dataset: NORS 
## occurrenceStatus has levels:  1
## Number of different organism quantities reported: 1
## Dataset: PIKE 
## occurrenceStatus has levels:  4 2 1 3 5
## Number of different organism quantities reported: 1493
## Dataset: trondheim_data 
## occurrenceStatus has levels:  1
## Number of different organism quantities reported: 193
## Dataset: nordic_survey 
## occurrenceStatus has levels:  1 2
## Number of different organism quantities reported: 5
## Dataset: transcribed_gillnet 
## occurrenceStatus has levels:  1
## Number of different organism quantities reported: 792
## Dataset: artsobsNO 
## occurrenceStatus has levels:  present absent
## Number of different organism quantities reported: 1

For some reason some of the datasets seem to have levels “1” and “2”, while others have “present” and “absent”, while in reality they all have only “present”/“absent” (or only “present”). Let’s check the classes of the variables for the different datasets:

class(nordic_survey$occurrenceStatus)
## [1] "factor"
class(artsobsNO$occurrenceStatus)
## [1] "character"

So the reason is just that most of the datasets have the occurrenceStatus as a factor variable, while artsobsNO has it as a character.

We also see that some of the datasets are clearly reporting abundance, while others are not, but some, like the Nordic survey, has mostly just reported presences, but seems to have an organismQuantity on a very few of the observations. Let’s take a closer look at that:

unique(nordic_survey$organismQuantity)
## [1] <NA>     unknown  sparse   ordinary abundant
## Levels: abundant ordinary sparse unknown

Another strange thing is the fact that there seems to be presence/absence observations in the citizen science data from the species observation services, where we would have expected presence only data. However, there is actually just one observation with “absent”.

nrow(filter(artsobsNO, occurrenceStatus == "absent"))
## [1] 1
Dataset Type of observations
NORS Presence only.
PIKE Presence/absence/rare/doubtful/irregular, with abundance as number.
Kautokeino something
Trondheim Presence only, with abundance as number.
Nordic survey Presence/absence, with abundance as one of “unknown”, “sparse”, “ordinary” or “abundant”
Transcribed gillnet Presence only, with abundance as number.
Species observation services Presence only. (note: there may be a few absences in the datasets)

2.3 Observation locations

Next, let’s look at the observation points for each of the datasets.

plot_obs <- function(dataset){
  nosefi <- map_data("world", region = c("Norway(?!:Svalbard)", 
                                         "Sweden", "Finland")) 
  p <- ggplot(dataset) +
    geom_polygon(data = nosefi, aes(long, lat, group = group), 
               color="#2b2b2b", fill = "white") +
    geom_point(aes(x = decimalLongitude, y = decimalLatitude), 
             color = 'hotpink4', alpha = 0.6, size = 0.5) +
    theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
    guides(colour = guide_legend(override.aes = list(size=2))) +
    ggtitle(deparse(substitute(dataset)))
  return(p)
}

NORS_plot <- plot_obs(NORS)
PIKE_plot <- plot_obs(PIKE)
#kautokeino_data <- plot_obs(kautokeino_data)

trond_loc = c(10.1, 63.34, 10.5, 63.46)
trond_map <- get_map(location=trond_loc,
                 source="stamen", maptype="watercolor", crop=FALSE)
trondheim_plot <- ggmap(trond_map) +
  geom_point(data = trondheim_data, aes(x = decimalLongitude, 
                                        y = decimalLatitude), 
             alpha = 0.6, size = 1, color = 'hotpink4') + 
  ggtitle("trondheim_data")

nordic_survey_plot <- plot_obs(nordic_survey)
transcribed_gillnet_plot <- plot_obs(transcribed_gillnet)
artsobsNO_plot <- plot_obs(artsobsNO)
artobsSE_plot <- plot_obs(artobsSE)

plot_grid(NORS_plot, PIKE_plot, trondheim_plot, nordic_survey_plot, transcribed_gillnet_plot, artsobsNO_plot, artobsSE_plot)

Now, note that we discovered that 36% of the observations from the Swedish species observation service are without year-variable. Let us look at where we find these observations compared to the observations with year-variable: So there is a blob right by the Norwegian border that doesn’t have year information. Let’s explore this a little more:

not_all_na <- function(x) any(!is.na(x))
artobsSE.NA <- artobsSE %>% filter(is.na(year)) %>% select_if(not_all_na)

rightsHolder_counts <- count(artobsSE.NA, rightsHolder, sort = TRUE)
varmland <- artobsSE.NA %>% filter(rightsHolder == as.character(rightsHolder_counts[1,1]))
plot_obs(varmland)

2.4 Miscellanea

A quick look at who made the Norwegian citizen science observations:

recordedBy_count <- count(artsobsNO, recordedBy, sort = TRUE)
institutionCode_count <- count(artsobsNO, institutionCode, sort = TRUE)
collectionCode_count <- count(artsobsNO, collectionCode, sort = TRUE)
datasetName_count <- count(artsobsNO, datasetName, sort = TRUE)

head(recordedBy_count)
## # A tibble: 6 x 2
##   recordedBy                n
##   <chr>                 <int>
## 1 Ole-HÃ¥kon Heier       1244
## 2 Øyvind Nyvold Larsen   197
## 3 Line Selvaag            190
## 4 Roar Pettersen          189
## 5 Knut Eie                105
## 6 Øystein Engen           90
institutionCode_count
## # A tibble: 1 x 2
##   institutionCode     n
##   <chr>           <int>
## 1 nzf              4557
collectionCode_count
## # A tibble: 1 x 2
##   collectionCode     n
##   <chr>          <int>
## 1 so2-fishes      4557
datasetName_count
## # A tibble: 28 x 2
##    datasetName                                                           n
##    <chr>                                                             <int>
##  1 ""                                                                 4267
##  2 "Forsvarsbygg"                                                      191
##  3 "Kartleggingsmidler Sabima"                                          21
##  4 "NØF-øvrige arter"                                                 12
##  5 "Artsjakten!"                                                        10
##  6 "Gjeddtjørna"                                                        8
##  7 "Statens naturoppsyn"                                                 8
##  8 "Besøkssenter våtmark Nordre Øyeren - generelle observasjoner"     7
##  9 "Viltkartlegging i våtmarksbiotopar langs Surna"                     5
## 10 "Abbor i Stavsjøen"                                                  3
## # ... with 18 more rows