15 August, 2025

Packages required

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

Introduction

This is the primary file for plantpopnet data cleaning.It is designed to run on the compiled data set (.csv) and has been tested with updated data-sets.

Each error was assigned a unique identifier. Once an error was identified by the diagnostic, the solution can be called from a script (of the same identifier name) and solved, before moving on to the next error. Error logs should be recorded in annotated text within this document. Error types were initially diagnosed and solved for a static test compiled data file (ca. 7000 plants). Once this script was complete and tested successfully on the test data using both Mac and PC platforms, a new static compiled data file was used for further testing and error diagnosis (complete data from 2014- May 2019).

Create an R project in a new folder and copy all files provided to this working directory, including this file.

# this has been removed for anonymisation, static files are provided with the manuscript files for review

Load in Data

The demographic census file (for year 0 (Y0) only) is input as ‘data_orig’ which is kept unchanged for reference. Ensure that sheet 1 in the excel file is the datasheet. All changes will be made to ‘mydata’. NOTE: using the “xlsx” package does not require data file to be in CSV format. The site description file is input as ‘sitedata’ as it is used in some variable diagnosis and cleaning.

data_orig <- read.csv("fullPPN_datasetY0_2020-11-18_.csv",header = T, na.strings = c("NA", "", "Na", "na", "nA"))
mydata <- data_orig 

sitedata <- read.csv("Coordinates_Oct2020_site_level.csv", header = T)
#str(mydata)

#change site code to factor
mydata$site_code <- as.factor(mydata$site_code)
mydata$s_year <- as.factor(mydata$s_year)
mydata$survival <- as.factor(mydata$survival)
## tidy up structure of sitedata 
str(sitedata)
## 'data.frame':    65 obs. of  10 variables:
##  $ site_code   : chr  "AC" "ACR" "AL1" "AL2" ...
##  $ native      : chr  "native" "non_native" "native" "native" ...
##  $ demographics: chr  "Y" "Y" "Y" "Y" ...
##  $ genetics    : chr  "Y" "Y" "N" "N" ...
##  $ round       : int  1 2 NA NA 1 1 1 2 NA 2 ...
##  $ country     : chr  "Spain" "USA" "Finland" "Finland" ...
##  $ location    : chr  "Archuchurria" "California" "Aland island" "Aland island" ...
##  $ region      : chr  "Europe" "Nth_America" "Europe" "Europe" ...
##  $ latitude    : num  41.3 39.7 60.2 60.1 60.7 ...
##  $ longitude   : num  1.95 -123.65 19.54 19.98 6.34 ...
sitedata$site_code <- as.factor(sitedata$site_code)

Global Fixes

Global1 - blank rows

Script for a function to output data frame which removes rows/cases with all NA variables. Takes ‘mydata’ and outputs ‘mydata’. Applies function to the dataset.

source("Global1_remove_na_rows_func.R")
## Warning in rows_rem(data_orig, mydata): 0 rows containing only NA's have been
## removed
#1428 NA rows removed (Oct 2020)
#3467 NA rows removed (March 2021)

Global2 - Rename rosette_number column

The naming of separate variables ‘no_of_rosettes’ (total number of rosettes recorded for a unique plant) and ‘rosette_number’ (disambiguation of rosettes within a plant, typically not retained in consecutive years) were confusing. We rename ‘rosette number’ to ‘rosette_ID’ but note that this ID can change from year to year as individual rosettes are not usually tagged.

names(mydata)[names(mydata) == "rosette_number"] <- "rosette_ID"

Global 3 Operators & functions used in multiple scripts

“Not in” operator is defined.

## create "not in" operator
'%nin%' = Negate('%in%')

Variables

Code chunks should be run in the sequence below, where variable number equals col number in mydata. Some variables are sequenced out of order in this markdown as they are required to be cleaned before a variable which comes before them numerically.

Var1 Site code

  1. Identify empty site code (NA) and resolve
  2. Checking the site codes in mydata match the Site description codes in sitedata
# levels(mydata$site_code)

source("var1_site_code.R")
## [1] "If nothing printed out until now, all rows had site code information (no NA's), or their site could be reattributed! Well done"
## [1] "The script will now see whether all site codes do match those listed in the summary site data."
## [1] "The script will automatically change the spelling of the codes CDF and LK to match the spelling on the summary sheet."
## [1] "If another site code appears listed here as a warning, please add a case to correct the spelling of the site code"
## Warning in eval(ei, envir): the following sites are in the demographic dataframe but 
##                  don't match the site description data:
## [1] "LK"
## Warning in eval(ei, envir): the following sites are in the site description dataframe but 
##                  don't match the sites in demographic dataframe
## [1] "JR"  "LK1"
## [1] "LK1 fixed"
## [1] "Site JR is featured in the Genomic analysis but never recorded demographic data so it is not included in mydata"
## [1] "Sites included in mydata: "
##  [1] "AC"    "ACR"   "AG"    "AL1"   "AL2"   "ARH"   "BG"    "BHU"   "BI"   
## [10] "BL"    "CDF"   "CH"    "CP"    "CPA"   "DP"    "EE"    "EL"    "GB"   
## [19] "GH"    "GU"    "HAS"   "HO"    "HR"    "HU"    "HUFZ"  "HV"    "IO"   
## [28] "JE"    "JSJ"   "KM"    "MACD"  "MN"    "NRM"   "OR_SS" "PA"    "PC"   
## [37] "PER"   "PM"    "RO"    "RO_IS" "RUSC"  "SBK"   "SC"    "SI"    "STB"  
## [46] "STR"   "SW242" "SW729" "TG"    "TJ"    "TNC"   "TNM"   "TO"    "TRU"  
## [55] "TUE"   "TW"    "TY"    "UC"    "UR"    "VA"    "WIN"   "ZG"    "ZM"   
## [64] "LK1"
# levels(mydata$site_code)

#notes: giving false warning messages about all site codes 
## fixed in var1_site_code.R

Var 2:5 GPS Coordinates.

Lat and Lon must be stored in decimal degrees. The measurements package can be used to convert other formats.

This script contains a number of issues and checks throughout. There is alot of manual checking and fixing required. May require additional input when new sites are added. Refer to comments within script and you may need to run commented code for diagnosis when new sites are added.

** NOTE: 20/10/2020 this code is buggy - but not essential for the rest of the codes to work - see the site summary file for site level coordinates, do not use the transect coordinates **

solution

Sys.setlocale('LC_ALL','C') ## needed to deal with some encoding issues (Mac to PC) if all ok print-out will read "[1] "C/C/C/C/C/en_IE.UTF-8" " . No action necessary
## [1] "C"
source("var2-5_coordinate_locations.R")
## [1] "ARH"   "CPA"   "PM"    "SBK"   "SW242" "SW729" "TG"    "TJ"    "ZM"
## Warning in eval(ei, envir): Error 1: the sites above have no transect
## coordinates - contact site coordinator. No fix applied.
## [1] "Error 3 fixed for PC"
## [1] EE  HAS
## Levels: EE HAS
## Warning in eval(ei, envir): N included in transect latitude start for the sites
## above
## Warning in eval(ei, envir): site IO fixed here, W removed from coordinates
## Warning in eval(ei, envir): Error 5 fixed: all letters and spaces removed from
## coordinates columns

## Warning in eval(ei, envir): Error 5 fixed: all letters and spaces removed from
## coordinates columns

## Warning in eval(ei, envir): Error 5 fixed: all letters and spaces removed from
## coordinates columns

## Warning in eval(ei, envir): Error 5 fixed: all letters and spaces removed from
## coordinates columns

## Warning in eval(ei, envir): Error 5 fixed: all letters and spaces removed from
## coordinates columns
## `summarise()` has grouped output by 'site_code'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'c_year'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'c_year'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'c_year'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'c_year'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'c_year'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'c_year'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'c_year'. You can override using the
## `.groups` argument.
## Warning in eval(ei, envir): These sites require more information before proper
## GPS fixes can be applied:
## Warning in eval(ei, envir):
## `summarise()` has grouped output by 'site_code'. You can override using the
## `.groups` argument.
## Warning in rm(tocheck2, TW_LonStp, TW_LatStr, TW, TNM, TNC, SCT2, SC, CDF, :
## object 'TW_LonStp' not found
## Warning in rm(tocheck2, TW_LonStp, TW_LatStr, TW, TNM, TNC, SCT2, SC, CDF, :
## object 'TW_LatStr' not found
## Warning in rm(tocheck2, TW_LonStp, TW_LatStr, TW, TNM, TNC, SCT2, SC, CDF, :
## object 'BG_2016_T2' not found
## Warning in rm(tocheck2, TW_LonStp, TW_LatStr, TW, TNM, TNC, SCT2, SC, CDF, :
## object 'x2' not found
## Warning in rm(tocheck2, TW_LonStp, TW_LatStr, TW, TNM, TNC, SCT2, SC, CDF, :
## object 'BG_2016_T1' not found
## Warning in rm(tocheck2, TW_LonStp, TW_LatStr, TW, TNM, TNC, SCT2, SC, CDF, :
## object 'BG_2016' not found
## Warning in rm(tocheck2, TW_LonStp, TW_LatStr, TW, TNM, TNC, SCT2, SC, CDF, :
## object 'BG' not found

Var 6_7 Transect and Plot

levels(mydata$plot)

  1. a,b,c,d plots are from HV site which surveyed using a 1m squared quadrat. We divided into 4 plots to make 50cm squared as per PPN protocol. These are to be left as is for now

  2. Duplicate plot numbers for P01-P09 and P1-P9. remove 0s for consistency.

  3. Plot numbers should range between 1 & 20. One sites continues to P43. Will ignore for now as Plot number is unique and should not cause any issues

levels(mydata$transect) 1. replicate of T2 with a space

solution

source("var6_7_transect_plot_numbers.R")
## Warning in eval(ei, envir): Clean up lower case t and transects without a T
## character(0)
## [1] "values for site BG are potentially solved by running var_9 plant ID script"
## Warning in eval(ei, envir): Errors. One site (HV) uses a,b,c,d (this site has 1m plots divided in 4 to make 4 50cm plots), P1-P9 and P01-P09, 
##         lower case p & values with no letter P,numbers above 20 (should only be 20 plots per transect, but will ignore for now.
## Warning in eval(ei, envir): NAs present in plot names
##  [1] "BG_T1_NA_Y0" "BG_T1_NA_Y0" "BG_T1_NA_Y0" "BG_T1_NA_Y0" "BG_T1_NA_Y0"
##  [6] "BG_T1_NA_Y0" "BG_T1_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0"
## [11] "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0"
## [16] "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0"
## [21] "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0" "BG_T2_NA_Y0"
## [26] "BG_T2_NA_Y0" "BG_T2_NA_Y0"
## [1] "values for site BG are potentially solved by running var_9 plant ID script"
## `summarise()` has grouped output by 'site_code'. You can override using the
## `.groups` argument.
## [1] "transect and plot are fixed. YAY!"

Var 8 No_seedlings

Variable is a factor with text na - needs to be numerical

Solution

source("var8_number_seedling.R")
## [1] "The number of seedling for the following plots was set to the maximal annotated value for... "
## [1] "RO_IS_T1_P1_Y0" "RO_IS_T1_P2_Y0" "RO_IS_T1_P3_Y0" "RO_IS_T2_P1_Y0"
## [1] "MACD_T1_P14_Y0"
## [1] "MACD_T1_P19_Y0"
## [1] "PC_T1_P19_Y0"
## [1] "PC_T1_P18_Y0"
## [1] "PC_T1_P11_Y0"
## [1] "RUSC_T1_P2_Y0"
## [1] "STB_T1_P2_Y0"
## [1] "STB_T1_P4_Y0"
## [1] "STB_T1_P5_Y0"
## [1] "STB_T3_P3_Y0"
## [1] "STB_T5_P1_Y0"
## [1] "STB_T5_P3_Y0"
## [1] "STB_T5_P4_Y0"
## [1] "UR_T1_P10_Y0"
## [1] "UR_T3_P4_Y0"
## [1] "This is not a warning, the plots are flagged only for transparancy. The error was dealt with"
## [1] "Some number of seedlings might be outliers; they are not removed or changed in anyway, but flagged on the following plot"

Var 9 Plant ID

Fixes issues with missing plant IDs and non-unique IDs. Creates a new variable called plant_unique_id

source("var9_PlantID.R")
## [1] "Number of Plant IDs that have NAs in them:  57"
## [1] "In BG_T2_P8_Y0 lines were deleted as they only record extra floral stems."
## [1] "Plot  PC_T1_P6_Y0 , These plots have no plant IDs and only contain seedlings. And replicate instances of these rows have been removed where they exist"
## [1] "The lines above were dealt with, only printed for transparancy."
## [1] "If a new set of years is added, please check that the following lines only contain information that you are happy to see in the final data set (no stems with no plant id, for instance). if they are ok, do nothing. Else, worry"
##      site_code transect_Lat_start transect_Lon_start transect_Lat_stop
## 1026        BG           61.44803           7.480639          61.44807
## 5924        PC           38.51770        -121.764322          38.51768
##      transect_Lon_stop transect plot number_seedlings plant_id x_coord y_coord
## 1026          7.480614       T2   P8                2     <NA>    <NA>    <NA>
## 5924       -121.764265       T1   P6               23     <NA>    <NA>    <NA>
##      suspected_clone survival no_rosettes rosette_ID no_leaves leaf_length
## 1026            <NA>     <NA>        <NA>         NA        NA          NA
## 5924            <NA>     <NA>        <NA>         NA        NA          NA
##      leaf_width no_fl_stems fl_stem_height inflor_length inflor_phenology
## 1026         NA        <NA>           <NA>          <NA>             <NA>
## 5924         NA        <NA>           <NA>          <NA>             <NA>
##      disease..yes.no. disease_comments herbivory..yes.no. herbivory_comments
## 1026             <NA>             <NA>               <NA>               <NA>
## 5924             <NA>             <NA>               <NA>               <NA>
##      other_comments c_year s_year unique_plot_id plant_unique_id
## 1026           <NA>   2015     Y0    BG_T2_P8_Y0     BG_T2_P8_NA
## 5924           <NA>   2016     Y0    PC_T1_P6_Y0     PC_T1_P6_NA
## [1] "Number of Plant IDs that have NAs in them:  2"

Go back and run Transect & Plot

Run the Transect &Plot script again to fix problems with NA’s in the BG transect/plot names

source("var6_7_transect_plot_numbers.R")
## Warning in eval(ei, envir): Clean up lower case t and transects without a T
## character(0)
## [1] "values for site BG are potentially solved by running var_9 plant ID script"
## Warning in eval(ei, envir): Errors. One site (HV) uses a,b,c,d (this site has 1m plots divided in 4 to make 4 50cm plots), P1-P9 and P01-P09, 
##         lower case p & values with no letter P,numbers above 20 (should only be 20 plots per transect, but will ignore for now.
## character(0)
## [1] "values for site BG are potentially solved by running var_9 plant ID script"
## `summarise()` has grouped output by 'site_code'. You can override using the
## `.groups` argument.
## [1] "transect and plot are fixed. YAY!"

Var 10 & 11 Within plot Coordinates (x,y)

Variable is a factor - should be numerical and between 0 and 50

##levels(mydata$x_coord)
##levels(mydata$y_coord)

Solution

source("var10_11_xy_coords.R")
## Warning in eval(ei, envir): NAs introduced by coercion

## Warning in eval(ei, envir): NAs introduced by coercion
######### Note Values still range above 50

Var 12 suspected_clone

tasks accomplished in this script: suspected_clone_binary column created suspected_clone column renamed and preserved as suspected_clone_notes

solution

source("var12_suspected_clone_binary.R")
## character(0)

Var 28_29 c_year & s_year

This script must be run before “survival” Any errors found will be printed with a warning Script is diagnostic only, does not make changes to source data

source("var28_29_c_year_and_s_year.R")

Var 13 survival

Ensuring the survival column is binary (yes/no)

source("var13_survival.R")
## Warning in `[<-.factor`(`*tmp*`, grep("([yY])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("1", survival_fix, ignore.case = TRUE), :
## invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("survive", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([c])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("revived", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([n])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("dead", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("died", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("0", survival_fix, ignore.case = TRUE), :
## invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([?])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([tamp])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated

Var 14 No. of rosettes

Checking no_rosette column, should be a numeric value

source("var14_no_rosettes.r")  
## 106 plants have NA in no_rosettes  new plants are assumed to have 1 rosette, this is fixed'no plant' has been changed to NA as it contains no useful informationAll 'not found' & 'removed' comments in no_rosettes changed to NA. Where no_leaves is NA for 'not found' or 'removed' rosettes survival has been changed to no
## Warning in eval(ei, envir): incorrect number of rosettes
## Warning in eval(ei, envir): incorrect number of rosettes for
## 104 unique plants[1] "Sites HV 2014, TO 2015, AC 2015 have all recorded multiple no_rosettes but only measures 1 rosette. These account for 43 of the mismatches listed in chex2"
## Warning in eval(ei, envir): mismatches still present, recheck chex2,
## 61 individuals need to be corrected in the master data[1] "site MACD has measured every flowering stem and recorded as duplicated rows. These will be cleaned below"
## Warning in eval(ei, envir): additional
## -214 rows have been removed. Consult 'Chec' object to ensure these have not mistakenly been removed

SURVIVAL SCRIPT NEEDS TO BE RE-RUN
Need to re-run plant survival script to ensure that correct values are recorded in plant_survival due to changes in survival as a result of no_rosettes tidying

source("var13_survival.R")
## Warning in `[<-.factor`(`*tmp*`, grep("([yY])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("1", survival_fix, ignore.case = TRUE), :
## invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("survive", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([c])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("revived", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([n])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("dead", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("died", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("0", survival_fix, ignore.case = TRUE), :
## invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([?])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, grep("([tamp])", survival_fix, ignore.case =
## TRUE), : invalid factor level, NA generated

Var 16:21 Outlier detection

searching for outliers in the numerical columns

source("var16-21_outlier_detection.R")
## tibble [9,557 x 36] (S3: tbl_df/tbl/data.frame)
##  $ site_code             : Factor w/ 64 levels "AC","ACR","AG",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ transect_Lat_start    : num [1:9557] 41.3 41.3 41.3 41.3 41.3 ...
##  $ transect_Lon_start    : num [1:9557] 1.95 1.95 1.95 1.95 1.95 ...
##  $ transect_Lat_stop     : num [1:9557] 41.3 41.3 41.3 41.3 41.3 ...
##  $ transect_Lon_stop     : num [1:9557] 1.95 1.95 1.95 1.95 1.95 ...
##  $ transect              : chr [1:9557] "T1" "T1" "T1" "T1" ...
##   ..- attr(*, "levels")= chr(0) 
##  $ plot                  : Factor w/ 59 levels "P1","P10","P11",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ number_seedlings      : num [1:9557] 13 13 13 13 13 13 13 13 13 13 ...
##  $ plant_id              : chr [1:9557] "1" "2" "3" "4" ...
##  $ x_coord               : num [1:9557] 30 19 0 31 35 45 14 8 18 25 ...
##  $ y_coord               : num [1:9557] 8 40 24 32 34 46 30 32 12 12 ...
##  $ suspected_clone_binary: chr [1:9557] "no" "no" "no" "no" ...
##  $ survival              : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
##  $ no_rosettes           : num [1:9557] 1 7 3 1 1 1 1 1 1 1 ...
##  $ rosette_ID            : int [1:9557] 1 1 1 1 1 1 1 1 1 1 ...
##  $ no_leaves             : int [1:9557] 6 17 19 3 3 5 2 2 5 1 ...
##  $ leaf_length           : num [1:9557] 103 38 96 72 82 90 65 65 80 71 ...
##  $ leaf_width            : num [1:9557] 9 4 7 5 7 6 3 4 4 3 ...
##  $ no_fl_stems           : chr [1:9557] "3" "10" "9" "0" ...
##  $ fl_stem_height        : chr [1:9557] "205" "222" "177" NA ...
##  $ inflor_length         : chr [1:9557] "5" "13" "12" NA ...
##  $ inflor_phenology      : chr [1:9557] "seeds dispersed" "seeds dispersed" "seeds dispersed" NA ...
##  $ disease..yes.no.      : chr [1:9557] NA NA NA NA ...
##  $ disease_comments      : chr [1:9557] NA NA NA NA ...
##  $ herbivory..yes.no.    : chr [1:9557] NA NA NA NA ...
##  $ herbivory_comments    : chr [1:9557] "0" "0" "0" "no" ...
##  $ other_comments        : chr [1:9557] NA "0" "0" "0" ...
##  $ c_year                : int [1:9557] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ s_year                : Factor w/ 1 level "Y0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ unique_plot_id        : chr [1:9557] "AC_T1_P1_Y0" "AC_T1_P1_Y0" "AC_T1_P1_Y0" "AC_T1_P1_Y0" ...
##  $ plant_unique_id       : Factor w/ 7153 levels "ACR_T1_P10_151",..: 169 174 175 176 177 178 179 180 181 170 ...
##  $ x_coord_notes         : chr [1:9557] "30" "19" "0" "31" ...
##  $ y_coord_notes         : chr [1:9557] "8" "40" "24" "32" ...
##  $ suspected_clone_notes : chr [1:9557] "no" "no" "no" "no" ...
##  $ plant_survival        : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
##  $ plant_survival2       : int [1:9557] NA NA NA NA NA NA NA NA NA NA ...
## [1] "no_leaves"
## character(0)
## [1] "leaf_length"
## character(0)
## [1] "leaf_width"
## character(0)
## [1] "no_fl_stems"
## [1] "no" "no" "no"
## [1] "fl_stem_height"
##  [1] "grazed"  "damaged" "damaged" "damaged" "damaged" "damaged" "see"    
##  [8] "see"     "see"     "see"     "see"     "see"     "see"     "see"    
## [15] "see"     "see"     "see"     "see"    
## [1] "inflor_length"
##  [1] "inflorescence" "inflorescence" "inflorescence" "inflorescence"
##  [5] "inflorescence" "inflorescence" "MISSING"       "MISSING"      
##  [9] "MISSING"       "MISSING"       "MISSING"       "MISSING"      
## [13] "MISSING"       "MISSING"       "MISSING"       "MISSING"      
## [17] "inflorescence" "inflorescence" "inflorescence" "inflorescence"
## [21] "inflorescence" "damaged"       "damaged"       "damaged"      
## [25] "damaged"       "damaged"       "see"           "broken"
## Warning in eval(ei, envir): Non numeric values present
## [1] "no_leaves"
## character(0)
## [1] "leaf_length"
## character(0)
## [1] "leaf_width"
## character(0)
## [1] "no_fl_stems"
## character(0)
## [1] "fl_stem_height"
## character(0)
## [1] "inflor_length"
## character(0)
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
## tibble [9,557 x 6] (S3: tbl_df/tbl/data.frame)
##  $ no_leaves     : num [1:9557] 6 17 19 3 3 5 2 2 5 1 ...
##  $ leaf_length   : num [1:9557] 103 38 96 72 82 90 65 65 80 71 ...
##  $ leaf_width    : num [1:9557] 9 4 7 5 7 6 3 4 4 3 ...
##  $ no_fl_stems   : num [1:9557] 3 10 9 0 0 0 0 0 1 0 ...
##  $ fl_stem_height: num [1:9557] 205 222 177 NA NA NA NA NA 122 NA ...
##  $ inflor_length : num [1:9557] 5 13 12 NA NA NA NA NA 8 NA ...
## [1] "no_leaves"
## character(0)
## [1] "leaf_length"
## character(0)
## [1] "leaf_width"
## character(0)
## [1] "no_fl_stems"
## character(0)
## [1] "fl_stem_height"
## character(0)
## [1] "inflor_length"
## character(0)
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## i Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in rm(x1, x2, x3, x4, x5, Y0, Y1, Y2, outliers, x, y, infl,
## infl_height, : object 'Y1' not found
## Warning in rm(x1, x2, x3, x4, x5, Y0, Y1, Y2, outliers, x, y, infl,
## infl_height, : object 'Y2' not found

Var 23_24 Disease_yes_no

ensuring binary yes/no values in column

solution

source("var23_disease_yes_no.R")
## factor()
levels(mydata$disease..yes.no.)
## NULL

Var 25_26 Herbivory

source("var25_herbivory_yes_no.R")
## factor()
levels(mydata$herbivory..yes.no.)
## NULL

Var 15 Rosette ID

Site AC has only recorded 1 rosette for multi-rosette plants. No_leaves column is total number for the whole plant not just the rosette from which leaf/inflor measurements are taken. The script below just diagnoses issues with number of rosettes and rosette id, no changes are made to the data.

source("var15_rosette_id.r")  
## Warning in eval(ei, envir): NA's for
## 117 lines of data
## Warning in eval(ei, envir): incorrect rosette ID's for
## AC_T1_P1_2 0 AC_T1_P1_3 0 AC_T1_P10_62 0 AC_T1_P11_66 0 AC_T1_P14_73 0 AC_T1_P14_74 0 AC_T1_P15_79 0 AC_T1_P2_16 0 AC_T1_P2_18 0 AC_T1_P3_46 0 AC_T1_P7_61 0 AC_T2_P2_87 0 AC_T2_P2_96 0 AC_T2_P3_97 0 HUFZ_T1_P2_25 0 HV_T1_P1d_27 0 HV_T1_P2b_29 0 HV_T1_P3b_42 0 HV_T1_P5d_82 0 HV_T1_P5c_87 0 HV_T1_P6b_94 0 HV_T1_P6b_97 0 HV_T1_P6b_99 0 HV_T1_P6b_100 0 HV_T1_P6b_101 0 HV_T1_P6c_102 0 HV_T1_P6c_105 0 HV_T1_P6c_110 0 JE_T1_P1_8 0 JE_T1_P1_8 0 PC_T1_P19_5 0 PC_T1_P19_5 0 PC_T1_P8_66 0 PC_T1_P8_99 0 PC_T1_P8_99 0 PC_T1_P8_99 0 PC_T1_P8_99 0 PC_T1_P8_99 0 PC_T1_P8_111 0 PC_T1_P8_111 0 PC_T1_P8_111 0 PC_T1_P11_127 0 PC_T1_P11_128 0 PER_T1_P1_13 0 SW729_T1_P1_11 0 SW729_T1_P6_26 0 TO_T1_P2_54 0 TO_T1_P2_73 0 TO_T1_P2_91 0 TO_T1_P3_54 0 TO_T1_P3_73 0 TO_T1_P3_91 0 TO_T1_P3_109 0 TO_T1_P3_116 0 TO_T1_P3_167 0 TO_T1_P4_109 0 TO_T1_P4_116 0 TO_T1_P4_130 0 TO_T1_P4_158 0 TO_T1_P4_167 0 TO_T1_P5_130 0 TO_T1_P5_158 0 TRU_T1_P11_67 0 UC_T1_P2_10 0 UC_T1_P2_10 0 UC_T1_P2_10 0 UC_T1_P2_10 0 UC_T1_P2_10 0
## Warning in rm(check_rosette_ID): object 'check_rosette_ID' not found

Production of Y0 data product

mydataY0 <- subset(mydata, s_year == "Y0")

Export Y0

output_name <- paste("PLANTPOPNET_Y0_V1.4",Sys.Date(), ".csv",  sep = "_") 

write.csv(mydataY0, output_name, row.names = F)