This package provides access to water quality and hydrological data from the dbhydro database provided by the South Florida Water Management District. Although the package has a regional focus, the database could be used to address more general ecological or environmental management questions. Having a package to more easily acquire the database will no doubt facilitate use and reach of the data.

The package is lightweight and is built around two main functions. The documentation and examples could be improved in some cases (see below), but overall I do not have any major concerns about package structure or functionality. However, access to metadata for the database is a serious limitation that needs to be addressed. Sure I could go online to find what I need but it seems like the package should facilitate that process in some way. The package does a good job retrieving data if the station id and test name are known but it would be much more helpful if there was some way to query the available information. Here are some options to consider:

  • The vignette has an example of retrieving data using a wild card for the station id. Some more examples like this would be very helpful.

** I believe that wild-cards only work with station-id(s). Are you thinking that test-names would also be exposed to wildcard selection?**

  • Including more detailed metadata in the appendix of the vignette, e.g., stations, date ranges, etc.

I added a link to the available test-name listing in the getwq docs. I think there may be too many stations to include a managable listing in the vignette but see response to the next comment.

  • Including more functions to return and search metadata. My SWMPr package has similar functions (e.g., SWMPRr::site_codes or SWMPr::site_codes_ind) but I have the advantage of much fewer stations than those in dbhydro, so I’m not sure the best approach.

Thank you for the suggestion. I am working on wrapping the station-search utility at http://my.sfwmd.gov/dbhydroplsql/water_quality_data.show_group_station_characters. For now, I tried to make map-based selection via the ArcGIS online or Google Earth interface more prominent in the vignette and README.

https://github.com/SFWMD/dbhydroR/commit/11c57f3a4f31b3804244ab269926798713f8aaa6

Build/install

  • I was having problems building the vignette when running devtools::check(). The file dbhydroR.tex was not compiling and returning an error “support pakage l3kernel too old”. Updating the package in my LaTeX distribution fixed the issue. I’m guessing you didn’t have this issue since the package is already on CRAN but I wanted to mention it in case the problem occurs in the future. Otherwise, the devtools build on my Windows machine and the CRAN Windows builder (devtools::build_win()) passed with no notes, warnings, or errors.

Examples

  • The following error was returned running devtools::run_examples(pkg = ‘.’, run = FALSE)):

cleanhydro(gethydro(dbkey = “15081”, date_min = “2013-01-01”, date_max = “2013-02-02”))
Error: is.response(x) is not TRUE

I ran the check to include examples in \dontrun tags, as for the above. It makes sense to not run the example if it’s supposed to fail but there is no description exaplaining what the example shows. Either fix the example above or provide some explanation as to why it fails. All other examples ran without issues.

The example has been fixed and now runs without error

https://github.com/SFWMD/dbhydroR/commit/ca676ef38c598ac6038dcd4217df15745ef67260 https://github.com/SFWMD/dbhydroR/commit/86436e321bce70501a1aa04e5f5e0a3b890b3940

Documentation/vignette

  • Maybe you should include some details (@details) for getwq and gethydro. The documentation for each is minimal, so perhaps add some of the text in the vignette to the help files for each function.

Great suggestion! More detail has been added to the docs for getwq and gethydro.

https://github.com/SFWMD/dbhydroR/commit/8b67b622a96df66f4bab968f0cf17787de7792ce

  • As noted above under documentation requirements, the README does not include a ‘statement of need clearly stating problems the software is designed to solve and its target audience’. The text from the onboarding issue above would be a good addition, i.e., ‘dbhydroR provides scripted access to the South Florida Water Management District’s DBHYDRO database which holds over 35 million water quality and hydrologic data records from the Florida Everglades and surrounding areas. The target audience is anyone interested in water quality and hydrologic data from the Florida Everglades - ecologists, engineers, meteorologists, hydrogeologists, hydrologists, etc.’

This has been added to the README and vignette.

https://github.com/SFWMD/dbhydroR/commit/6a277de115df38dd4ff214f2f2b7e5a81ccfb272 https://github.com/SFWMD/dbhydroR/commit/7a7d4142192d62a55d6757dd5510889b21055f28

  • For the vignette (or even on the README), I think it would be useful to provide a more user-friendly view of the available stations for getwq. The link in the vignette is to a KMZ file, which is not easily viewed. I tried installing the KML/KMZ Viewer for Chrome but it was blocked on my goverment computer, so I suspect others would have problems. Another option is to include an attachment to the vignette with a table or map showing the stations.

Links to the equivalent ArcGIS Online station map have been added to the vignette and README, and function docs. Hopefully this will make it more generally accessible.

https://github.com/SFWMD/dbhydroR/commit/11c57f3a4f31b3804244ab269926798713f8aaa6 https://github.com/SFWMD/dbhydroR/commit/3750b847e029373a2c8a2d8f5d9d3863e3a2bed0 https://github.com/SFWMD/dbhydroR/commit/e66be8f4773e50f7f29505895e785fa80c42de54

Functions

  • I’m wondering if cleanhydro and cleanwq should be exported since they are only used within getwq and gethydro. I say this because the documentation is sparse (no details, no returned value description, minimal examples). Maybe it’s better not to export. Would a user ever have a separate need for these functions?

I think it is best to have these functions exported. I envision a use-case where dplyr/SQL-like data would be preferable to the default format. Exposing these functions would allow for the ability to switch to non-dplyr/SQL-like data without re-pulling the data. Also, I added more detail to the clean* docs and refactored cleanhydro to be more consistent with cleanwq

https://github.com/SFWMD/dbhydroR/commit/86436e321bce70501a1aa04e5f5e0a3b890b3940

  • For cleanwq, the vignette says that rows with missing data are removed. What happens with these rows if multiple stations or ‘test_names’ are queried? Is the entire row removed even if data are available for the other columns? I think some clarification is needed.

Hmm, possibly you are referring to the “QA blanks” discussion? This refers not to missing data but to analytical samples run to check for equipment/process contamination. I have added text to the vignette to this effect.

https://github.com/SFWMD/dbhydroR/commit/5d4fe25aec7b4c0b5aaf4ecffc0832196d57ca9a

Contributing

  • Might be worth including a contributing.md file on the repo, or something more minimal in the README. Maybe modify this for your needs?

Compliance with rOpenSci Packaging Guide

My comments above address most of the compliance concerns for rOpenSci. Below are some minor, additional points.

Function variable naming

I guess the CRAN gatekeepers don’t care about this but I was having a mental block with getwq given it’s similarity to getwd. You might consider changing the function name to make it more distinct.

README

The ropensci footer will have to be added eventually: ropensci_footer

Code of conduct

Also consider adding the code of conduct badge with devtools::use_code_of_conduct.

A COC has been added.

https://github.com/SFWMD/dbhydroR/commit/0758c0cc3763f22cfcec2090ca3ed784699e870c

Hopefully these comments are helpful. Let me know if you have any questions.

Review Comments (@aappling-usgs)

This package provides scriptable access to an extensive Florida water quality/quantity database, DBHYDRO. It’s also noticeably faster than the browser-based interface. dbhydroR will enhance the usability of the database and the reproducibility of analyses of DBHYDRO data.

I agree with @fawda123 that the largest room for improvement is in the documentation. I have also made suggestions below for increasing the flexibility of the data-cleaning functions to serve several common use cases, and I’ve pointed out a few edge cases where the package’s functionality might surprise a user.

Documentation

As @fawda123 noted, it can be hard to identify the station IDs and tests you want if you don’t know what the options are. There are a few ways you could helper users:

  • You have several great links in the vignette that could be ported to the corresponding help files.

Great suggestion. I ported these links and added links to the ArcGIS Online Station Map.

https://github.com/SFWMD/dbhydroR/commit/8b67b622a96df66f4bab968f0cf17787de7792ce

  • The Google Earth kmz file is excellent for interactively identifying sites and timeseries. However, http://my.sfwmd.gov/KMLEXT/CUSTOMKMLS/DBHydro/DBHydroKML/DBHYDRO_KML.kmz doesn’t open a browser window as a user might expect - maybe include some suggestions for how to best open the file, like on p. 56 of the DBHYDRO manual.
  • If you’re feeling ambitious, functions that return lists of options (sites, tests) would enhance users’ ability to search sites programatically. I also see a simplicity/maintainability argument for avoiding this route.

I am working on wrapping the station-search utility at http://my.sfwmd.gov/dbhydroplsql/water_quality_data.show_group_station_characters. For now, I tried to make map-based selection via the ArcGIS online or Google Earth interface more prominent in the vignette and README.

Questions that arose on my first encounter with the package/dataset:

  • What does a dbkey represent? A site, a timeseries, …? A dbkey represents a unique site x variable time series. I added some more info about dbkeys to the gethydro docs.

https://github.com/SFWMD/dbhydroR/commit/8b67b622a96df66f4bab968f0cf17787de7792ce

  • How should I cite DBHYDRO and/or the dbhydroR package?

A notice about citation() has been added to the README

https://github.com/SFWMD/dbhydroR/commit/bbf0246398ef1446feca5be4fde0e98126e39b7b

Other documentation thoughts

  • The description of “cleaned output” in the vignette could also be copied right over to the help files for getwd and gethydro (and their internal cleaning functions if those continue to be exported).

I expanded the docs for the get* and clean* functions to describe what goes on in the respective functions. I also refactored cleanhydro to be more consistent with cleanwq (they both operate on data.frame objects now).

https://github.com/SFWMD/dbhydroR/commit/8b67b622a96df66f4bab968f0cf17787de7792ce https://github.com/SFWMD/dbhydroR/commit/86436e321bce70501a1aa04e5f5e0a3b890b3940

  • Contribution guidelines could be added to the README or a CONTRIBUTING file.

The contributing guidelines of the other rOpenSci repos appear to be fairly specific to rOpenSci. I wonder if I should hold off on this until we are closer to acceptance?

Build/install

  • I had no trouble installing/building the package from CRAN, GitHub, or a local clone of the repository.
  • The GitHub repository name is dbhydro while the package name is dbhydroR. You might consider switching to the same name in both places to make it ever-so-slightly easier to find the package on GitHub. (E.g., my first try was devtools::install_github(‘SFWMD/dbhydroR’) because I knew the package name and only glanced at the URL.)

The repo name has been switched to match the package name.

Examples/tests

  • I encountered the same error as @fawda123 on running the example for cleanhydro:

devtools::run_examples(run=FALSE)

cleanhydro(gethydro(dbkey = “15081”, date_min = “2013-01-01”, date_max = “2013-02-02”))
Error: is.response(x) is not TRUE

  • devtools::check() runs cleanly with 0 errors, 0 warnings, and 0 notes.

The example has been fixed and now runs without error

https://github.com/SFWMD/dbhydroR/commit/ca676ef38c598ac6038dcd4217df15745ef67260 https://github.com/SFWMD/dbhydroR/commit/86436e321bce70501a1aa04e5f5e0a3b890b3940

Possibly unnecessary files & code

  • The NEWS file has output: pdf_document at the top. Are you using this feature?

These lines have been removed.

https://github.com/SFWMD/dbhydroR/commit/4cc1b2d14357502d7f248a3e78558690e83cd357

  • Do you need all of the files in dbhydro/vignettes? DFlowInterpR.bib~, Thumbs.db, and Untitled Document~ (and maybe others) might not be getting used.

These files have been removed.

https://github.com/SFWMD/dbhydroR/commit/088884276228f665e2b42d48a2ad318d71d8fea0 https://github.com/SFWMD/dbhydroR/commit/9186979e9ca2a5694d68bcdf863c326d98f304b6

Functions

getwq + gethydro

These functions have a clean, intuitive interface. A few thoughts:

  • It would be helpful to more thoroughly explain dbkey in the documentation for gethydro.

I included some additional information about dbkey in the gethydro docs.

https://github.com/SFWMD/dbhydroR/commit/8b67b622a96df66f4bab968f0cf17787de7792ce

  • Is it / could it be possible to pass dbkey into getwq, as it is for gethydro? One dbkey might be easier than site+test for some programming applications, and it looks like water quality data do have dbkeys, e.g.:

getdbkey(stationid = “C111%”, stat = ‘MEAN’, category = “WQ”)
Search results for ‘C111% WQ’
Dbkey Group Data Type Freq Recorder Start Date End Date
1 38097 C111WETL H2OT DA ???? 04-OCT-2003 10-OCT-2011
2 38104 C111WETL SALIN DA ???? 04-OCT-2003 10-OCT-2011
3 38100 C111WETL SCOND DA ???? 04-OCT-2003 10-OCT-2011

I realize that this is a bit confusing. The getwq function pulls data from a seperate set of DBHYDRO tables than gethydro. Only the gethydro tables contain dbkeys. Then, confusingly, the gethydro tables have a category called "WQ". The getwq tables do not support selection by dbkey. The getwq water quality data is generated from laboratory tests while the gethydro "WQ" data is generated from things like automated sondes.

  • Two things only occur in getwq if raw==FALSE which I expected to be conditional: (1) the message “No data found” doesn’t occur even if there are no data, and (2) MDL handling doesn’t occur even if I specify mdl_handling. My favorite solution would be to make these things happen regardless of the value of raw, but another option would be to document them prominently in the ?getwq help file.

getwq(“FLAB01”, “2014-09-14”, “2014-09-18”, “NITRATE+NITRITE-N”, raw = TRUE)$Value
[1] -0.01
getwq(“FLAB01”, “2014-09-14”, “2014-09-18”, “NITRATE+NITRITE-N”, raw = TRUE, mdl_handling=‘half’)$Value
[1] -0.01

I refactored getwq and cleanwq so that MDL handling occurs regardless of how raw is set. Also, the “No data found” message should now appear regardless of how raw is set.

https://github.com/SFWMD/dbhydroR/commit/8a629183c03003175918640937ed18666a56f5eb

getdbkey

Though not advertised in the README, the getdbkey function looks like a useful way to identify datasets to download. With that in mind, I have a few suggestions that are mostly illustrated by this example:

getdbkey(stationid = “C111%”, stat = ‘MEAN’, category = “WQ”, detail.level=“full”)
Search results for ‘C111% WQ’
Get Data Dbkey \nStation\n Group Data Type Freq Stat Recorder Agency
1 38097 C111WETLND C111WETL H2OT DA MEAN ???? USGS
2 38104 C111WETLND C111WETL SALIN DA MEAN ???? USGS
3 38100 C111WETLND C111WETL SCOND DA MEAN ???? USGS
Start Date End Date Strata County Op Num Latitude Longitude X Coord
1 04-OCT-2003 10-OCT-2011 0 DAD Â 251740 803112 814755.394
2 04-OCT-2003 10-OCT-2011 0 DAD Â 251740 803112 814755.394
3 04-OCT-2003 10-OCT-2011 0 DAD Â 251740 803112 814755.394
Y Coord Basin Struct Section Township Range
1 349549.418 C111_CO Â 15 59 38
2 349549.418 C111_CO Â 15 59 38
3 349549.418 C111_CO Â 15 59 38

I added some advertisement for getdbkey to the README.

  • Calls with detail.level=‘full’ return a column named “\nStation\n”. I’d prefer “Station” if there’s no reason to stick with the extra newlines.
  • The values of Latitude and Longitude are unconventionally formatted. Would you consider converting them to decimal degrees?

I removed the extra newlines from “Station” and formatted Latitude and Longitude as decimal degrees.

https://github.com/SFWMD/dbhydroR/commit/3df0ad17f4e1b4f78f765b5bd182b1b3d4d34446

  • Are the values in the “Struct” column being parsed correctly? I don’t know what to make of Â. It seems to appear in other columns, too, in other queries.

I am not getting  printed in any of the columns. Maybe I fixed it as a side-effect of another commit? The following line prints a reasonable (non-blank) value of Struct for me:

getdbkey(category = “SW”, stationid = “ALLIGAT”, param = “STG”, detail.level = “full”)

  • Does anything useful ever appear in the “Get Data” column?

Good catch. No it doesn’t. This is an artifact from the web interface. I removed this column.

https://github.com/SFWMD/dbhydroR/commit/b68cc5ed0c3814455bd104c22ff828687031546a

  • Consider using stringsAsFactors=FALSE so that Station, Dbkey, etc. come back as strings.

Good suggestion. I made this change.

https://github.com/SFWMD/dbhydroR/commit/3df0ad17f4e1b4f78f765b5bd182b1b3d4d34446

# we’d expect >100 dbkeys in this query
nrow(getdbkey(category = “GW”, detail.level = “full”))
Search results for ‘NA GW’
[1] 100
# because we easily get >100 sites combined in these two queries:
dba <- getdbkey(stationid = ‘A%’, category = “GW”, detail.level = “full”)
Search results for ‘A% GW’
dbi <- getdbkey(stationid = ‘I%’, category = “GW”, detail.level = “full”)
Search results for ‘I% GW’
nrow(dba) + nrow(dbi)
[1] 134

I circumvented this limitation.

https://github.com/SFWMD/dbhydroR/commit/b68cc5ed0c3814455bd104c22ff828687031546a

cleanwq + cleanhydro

h <- gethydro(dbkey = “IY639”, date_min = “2009-01-30”, date_max = “2009-01-31”)
Warning messages:
1: In cleanhydro(res) : Column headings missing. Guessing…
2: In cleanhydro(res) : Instantaneous data detected
head(h$date)
[1] “2009-01-30 00:40:00 MST” “2009-01-30 00:55:00 MST” “2009-01-30 01:10:00 MST”
[4] “2009-01-30 01:25:00 MST” “2009-01-30 01:40:00 MST” “2009-01-30 01:55:00 MST”

I forced hydro and wq date/times to return in EST.

https://github.com/SFWMD/dbhydroR/commit/4d1dd1552c08568bc429959dd6b52898438d25b5 https://github.com/SFWMD/dbhydroR/commit/239c03857d721d363fc373396c0e3e442a920d59

  • In cleanhydro, why is “Instantaneous data detected” a warning rather than a message? Is it possible for instantaneous data to make trouble?

This has been changed to a message rather tha a warning. The raw server response for instantaneous data returns no column headers. These are assumed, hence the warning/message. The column assignments seem pretty reliable. Maybe I can drop these messages.

https://github.com/SFWMD/dbhydroR/commit/2d9902d75d2ed2ec5ef90d7a0bda18b8b778525d

  • At first glance, I agreed with @fawda123 that it would make sense not to export the cleanwq and cleanhydro functions because they are wrapped by getwd and gethydro. After more thought, though, I decided that I might often want to retain the Station.ID column and the long format (e.g., for dplyr-style analyses of multiple sites or tests) but might also want to specify the mdl_handling and/or get back parsed timestamps. I see no one-line option for these use cases, and it makes me wonder if the cleaning functionality could be split into things you’d almost always want (parsed timestamps, connecting metadata to headers, maybe move the Value column closer to the left) and things you might want to specify in various ways (MDL handling, selecting specific columns, switching between wide and long, renaming the value column). The former functionality could be the default in getwd and gethydro while the latter could live in (a) exported functions that you’d call separately or (b) new arguments to getwd and gethydro.

Maybe I have captured the spirit of this comment by standardizing the behavior of the clean* functions, and forcing mdl_handling to occur regardless of the value of raw? Also, timestamps are parsed to the date column regardless of the value of raw.

  • If you decide that cleanwq and cleanhydro should continue to be exported functions, they should be used in examples in the README and vignette.

I aded examples to the README and vignette.

https://github.com/SFWMD/dbhydroR/commit/cf5ebd7d134a5eaf9107f190aa931d8af25f1226

Signing off

This package is already a nice contribution - thanks for developing it. Let me know if you’d like to clarify/discuss any of the above points. Best, Alison (@aappling-usgs)