Hi @cderv and @pegeler, thank you both of you very much for your helpful comments and suggestions, sorry for not responding more quickly.
I find it kind of necessary because you also used list-columns in tibbles. This is not something easy for some newcomers to handles. Even for some dplyr users who do not used that a lot, it is not something easy. Adding some example in the vignettes to show how to retrieve embedded data.frame inside some tibble’s list column could be a good idea. To illustrate:
library(nomisr) library(dplyr, warn.conflicts = F) y <- nomis_data_info(“NM_893_1”) y$annotations.annotation %>% class() #> [1] “list” y$annotations.annotation[[1]] %>% class() #> [1] “data.frame” y %>% pull(annotations.annotation) %>% class() #> [1] “list” y %>% pull(annotations.annotation)%>% .[[1]] %>% class() #> [1] “data.frame” y %>% pull(annotations.annotation) %>% purrr::pluck(1) %>% class() #> [1] “data.frame” y %>% pull(annotations.annotation) %>% purrr::map(1) %>% class() #> [1] “list” y %>% tidyr::unnest(annotations.annotation) %>% glimpse() #> Observations: 14 #> Variables: 11 #> $ agencyid#> $ annotationtitle
Response: I’ve added a similar example to the vignette, which is similar enough to your example that I feel I should cite you as a contributor if that is okay.
Following are my review comments, for improvement:
Response: Added a CONTRIBUTING.md in https://github.com/evanodell/nomisr/commit/e4bbb0229709a861b1ef2141e9ecf6da6ff8fc10
Response: Now only using Authors@R
see below parts for details
Response: For reasons I’m not entirely clear on, the package was failing on Travis without the curl suggestion, I think due to a bug in jsonlite.
Response: I’ve added more details to the README, including more details on Nomis itself, a simple example and citation information.
These improvement are the reason why I did not check the vignettes box, even if there is one in the github repo and doc. If you need help on all this, do not hesitate to ask.
Response: As mentioned above, I’ve added your suggestions on exploring list-columns to the vignette, and fixed the online formatting issues. I’ve changed echo to TRUE to show code
Response: I’ve changed \dontrun{} to \donttest{}
I ran all tests with devtools::run_examples(run = FALSE) (in several time, due to timeout constraint). Everything is working. It works well for almost all, but your example assign systematically to a variable (named x or y - room for improvement) but never print to show head of results. I feel that is missing.
Response: I have changed the names in examples to be more descriptive of the data they are actually retrieving, and added a print call so the head is actually dixplayed.
Response: I’ve removed the very long examples from nomis_get_data - I considered wrapping them in \dontrun{} but I think dropping them is the better option.
I saw you made a pkgdown website. This is great !! and nice theme!
I did not find the source code of this website. I would have thought it leaved with the package source code. But no docs folder or gh-pages. I find it useful to have access for letting people PR your some typos or other thing and all the site from your package source code. However, you may have good reason.
Response: I keep all the source code for my documentation website in my docs repo. I did this so I could host on netlify with a custom domain and https encryption which - at least when I set it up - was not available through github pages. I’m not sure if that is the best option or not.
Response: I’ve turned the API URL into its own variable.
Response: I’ve dropped suppressMessage and used col_types = readr::cols(.default = readr::col_character())
to make columns characters by default.
[minor] I don't know what are the good practice on this one. I just feel that you have a dplyr dependency in import for three functions bind_rows, case_when and if_else than can be easily replace with rbind or if.else. I feel that, if the former are very useful interactively in an analysis, using them in a package can be unnecessary dependency to a "big" package that can cause you some effort later on if anything change in dplyr. More of a personal point of view on this one.
However, it relates on how you are build your query url
query <- dplyr::if_else(keywords==FALSE, paste0(“/def.sdmx.json?search=“, search,””), paste0(“/def.sdmx.json?search=keywords-”, search, “*“)
could be rewritten without dplyr::if_else
base_query <- “/def.sdmx.json?search=” if (keyword) { keyword_term <- “keywords-”) } else { keyword_term <- “*" }
It would suppres a non essential dependency.
Response: I’ve dropped the use of case_when and if_else in favour the the style you described above. bind_rows is used to combine together a single tibble with a list of one or more tibbles, and using rbind in place of bind_rows would require an extra step. If there are changes to bind_rows in the future it is only in one location so should be easy to adjust for.
The style in code could be improve in some places. You can follow some style guide to help you.
like the one in ROpensci Guidelines linking to Advanced R or more recent the one used in the tidyverse styleguide. They even build some tools to help like styler and lintr. Example of improvement:
put spaces after foror if and also around =, ==, +, >, ...
indentation of code (in RStudio CTRL+i i useful). for example in if () { } else { }, lines of code that are inside brackets should be indented.
use {} with if when it does not fit on one line
More importantly, be consistent in all your script with the guideline you choose.
Know that when you want to stop on a condition, there is also stopifnot. For user experience, you can also return clearer error sometimes with stop("API request did not return any results", call. = FALSE) to not show the call for example.
A good practice is also to use meaningful name inside your internal code, even for temporary variable. There are a lot of x, a, q, ... This will improve readability and the ease for others to collaborate on your code (including future you)
Response: I’ve added call. = FALSE
to the various stop messages.
Response: I’ve gone through the code and changed temporary variable names to be more meaningful, and formatted the code to improve indentation, spaces around if and for operators, ==, etc.
Retrieving additional pages 1 of 208888 Retrieving additional pages 2 of 208888
I don’t know if some other API packages are doing something to help user or if it is all the user responsibility to know what is he doing. (users will surely be already NOMIS data users).
a <- nomis_codes(“NM_1_1”)
You could use some tryCatch mechanism (some useful info in Advanced R).
Response: I’ve used tryCatch to hopefully provide more useful information in the case of time outs. I’ve also written, but not implemented, a warning and yes/no option in interactive sesssions with requests over 350,000 cells which are likely to cause rate limiting issues.
About SMDX format on NOMIS
Do you know about the rsdmx package ? it is on github and CRAN, there is WIKI. I did not find NOMIS in their listing, so I did not know if it works for you and how it could overlap with your package.
rsdmx is definitely more a generic package than yours. I just wanted to point it to you in case you did not know.
Response: I was not aware of it, but it is something I should explore for the future, thank you for pointing me towards it.
Package Review for nomisr
Onboarding submission: ropensci/onboarding#190
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
Documentation
The package includes all the following forms of documentation:
A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS
The package has an obvious research application according to JOSS's definition
The package contains a paper.md matching JOSS's requirements with:
A short summary describing the high-level functionality of the software
Authors: A list of authors with their affiliations
A statement of need clearly stating problems the software is designed to solve and its target audience.
References: with DOIs for all those that have one (e.g. papers, datasets, software).
Functionality
Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
Final approval (post-review)
Estimated hours spent reviewing: 10 Review Comments
Excellent work! This is a fine example of the level of quality and professionalism that I have come to expect from rOpenSci contributions. This package is a great addition to the complement of R functionality and provides a facile portal to a useful data source.
Many of the comments that I had have been addressed as a result of @cderv’s prompt and thorough review. In addition, development has continued to be very active over the past few weeks! Issues are fixed as quickly as I find them! Below are a few more comments that I would like to add. Basic Package Structure Overview README.md
The README.md does not contain citation information, per rOpenSci packaging guide. I am not sure how necessary that is since there is a CITATION file in the inst/ directory, however, it should be an easy add.
I would concur with @cderv in suggesting additional usage examples. DESCRIPTION
The DESCRIPTION file does not currently specify a minimum version of R. I would
recommend doing so. A minor release is preferred over a patch release, e.g.
Depends: R (>= 3.4.0)
is better than
Depends: R (>= 3.4.1)
I am not seeing where curl is used even though it is in the Suggests list.
Response: I’ve added a minimum version of R (>= 3.4.0) to the DESCRIPTION.
Response: Not including curl was causing errors with jsonlite on travis - I’m not sure why, as jsonlite is supposed to import curl itself.
CITATION
The CITATION file uses meta$Date to dynamically fill in date for the citation. However, the Date entry was removed from the DESCRIPTION file per gp suggestion so the date is NULL.
Since it is considered good practice to omit the Date entry because it may be prone to falling out of date, it seems that hard-coding the date in the citation would be silly. I briefly considered the use of a different field to dynamically update the citation date, such as Date/Publication or Packaged. However, those will only work for users who are downloading from a CRAN repos. Therefore I’m a little bit at a loss as to what the most elegant solution might be. I might suggest breaking with good practice here and adding the Date field back in since there is a reason to do so. Suggestions are welcome.
The CITATION file should also end in a newline character :)
Response: I’ve hardcoded 2018 into the CITATION file, which I think is the easiest solution, as its the year the package was first released, although I’m open to other suggestions as well.
Documentation
The man page for the package is named nmisr, not nomisr. This appears to be a typo.
@seealso can link to the other package functions using \link{}.
It appears that some descriptions may be intended to be multi-paragraph. Without explicitly
using the @description and @details tags, the third code block is assumed to be the 'details'
block by default. Therefore, using explicit roxygen tags might be helpful here.
Nomis data comes packaged in very complex data structures. Please don't be afraid of 'over-explaining'.
The more information in the documentation and vignettes, the better for the user, especially if they do
not have prior experience with Nomis data.
'Nomis' in the title for nomis_search() should be capitalized to be consistent with the rest of the package.
Response: I’ve expanded the documentation and used roxygen tags to improve organisation. I’ve also added an expanded explanation to the package vignette.
Examples
The example for nomis_get_data failed due to lack of data. I recommend finding a specific
date range that will return results and then hard-code that into the example.
Athough \donttest{} used to be perfect for API packages because it would not be run by
R CMD check but would be run by example() and would be included in the man page, it appears
that the functionality has changed.
Now, according to Hadley's book, it appears that code enclosed
in \donttest{} is indeed checked (though I have yet to confirm). Therefore,
for long-running API calls, it might make sense to change back to \dontrun{} if you are
concerned about time-out. I recall
reading some best practice advice on CRAN's website at some point, but cannot find the reference
anymore.
The examples often show multiple ways to use the function, which is excellent. However,
it might be a nice enhancement to add a little extra code to glimpse the data for the users to get
a better feel for what they are getting out of the API call. This is especially helpful since
the data that is returned can differ greatly depending on what args are passed.
Response: I’ve added glimpsing to the examples in different functions.
Vignettes
I like the switch to echo = TRUE for vignette code chunks. This is the user's opportunity
to see the code in action for the first time. Including the code used to generate the results can
aid in understanding and readability.
Vignettes are not currently built. This is not a big deal. However, it would
be a convenience for those installing from GitHub or local repos. This is especially true since
the API calls can cause vignettes to take a long time to build. (Currently I cannot build vignettes
because I keep getting hung up).
The addition of the theme and TOC is a nice touch.
It looks like the vignette might still be a little incomplete. I see that the final paragraph of
Querying data variables is not finished.
It's unnecessary to hard-code the results in code chunks. The results will be output when the vignette is knit.
Response: I’ve now pre-built the vignette.
Functionality
nomis_search()
[suggestion] Currently, users can either search based on generic search terms or search for specific keywords.
Looking at the API docs, I think there is an opportunity to improve functionality by taking advantage
of some of the other search categories (e.g., name, description, content type, units). This would
be doubly useful since content type and units are not searched when performing a generic search.
[suggestion] Rather than automatically inserting wildcards (*), I would recommend informing the user of the wildcard
character so that they can have finer control over how wildcards are used in their search.
It would be nice to mention in the documentation that several terms can be searched at once using
a comma. Alternatively, the parameter could accept a character vector that could be
collapsed with commas.
[suggestion] The function could potentially accept character vectors from each of the categories (as well as
a generic search term field) and then the appropriate search string could be created.
Response: I have re-written nomis_search in keeping with your suggestions. It now accepts five parameters - name, description, keywords, content_type, units - and can search in any or all of them at the same time, with or without wildcards. I’ve also added a nomis_content_type() function to assist in looking up specific content types, which may be used with nomis_search().
nomis_codes()
Generally, it is helpful to mirror the language of the source material when possible.
The nomis_codes() function accepts a dimension (concept) and the function returns codes for that dimension.
Therefore, the argument code might be renamed to dimension or concept, since the
function searches for codes availalbe within a dimension.
Related to above, I think the @title could be more descriptive.
[suggestion] The query is currently built using a hierarchal structure. E.g., /api/v01/dataset/<ID>/<DIMENSION>/<CODE>/def.sdmx.json. However,
I think there is an opportunity to make this query a little more flexible and extensible. Instead of enforcing the hierarchy,
the funciton could use the GET verb to provide a little more control over how the data is filtered. A user could
query on criteria from different dimensions than the one which they are interested in results from. For example,
even if the user is interested in the 'item' dimension, they may want to filter on 'geography' and 'sex'. Using GET would allow
for this. As an example: /api/v01/dataset/NM_1_1/item.def.sdmx.json?geography=2038432081&sex=7. This would also expose the user
to options for specifying times and dates, which cannot otherwise be done. I fully acknowledge that this might be over-building,
since this function only provides a support role and will probably not be used that way.
This is a minor concern, but I feel it's cleaner to specify value if a test function returns TRUE first when using ifelse().
The code ifelse(is.null(type), "", paste0("/", type)) reads cleaner than ifelse(is.null(type) == FALSE, paste0("/", type), "") IMO.
nomis_data_info()
The documentation could be more explicit in telling the user that the data returned by this function is metadata.
Response: I’ve updated the documentation to be more explicit about what the function actually does, and improved the code readability. I changed the code
parameter to concept
, as you suggested, to be more in keeping with the syntax used in the original API. I’ve also added a specific search
parameter, and a generic additional_queries
parameter, as used in nomis_get_data()
[suggestion] Relating to the comment above, it may be a little clearer to rename the function something to the effect of nomis_get_metadata().
This naming scheme is more in line with the pattern set forth by nomis_get_data naming scheme.
nomis_get_data()
Response: I’ve renamed the nomis_codes() function to nomis_get_metadata(), which I agree is more descriptive of what the function actually does. I’ve also added the nomis_overview() function, which returns a more general overview of all available metadata, without the flexibility and specificity offered by nomis_get_metadata().
I see that the API documentation states, "you have a 25000 cell limit on the number of data values downloaded in these formats."
It appears that the code currently breaks the fetch into multiple steps if the record count is greater than or equal to 25k records.
I am curious if this might be an oversight. It might be useful to test the edge case where the query returns more than 25k cells but
fewer than 25k records to see if everything still works.
Response: The 25k limit refers to the number of observation values, the documentation is rather confusing on that point.
Including a select parameter may be a good enhancement for the future. Right now, that can be accomplished with
additional_queries but is probably important enough to be standalone. This would allow for additional documentation
and sugar to be added for it.
Response: I’ve added a select parameter as an enhancement, as it is straightforward to do and reduces the amount of information that could possibly need to be included under additional_queries.
Conclusion
This is a great package that will be valuable to anyone interested in analysis of this data. Nice work!
I appreciate the opportunity to review this package and hope that my comments are constructive and useful. This package needs just a few tweaks (namely getting the examples up and running) and then I will gladly approve!