The C6H6 NMR repository: An integral solution to control the flow of your data from the magnet to the public

NMR is a mature technique that is well established and adopted in a wide range of research facilities from laboratories to hospitals. This accounts for large amounts of valuable experimental data that may be readily exported into a standard and open format. Yet the publication of these data faces an important issue: Raw data are not made available; instead, the information is slimed down into a string of characters (the list of peaks). Although historical limitations of technology explain this practice, it is not acceptable in the era of Internet. The idea of modernizing the strategy for sharing NMR data is not new, and some repositories exist, but sharing raw data is still not an established practice. Here, we present a powerful toolbox built on recent technologies that runs inside the browser and provides a means to store, share, analyse, and interact with original NMR data. Stored spectra can be streamlined into the publication pipeline, to improve the revision process for instance. The set of tools is still basic but is intended to be extended. The project is open source under the Massachusetts Institute of Technology (MIT) licence.


| INTRODUCTION
Concern is growing, across different disciplines, over the current state of scholar data. Contemporary publication practices prevent us from taking advantage of research outcomes, from seamlessly integrating them in new knowledge discovery enterprises. To overcome this problem, the FAIR Data Principles [1] (proposed by a collective of 47 researchers around the world) attempt to set the standard for scientific data sharing. FAIR is an acronym for Findability (data are to be stored with rich metadata and registered or indexed in a searchable source), Accessibility (all data and metadata should be accessible through open, free, and standard communication protocols), Interoperability (data and tools from non-cooperating resources should be able to integrate with minimum effort), and Reusability. The FAIR data initiative acknowledges the role of computer agents working for human agents and the importance of making data FAIR for both.
The NMR community has recognized similar challenges and asks for completeness, accessibility, and both human-and machine-readability of published NMR data. [2][3][4] Surprisingly, although modern spectrometers already meet these requirements, generating full sets of experimental data that can be seamlessly shared and read both by machines and (through readily available software) by humans, our scientific publication practices break this achievement. [4] The full spectrum is reduced to a peak list, a time-consuming task that summarizes the findings while dismissing important parts of the experimental data (see Figure 1). This peak list may or may not be accompanied by an illustration of the spectra, rendered as an image that is a cryptogram to any algorithm attempting to interpret the underlying spectrum. [5,6] The final result is presented as a PDF, with data that are ultimately insufficient to seriously referee the publication, replicate it, or expand on it. [7] Peer reviewers are thus doomed to vain attempts at checking the quality of the publication without access to the original data; readers have to "resurrect" [8] the spectra from the peak list to a more human-readable format; and developers have to parse and interpret these incomplete data to feed their algorithms. This absurd cycle engenders a workflow that is not only slow but also prone to errors. [7,[9][10][11] In this sense, one could even argue that only the (potentially wrong) results of the author's interpretation of the spectrum are being published, whereas the actual experimental data remain hidden. This is the opposite of what is expected of an empirical science.
The community has not been oblivious to this contradiction. Ten years ago, on the widely read chemistry blog of Peter Murray, accompanying a low resolution image of an NMR spectrum taken from a scientific publication, one could read: "If an article costs USD 3000 then the scientific community deserves better. How many chemists have cursed the unreadability of numeric data mangled by graphics tools? There is no technical reason why the digital data shouldn't be deposited with the publisher, the institution, the department." [6] A decade ago, it was already clear that no representation could ever replace the original information and that technology was mature enough for that task.
As NMR techniques become part of the daily experimentalist's workflow and more and more structures are published, there is a growing concern about how to ensure the quality and veracity of the published assignments and structures. [7,11] In addition, it is not a simple task to define what is the minimum information that should be made available; indeed, this can keep scientists from different research fields talking for years. Therefore, it does seem very reasonable to, at least, ask for raw data to be stored and shared. [4,10] Note that publishing the raw spectra is a "less is more" kind of solution: Because the spectrum already comes in an optimal shape right out of the spectrometer, we just need to publish it "as it is." This is not a new idea. Some authors endeavour to make their raw data available, either as Supporting Information or via a website. [13] The first repositories of raw spectra appeared at least a decade ago. [14] Their importance is increasingly recognized for various applications and strategies, such as fingerprinting [15] ; computer-assisted spectra analysis [16][17][18][19][20][21][22][23][24][25][26] ; design of QSAR/QSPR descriptors to predict properties from spectra [27] ; and identification of putative metabolites. [28,29] Nevertheless, it appears that availability of raw NMR data is still a significant issue.
Several dedicated databases exist [18,30] that provide spectroscopic assignments but they are just starting to include the original data. [18] ChemSpider [31] is a free but FIGURE 1 Illustration of the information lost during the current publication scheme. Raw data (up) are peak-picked, integrated, and signal multiplicities determined by the authors. This information is then summarized as a list of picked signals, sometimes referred to as "NMR text," [12] and published (mid). The "spectrum" at the bottom shows the information available to the reader after resurrection, that is, after a software is used to display the information contained in the NMR text. Although all relevant information is retained for the aliphatic region, the same does not happen in the aromatic region. Clearly, such a loss of information cannot be justified anymore by limitations of the technology available not open compound database that provides NMR raw files as subsidiary data. In a similar vein, metabolomics databases such as HMDB [32,33] and BMRB [34,35] include NMR data; some even provide access to the raw spectra. [32,33] Although they are not meant to become universal NMR repositories, it is worth noting that HMDB presents several features we consider desirable in a general-purpose repository (e.g., access to both raw data and assigned data and peak search capabilities). NMRb, [14] a repository of raw NMR datasets for the biosciences, seems to have disappeared. The SPECTRa project for sharing raw NMR data seems to have fallen into oblivion as well. [36] Lastly, in OSDB, we found a recent effort that has not yet reached maturity. [37] The issue with these raw data NMR repositories has been that they have not given enough importance to interactivity. Offering little more than a download link to the jcamp files might not be enough. NMR spectroscopy produces complex data, requiring different forms of processing, visualization, and searching. A repository needs to help different users interact with these data in the ways they want to; ways that may be as varied as the disciplines where NMR spectroscopy has earned a spot. For instance, there is the view of the organic chemist, to whom NMR spectra are a means to elucidate molecular structures, and then, there is the metabolomics view, in which the spectrum is a fingerprint. Organic chemists will browse the data looking for similar structure and assignment tables, whereas metabolomics researchers will browse the data looking for signals at a particular region of the spectra. Their different interests demand different capabilities from the repository. Other features such as peak-picking are equally crucial to both of them. We have not found an NMR repository that offers all these possibilities. Even more, doing so would not be enough, because there will always be somebody who wants to process NMR data in a previously unthought of way. This means that extensibility is a must. What we are lacking, then, is a repository that provides extensible tools to extract useful information and to then convert that information into knowledge.
Here, we present a repository that is intended to sit in between the spectrometer and the public, providing a set of efficient tools to manipulate spectra online and to browse the database. The whole system is open, and its code is shared under the MIT licence, as an invitation for others to join the project.

| DATA AND METHODS
The repository is composed of three parts, depicted in Figure 2a.
1. A storage component where raw NMR data and additional information can be stored from any web page, from a third-party software, or, if desired, directly from a spectrometer (see Figure 2b). This component consists of the following: a. A data management system implemented in CouchDB. [38] We chose CouchDB because it presents several pros for building the repository: It is document-based, it has been designed to work in a distributed manner, it is easily replicable, it scales horizontally, and it has a data-revisions functionality. b. A data structure for chemical information, described in the Supporting Information. The proposed data model has been defined using the JavaScript Object Notation [39] (JSON). The main advantages with this approach are that the natural text representation of the data is human readable and that it is supported by libraries in almost any modern programming language; in particular, it is natively supported by JavaScript in any web browser. Spectra are stored as attachments in JCAMP-DX format. Original data from spectrometer manufacturers can be accepted in the future, whenever the format is well described. Molecules are stored separately as SDF files. c. A RESTful API (Application Programming Interface) called rest-on-couch [40] that exposes the data to the web and allows the control of permissions on the documents. This API has been developed in JavaScript.
All the related sources are available in GitHub. [41] 2. A toolbox of JavaScript libraries for data manipulation (processing and analysis). [42,43] It is built over more than 60 libraries (some developed in-house and some borrowed from other open-source projects) that enable it to perform a gamut of operations on different kinds of information ranging from image analysis, Fourier transform, multiplet analysis, spectra prediction and simulation to data mining, and multivariate statistical analysis. Full details on the methods implemented in these libraries can be found in the projects' documentation; here, we will just refer to those specifically concerning the NMR repository, which are available in GitHub from the cheminfo-js [44] and mljs [45] organizations: a. Structure search uses the algorithms of DataWarrior. [46] b. 1D NMR peak-picking uses the Global Spectra Deconvolution method described by Cobas et al. [47] 2D peak detection is performed by using the watershed algorithm for image segmentation [48] and by the identification of centroids of closed regions on the Laplacian of Gaussian of the 2D spectra. [49] If many spectra are available for the peak-picking process (e.g., both 1D and 2D 1 H), then a validation is performed, which can identify fake or missing peaks by comparing their patterns. c. 13 C-NMR chemical shifts are predicted using NMRshiftDB, [17,18] whereas 1 H-NMR chemical shifts and coupling constants are predicted with Spinus. [22,23] 1D spectra are simulated from predicted shifts using the method of Castillo et al. [25] Spin couplings in 2D spectra are predicted by calculating n-length paths between active nuclei in the corresponding molecule; for example, a COSY cross-peak pair is drawn for each pair of protons separated by up to three bonds. Crosspeaks coming from long-range couplings, such as COSY couplings at >3 bonds, are included in the 2D spectrum when Spinus predicts a coupling constant >2 Hz.

A visualization tool called
Visualizer, [50] developed in JavaScript and HTML5. Using an interface written in a programming language supported by all modern web browsers enables access to the application without having to install software on the client: Everything runs in the browser and always uses the latest updates. It is built in a modular manner that allows to modify its behavior directly from the browser, by executing code written inside the tool.
This is similar to extending TopSpin's functionalities using AU programs [51] or jython [52] scripts.
The whole system is available at Github [53] and can be easily deployed using container technology, [54] that permits to install and configure the service without having to install and configure all its components individually.

| RESULTS
Putting all the elements described in the previous section together allowed us to build C6H6, an NMR repository with several features to make it worth the effort consented by the users when sharing their data. A working implementation of our application can be accessed at www.c6h6.org. The concept behind C6H6 is as follows: a database of spectra ( 1 H, 13 C, COSY, HMBC, and HSQC are currently supported) and other sample data (ID, origin, name, and physical constants; see Supporting Information) is kept on secured servers. The webtool interacts with the database through the RESTful API, allowing the user to search, view, modify, and download the data (see Figure 2b). Results of the query are processed on the client using JavaScript, and the final result (e.g., a plot of the queried spectrum) is presented in the browser interface.
The user is asked to login before using the application; this is done in order to manage permissions and ensure safety and security of the user's data. The submitted data are private, unless the owner decides otherwise. Users Synchronization with the laboratory: C6H6 may be directly linked to the spectrometer via synchronization/replication with a local laboratory and information management system (LIMS); alternatively, data may be imported from third-party software or submitted online. Efforts are underway to implement an experiment configuration and request queue, thus turning C6H6 into a full-feature LIMS. (c) Insertion in the publication system: Data are readily shared via a URL; it is kept private during research and peer review, then made public once the paper is accepted may choose, for each register, whether to make it available to the public. Public data can be accessed without login, but it cannot be modified by anyone but the owner. Figure 3 shows the main interface. On the left, we have the list of samples available to the user, a basic search utility, and the Add sample button. On the right, we can access the tool set. Upon entering or opening a sample, the user is taken to a new tab 1 (Figure 4) where sample attributes (structure, name, physical constants, etc.) can be edited and spectra can be uploaded. Similarly, clicking on a tool takes the user to a new tab where an interface to perform the corresponding task is presented ( Figure 5).
As it can be seen in these snapshots, support for other techniques such as mass spectrometry and infrared spectroscopy is being developed. Therefore, the necessary non-proprietary formats, such as NetCDF, [55,56] will be ported to our application. In addition, work is in progress to fully support and test a recently proposed standard for the inclusion of NMR assignments as associated data items in SDF files. [3] This new format may help the process of importation of data analysed using a third-party software (see Figure 2b).

| DISCUSSION
The issue we want to address with C6H6 is one of data availability and usability. NMR spectroscopy is a powerful technique that has rightfully attracted the attention of people with different backgrounds and interests. Making NMR data available means that the data published copes with the interests of all these parties. As discussed above, no optimal solution in this sense has yet been achieved by existing repositories. With C6H6, we thus intended to appeal to all disciplines where NMR is of relevance.
For the organic chemist, an NMR spectrum is, above anything else, a means to determine or confirm the identity of a compound of interest. This community is probably the most familiar with the data generated by the technique but are also probably the least interested in directly reading the whole output of the spectrometer (though there will be times when they will absolutely want to!). Yet they probably will not be satisfied with just reading the peak list, either. First, because this is not the representation of the spectrum they are most adept at reading; that would be the standard 2D plot. But, most importantly, because peak lists are not "true" experimental data. Indeed, they have already gone through a peakpicking and assignment process with which they may not agree and that may well hide the presence of impurities in the sample. In fact, checking and confronting the author's choices on this regard is a key objective of peer reviewing in this area. Overall, researchers in organic chemistry and related disciplines seek for the ability to graphically navigate the spectroscopic data on different stages of processing, from full-processed peak-picked and assigned spectra to, in extraordinary cases, the rawest data provided by the spectrometer. For these users, C6H6 packages a basic set of computer-assisted spectra processing tools and plotting capabilities along with the raw data repository.
In many other fields, an NMR spectrum is first and foremost a fingerprint. Metabolomics is a key example. In this community, NMR is used to characterize the composition profile of complex samples of biological origin (e.g., a blood sample). The ultimate goal is to measure metabolic responses to different stimuli by detecting statistically significant changes in the NMR profiles of intervened and control samples. In this setting, assignment of signals is often not a priority and only required to confirm already identified biomarkers. Instead, other necessities arise. Access to full-resolution spectra is an obvious must: Even if it may be trimmed later during data processing, the researcher wants to start with the full fingerprint to make sure that no relevant features are missed. Knowledge of the precise conditions under which the spectrum was taken is equally important, because 1 H-NMR spectra vary significantly (for fingerprinting purposes) with parameters such as static magnetic field intensity, buffer or solvent composition, and temperature. C6H6 allows to store such information and to search, for example, for spectra recorded at a particular field intensity. Tools such as spectra superposition and signal search may also be useful to the metabolomics community.
For the developer, data availability means access to raw, full-resolution data that is key to the development of new methods and algorithms used to process and analyse spectra. Indeed, reliable datasets are needed to test and validate new methods. This is true not only for NMR, and one can look into other fields to realize what could happen to computer-assisted NMR analysis once data availability is given the importance it deserves. For example, in the field of visual media analysis, ImageCLEF [58] provides developers with curated datasets of images that developers use for training and testing their algorithms. The impact of this initiative is not to be underestimated, as ImageCLEF has become a major driving force behind a rapidly developing field. We believe a similar initiative could be equally valuable to the development of automatic NMR analysis; though surprisingly, previous attempts [14] seem to have faced limited success.
But beyond the importance of testing sets to validate new methods for NMR analysis is the requirement of larger datasets allowing the method itself to function. Chemical shift prediction is the clearest example: All FIGURE 4 Sample edition tab. Any kind of NMR data can be uploaded using the drag & drop area. 1D and 2D spectra can be displayed and zoomed in (lower right). Accompanying structures may be drawn in the JSME editor [57] (left) or input directly as an SDF file. Additional information such as molecular formula and molecular weight are calculated on the fly while drawing. Similarly, more complex processes can be triggered automatically when data are uploaded, such as automatic peak-picking and assignment state-of-the-art chemical shift predictors need a database of assigned spectra to work. [17][18][19][20][21][22][23][24]26] For all we know, this is probably the way it will always be. [26] Chemists often complain about the high cost of commercial suites for computer-assigned NMR analysis, but this cost acknowledges the enormous value of the spectra databases that actors in the private sector have managed to amass. If public science wants to compete with these corporations and produce its own, freely available applications, it needs its own, freely available NMR databases.
It must be emphasized that C6H6 is not just a web page or web service: It is a full-fledged application designed to run in the browser. The "web page" is intended to perform as traditional software; all the code necessary to perform the task is downloaded and executed within the browser. This solves the issue of operating system compatibility that often makes downloading and installing new software a troublesome task. Furthermore, the application is designed to be readily extensible: The JavaScript code necessary to perform new tasks can be stored inside the tool or implemented as an external library and called directly by the visualizer. In this manner, using the JavaScript language allows us to benefit from the biggest and fastest growing development community over the world. Finally, the same code can be executed either on the server side or directly in the browser (client side), a very powerful argument in favour of javascript programming language that makes it very suitable for mobile devices.
Ongoing efforts are currently focused on converting C6H6 into a full LIMS (see Figure 2b) that provides a request and queueing system to encourage users to provide at least a minimal set of data to describe their experiment. This queue can be configured to automatically set up and trigger experiments in order to ensure that most experiments are performed with optimal parameters. Once the experiment is finished, the data are automatically sent back to the server, thereby ensuring that all the data are correctly stored, thus improving the traceability of the data. Validated data can then be shared via a URL and streamlined into the publication pipeline (see Figure 2c).
From a decade of running NMR facilities, we understand that sharing data is not a spontaneous act. It should be encouraged either by lowering the effort required to share and by providing added value, which are goals that we attempted to achieve, or by including this as a requirement within the publication pipeline (see Figure 2c). We appeal to the NMR community and publishers to include depositing raw data as a necessary condition for publication. We insist that sharing of raw data may be key to the future of public science for fundamental reasons. Over the past years, reproducibility of scientific studies has emerged as a major concern. Providing open access to the raw data, FIGURE 5 Peak-picking interface. Peak-picking can be performed either manually (by right-clicking on the spectra) or automatically. Assignment can be performed by selecting a signal (green rectangle on top of each peak) and then selecting a proton in the molecule experimental designs, and source code has been proposed as one avenue to mitigate this problem. Therefore, sharing electronic laboratory notebooks represents a true step towards incremental and reproducible science. [59] LIMS and repositories are key elements in the construction of such new ways to share and publish research outcomes. Although the technology is there, the main challenge remains.

| CONCLUSIONS
We believe the repository presented here enables a sharing and publication model that can warrant the quality, traceability, transferability, and agility required by contemporary NMR-related research; an ideal that the current publication system has not properly achieved.
But the sole existence of a repository does not guarantee that these ideal will be achieved. In order to ensure the comprehensiveness and correctness of the data stored, the repository needs to be inserted into the peer-review and publication pipeline, which in turn demands the collaboration of researchers and publishers. We invite the NMR community to participate in this joint effort.