rdflib
vignettes/articles/storage.Rmd
storage.Rmd
By default, rdflib
stores triples in memory, which can be read in from and written out to files on disk using a variety of serializations (rdfxml
, json-ld
, ntriples
, nquads
or turtle
). This is analgous to reading in a data.frame
object from a csv
, tsv
, or other text-delimited file, which requires the full data object to be read into memory before a user can manipulate it. This approach runs into limitations with very large datasets, which may exceed the memory available. Just as R packages such as dplyr
offer the ability to perform manipulations with very large data through a connection to a database backend such as SQLite, MySQL or PostgreSQL, rdflib
can rely on these and other databases to provide a disk-based backend.
Unfortunately, at this time, support for these storage devices is not included in the prebuild Mac and Windows binaries of the redland
R package. Users wanting to take advantage of disk-based storage must instead build and install the redland
R package from source; and in some cases, also build the redland
C libraries from source. This package provides a Dockerfile containing a portable recipe for building the library with support for the 5 main backend storage devices: SQLite, MySQL, PostgreSQL, Virtuoso, and Berkeley DB. This vignette documents the installation and use of these devices.
Also see the official documentation for the redland C library discussing all supported storage devices.
Here we show examples of reading in a modest set of 190,000 triples from an nquads
file and executing a simple SPARQL query using each the five backends supported by rdflib
. First we load the libraries and prepare an example file by triple-ifying 10,000 rows of the flights
data.frame:
library(rdflib)
library(nycflights13)
example <- flights[1e4,]
system.time(
write_nquads(example, "example.nq", prefix = "flights")
)
#> user system elapsed
#> 0.015 0.000 0.014
system.time(
write_nquads(flights, "flights.nq", prefix = "flights")
)
#> user system elapsed
#> 23.119 1.063 24.720
Because the dataset is small enough to easily fit in memory, the default in-memory option is an obvious choice with excellent performance. Note that this option will not be possible with larger triplestores (e.g. with millions or more triples). Our testing has found that even on machines with 100GB+ of RAM, the redland in-memory backend is not able to take advantage of that additional memory and disk-based backends are required.
triplestore <- rdf()
system.time(
read_nquads("example.nq", rdf = triplestore) # smaller set
)
#> user system elapsed
#> 0.003 0.000 0.003
query <-
'SELECT ?carrier ?flight ?origin ?dep_delay
WHERE {
?flight <flights:carrier> ?carrier .
?flight <flights:dep_delay> ?dep_delay .
?flight <flights:origin> ?origin
}'
system.time(
df <- rdf_query(triplestore, query)
)
#> user system elapsed
#> 0.024 0.003 0.027
rdf_free(triplestore)
The Berkeley DB is a simple key-value store
It is the most mature and primary persistent store and suitable for large models, tested in the 2-3 million range.
Berkeley DB is a simple disk-based storage option. Install both the redland libraries and the berkeley-db (e.g. bd-dev
on Debian/Ubuntu) libraries, and then install redland
from source. BDB is relatively fast for data too large for memory.
triplestore <- rdf(storage="BDB", new_db = TRUE)
system.time(
read_nquads("flights.nq", rdf = triplestore)
)
#> user system elapsed
#> 89.913 27.217 117.540
system.time(
df <- rdf_query(triplestore, query)
)
#> user system elapsed
#> 23.473 0.483 23.970
rdf_free(triplestore)
## Becuase BDB is just a hash table, redland needs three separate files:
unlink("rdflib-po2s.db")
unlink("rdflib-so2p.db")
unlink("rdflib-sp2o.db")
Unlike the other backends which use general purpose relational databases or key-value stores, Virtuoso dedicated to RDF-based data. Virtuoso is a popular open source database for RDF with a rich set of built-in interfaces and features, but we can also interact with it directly through the redland
bindings just like we do any other backend in rdflib
. Virtuoso setup may be slightly more involved for individuals unfamiliar with it, but will probably provide the best performance in the case of very large triplestores. The example below shows a new database setup, but rdflib
can also connect to any existing Virtuoso database.
triplestore <- rdf(storage = "virtuoso",
user = "dba",
password = "dba",
dsn = "Local Virtuoso"
)
system.time(
read_nquads("flights.nq", rdf = triplestore)
)
#> user system elapsed
#> 214.767 182.883 6930.584
system.time(
df <- rdf_query(triplestore, query)
)
#> user system elapsed
#> 44.730 29.497 396.329
df <- rdf_query(triplestore, "SELECT ?s ?p ?o WHERE{ ?s ?p ?o }")
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 1391311 parsing failures.
#> row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1338850 o an integer _:r1011167 literal data file 2 1338851 o an integer _:r1011168 literal data row 3 1338852 o an integer _:r1011169 literal data col 4 1338853 o an integer _:r1011170 literal data expected 5 1338854 o an integer _:r1012106 literal data
#> ... ................. ... .................................................. ........ .................................................. ...... .................................................. .... .................................................. ... .................................................. ... .................................................. ........ ..................................................
#> See problems(...) for more details.
df
#> # A tibble: 6,398,744 x 3
#> s p o
#> <chr> <chr> <int>
#> 1 flights:1 flights:year 2013
#> 2 flights:2 flights:year 2013
#> 3 flights:3 flights:year 2013
#> 4 flights:4 flights:year 2013
#> 5 flights:5 flights:year 2013
#> 6 flights:6 flights:year 2013
#> 7 flights:7 flights:year 2013
#> 8 flights:8 flights:year 2013
#> 9 flights:9 flights:year 2013
#> 10 flights:10 flights:year 2013
#> # ... with 6,398,734 more rows
rdf_free(triplestore)
Or on remote virtuoso:
triplestore <- rdf(storage = "virtuoso",
user = "dba",
password = "dba",
host = "virtuoso:1111"
)
system.time(
read_nquads("flights.nq", rdf = triplestore)
)
#> user system elapsed
#> 346.051 418.774 5738.233
Postgres and MySQL are ubiquitious relational databases. This backend requires the redland
binaries are built from source with this support enabled, which is not the case for pre-built Mac or Linux binaries.
triplestore <- rdf(storage = "postgres",
host = "postgres",
user = "postgres",
password = "rdflib",
new_db = TRUE)
#> Warning in rdf_storage(storage, world, host, port, user, password,
#> database, : postgres driver not found. Falling back on in-memory storage
system.time(
read_nquads("flights.nq", rdf = triplestore)
)
#> user system elapsed
#> 91.684 66.828 68784.126
system.time(
df <- rdf_query(triplestore, query)
)
#> user system elapsed
#> 6.975 0.000 6.977
rdf_free(triplestore)
triplestore <- rdf(storage = "mysql",
host = "mariadb",
user = "root",
password = "rdflib",
database = "mysql",
new_db = TRUE
)
read_nquads("flights.nq", rdf = triplestore)
#> Total of 6398744 triples, stored in mysql
#> -------------------------------
#>
#> (preview supressed for performance)
system.time(df <- rdf_query(triplestore, query))
#> user system elapsed
#> 29.511 26.074 340.782
rdf_free(triplestore)
SQLite is relatively easy to set up, but appears to have rather poor overall performance. Requires SQLite development libraries installed (should work ‘out-of-the-box’ with Mac binaries for redland
package).
triplestore <- rdf(storage="sqlite", new_db = TRUE, name="rdflib.sqlite")
system.time(
read_nquads("flights.nq", rdf = triplestore)
)
#> user system elapsed
#> 9309.997 2184.927 32132.714
system.time(
df <- rdf_query(triplestore, query)
)
#> user system elapsed
#> 92.374 0.239 99.388
rdf_free(triplestore)
Getting full support for the above backend databases through the redland
R package is not trivial. The redland
R package provides bindings to the redland libraries. Unfortunately, commonly available binary versions of those libraries, such as librdf0-dev
on Debian, redland
on Mac OSX brew, and the statically linked versions for Mac and Windows shipping in the R package, do not build those libraries with the optional support for all backends. (NOTE: it is the C library itself which must be built from source with these options, not just the R package source). Consequently, users must build librdf
from the original sources, https://github.com/dajobe/librdf with all backend linking libraries available, and then also build the redland
R package from source, to be able to access all of these backends. On a Debian or Ubuntu system this looks like the following:
apt-get update && apt-get -yq install \
libxml2-dev \
libcurl4-openssl-dev \
libssl-dev \
git \
automake \
libtool \
gtk-doc-tools \
bison \
flex \
libgmp-dev \
libmhash-dev \
libgcrypt20-dev \
libpcre3-dev \
libv8-dev \
libjq-dev \
libpq-dev \
libdb-dev \
libsqlite3-dev \
libmariadbclient-dev \
librdf-storage-virtuoso \
virtuoso-server \
unixodbc-dev
Now we can build raptor
(parsers), rasqal
(sparql queries) and rdflib
from source:
git clone git://github.com/dajobe/raptor.git && \
cd raptor && \
./autogen.sh && \
make && \
make install && \
cd .. && \
git clone git://github.com/dajobe/rasqal.git && \
cd rasqal && \
./autogen.sh && \
make && \
make install && \
cd .. && \
git clone git://github.com/dajobe/librdf.git && \
cd librdf && \
./autogen.sh && \
make && \
make install
A See the Dockerfile in inst/docker
for an example of this, or simply use the Rocker-based image ropensci/rdflib
.
rdflib
uses Circle-CI to test database backends using docker-compose
. docker-compose
pulls down dedicated docker containers for postgres
and mariadb
, along with the ropensci/rdflib
container, which includes a version of redland
compiled with support for all five major backend storage systems. See the Dockerfile in inst/docker
and associated [docker-compose.yml]((https://github.com/ropensci/rdflib/tree/master/docker-compose.yml) used in testing for an example of this configuration. You can also pull the Docker image ropensci/rdflib
from Docker Hub for testing with all these libraries installed.
The badge below (also on the package README) indicates that these dedicated tests are passing.