Unarchive a list of compressed tsv files into a database

unark(files, db_con, streamable_table = streamable_base_tsv(),
  lines = 50000L, overwrite = "ask", ...)

Arguments

files

vector of filenames to be read in. Must be tsv format, optionally compressed using bzip2, gzip, zip, or xz format at present.

db_con

a database src (src_dbi object from dplyr)

streamable_table

interface for serializing/deserializing in chunks

lines

number of lines to read in a chunk.

overwrite

should any existing text files of the same name be overwritten? default is "ask", which will ask for confirmation in an interactive session, and overwrite in a non-interactive script. TRUE will always overwrite, FALSE will always skip such tables.

...

additional arguments to streamable_table$read method.

Value

the database connection (invisibly)

Details

unark will read in a files in chunks and write them into a database. This is essential for processing large compressed tables which may be too large to read into memory before writing into a database. In general, increasing the lines parameter will result in a faster total transfer but require more free memory for working with these larger chunks.

If using readr-based streamable-table, you can suppress the progress bar by using options(readr.show_progress = FALSE) when reading in large files.

Examples

## Setup: create an archive. library(dplyr) dir <- tempdir() db <- dbplyr::nycflights13_sqlite(tempdir()) ## database -> .tsv.bz2 ark(db, dir)
#> Warning: overwriting airlines.tsv.bz2
#> Exporting airlines in 50000 line chunks:
#> ...Done! (in 0.004136801 secs)
#> Warning: overwriting airports.tsv.bz2
#> Exporting airports in 50000 line chunks:
#> ...Done! (in 0.02498388 secs)
#> Warning: overwriting flights.tsv.bz2
#> Exporting flights in 50000 line chunks:
#> ...Done! (in 14.67398 secs)
#> Warning: overwriting planes.tsv.bz2
#> Exporting planes in 50000 line chunks:
#> ...Done! (in 0.04430103 secs)
#> Warning: overwriting weather.tsv.bz2
#> Exporting weather in 50000 line chunks:
#> ...Done! (in 0.9905419 secs)
## list all files in archive (full paths) files <- list.files(dir, "[.]tsv\\.bz2$", full.names = TRUE) ## Read archived files into a new database (another sqlite in this case) new_db <- src_sqlite(file.path(dir, "local.sqlite"), create=TRUE) unark(files, new_db)
#> Importing /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpvRHOJK/airlines.tsv.bz2 in 50000 line chunks:
#> ...Done! (in 0.01972699 secs)
#> Importing /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpvRHOJK/airports.tsv.bz2 in 50000 line chunks:
#> ...Done! (in 0.03218198 secs)
#> Importing /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpvRHOJK/flights.tsv.bz2 in 50000 line chunks:
#> ...Done! (in 9.822594 secs)
#> Importing /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpvRHOJK/planes.tsv.bz2 in 50000 line chunks:
#> ...Done! (in 0.04933119 secs)
#> Importing /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpvRHOJK/weather.tsv.bz2 in 50000 line chunks:
#> ...Done! (in 0.534965 secs)
## Prove table is returned successfully. tbl(new_db, "flights")
#> # Source: table<flights> [?? x 19] #> # Database: sqlite 3.22.0 #> # [/var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpvRHOJK/local.sqlite] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # ... with more rows, and 11 more variables: arr_delay <int>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <int>, #> # distance <int>, hour <int>, minute <int>, time_hour <dbl>