Unarchive a list of compressed tsv files into a database
unark(files, db_con, streamable_table = streamable_base_tsv(), lines = 50000L, overwrite = "ask", ...)
| files | vector of filenames to be read in. Must be |
|---|---|
| db_con | a database src ( |
| streamable_table | interface for serializing/deserializing in chunks |
| lines | number of lines to read in a chunk. |
| overwrite | should any existing text files of the same name be overwritten? default is "ask", which will ask for confirmation in an interactive session, and overwrite in a non-interactive script. TRUE will always overwrite, FALSE will always skip such tables. |
| ... | additional arguments to |
the database connection (invisibly)
unark will read in a files in chunks and
write them into a database. This is essential for processing
large compressed tables which may be too large to read into
memory before writing into a database. In general, increasing
the lines parameter will result in a faster total transfer
but require more free memory for working with these larger chunks.
If using readr-based streamable-table, you can suppress the progress bar
by using options(readr.show_progress = FALSE) when reading in large
files.
## Setup: create an archive. library(dplyr) dir <- tempdir() db <- dbplyr::nycflights13_sqlite(tempdir()) ## database -> .tsv.bz2 ark(db, dir)#> Warning: overwriting airlines.tsv.bz2#>#>#> Warning: overwriting airports.tsv.bz2#>#>#> Warning: overwriting flights.tsv.bz2#>#>#> Warning: overwriting planes.tsv.bz2#>#>#> Warning: overwriting weather.tsv.bz2#>#>## list all files in archive (full paths) files <- list.files(dir, "[.]tsv\\.bz2$", full.names = TRUE) ## Read archived files into a new database (another sqlite in this case) new_db <- src_sqlite(file.path(dir, "local.sqlite"), create=TRUE) unark(files, new_db)#>#>#>#>#>#>#>#>#>#>#> # Source: table<flights> [?? x 19] #> # Database: sqlite 3.22.0 #> # [/var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpvRHOJK/local.sqlite] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # ... with more rows, and 11 more variables: arr_delay <int>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <int>, #> # distance <int>, hour <int>, minute <int>, time_hour <dbl>