Immunarch data format

immunarch comes with its own data format, including tab-delimited columns that can be specified as follows:

Input / output

The package provides several IO functions:

repLoad has the .format argument that sets the format for input repertoire files. Do not provide it if you want immunarch to detect the formats and parse files automatically! In case you want to force the package to parse the data in a specific format, you can choose one of the several options for the argument:

These parsers will be available soon.

Please contact us if there are more file formats you want to be supported.

For parsing IgBLAST results process the data with MigMap first.

You can load the data either from a single file or from a folder with repertoire files. A single file can be loaded as follows:

# To load the data from a single file without forcing the data format:
immdata <- repLoad("path/to/your/folder/immunoseq_1.txt")

# To load the data from a single ImmunoSEQ file go with:
immdata <- repLoad("path/to/your/folder/immunoseq_1.txt", .format = "immunoseq")

In the second case you may want to provide a metadata file and locate it in the folder. It is necessary to name it exactly “metadata.txt”.

# For instance you have a following structure in your folder:
# >_ ls
# immunoseq1.txt
# immunoseq2.txt
# immunoseq3.txt
# metadata.txt

With the metadata repLoad will create a list in the environment with 2 elements, namely data and meta. All the data will be accessible simply from immdata$data.

Otherwise repLoad will create a list with the number of elements matching the number of your files. They will be accessible directly from immdata.

# To load the whole folder with every file in it type:

immdata <- repLoad("path/to/your/folder/")

# In order to do that your folder must contain metadata file named
# exactly "metadata.txt".

# In R, when you load your data:
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"

# Suppose you do not have "metadata.txt":
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "immunoseq_1" "immunoseq_2" "immunoseq_3"

The metadata has to be tab delimited file with first column named “Sample” and any number of additional columns with arbitrary names. The first column should contain base names of files without extensions in your folder.

Sample Sex Age Status
immunoseq_1 M 1 C
immunoseq_2 M 2 C
immunoseq_3 F 3 A

In order to import data from the external databases you have to create a connection to this database and then load the data. Make sure that the table format in your database matches the immunarch’s format.

To illustrate the use of external database, here is an example demonstrating data loading to the local MonetDB database:

# Your list of repertoires in immunarch's format
DATA
# Metadata data frame
META

# Create a temporary directory
dbdir = tempdir()

# Create a DBI connection to MonetDB in the temporary directory.
con = DBI::dbConnect(MonetDBLite::MonetDBLite(), embedded = dbdir)

# Write each repertoire to MonetDB. Each table has corresponding name from the DATA
for (i in 1:length(DATA)) {
  DBI::dbWriteTable(con, names(DATA)[i], DATA[[i]], overwrite=TRUE)
}

# Create a source in the temporary directory with MonetDB
ms = MonetDBLite::src_monetdblite(dbdir = dbdir)
res_db = list()

# Load the data from MonetDB to dplyr tables
for (i in 1:length(DATA)) {
  res_db[[names(DATA)[i]]] = dplyr::tbl(ms, names(DATA)[i])
}

# Your data is ready to use
list(data = res_db, meta = META)

You might want to make a list with data, which is a list of tables of repertoires, and the meta, containing metadata:

# Load the data to the immdata variables. Metadata file "metadata.txt" will be found automatically.
immdata = repLoad("your_folder", "optionally_your_format")
# Repertoires
immdata$data
# Metadata
immdata$meta

immunarch is compatible with following sources:

Basic data manipulations with dplyr and immunarch

You can find the introduction to dplyr here: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Get the most abundant clonotypes

The function returns the most abundant clonotypes for the given repertoire:

top(immdata$data[[1]])
## # A tibble: 11 x 15
##    Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start
##     <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int>
##  1    206    0.0206  TGCGCC… CASSQE… TRBV4… TRBD1  TRBJ2…    16      18
##  2    193    0.0193  TGCGCC… CASSYR… TRBV4… TRBD1  TRBJ2…    11      13
##  3     79    0.0079  TGTGCC… CATSTN… TRBV15 TRBD1  TRBJ2…    11      16
##  4     64    0.0064  TGTGCC… CATSIG… TRBV15 TRBD2  TRBJ2…    11      19
##  5     62    0.0062  TGCGCC… CASQGD… TRBV4… TRBD1  TRBJ1…     8      13
##  6     56    0.0056  TGTGCC… CASSPW… TRBV27 TRBD1  TRBJ1…    11      16
##  7     54    0.0054  TGCGCC… CASSQD… TRBV4… TRBD1  TRBJ2…    16      21
##  8     35    0.0035  TGTGCC… CASSEE… TRBV2  TRBD1  TRBJ1…    15      17
##  9     34    0.0034  TGCGCC… CASSQP… TRBV4… TRBD1  TRBJ2…    14      23
## 10     29    0.00290 TGTGCC… CASSWV… TRBV6… TRBD1  TRBJ2…    12      20
## 11     29    0.00290 TGTGCC… CASSFR… TRBV27 TRBD1  TRBJ2…    13      16
## # … with 6 more variables: D.end <int>, J.start <int>, VJ.ins <dbl>,
## #   VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>

Filter functional / non-functional / in-frame / out-of-frame clonotypes

Conveniently, functions are vectorised over the list of data frames; and coding(immdata$data) in the example below returns a list of data frames with coding sequences:

coding(immdata$data[[1]])

The next one operates in a similar fashion:

noncoding(immdata$data[[1]])

Now, the computation of the number of filtered sequences is straightforward:

nrow(inframes(immdata$data[[1]]))

And for the out-of-frame clonotypes:

nrow(outofframes(immdata$data[[1]]))

Get subset of clonotypes with a specific V gene

It is simple to subset data frame according to labels in the specified index. In the example the resulting data frame contains only records with ‘TRBV10-1’ V gene:

filter(immdata$data[[1]], V.name == 'TRBV10-1')
## # A tibble: 26 x 15
##    Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start
##     <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int>
##  1      2     0.0002 TGCGCC… CASSES… TRBV1… TRBD2  TRBJ2…    16      20
##  2      2     0.0002 TGCGCC… CASSDG… TRBV1… TRBD1  TRBJ2…    13      15
##  3      2     0.0002 TGCGCC… CASSGD… TRBV1… TRBD2  TRBJ2…     8      10
##  4      1     0.0001 TGCGCC… CASSEA… TRBV1… TRBD2  TRBJ2…    14      21
##  5      1     0.0001 TGCGCC… CATLRS… TRBV1… TRBD1  TRBJ2…     6       7
##  6      1     0.0001 TGCGCC… CASSES… TRBV1… TRBD2  TRBJ2…    16      20
##  7      1     0.0001 TGCGCC… CASSES… TRBV1… TRBD2  TRBJ2…    16      17
##  8      1     0.0001 TGCGCC… CASRAS… TRBV1… TRBD2  TRBJ2…    10      13
##  9      1     0.0001 TGCGCC… CASRGS… TRBV1… TRBD1  TRBJ2…    10      11
## 10      1     0.0001 TGCGCC… CASRRD… TRBV1… TRBD1  TRBJ2…     8      13
## # … with 16 more rows, and 6 more variables: D.end <int>, J.start <int>,
## #   VJ.ins <dbl>, VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>

Downsampling

ds = repSample(immdata$data, "downsample", 100)
sapply(ds, nrow)
## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192     MS1     MS2     MS3 
##      89      99      97     100      88      91      84      99      85 
##     MS4     MS5     MS6 
##      99      92     100
ds = repSample(immdata$data, "sample", .n = 10)
sapply(ds, nrow)
## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192     MS1     MS2     MS3 
##      10      10      10      10      10      10      10      10      10 
##     MS4     MS5     MS6 
##      10      10      10