_____________________ SUBSETTING 10X DATA _____________________ Table of Contents _________________ 1. Extract Headers and Sequence from a FASTQ Barcode file 2. Extract 10X barcodes from the sequence file 3. Rank barcodes by frequency 4. Subselect a wide range of good to low quality barcodes 5. Map the barcodes back to the sequences and then back to the header they came from 6. Filter original FASTQ with wanted headers 7. Result 1 Extract Headers and Sequence from a FASTQ Barcode file ======================================================== Header and Sequence are a 1:1 mapping, so both files should have an identical number of lines ,---- | zcat $file | awk 'NR%4==1' > header_only.txt | zcat $file | awk 'NR%4==2' > sequence_only.txt `---- 2 Extract 10X barcodes from the sequence file ============================================= This is CR3 data, so it's 16bp of CB and 10bp of UMI at the end ,---- | cut -b 1-16 sequence_only.txt > sequence_only.1_16CB.txt `---- 3 Rank barcodes by frequency ============================ We want to capture cells of good quality ,---- | sort sequence_only.1_16CB.txt | uniq -c | sort -rnk1 > sequence_only.1_16CB.freqs.txt `---- 4 Subselect a wide range of good to low quality barcodes ======================================================== If we just take the top 4000 prolific barcodes then this will give us 90% of the data, so we wouldn't save on anything. Instead we take the 40,000 and pick a barcode every N lines. ,---- | head -$hed sequence_only.1_16CB.freqs.txt | awk "NR%$ev==1 {print $2}" > sequence_only.1_16CB.freqs.subselected.txt `---- 5 Map the barcodes back to the sequences and then back to the header they came from =================================================================================== With our list of wanted barcode sequences, we then need to map them back to the original sequences file in order to use the 1:1 mapping of sequence to header, so that we can get the headers which match our wanted barcodes. ,---- | #!/usr/bin/python3 | wanted_file="sequence_only.1_16CB.freqs.subselected.txt" | wanted_map = {} | with open(wanted_file,'r') as f: | for line in f: | barc = line.splitlines()[0] | | if barc in wanted_map: | print("Error", barc) | exit(-1) | | wanted_map[barc] = -1 | f.close() | | raw_seqfile="sequence_only.1_16CB.txt" | line_wanted = {} | with open(raw_seqfile,'r') as f: | i = 0 | for line in f: | line = line.splitlines()[0] | if line in wanted_map: | if i in line_wanted: | print("Error", i) | exit(-1) | line_wanted[i] = True | i += 1 | f.close() | | # Reduction from total | len(line_wanted) * 100 / 30462643 | | raw_headfile="header_only.txt" | with open(raw_headfile, 'r') as f: | g = open("header_only.wanted.txt", 'w') | i = 0 | for line in f: | if i in line_wanted: | print(line, sep="", end="", file=g) | i += 1 | g.close() | f.close() `---- I took the top 40k most prevalent cell barcodes, and selected 1 line out of 8 of them, leading to a 88% reduction in size. 6 Filter original FASTQ with wanted headers =========================================== ,---- | #!/usr/bin/env python3 | | hfile='header_only.wanted.txt' | | headers = {} | with open(hfile, 'r') as h: | for line in h: | head, part = line.splitlines()[0].split() | headers[head] = True | | | def makeSubset(fname): | with open(fname, 'r') as r1: | g = open(fname.split('.fastq')[0]+'.wanted.fastq', 'w') | matched = 0 | for line in r1: | if line.startswith("@"): | match, rest = line.split() | if match in headers: | print(line, file=g, end="") | print(r1.readline(), file=g, end="") | print(r1.readline(), file=g, end="") | print(r1.readline(), file=g, end="") | matched += 1 | if matched%1000==0: | print(matched, end=" \r") | print(fname, matched) | | | makeSubset(rf1) | makeSubset(rf2) `---- 7 Result ======== This gave a good STARSolo run of 15 minutes, and a good default drops method in DropletUtils. but it failed at the empty drops stage due to bad minimum thresholds. For the more custom emptyDrops part of the tutorial, we will use the matrices generated from the full source files.