_____________________

                          SUBSETTING 10X DATA
                         _____________________


Table of Contents
_________________

1. Extract Headers and Sequence from a FASTQ Barcode file
2. Extract 10X barcodes from the sequence file
3. Rank barcodes by frequency
4. Subselect a wide range of good to low quality barcodes
5. Map the barcodes back to the sequences and then back to the header they came from
6. Filter original FASTQ with wanted headers
7. Result


1 Extract Headers and Sequence from a FASTQ Barcode file
========================================================

  Header and Sequence are a 1:1 mapping, so both files should have an
  identical number of lines

  ,----
  | zcat $file | awk 'NR%4==1' > header_only.txt
  | zcat $file | awk 'NR%4==2' > sequence_only.txt
  `----


2 Extract 10X barcodes from the sequence file
=============================================

  This is CR3 data, so it's 16bp of CB and 10bp of UMI at the end

  ,----
  | cut -b 1-16 sequence_only.txt > sequence_only.1_16CB.txt
  `----


3 Rank barcodes by frequency
============================

  We want to capture cells of good quality

  ,----
  | sort sequence_only.1_16CB.txt | uniq -c | sort -rnk1 > sequence_only.1_16CB.freqs.txt
  `----


4 Subselect a wide range of good to low quality barcodes
========================================================

  If we just take the top 4000 prolific barcodes then this will give us
  90% of the data, so we wouldn't save on anything. Instead we take the
  40,000 and pick a barcode every N lines.

  ,----
  | head -$hed sequence_only.1_16CB.freqs.txt | awk "NR%$ev==1 {print $2}" > sequence_only.1_16CB.freqs.subselected.txt
  `----


5 Map the barcodes back to the sequences and then back to the header they came from
===================================================================================

  With our list of wanted barcode sequences, we then need to map them
  back to the original sequences file in order to use the 1:1 mapping of
  sequence to header, so that we can get the headers which match our
  wanted barcodes.

  ,----
  | #!/usr/bin/python3
  | wanted_file="sequence_only.1_16CB.freqs.subselected.txt"
  | wanted_map = {}
  | with open(wanted_file,'r') as f:
  |     for line in f:
  |         barc = line.splitlines()[0]
  | 
  |         if barc in wanted_map:
  |             print("Error", barc)
  |             exit(-1)
  | 
  |         wanted_map[barc] = -1
  |     f.close()
  | 
  | raw_seqfile="sequence_only.1_16CB.txt"
  | line_wanted = {}
  | with open(raw_seqfile,'r') as f:
  |     i = 0
  |     for line in f:
  |         line = line.splitlines()[0]
  |         if line in wanted_map:
  |             if i in line_wanted:
  |                 print("Error", i)
  |                 exit(-1)
  |             line_wanted[i] = True
  |         i += 1
  |     f.close()
  | 
  | # Reduction from total
  | len(line_wanted) * 100 / 30462643
  | 
  | raw_headfile="header_only.txt"
  | with open(raw_headfile, 'r') as f:
  |     g = open("header_only.wanted.txt", 'w')
  |     i = 0
  |     for line in f:
  |         if i in line_wanted:
  |             print(line, sep="", end="", file=g)
  |         i += 1
  |     g.close()
  |     f.close()
  `----

  I took the top 40k most prevalent cell barcodes, and selected 1 line
  out of 8 of them, leading to a 88% reduction in size.


6 Filter original FASTQ with wanted headers
===========================================

  ,----
  |   #!/usr/bin/env python3
  | 
  | hfile='header_only.wanted.txt'
  | 
  | headers = {}
  | with open(hfile, 'r') as h:
  |     for line in h:
  |         head, part = line.splitlines()[0].split()
  |         headers[head] = True
  | 
  | 
  | def makeSubset(fname):
  |     with open(fname, 'r') as r1:
  |         g = open(fname.split('.fastq')[0]+'.wanted.fastq', 'w')
  |         matched = 0
  |         for line in r1:
  |             if line.startswith("@"):
  |                 match, rest = line.split()
  |                 if match in headers:
  |                     print(line, file=g, end="")
  |                     print(r1.readline(), file=g, end="")
  |                     print(r1.readline(), file=g, end="")
  |                     print(r1.readline(), file=g, end="")
  |                     matched += 1
  |                     if matched%1000==0:
  |                         print(matched, end="   \r")
  |         print(fname, matched)
  | 
  | 
  | makeSubset(rf1)
  | makeSubset(rf2)
  `----


7 Result
========

  This gave a good STARSolo run of 15 minutes, and a good default drops
  method in DropletUtils. but it failed at the empty drops stage due to
  bad minimum thresholds. 
  
  For the more custom emptyDrops part of the tutorial, we will use the
  matrices generated from the full source files.