# FICTURE real data results

This repository contains pixel level factor inference results from FICTURE for five datasets. Except for Seq-scope injured colon dataset, raw data were acquired from public sources, see our paper for more information.

There is one tab delimited text file for pixel level factor inference result and one image file visualizing all factors for each dataset.

## Datasets

### Seq-scope injured colon
```
SeqScope_InjCol.decode.prj_24.r_4_5.pixel.sorted.tsv.gz
SeqScope_InjCol.decode.prj_24.r_4_5.pixel.png
```

### Stereo-seq E16.5 mouse embryo E1S3
```
StereoSeq_E16.5_E1S3.nF48.d_15.decode.r_5_10.pixel.sorted.tsv.gz
StereoSeq_E16.5_E1S3.nF48.d_15.decode.r_5_10.pixel.png
```

### Xenium human breast cancer
```
Xenium_human_breast_cancer.nF20.d_12.decode.prj_12.r_4_5.pixel.sorted.tsv.gz
Xenium_human_breast_cancer.nF20.d_12.decode.prj_12.r_4_5.pixel.png
```

### Xenium human lung preview, non-disease sample
```
Xenium_human_lung_preview.nF36.d_12.decode.prj_12.r_4_5.pixel.sorted.tsv.gz
Xenium_human_lung_preview.nF36.d_12.decode.prj_12.r_4_5.pixel.png
```

### Vizgen MERSCOPE mouse liver
```
MERSCOPE_MouseLiver1Slice1.nF24.d_12.decode.prj_12.r_4_5.pixel.sorted.tsv.gz
MERSCOPE_MouseLiver1Slice1.nF24.d_12.decode.prj_12.r_4_5.pixel.png
```

## Data format

We store the top 3 factors and their corresponding posterior probabilities for each pixel in tab delimted text files and upload files compressed with `bgzip`.
As a temporary hack for accessing specific regions in large dataset faster, we divided the data along one axis (X or Y), sorted within each block by the other axis, and indexed the file with `tabix`.
The first 3 lines of the file, starting with `##`, are metadata, the 4th line, starting with `#`, contains columns names.
To use the file as plain text, you can ignore this complication and read the file from the 4th line.

Take `MERSCOPE_MouseLiver1Slice1.nF24.d_12.decode.prj_12.r_4_5.pixel.sorted.tsv.gz` as an example, the first few lines of the file are as follows:

```
##K=24;TOPK=3
##BLOCK_SIZE=500;BLOCK_AXIS=X;INDEX_AXIS=Y
##OFFSET_X=-31;OFFSET_Y=-97;SIZE_X=10329;SIZE_Y=9711;SCALE=100
#BLOCK  X       Y       K1      K2      K3      P1      P2      P3
0       42000   60050   1       11      23      8.27e-01        1.73e-01        1.05e-10
0       42100   60050   1       11      10      8.22e-01        1.78e-01        1.08e-09
0       42150   60050   1       11      10      7.68e-01        2.32e-01        5.00e-10
```

The 4th line contains the column names. From the 5th line on, each line contains the information for one pixel with coordinates `(X, Y)`, the top 3 factors indicated by `K1, K2, K3` and their corresponding posterior probabilities `P1, P2, P3`. Factors are 0-indexed.

The 1st line indicates that the data is from a model with 24 factor (`K=24`) and we store the top 3 factors for each pixel (`TOPK=3`).

The 2nd line indicates that the data is separated into blocks by the X axis (`BLOCK_AXIS=X`) with block size 500$\mu m$ (`BLOCK_SIZE=500`), then within each block the data is sorted by the Y axis (`INDEX_AXIS=Y`).
The block IDs (first column in the file) are integer multiples of the block size (in $\mu m$), i.e. the 1st block, with $X \in [0, 500)$ have block ID 0, the 2nd block, with $X \in [500, 1000)$ have block ID 1, etc.
The 2nd line tells us the file can be (and has been) indexed by `tabix` by `tabix -f -S4 -s1 -b3 -e3`.
For example, to access pixels in a rectangle with $X \in [1000, 2000)$ and $Y \in [12000, 14000)$, we can use `tabix MERSCOPE_MouseLiver1Slice1.nF24.d_12.decode.prj_12.r_4_5.pixel.sorted.tsv.gz 1000:120000-140000 1500:120000-140000`.

The 3rd line describes the translation between the stored cooredinates and the physical coordinates in $\mu m$.
Take `(X, Y)` as a pixel coordinates read from the file, the physical coordinates in $\mu m$ is `(X / SCALE + OFFSET_X, Y / SCALE + OFFSET_Y)`.
In this above example, the raw data from Vizgen MERSCOPE mouse liver data contains negative coordinates, but for convineince we shifted all coordinates to positive. `SIZE_X` and `SIZE_Y` record the size of the raw data in $\mu m$.
