﻿Setting Up A Subject List
==========================
Overview
--------
For C-PAC to run an analysis, it must have at least one pipeline configuration file and one subject list.  The creation of a pipeline configuration file is treated in the next section.  This section will deal with how to create a C-PAC subject list, which specifies a set of subjects, where to find each subject's associated image files, and (optionally) scan parameter information for use during :doc:`Slice Timing Correction </func>`.  There are two ways to set up such a list:

* Using a text editor (useful for servers where using the C-PAC GUI is not possible or impractical)
* Using the subject list builder interface in the C-PAC GUI

The first method requires you to create a data configuration file, which encodes settings that you would normally enter through the subject list builder GUI (such as a template file path used to grab T1 anatomical data; see *Defining Anatomical and Functional File Path Templates* below).  Both methods will result in the creation of a list that can be re-used for future runs with different pipelines.

Using a Text Editor
-------------------
Data configuration files are stored as YAML files, which are text files meant to encode data in a human-readable markup (see here  for more details).  Each of the parameters used by C-PAC to assemble your subject list can be specified as key-value pairs, so a data configuration YAML might have multiple lines of the form::

    key : value

An example of such a YAML file can be found here .  You can create your own configuration file using your favorite text editor and the table of keys and their potential values in the table below.

.. csv-table::
    :header: "Key","Description","Potential Values"
    :widths: 5,30,15
    :file: _static/params/data_config.csv

Generating Subject Lists with the Data Configuration File
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When you are done setting up the data configuration file, navigate in a terminal to the directory where you would like to store your subject list.  If you are not running slice timing correction, or all subjects within a site have the same acquisition order, type the following command:

.. code-block:: bash

    cpac_setup.py /path/to/data_config

If you are running slice timing correction and subjects within a site have differing acquisition orders open iPython by typing :file:`ipython` and run the following command from the iPython prompt::

  import CPAC
  CPAC.utils.extract_data_multiscan_params.run('/path/to/data_config.yml')

Either of these methods will produce the subject list in your current directory, as well as two other files needed to run group-level analyses:

* :file:`CPAC_subject_list_<name>.yml` - The subject list used when running preprocessing and individual-level analyses. Contains subject IDs and paths to their associated data files.  Subject lists are also stored as YAMLs.
* :file:`template_phenotypic.csv` - A template phenotypic file for use with the :doc:`group-level analysis builder</group_analysis>`.
* :file:`subject_list_group_analysis.txt` - The subject list used when running group-level analyses.

Note: An alternate version of the phenotype template and the group analysis subject list will be created that are pre-formatted for use in repeated measures or within-subject group-level analysis.

For INDI releases, we have available :doc:`preconfigured data configuration files </files>`, to generate subject lists. You will need to make sure that the file templates have been modified to match the specific location on your system for these files.

Using the GUI
--------------
First, open the C-PAC GUI by entering the command ``cpac_gui`` in a terminal window. You will be presented with the main C-PAC window.

.. figure:: /_images/main_gui.png

Now, to generate a subject list file, click on the *New* button next to the *Subject Lists* box.  This will bring up the following dialog window, which will allow you to define the parameters of the data configuration YAML:

.. figure:: /_images/subject_list_gui.png

#. **Data format - [BIDS, Custom]:** Select if data is organized using the BIDS  standard or a custom format (see *Defining Anatomical and Functional File Path Templates* below for more information on forming templates for custom data organization).

#. **BIDS Base Directory - [path]:** The base directory of BIDS-organized data if you are using BIDS.

#. **Anatomical File Path Template - [text]:** A file path template for anatomical scans (see below for how to define templates).  Used when a custom data organization scheme is enabled.

#. **Functional File Path Template - [text]:** A file path template for functional scans (see below for how to define templates).  Used when a custom data organization scheme is enabled.

#. **Subjects to Include (Optional) - [text/path]:** An optional comma-separated list of subjects to include (if you wish to include only a subset of subjects whose scans match the templates).  A value of 'None' means all subjects will be run. This list can also be read from a text file whose path is given in this field.

#. **Subjects to Exclude (Optional) - [text/path]:** An optional comma-separated list of subjects to exclude (if you wish to exclude only a subset of subjects whose scans match the templates).  A value of 'None' means all subjects will be run. This list can also be read from a text file whose path is given in this field.

#. **Sites to Include (Optional) - [text/path]:** An optional comma-separated list of sites to include (if you wish to include only a subset of sites whose scans match the templates).  A value of 'None' means all sites will be run. This list can also be read from a text file whose path is given in this field.

#. **Scan Parameters File (Optional) - [path]:**  Path to a CSV specifying the slice time acquisition parameters for scans.  If set to 'None', these parameters will either be defined by the NifTI headers or by an explicit slice order specified in the pipeline configuration builder. Instructions for creating such a file can be found :doc:`here </func>`.  Note that a scan parameters file will also override the settings used in the slice timing screen of the pipeline configuration builder.

#. **AWS Credentials File (Optional)- [path]:** Required if downloading data from a non-public S3  bucket on Amazon Web Services instead of using local files.

#. **Output Directory - [path]:** The directory where the subject list will be generated.

#. **Subject List Name - [text]:** A name for the subject list.

#. **Multiscan Data [checkbox]:** Check this box only if the scans have different slice timing information.

Defining Anatomical and Functional File Path Templates
------------------------------------------------------
C-PAC has been designed to process large, complex data sets, and supports processing of multiple scans per participant, multiple scan sessions, and multiple data acquisition sites with differing scan acquisition parameters. Because the file directory structures resulting from such data sets can be complex, it is necessary to define the location of the image files to be processed for each participant.  If you are using the BIDS  directory structure, the location of image files is defined via the BIDS specification, and C-PAC is able to automatically find all of the files needed to run the pipeline.  Otherwise, you will need to manually define path templates using the conventions described below.

File path templates are defined by adding placeholder variables nested within curly brackets to the file path of each type of image file to be processed (anatomical and functional).  The possible placeholders are `site`, `participant` and `session`.  `session` is optional, while the other two placeholders are not.  To illustrate, if the full paths to the image files for a hypothetical :file:`participant_1` were:: 

  /home/data/site_1/participant_1/anat/mprage.nii.gz
  /home/data/site_1/participant_1/func/rest.nii.gz

Then the anatomical and functional file path templates would would be::

   Anatomical Template:  /home/data/{site}/{participant}/anat/mprage.nii.gz
   Functional Template:  /home/data/{site}/{participant}/func/rest.nii.gz

It should be noted that C-PAC currently requires participant directories to be within a site level directory. If your data set does not contain scans from multiple acquisition sites, we recommend creating a dummy :file:`site_1` directory and placing participant files inside this directory. This peculiarity will be fixed in future versions of C-PAC.

In cases where file paths differ in more than just the site, session and subject directories, asterisks can be used. This is useful for data sets containing multiple sessions in a single scan date or other subject-specific information in file or folder names. You may use as many asterisks as necessary to to define your file path templates. For example, if the paths to functional image files for a subject were::

    /home/data/site_1/subject_1/scan_1/session_1/rest_1.nii.gz
    /home/data/site_1/subject_1/scan_1/session_2/rest_2.nii.gz
    /home/data/site_1/subject_1/scan_2/session_1/rest_1.nii.gz
    /home/data/site_1/subject_1/scan_2/session_2/rest_2.nii.gz

The file path template would be::

    /home/data/{site}/{participant}/scan_*/{session}/rest_*.nii.gz

Example File Path Templates
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here are the file path templates used for the 1000 Functional Connectomes  data release, as well as an illustration of the directory structure used for the release::

   Anatomical Template:  /path/to/data/{site}/{participant}/anat/mprage_anonymized.nii.gz
   Functional Template:  /path/to/data/{site}/{participant}/func/rest.nii.gz

.. figure:: /_images/fcon_structure.png
  
Another example is the file structure used by the ABIDE  releases::

   Anatomical Template:  /path/to/data/{site}/{participant}/{session}/anat_*/mprage.nii.gz
   Functional Template:  /path/to/data/{site}/{participant}/{session}/rest_*/rest.nii.gz

.. figure:: /_images/abide_adhd_structure.png

A final example is the file structure used by the Enhanced Nathan Kline Institute-Rockland Sample ::

   Anatomical Template:  /path/to/data/{site}/{participant}/anat/mprage.nii.gz
   Functional Template:  /path/to/data/{site}/{participant}/{session}/RfMRI_*/rest.nii.gz

.. figure:: /_images/nki-rs_template.png

Users experiencing difficulties defining file path templates may want to re-organize their data to match one of the examples above. If you manually define a file path template and encounter an error when attempting to generate participant lists, please :doc:`contact us </help>` and we will be happy to help.

Subject List YAML Fields
^^^^^^^^^^^^^^^^^^^^^^^^^
The subject list builder GUI or the command line utility will produce a YAML file containing all of the subjects and various properties associated with that subject, such as its ID, session number, the location of its resting-state/functional and anatomical scans.  Before each subject definition there is a single line with a dash, which indicates that start of the property defintions.  Subject properties are indented under this dash.  To illustrate, see the sample subject definition below:

.. code-block:: yaml

    -
        subject_id: 'subj_1'
        unique_id: 'session_1'
        anat: '/data/subj_1/session_1/anat_1/mprage.nii.gz'
        rest: 
          rest_1_rest: '/data/subj_1/session_1/rest_1/rest.nii.gz'
          rest_2_rest: '/data/subj_1/session_1/rest_2/rest.nii.gz'
        scan_parameters:
            tr: '2.5'
            acquisition: 'seq+z'
            reference: '24'
            first_tr: ''
            last_tr: ''

Note that more than one resting state scan is defined under the `rest` key (i.e., multiple sessions), and that individual scan parameters can be defined to override the settings used in the C-PAC pipeline configuration GUI.  Be careful not to name the multiple sessions under the `rest` key using dashes or periods, as this input will not be parsed by C-PAC and may cause errors.
