### Script snippet to manipulate the dataset

The PSSM files generated by PSI-BLAST were parsed to retrieve the relative frequencies. The ouput matrices were saved in numpy arrays for efficent storage. The whole training set of 1832414 protein sequences was then split randomly into equal batches, each holding 10k proteins. Each batch is stored in a dictionary object, where the key is the Uniref50 header of the sequence and the value is the corresponding numpy array, with the relative frequencies. Each batch was then converted to a byte stream (serialized) and saved in the disc.

Regarding the final set of input sequences, it was also stored as a dictionary, where the key contains the batch name and the value is another sub-dictinary of two elements: the first entry has the key "Headers" and its value contains a list of 10k Uniref50 headers, while the second entry has the key "Sequences" and its value contains the list of protein sequences keeping the same order of the list of headers. In other words, for a header at position 50 the corespending sequence is in the list of sequences at the same index, 50.

The same structure is applied to the validation and test sets, though they contain only 879 protein sequences each.

#### 1) Import pickle to deserialize and load the data

In [None]:
import pickle

#### 2) Load input sequeces of the training set

In [None]:
with open('./Dataset/train_set/train_set_sequences/training_set_seqs_short_batchname.data', 'rb') as filehandle:
    # read the data as binary data stream
    training_data = pickle.load(filehandle)

#### 3) Example displaying the header, sequence, and coresponding labels (i.e. relative frequencies matrix stored as a numpy array) of the first batch of the 10k proteins

In [None]:
for counter, (batch_name, batch_data) in enumerate(training_data.items(),1):
    if counter > 1:
        break
    with open('./Dataset/train_set/train_set_labels/'+batch_name+'_labels.data', 'rb') as filehandle2:
        # read the data as binary data stream
        training_labels = pickle.load(filehandle2)
    for i, seq in enumerate(training_data[batch_name]['Sequences']):
        print("> "+training_data[batch_name]['Headers'][i])
        print(seq)
        print(training_labels[training_data[batch_name]['Headers'][i]])