Title: Innate antiviral systems are major defensome components that influence prophage distribution in Acinetobacter baumannii

Authors: Antonio Moreno-Rodríguez, Alejandro Rubio, Andrés Garzón, Younes Smani & Antonio J. Pérez-Pulido

In this project, we have analysed the defensome of Acinetobacter baumannii with the aim of profiling different defense systems associated with particular prophage profiles, as well as to predict which systems are more effective and against which specific phages, associating both positively and negatively prophages to defense systems using machine learning techniques.

#######################################################################

# A. baumannii genomes have 81 different defense systems

Defense system frequencies were calculated using the output of the defense-finder tool ['defense_finder_systems_wored100.tsv'] with Acinetobacter baumannii pangenome.

To plot panels B and C of the Figure 1, we used cluster_defsys_count.tsv and defsys_strains_count.tsv files, which are obteined calculating pangenome genes and defense systems frequencies respectively, in fig1.R script. 
For the panel D, we used 'subtypes_count.tsv', a file that collects the frequencies of the subtypes of main defense system.

The sequences from the main defense systems in A. baumannii were clustered. The distribution of these major variants is shown in panel A of Figure 2.
Additionally, to check for horizontal gene transfer of defense systems, the phylogenetic distance was compared to the distance from some defense system sequences and is represented in panel B of Figure 2.

# Different innate defense systems do not usually appear together in the same genome

We also used the output generated by the defense-finder output to calculate frequencies of defense system co-appearance.

To show this co-appearances, we built a correlation matrix using the script 'coocurr_matrix.py', which we used to plot the panel 2A. To show the overlapping of defense system types, we use their correlation frequencies to plot the figure 2B.

# R-M systems appear to be more efficient than SspBCDE in avoiding phage integration

In this section, we add the use of predicted phages to the analysis. These prophages were obtained using the phage prediction tool Phigaro. In addition, for some figures we used total phages (those prophages present in at least 1% of the genomes) ['cluster_ab_phigaro_90_def_1prcst_cl.txt' & 'st_phage_phigaro_cl_nored100.tsv'] and for others, we only used the most frequent prophages per clonal group ['cluster_ab_phigaro_90_def_freqmlst8_cl.txt' & 'st_phage_phigaro_cl_mlst8_nored100.tsv'], which are obtained from the script ‘freq_phages_bymlst.py’. 

To check what defense systems the phage-free genomes had, we used the script 'binary_matrix.py' to build a binary presence-absence matrix of defense systems using only genomes with 0 prophages ['defense_finder_systems_wophages_nored100.tsv']. This binary matrix was used to create figure 3B. 

We used the isolation site metadata and clonal group data from the file 'metadata_ab_is.tsv' and 'mlst_ab_freq_wored100.tsv' respectively, to analyse whether there is any correlation between ST and isolation source.
  
For phylogeny of Acinetobacter baumannii, we used 'assembly_seq.pl' and 'uniq_sl.pl' scripts to build the initial multifasta with only the core genes, as input of MAFFT software. The generated MSA is processed using Clipkit, to eliminate gaps and keep the most informative regions. The processed MSA is used as input to iqTREE to generate the tree.
In addition, we generated a defense system and prophage presence-absence matrix, by using 'defsys_pres_ann.py' and 'pres_aus_matrix_cl.py' respectively, to add it to the phylogeny in Figure 3D.
Latter script can be used for all phages or for frequent phages in each clonal group, each one being used for a particular part of the figure.

We used the clonal group and prophage information to show correlations between phage infection and phylogenetic relationships in Figure 3E.

To analyse the prophage profile of each clonal group, we created a matrix of relative and absolute frequencies for each of them, using the script 'matrix_mlst_phages_freq.py'. 

Differences in total prophage number between group 1 and group 2 were obtained using total prophages.

# The presence of certain defense systems is associated with specific prophages

For Figures 4 A and 4 B, the total phage matrix 'matrix_90_ab_ml_1prcst_nored100_cl_mlst.tsv' was used, where, in addition to the prophages of each genome, the defence systems and clonal group associated with each strain are indicated. This matrix was generated using the script 'matrix_presaus_ml.py'.

For Figure 4 C, circos were plotted using files generated by 'prepareForCircos2.pl'. This script uses 'defsys_presaus_ann.tsv', 'logical_viruses.tsv' and a list of genomes of each MLST group to create the input file for the figure. These files are also provided.

# Prophage combinations allow prediction of which defense systems a bacterium carries 

The ML model of xgboost used to derive the importance of prophages for predicting the defence system of each strain was generated using the script 'xgb_model.R'. The figure was plotted using the 'fig5.R' script from the importance data from the model and the relative abundances of each phage in each genome group. 

Structural model was predicted using AlphaFold 3 and structural alignment was driven by USalign tool, using USalign.sh.
