The Plan

  1. Make sure you have all tools installed. mafft, trimal, iqtree at the very least.
  2. Get a set of sequences to build a tree of.
  3. Align these to each other.
  4. Trim the alignment, removing gaps.
  5. Build a fast phylogentic tree.
  6. Build a thorough phylogenetic tree.

Composing your fasta

Let's look at what we have

In [ ]:
tree
In [115]:
inseq=1kP_LAR_orthogroup
(phylogenetics) 

In [2]:
head data/$inseq.fasta
>KTWL-2000328-Kaliphora_madagascariensis
GGTGYIGKFLVEASAKAGHPTFAFAREITVSDPLKGKLVESFKNSGVTLVHGDLYDHSSLVKAIKQVDVVISTVGQGQLNDQGKIIDAIKEAGNVKKFYPSEFGVDVDRQHAVEPAKFTFTGKAQFRRAIEAAGIPHTFLVSNAFAGYVLPTLGQPGVSAPPHDKVVILGDGNAKAVVNEEHDIGTFTIKTVDDPRTLNKIVYIRPPKNTVSFNEIVAIWERTIGKTLEKEYVSEEQLLKQIKEAPPPLNYMLSIYHAYLVKGDSANFAIEPSGVEASELYPDVNYTTVEEYI
>QSNJ-2064790-Taiwania_cryptomerioides
KSKVLIFGATGYMGHFIAIASLDSDHPTFVFIRPSSLEDPARAQILSNLRSSGIHFITGSLEDHESLVRAIQGMDVVISVLGGGQIMEQLKIIDAIKQVGTVKRFLPSEFGHDVDRAEPVEPALGFYESKRRIRRATEEAKIGYTYICCNSIAGWPYHYHTHPSKMFPPTDKIHIYGNGNIKAYFMTGKDIGSYTMRAIEDPRTLNKKLHFRPADNFLTMNELAVMWQKQLGKKLPMAVISEEDLLKCARANVMPASIVAALTHDIFIKGCQYNFSIEGDEDEEACQLYPDIKYTKLEDYFSSF
>URCP-2006931-Athyrium_filix_femina
MASKSKVLVIGATGYIGQHIAQAGPAQGHPTFVLVRPSSLSSKATLIHSFKLAGITILEGSLDDYESLVAAIKQVDVVISALGSNTLDQLKIVEAIKEVGTIKRFFPSEFGNDVDRVKALEPASQVFGLKVQVRRAIEEAGIPYTYVVSNGFAGYFLRSLLQPGLQSPPRDKVSIYGSGDVKFITASEEDIGAYTIKAVDDPRTLNKTLHIRPAANIVTINSLVELWEKLIGKTLEKVTVTAEDIVKQIQETPFPNNIFPAILHDIVIQGDHCNFELGPKDVEATTLYPDHKYTTSEEYLAKFV
>ROLB-2007609-Oenothera_elata-multiple_tissue_pool
NKQRFLPSEFGNDVDRVNAVGPAKLAFGIKANIRRAIETEGIPYTYVVNNFFAGYFLPTLLQPGATAPPRDQVVIPGNGNPKAIFNKEEDIGTYTIKAVDDPRTLNKTYTIKAVDDPRTLNKILYLRPPKNIYSFNELVALWENKIGHTLKKIYVLEDQILKDIQEASDGRFRSLTMNAFQSINHSVFVKGDQTNFDIEPAFGVEASQ
>RFSD-2005689-RFSD-Lactuca_graminifolia-2_samples_combined
EEEIALTTIKTVDDRRTLNKALIFRPPGNTLSFNEIVSIWESKIGKTLVKTYVSAEQLLKNIQEAPFGLSIELSVLHSVLVNGSSTNFDIEPSFGVEASELYPEVKYTTIDEYLT
In [3]:
grep '>' data/$inseq.fasta -c
7448

Selection subsetting

this is the new route opposed to the random subsetting.

In [29]:
grep -o '^>[A-Z]\{4\}-' data/$inseq.fasta | cut -c 2-5 | sort |  uniq -c
     10 AALA
      5 AAXJ
      1 ABCD
     12 ABEH
      4 ABIJ
      7 ABSS
      7 ACFP
      8 ACWS
      9 ADHK
      6 AEPI
     10 AFLV
      7 AFPO
      6 AFQQ
      3 AHRN
     20 AIGO
      6 AIOU
     11 AJFN
     11 AJJE
      1 AKCR
     11 AKXB
      5 ALUC
     13 ALVQ
      2 ANON
     11 AQFM
      6 AQGE
      8 AQZD
     18 AREG
      8 ARYD
      4 ASMV
      6 ATFX
      9 ATYL
     14 AUDE
      4 AUGV
     18 AUIP
      7 AVJK
      8 AWJM
      3 AWOI
     23 AWQB
      3 AXAF
     10 AXBO
      9 AXNH
      5 AXPJ
      6 AYIY
     12 AYMT
      7 AZBL
     17 BAHE
      2 BAZF
      6 BBBA
      8 BBDD
      7 BCAA
      6 BCGB
      8 BDJQ
     13 BEFC
      8 BEGM
      5 BEKN
      5 BERS
      8 BFJL
      1 BGXB
      7 BGZG
     10 BHYC
      6 BIDT
      6 BJKT
      6 BJSW
      6 BKQU
      6 BLAJ
      8 BLVL
     11 BMIF
      7 BMJR
      7 BMSE
      8 BNCU
     14 BNDE
     11 BNTL
      8 BOLZ
     13 BPKH
      4 BRUD
      5 BSEY
      9 BSNI
     10 BSTR
      2 BSVG
     17 BTTS
     18 BUWV
      9 BVOF
      5 BWRK
      7 BXAY
      5 BXBF
      9 BYNZ
      6 BYPY
      4 BYQM
      9 BZDF
      1 CAPN
     12 CAQZ
      4 CBAE
      6 CBJR
      5 CCID
     10 CDFR
      8 CFRN
     13 CGDN
      1 CGGO
     16 CIAC
      4 CIEA
      6 CJGZ
      6 CJNT
     12 CKDK
      7 CKKR
      8 CLMX
      8 CLNU
      7 CLRW
      8 CMCY
      3 CMEQ
      2 CMFF
      2 CNTZ
      7 COAQ
      8 COBX
      3 COCP
      6 CPKP
      3 CPLT
      5 CPOC
      6 CQMG
     10 CQPW
      4 CRNC
      6 CSSK
      8 CSUV
      7 CTSS
      8 CTYH
      5 CUTE
      5 CVDF
      4 CVEG
      7 CWYJ
      7 CWZU
      7 CZPV
      8 DAAD
     11 DAYQ
      5 DCCI
      6 DCDT
      7 DDRL
     11 DESP
      4 DFHO
      4 DFYF
      9 DGXS
      4 DHAW
      8 DHPO
     10 DIHD
      3 DJSE
      7 DKFZ
      4 DLAI
      7 DLJZ
      6 DMIN
      9 DMLT
      4 DNQA
      4 DOVJ
     12 DPFW
     11 DRIL
      4 DSUV
      8 DTNC
      8 DUNJ
      9 DUQG
      5 DVXD
      2 DWZT
      4 DXOU
      9 DXQW
     11 DZLN
      1 DZPJ
     22 DZQM
      8 DZTK
      9 EAAA
      1 EATP
     12 EAVM
      8 ECTD
      9 EDBB
      4 EDEQ
     11 EDHN
      6 EDIT
      9 EDXZ
      5 EEAQ
      5 EEMJ
     11 EFCZ
      7 EFMS
     13 EGLZ
      7 EGOS
     12 EHNF
      7 EILE
      4 EJBY
      8 EJCM
     13 EMAL
      4 EMBR
      8 EMJJ
      1 ENAU
      4 ENQF
      8 EQYT
      9 ERIA
      3 ERWT
     16 ETCJ
      2 ETGN
      4 EVOD
     19 EWXK
      6 EYKJ
      7 EYRD
     10 EZGR
      8 EZZT
      7 FAKD
      7 FALI
      5 FAMO
     12 FCBJ
      6 FCCA
      7 FCEL
      4 FCHS
     11 FDMM
      8 FEDW
     10 FFFY
      1 FFPD
     11 FGDU
      4 FGRF
     19 FHST
      4 FLTD
     12 FMWZ
      6 FNEN
      1 FOYQ
      2 FPCO
     12 FQGQ
      2 FQLP
      6 FROP
     15 FRPM
      6 FUMQ
      7 FUPX
      4 FVXD
      4 FWBF
      8 FWCQ
      7 FXGI
      7 FYSJ
      9 FYTP
      4 FYUH
     12 FZJL
      6 FZQN
     17 GAMH
     11 GANB
      3 GAON
      7 GBVZ
      9 GCFE
      8 GCYL
      4 GDKK
      7 GDZS
      8 GEHT
      5 GETL
     16 GGEA
      5 GGJD
      6 GHLP
      6 GIOY
      7 GIPR
      8 GIWN
      5 GJNX
      3 GJPF
      2 GKAG
     11 GKCZ
     10 GLVK
     16 GMHZ
      9 GNPX
     13 GNQG
      7 GNRI
      9 GOWD
      7 GRFT
      3 GRRW
      3 GSXD
      8 GSZA
      4 GTHK
      2 GTSV
     11 GUMF
      9 GVCB
      1 GXBM
      3 HABV
      9 HAEU
      6 HANM
      5 HATH
      5 HBGV
      8 HBHB
      8 HBUQ
      6 HDSY
      8 HDWF
      5 HEGQ
     13 HENI
      2 HERT
      7 HGSM
     15 HILW
      6 HJMP
      6 HKMQ
     10 HLJG
      1 HMHL
      7 HNCF
      5 HNDZ
     15 HOKG
      8 HPXA
     11 HQOM
     15 HQRJ
      5 HRUR
      8 HTDC
      7 HTFH
      7 HTIP
      8 HUQC
      5 HUSX
      8 HVBQ
     11 HWUP
      8 HXCD
      6 HXJE
      6 HYZL
      6 HZTS
      5 IADP
     21 IAJW
      7 IANR
      5 ICNN
      7 IDAU
      5 IDGE
      9 IEPQ
     11 IFLI
      2 IGUH
      8 IHCQ
      7 IHPC
     13 IIOL
      8 IKFD
      2 ILBQ
     10 IMZV
     13 INQX
     10 INSP
     19 IOVS
      5 IPWB
      6 IRAF
      4 IRBN
      1 IRYH
      9 IUSR
      6 IWIS
      6 IWMW
      5 IXEM
      5 IXVJ
      9 IYDF
     16 IZGN
      6 IZLO
      2 IZNU
      1 JADL
     10 JAFJ
      9 JBLI
     22 JBND
      9 JCLQ
      5 JCMU
     13 JDQB
      6 JDTY
      8 JEPE
     12 JETM
     14 JEXA
      6 JGAB
     11 JGYZ
     12 JHCN
      6 JHFI
      5 JHUL
      7 JKAA
      1 JKHA
      1 JKKI
      9 JKNQ
      4 JLOV
      2 JMTE
      8 JNKW
      8 JNUB
      5 JNVS
      3 JOIS
      6 JPDJ
      1 JPYU
     12 JRNA
      7 JSAG
     11 JSZD
     16 JTQQ
      7 JTRM
     13 JUWL
      7 JVBR
      7 JVSZ
      5 JWEY
      9 JYMN
     10 JZVE
      7 KAWQ
      4 KAYP
      5 KBRW
      3 KBXS
      9 KCPT
      6 KDCH
     15 KEGA
      8 KFZY
      8 KGJF
      3 KIIX
     11 KJAA
      6 KJYC
      1 KJZG
     10 KKDQ
     15 KLGF
     11 KNMB
      3 KOFB
     10 KPTE
      7 KPUM
      5 KRJP
      4 KRUQ
      9 KTAR
      8 KTWL
     10 KUXM
      4 KVAY
      7 KVFU
      2 KWGC
      9 KXSK
     16 KYAD
      4 KZED
      7 LAPO
      6 LDEL
      1 LDME
      7 LELS
      4 LEMW
      7 LFOG
      6 LGDQ
      4 LGOW
      8 LHLE
      3 LJQF
      7 LKKX
      7 LLQV
      1 LMVB
     15 LNER
      1 LNSF
      4 LPGY
      5 LQJY
      9 LRRR
      4 LRTN
      7 LSJW
      6 LSKK
      6 LTZF
      7 LVUS
      6 LWCK
      9 LWDA
      6 LYPZ
     11 MAQO
      5 MDJK
     18 MEKP
      7 MFEA
      8 MFIN
     21 MFTM
      5 MGVU
     14 MHGD
     13 MHYG
      1 MIRS
     10 MIXZ
      3 MKZR
      9 MQIV
      7 MRKX
     13 MROH
     10 MTGC
      6 MTHW
      8 MTII
      6 MUMD
      8 MUNP
      5 MVRF
      6 MVSE
     13 MWYQ
      3 MXFG
     12 MYMP
      8 MYVH
      2 MYZV
     12 MZLD
      4 MZOB
      2 NAUM
      7 NBMW
      7 NCVK
     10 NEBM
      7 NFXV
     13 NGRR
      2 NGTD
      4 NHAG
      7 NHCM
      4 NHIX
     12 NHUA
      7 NIGS
      4 NJLF
     18 NKIN
      2 NMDZ
      4 NMGG
     11 NNGU
      6 NNOK
     10 NOKI
      4 NPND
     21 NPRL
      6 NRWZ
     19 NRXL
      6 NSPR
      5 NTEO
      9 NUZN
     15 NVGZ
      5 NVSO
      5 NWMY
     13 NWWI
      8 NXOH
      7 NXTS
      7 OAGK
     12 OBPL
      7 OBTI
      9 OCTM
      5 OCWZ
      4 OCZL
     11 ODDO
      7 OEKO
      5 OFTV
     16 OHAE
      7 OHKC
      9 OINM
      5 OLES
      9 OLXF
     14 OMDH
      5 OMYK
      4 ONBE
      3 ONLQ
      8 OODC
      8 OOSO
     11 OOVX
      8 OPDF
      4 OQBM
      7 OQHZ
     11 OQWW
     12 ORJE
      2 ORKS
      3 OSHQ
      4 OSIP
      8 OSMU
      4 OTAN
      6 OUER
     11 OVIJ
      6 OWAS
     17 OWFC
     17 OXGJ
     13 OXYP
      6 PAWA
      3 PAZJ
     12 PBUU
      6 PCGJ
      9 PCNH
      7 PDQH
     19 PEZP
      9 PGKL
      5 PHCE
      4 PIUF
     10 PIVW
     13 PJSX
      7 PKMO
      9 PLBZ
     15 PMTB
     17 PNZO
      1 POOW
     10 POPJ
     11 POZS
      8 PPPZ
      9 PPQR
      1 PQED
      6 PQTO
      5 PRFO
      2 PRIQ
      6 PSHB
     11 PSJT
      7 PSKY
      6 PTBJ
      7 PTFA
      6 PTLU
      6 PUCW
      6 PUDI
      7 PWSG
      6 PXYR
      6 PZRT
      6 QACK
      9 QAUE
     12 QCGM
      8 QCOU
     16 QDVW
      7 QEHE
     12 QFAE
     15 QHBI
     12 QIAD
     10 QICX
      9 QIEH
      8 QIKZ
      5 QKMG
      1 QKQO
     10 QNGJ
      3 QNOC
      5 QNPH
      6 QOXT
      1 QRTH
      6 QSKP
      3 QSLH
     17 QSNJ
     10 QTJY
      6 QURC
      9 QUTB
      7 QVMR
      6 QXWF
      1 QYXY
      4 QZXQ
      5 QZZU
     13 RBYC
      4 RCAH
      7 RCBT
     10 RCUX
      2 RDOO
      3 RDYY
     17 RFMZ
      8 RFRB
     10 RFSD
      2 RGKI
      5 RHAU
      7 RICC
      9 RJIM
     12 RJNQ
     10 RKFX
     11 RKLL
     14 RMMV
      6 RMVB
      6 RMWJ
      2 RNBN
      5 ROEI
     10 ROLB
     16 ROWR
     10 RPPC
      5 RQNK
      7 RQUG
     11 RQZP
     10 RSCE
      1 RSOF
      6 RSPO
     16 RTMU
     10 RTNA
      9 RTTY
      5 RUUB
      7 RVGH
      4 RWKR
      7 RXEN
      9 RZTJ
      4 SALZ
      6 SART
     10 SBZH
      8 SCAO
     12 SCEB
      4 SERM
      1 SFKQ
      7 SGTW
      1 SHEZ
      8 SIBR
      8 SIIK
      4 SILJ
      6 SIZE
      6 SJAN
     14 SJEV
      3 SKNL
      6 SKQD
      2 SKYV
      6 SLOI
      8 SLYR
      7 SMMC
      3 SMUR
      6 SNNC
      4 SOHV
      7 SQCF
      7 SSDU
      2 STKY
      9 SUAK
     11 SUVN
      6 SVTS
     10 SVVG
      9 SWGX
      6 SWOH
      6 SWPE
      8 SXCE
      6 SXML
     11 SYHW
      8 SZPD
      4 SZUO
      7 TAGM
      1 TAVP
      7 TCBC
      6 TCYS
      7 TDTF
      9 TEZA
     11 TFDQ
      3 TFYI
     14 TGKW
      6 THDM
      6 THEW
      9 THHD
     12 TIUZ
      4 TJES
      7 TJLC
      8 TJQY
      5 TKEK
     11 TLCA
      1 TMAJ
     11 TOKV
     10 TORX
     12 TOXE
      7 TPEM
      5 TQKZ
      4 TRPJ
      8 TRRQ
      5 TTRG
      8 TVCU
      9 TVSH
      3 TWFZ
      6 TWUW
      8 TXMP
      6 TXVB
      7 TZNS
      7 TZWR
      9 UAXP
      8 UBLN
      5 UCNM
      1 UDHA
      7 UDUT
     15 UEVI
      8 UFHF
      8 UFJN
      4 UGJI
      7 UGNK
     16 UHBY
      3 UHJR
      7 UHLI
     11 UJGI
      6 UJTT
      5 UJWU
      1 UKUC
      4 ULKT
      7 UMUL
      6 UOEL
     12 UOMY
      5 UOYN
      6 UPMJ
      7 UPOG
      5 UPZX
      7 UQCB
     11 URCP
      3 URDJ
     10 UTQR
      4 UUHD
     12 UUJS
      8 UVDC
      5 UVQL
     14 UWFU
     16 UWOD
      8 UXCS
      6 UYED
     16 UZWG
      6 UZXL
      1 VALZ
      6 VBHQ
     10 VCIN
      4 VDAO
      6 VDKG
     12 VFFP
     12 VFYZ
      3 VGHH
     12 VGSX
     10 VGVI
     10 VHZV
     13 VIBO
      9 VITX
      7 VJPU
     13 VKJD
      9 VLNB
      8 VMNH
      2 VMXJ
     12 VNMY
      9 VQFW
      5 VQYB
     16 VSRH
      4 VTLJ
      5 VTUS
     15 VUSY
      7 VVPY
      8 VVRN
      4 VVVV
      3 VXKB
      8 VXOD
      4 VYDM
      6 VYGG
      3 VYLQ
     11 VZCI
      3 WAFT
      8 WAIL
     10 WAXR
     11 WBIB
     11 WBOD
      8 WCLG
     11 WCOR
      1 WDWX
     11 WEAC
      5 WEEQ
      7 WEQK
      7 WFBF
      4 WGET
     10 WGTU
      7 WHNV
      9 WIGA
      1 WJLO
     10 WKCY
      9 WKSU
     12 WLIC
      3 WMLW
     11 WMUK
      4 WNGH
      5 WOBD
      2 WOGB
     17 WOHL
      9 WPHN
      8 WQML
      2 WQRD
      8 WQUF
      8 WRPP
      4 WTDE
      2 WTKZ
      5 WVEF
      2 WVMY
     10 WVWN
      6 WWKL
     12 WWQZ
     15 WWSS
      5 WXNT
     10 WXVX
     16 WYIG
      5 WZFE
      7 WZYK
      3 XAYK
      8 XBKS
      3 XDDT
      7 XDVM
      5 XFFT
     14 XFJG
      7 XGFU
      7 XHHU
      4 XHKT
     18 XIRK
      3 XISJ
      1 XIVI
     13 XKPS
     17 XLGK
      2 XMBA
     11 XMGP
      7 XMQO
      4 XMVD
      9 XNLP
      2 XNXF
      1 XONJ
      5 XOOE
      9 XPAF
      8 XPBC
      5 XQRV
     17 XQSG
     10 XQWC
     13 XRCX
      8 XRLM
      5 XSSD
     11 XSZI
     16 XTZO
      8 XUAB
     10 XVJB
     11 XVRU
      7 XXHP
      5 XXYA
      7 XZME
      6 XZUY
      4 YADI
      8 YBML
      9 YBQN
      6 YCKE
      9 YFGP
      8 YFQX
     10 YFZK
     11 YGAT
     12 YGCX
     12 YHFG
      8 YHLF
      6 YHZW
      6 YIXP
      7 YJJY
      9 YJUG
     13 YKFU
     10 YKQR
      6 YKZB
      6 YLJA
     11 YLPM
      4 YLWW
      1 YMES
      8 YNFJ
     10 YNUE
      8 YOWV
      6 YPIC
      2 YQEC
      7 YQIJ
      9 YRBQ
      5 YRHD
      7 YSRZ
      8 YUOM
      2 YWNF
      6 YXNR
     16 YYPE
      3 YZGX
     14 YZRI
      8 YZVJ
      1 ZACW
      6 ZBPY
      6 ZBTA
     12 ZBVT
     10 ZCDJ
     10 ZCUA
      1 ZDIZ
      4 ZENX
      8 ZETY
      7 ZFGK
      7 ZFRE
      3 ZGQD
      7 ZHEE
     12 ZHMB
      8 ZINQ
      1 ZIWB
      7 ZJRC
      6 ZJUL
      4 ZKPF
      5 ZMGN
      1 ZNUM
      8 ZPKK
     20 ZQVF
      7 ZQWM
      7 ZQYU
      6 ZRAV
      5 ZRIN
      6 ZSAB
      3 ZSGF
      9 ZSSR
      1 ZTHV
      8 ZTLR
      9 ZUHO
     10 ZXJO
     15 ZYAX
      5 ZYCD
      5 ZZEI
      4 ZZOL
In [53]:
grep '^>' data/$inseq.fasta | grep -v '[A-Z]\{4\}-' | cut -c 2-6 | sort | uniq -c
     16 Ambtr
      9 Aquco
      8 Arath
      5 Betvu
      6 Carpa
      1 Chlre
      2 Chlva
     22 Elagu
     56 Eucgr
      4 Klefl
     20 Manes
      2 Micpu
      6 Mimgu
      8 Musac
     11 Nelnu
     14 Orysa
     19 Phavu
     62 Pinta
     22 Poptr
     15 Prupe
     10 Selmo
      9 Solly
      8 Sorbi
      6 Spipo
     10 Theca
     21 Vitvi
In [60]:
mkdir data/LAR_orthogroup_selection_v1

Using the codes above, and the online sample list of the 1kPlants project, I made an overview table containing extra information about the samples to

  1. choose a balanced proporties of different major clades
  2. annotate the tree afterwards

the summary looks like this: image.png

And the full table is online here: https://docs.google.com/spreadsheets/d/1v2igxY_nr7ETMoUdbqpY0QKVxJ-KYiRiO2lLoyOABsw/edit?usp=sharing

I made a selection in

./data/LAR_orthogroup_selection_v1/1kP_LAR_RNAselection

and

./data/LAR_orthogroup_selection_v1/1kP_LAR_DNAselection

then there is Erbils guide v4

./data/Erbilsguide_v3.fasta supplemented with ./data/Erbils_verified_sequences_v2_nonredundantwithv3.fasta

, and extra Azolla sequences.

Azfi PIP-likes V1.txt & Azfi PIP-likes V2.txt

In [59]:
ls data -l
total 2812
-rw-rw-r-- 1 laura laura 2132281 May  4 21:27  1kP_LAR_orthogroup.fasta
-rw-rw-r-- 1 laura laura   21727 May  5 10:40  1kP_LAR_orthogroup_manual-selection-1.fasta
-rw-rw-r-- 1 laura laura       0 May  5 10:42  1kP_LAR_orthogroup_manual-selection-1_guidev4cat
-rw-rw-r-- 1 laura laura   40650 May  5 10:44  1kP_LAR_orthogroup_manual-selection-1_guidev4.fasta
-rw-rw-r-- 1 laura laura    2832 May  5 10:38  1kP_LAR_orthogroup_manual-selection-1.txt
-rw-rw-r-- 1 laura laura  187401 May  4 21:27  1kP_LAR_orthogroup_random700.fasta
-rw-rw-r-- 1 laura laura  194663 May  1 17:21  1kP_LAR_orthogroup_random700_guidev3.fasta
-rw-rw-r-- 1 laura laura  225247 May  5 10:42  1kP_LAR_orthogroup_random700_guidev4.fasta
-rw-rw-r-- 1 laura laura   23808 May  1 17:20  1kP_LAR_orthogroup_random700.txt
drwxrwxr-x 2 laura laura    4096 May  5 10:45  alignments_raw
drwxrwxr-x 3 laura laura    4096 May  5 10:45  alignments_trimmed
-rw-r--r-- 1 laura laura    1789 May  7 11:44 'Azfi PIP-likes V1.txt'
-rw-r--r-- 1 laura laura    3618 May  7 11:44 'Azfi PIP-likes V2.txt'
-rw-rw-r-- 1 laura laura   12902 May  1 16:54  Erbils_guide_v3.fasta
-rw-rw-r-- 1 laura laura    6018 May  4 21:59  Erbils_verified_sequences_v2_nonredundantwithv3.fasta

Now let's compose a fasta combining all these subsets.

In [116]:
grep -f ./data/LAR_orthogroup_selection_v1/1kP_LAR_DNAselection data/$inseq.fasta | grep '^>' > ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection.names
grep -f ./data/LAR_orthogroup_selection_v1/1kP_LAR_RNAselection data/$inseq.fasta | grep '^>' >> ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection.names
(phylogenetics) (phylogenetics) 

In [117]:
grep '>' -c ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection.names
grep '>'    ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection.names | sort | uniq | wc -l
743
(phylogenetics) 743
(phylogenetics) 

In [118]:
head  ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection.names
>Pinta_v2_0-PITA_000013176_RA-Pinus_taeda
>Pinta_v2_0-PITA_000002308_RA-Pinus_taeda
>Aquco_v1_1-Aquca_022_00106_1-Aquilegia_coerulea
>Klefl_v1_0-kfl00402_0160-Klebsormidium_flaccidum
>Manes_v4_1-cassava4_1_029219m-Manihot_esculenta
>Pinta_v2_0-PITA_000022482_RA-Pinus_taeda
>Sorbi_v2_1-Sobic_008G106800_1-Sorghum_bicolor
>Poptr_v3_0-Potri_001G133200_1-Populus_trichocarpa
>Aquco_v1_1-Aquca_010_00523_1-Aquilegia_coerulea
>Pinta_v2_0-PITA_000093978_RA-Pinus_taeda
(phylogenetics) 

In [119]:
rm ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection_v1.fasta 2> /dev/null
/opt/rnaseq/bin/extract_sequence_by_id.pl --ID=./data/LAR_orthogroup_selection_v1/1kP_LAR_selection.names \
                                          --fileformat=fasta \
                                          --sequences=data/$inseq.fasta \
                                          --outfile=./data/LAR_orthogroup_selection_v1/1kP_LAR_selection_v1.fasta
(phylogenetics) (phylogenetics) 

In [120]:
grep '>' ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection_v1.fasta | sort | wc -l 
grep '>' ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection_v1.fasta | sort | uniq | wc -l
743
(phylogenetics) 743
(phylogenetics) 

Now we have our fasta with the selected kP sequences, let's add extra Azolla and guide sequences.

In [121]:
cat ./data/Erbils_guide_v3.fasta \
    ./data/Erbils_verified_sequences_v2_nonredundantwithv3.fasta \
   './data/LAR_orthogroup_selection_v1/Azfi PIP-likes V1.txt' \
   './data/LAR_orthogroup_selection_v1/Azfi PIP-likes V2.txt' | grep '>' | sort | uniq -c
      1 >Afi_v2_s1241G000080.2 (IFR)
      1 >Afi_v2_s1241G000100.1 (PCBER)
      1 >Afi_v2_s1241G000110.6 (PCBER)
      1 >Afi_v2_s1281G000010.2 (IFR)
      1 >Afi_v2_s1281G000030.1 (PCBER)
      1 >Afi_v2_s19G000320.3 (no BLAST result)
      1 >Afi_v2_s22G002760.1 (divinyl chlorophyllide a 8-vinyl-reductase)
      1 >Afi_v2_s74G000210.2 (characterized LAR)
      1 >Afi_v2_s74G000210.2 (PCBER)
      1 >Azfi_s0004.g008890 (no BLAST result)
      1 >Azfi_s0056.g034071 (divinyl chlorophyllide a 8-vinyl-reductase)
      1 >Azfi_s0197.g057377 (characterized LAR)
      1 >Azfi_s0435.g069924 (no BLAST result)
      1 >Azolla filiculoides LAR
      1 >Chlamydomonas reinhardtii PLR-like (379876)
      1 >Chlorella variabilis PLR-like (42998)
      1 >Cicer arietinum IFR (Q00016)
      1 >Clarkia breweri EGS1 (ABR24113)
      1 >Clarkia breweri EGS2 (ABR24114)
      1 >Clarkia breweri IGS (ABR24112)
      1 >Coccomyxa subellipsoidea PLR-like (23909)
      1 >Desmodium uncinatum LAR (Q84V83)
      1 >EGS Gymnadenia conopsea (Koeduka et al. 2018)
      1 >EGS Gymnadenia odoratissima (Koeduka et al. 2018)
      1 >EGS Ocimum basillicum (Koeduka et al. 2018)
      1 >EGS Ocimum gratissimum (Koeduka et al. 2018)
      1 >EGS Ocimum kilimandscharicum (Koeduka et al. 2018)
      1 >Forsythia intermedia PCBER (AAF64174)
      1 >Forsythia intermedia PLR (AAC49608)
      1 >Fragaria ananassa EGS1 (AGV02007)
      1 >Fragaria ananassa EGS2 (AGV02008)
      1 >Fragaria ananassa LAR (ABH07785)
      1 >Gymnadenia densiflora EGS (AKB11749)
      1 >IFR-LIKE Ginkgo biloba (Hua et al. 2013)
      1 >IFR-LIKE Oryza sativa (Kim et al. 2009)
      1 >Klebsormidium nitens PLR-like (kfl00017_0610)
      1 >LAR Azolla caroliniana (CVEG-2016335)
      1 >LAR Desmodium uncinatum
      1 >LAR Fragaria x ananassa (Koeduka et al. 2018)
      1 >LAR Malus domestica (Koeduka et al. 2018)
      1 >LAR Picea abies
      1 >LAR Pinus taeda
      1 >Larrea tridentata APS (AHA90804)
      1 >Larrea tridentata PPS (AHA90806)
      1 >LAR Vitis vinifera (Koeduka et al. 2018)
      1 >Linum album PLR (CAH60857)
      1 >Lotus japonicus PTR (BAF34844)
      1 >Medicago sativa IFR (CAA41106)
      1 >Nicotiana tabacum PCBER (BAG84267)
      1 >Ocimum tenuiflorum EGS (AOC97444)
      1 >Petunia hybrida EGS (ABR24115)
      1 >Petunia hybrida IGS (ABD17322)
      1 >Pimpinella anisum AIS (ACL13526)
      1 >Pinus taeda PCBER (AAC32591)
      1 >Piper regnellii APS (AHA90807)
      1 >Piper regnellii PPS (AHA90809)
      1 >Pisum sativum IFR (P52576)
      1 >P(L)R Arabidopsis thaliana (Koeduka et al. 2018)
      1 >PLR Linum perenne (Koeduka et al. 2018)
      1 >Populus trichorpa PCBER (CAA06707)
      1 >Selaginella moellendorffii PLR-like1 (438560)
      1 >Selaginella moellendorffii PLR-like2 (230858)
      1 >Thuja plicata PLR (AAF63509)
      1 >Tsuga heterophylla PCBER (AAF64179)
      1 >Tsuga heterophylla PLR (AAF64184)
(phylogenetics) 

In [122]:
cat ./data/Erbils_guide_v3.fasta \
    ./data/Erbils_verified_sequences_v2_nonredundantwithv3.fasta \
   './data/LAR_orthogroup_selection_v1/Azfi PIP-likes V1.txt' \
   './data/LAR_orthogroup_selection_v1/Azfi PIP-likes V2.txt' > ./data/LAR_orthogroup_selection_v1/1kP_LAR_guide_v5.fasta
(phylogenetics) 

In [191]:
cat ./data/LAR_orthogroup_selection_v1/1kP_LAR_guide_v5.fasta \
    ./data/LAR_orthogroup_selection_v1/1kP_LAR_selection_v1.fasta \
    > ./data/1kP_LAR_selectionv1_guide_v5.fasta
inseq=1kP_LAR_selectionv1_guide_v5
echo $inseq
(phylogenetics) (phylogenetics) 1kP_LAR_selectionv1_guide_v5
(phylogenetics) 

In [237]:
grep '>' data/$inseq.fasta -c
808
(phylogenetics) 

random subsetting

Skipping this section for now

In [ ]:
n=700
grep '>' data/$inseq.fasta | tr -d '>' | shuf | head -n $n > data/"$inseq"_random"$n".txt
In [ ]:
grep '>'  ../LAR_tree/alignments_raw/LAR_orthogroup_LARselectionExtraFernsAndLycophytes_guide2_aligned-mafft-linsi.fasta | tr -d '>' | tail -n +49 > data/"$inseq"_random"$n".txt
In [ ]:
wc -l ./data/"$inseq"_random700.txt

Let's add all Azolla stuff in there, but first see what is there already

In [ ]:
grep -i azolla data/"$inseq"_random"$n".txt
In [ ]:
grep -i azolla data/"$inseq".fasta

Remove double guide sequence AZOLLAFILICULOIDES

In [ ]:
sed -i '/AZOLLAFILICULOIDES/d' data/"$inseq".fasta
In [ ]:
head data/"$inseq"_random"$n".txt
In [ ]:
/opt/rnaseq/bin/extract_sequence_by_id.pl --ID=<(sort data/"$inseq"_random"$n".txt | uniq) \
                                          --fileformat=txt \
                                          --sequences=<(cat data/$inseq.fasta | tr -d '.')\
                                          --outfile=data/"$inseq"_random"$n".fasta
In [ ]:
grep '>' -c data/"$inseq"_random"$n".fasta
In [ ]:
n=700
grep -i azolla data/"$inseq"_random"$n".fasta
In [ ]:
ls data
In [ ]:
head data/1kP_LAR_orthogroup_random700.fasta
In [ ]:
head data/Erbils_guide_v3.fasta
In [ ]:
grep '>' data/Erbils_verified_sequences_v2_nonredundantwithv3.fasta
In [ ]:
grep '>' data/Erbils_guide_v3.fasta
In [ ]:
cat data/Erbils_guide_v3.fasta > data/1kP_LAR_orthogroup_random700_guidev4.fasta
In [ ]:
cat data/Erbils_verified_sequences_v2_nonredundantwithv3.fasta >> data/1kP_LAR_orthogroup_random700_guidev4.fasta
In [ ]:
echo "\n" >> data/1kP_LAR_orthogroup_random700_guidev4.fasta
In [ ]:
cat data/1kP_LAR_orthogroup_random700.fasta >> data/1kP_LAR_orthogroup_random700_guidev4.fasta
In [ ]:
grep '>' -c ./data/1kP_LAR_orthogroup_random700_guidev4.fasta
In [ ]:
inseq="$inseq"_random"$n"_guidev4
echo $inseq
In [192]:
head data/$inseq.fasta
>Ocimum tenuiflorum EGS (AOC97444)
MEENGMKSKILIFGGTGYIGNHMVKGSLKLGHPTYVFTRPNSTKTALLNEFQSMGAIIVKGEMGEHEKLV
RLMKKVDVVISAVAIPQVLEQLKIVEAIKVAGNIKRFLPSDFGVEEDRISALPPFEACIEKKRIIRRAIE
EANIPYTYVSANCFASYFINYLLHPSDPKDEITVYGTGEVKFAMNYEQDIGLYTIKVATDPRTLNRVVIY
RPSTNITTQLELISKWEKKIGRNFKKIHIPEEEIVALTKELPEPENIPVAILHCVFIKGVTMSYDFKEND
VEASTLYPELKFTTIDELLDIFLHDPPPPASAAF

>Clarkia breweri IGS (ABR24112)
MEKIIIYGGTGYIGKFMVRASLSFSHPTFIYARPLTPDSTPSSVQLREEFRSMGVTIIEGEMEEHEKMVS
VLRQVDVVISALSVPMYPSQLLIIDAIKAAGNIKRFLPSEFGSEEDRIKPLPPFESVLEKKRIIRRAIEA
(phylogenetics) 

Aligning

Either run everything one by one...

First with MAFFT

In [319]:
mafft --help
------------------------------------------------------------------------------
  MAFFT v7.455 (2019/Dec/7)
  https://mafft.cbrc.jp/alignment/software/
  MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
  % mafft in > out
  % mafft --retree 1 in > out (fast)

High accuracy (for <~200 sequences x <~2,000 aa/nt):
  % mafft --maxiterate 1000 --localpair  in > out (% linsi in > out is also ok)
  % mafft --maxiterate 1000 --genafpair  in > out (% einsi in > out)
  % mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)

If unsure which option to use:
  % mafft --auto in > out

--op # :         Gap opening penalty, default: 1.53
--ep # :         Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout :   Output: clustal format, default: fasta
--reorder :      Outorder: aligned, default: input order
--quiet :        Do not report progress
--thread # :     Number of threads (if unsure, --thread -1)
--dash :         Add structural information (Rozewicki et al, submitted)
(phylogenetics) 

This is probably the most acurate mafft setting, which is turned off by default in normal or auto mafft for alignments bigger than 200 sequences. Since We're moving to a publication quality tree, let's do is this way and see.

In [318]:
conda activate phylogenetics
(phylogenetics) 

In [193]:
rm "./data/alignments_raw/$inseq"_aligned-mafft-linsi.fasta
if    [ ! -d ./data/alignments_raw/ ]
then  mkdir  ./data/alignments_raw
fi
if    [ ! -f "./data/alignments_raw/$inseq"_aligned-mafft.fasta ]
then  linsi --thread 12 data/$inseq.fasta > "./data/alignments_raw/$inseq"_aligned-mafft-linsi.fasta
fi
(phylogenetics) (phylogenetics) outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.455
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
12 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  800 / 808
done.

Progressive alignment ... 
STEP   629 /807 (thread    0) 
Reallocating (by thread 6) ..done. *alloclen = 2231
STEP   758 /807 (thread    2) 
Reallocating (by thread 8) ..done. *alloclen = 3401
STEP   807 /807 (thread    3) 
done.
tbfast (aa) Version 7.455
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
12 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  800 / 808
Segment   1/  1    1-2592
010-1612-0 (thread    6) worse         
Converged2.
done
dvtditr (aa) Version 7.455
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
8 thread(s)


Strategy:
 L-INS-i (Probably most accurate, very slow)
 Iterative refinement method (<16) with LOCAL pairwise alignment information

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

(phylogenetics) 

In [150]:
ls ./data/alignments_raw
1kP_LAR_orthogroup_manual-selection-1_guidev4_aligned-mafft-linsi.fasta
1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi.fasta
1kP_LAR_orthogroup_random700_guidev4_aligned-mafft-linsi.fasta
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi.fasta

In [185]:
head ./data/alignments_raw/"$inseq"_aligned-mafft-linsi.fasta
exit1146-0 (thread    8) identical     
Terminated
Restarting Bash

Alignment trimming

odds are, your alignment is quite gappy which may confuse tree building algorithms. Often it is better to remove gappy columns in your alignment. Let's have a look at this with trimAl. Short for 'trim alignment' No Artificial intelegence stuff going on here.

In [194]:
trimal -h
trimAl v1.4.rev15 build[2013-12-17]. 2009-2013. Salvador Capella-Gutierrez and Toni Gabaldón.

trimAl webpage: http://trimal.cgenomics.org

This program is free software: you can redistribute it and/or modify 
it under the terms of the GNU General Public License as published by 
the Free Software Foundation, the last available version.

Please cite:
		trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.
		Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon.
		Bioinformatics 2009, 25:1972-1973.

Basic usage
	trimal -in <inputfile> -out <outputfile> -(other options).

Common options (for a complete list please see the User Guide or visit http://trimal.cgenomics.org):

    -h                       Print this information and show some examples.
    --version                Print the trimAl version.

    -in <inputfile>          Input file in several formats (clustal, fasta, NBRF/PIR, nexus, phylip3.2, phylip).

    -compareset <inputfile>  Input list of paths for the files containing the alignments to compare.
    -forceselect <inputfile> Force selection of the given input file in the files comparison method.

    -backtrans <inputfile>   Use a Coding Sequences file to get a backtranslation for a given AA alignment
    -ignorestopcodon         Ignore stop codons in the input coding sequences
    -splitbystopcodon        Split input coding sequences up to first stop codon appearance

    -matrix <inpufile>       Input file for user-defined similarity matrix (default is Blosum62).

    -out <outputfile>        Output alignment in the same input format (default stdout). (default input format)
    -htmlout <outputfile>    Get a summary of trimal's work in an HTML file.

    -keepheader              Keep original sequence header including non-alphanumeric characters.
                             Only available for input FASTA format files. (future versions will extend this feature)

    -nbrf                    Output file in NBRF/PIR format
    -mega                    Output file in MEGA format
    -nexus                   Output file in NEXUS format
    -clustal                 Output file in CLUSTAL format

    -fasta                   Output file in FASTA format
    -fasta_m10               Output file in FASTA format. Sequences name length up to 10 characters.

    -phylip                  Output file in PHYLIP/PHYLIP4 format
    -phylip_m10              Output file in PHYLIP/PHYLIP4 format. Sequences name length up to 10 characters.
    -phylip_paml             Output file in PHYLIP format compatible with PAML
    -phylip_paml_m10         Output file in PHYLIP format compatible with PAML. Sequences name length up to 10 characters.
    -phylip3.2               Output file in PHYLIP3.2 format
    -phylip3.2_m10           Output file in PHYLIP3.2 format. Sequences name length up to 10 characters.

    -complementary           Get the complementary alignment.
    -colnumbering            Get the relationship between the columns in the old and new alignment.

    -selectcols { n,l,m-k }  Selection of columns to be removed from the alignment. Range: [0 - (Number of Columns - 1)]. (see User Guide).
    -selectseqs { n,l,m-k }  Selection of sequences to be removed from the alignment. Range: [0 - (Number of Sequences - 1)]. (see User Guide).

    -gt -gapthreshold <n>    1 - (fraction of sequences with a gap allowed). Range: [0 - 1]
    -st -simthreshold <n>    Minimum average similarity allowed. Range: [0 - 1]
    -ct -conthreshold <n>    Minimum consistency value allowed.Range: [0 - 1]
    -cons <n>                Minimum percentage of the positions in the original alignment to conserve. Range: [0 - 100]

    -nogaps                  Remove all positions with gaps in the alignment.
    -noallgaps               Remove columns composed only by gaps.
    -keepseqs                Keep sequences even if they are composed only by gaps.

    -gappyout                Use automated selection on "gappyout" mode. This method only uses information based on gaps' distribution. (see User Guide).
    -strict                  Use automated selection on "strict" mode. (see User Guide).
    -strictplus              Use automated selection on "strictplus" mode. (see User Guide).
                             (Optimized for Neighbour Joining phylogenetic tree reconstruction).

    -automated1              Use a heuristic selection of the automatic method based on similarity statistics. (see User Guide). (Optimized for Maximum Likelihood phylogenetic tree reconstruction).

    -terminalonly            Only columns out of internal boundaries (first and last column without gaps) are 
                             candidated to be trimmed depending on the applied method
    -block <n>               Minimum column block size to be kept in the trimmed alignment. Available with manual and automatic (gappyout) methods

    -resoverlap              Minimum overlap of a positions with other positions in the column to be considered a "good position". Range: [0 - 1]. (see User Guide).
    -seqoverlap              Minimum percentage of "good positions" that a sequence must have in order to be conserved. Range: [0 - 100](see User Guide).

    -clusters <n>            Get the most Nth representatives sequences from a given alignment. Range: [1 - (Number of sequences)]
    -maxidentity <n>         Get the representatives sequences for a given identity threshold. Range: [0 - 1].

    -w <n>                   (half) Window size, score of position i is the average of the window (i - n) to (i + n).
    -gw <n>                  (half) Window size only applies to statistics/methods based on Gaps.
    -sw <n>                  (half) Window size only applies to statistics/methods based on Similarity.
    -cw <n>                  (half) Window size only applies to statistics/methods based on Consistency.

    -sgc                     Print gap scores for each column in the input alignment.
    -sgt                     Print accumulated gap scores for the input alignment.
    -ssc                     Print similarity scores for each column in the input alignment.
    -sst                     Print accumulated similarity scores for the input alignment.
    -sfc                     Print sum-of-pairs scores for each column from the selected alignment
    -sft                     Print accumulated sum-of-pairs scores for the selected alignment
    -sident                  Print identity scores for all sequences in the input alignment. (see User Guide).

Some Examples:

1) Removes all positions in the alignment with gaps in 10% or more of
   the sequences, unless this leaves less than 60% of original alignment. 
   In such case, print the 60% best (with less gaps) positions.

   trimal -in <inputfile> -out <outputfile> -gt 0.9 -cons 60

2) As above but, the gap score is averaged over a window starting
   3 positions before and ending 3 positions after each column.

   trimal -in <inputfile> -out <outputfile> -gt 0.9 -cons 60 -w 3

3) Use an automatic method to decide optimal thresholds, based in the gap scores
   from input alignment. (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -gappyout

4) Use automatic methods to decide optimal thresholds, based on the combination 
   of gap and similarity scores. (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -strictplus

5) Use an heuristic to decide the optimal method for trimming the alignment. 
   (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -automated1

6) Use residues and sequences overlap thresholds to delete some sequences from the 
   alignemnt. (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -resoverlap 0.8 -seqoverlap 75

7) Selection of columns to be deleted from the alignment. The selection can 
   be a column number or a column number interval. Start from 0

   trimal -in <inputfile> -out <outputfile> -selectcols { 0,2,3,10,45-60,68,70-78 }

8) Get the complementary alignment from the alignment previously trimmed.

   trimal -in <inputfile> -out <outputfile> -selectcols { 0,2,3,10,45-60,68,70-78 } -complementary

9) Selection of sequences to be deleted from the alignment. Start in 0

   trimal -in <inputfile> -out <outputfile> -selectseqs { 2,4,8-12 } 

10) Select the 5 most representative sequences from the alignment

   trimal -in <inputfile> -out <outputfile> -clusters 5 

(phylogenetics) 

I'm trying some manual trimming as alternative

In [204]:
mkdir data/alignments_trimmed 2> /dev/null

trimappendix='trim-auto'

for a in "data/alignments_raw/$inseq"_aligned*.fasta
do  appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    if    [ ! -f data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta ]
    then  echo "trimming alignment $a"
          sed -i 's/ /_/g' $a
          trimal -in $a   \
                 -out data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta \
                 -automated1 \
                 -htmlout data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".html &
    fi
done
(phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) trimming alignment data/alignments_raw/1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi.fasta
[1] 10376
(phylogenetics) 

In [227]:
mkdir data/alignments_trimmed 2> /dev/null

trimappendix='trim-gt4-seq90-res8'

for a in "data/alignments_raw/$inseq"_aligned*.fasta
do  appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    if    [ ! -f data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta ]
    then  echo "trimming alignment $a"
          sed -i 's/ /_/g' $a
          trimal -in $a   \
                 -out data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta \
                 -gt .4 \
                 -seqoverlap 80 \
                 -resoverlap 0.7 \
                 -htmlout data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".html
    fi
done
(phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) 

In [215]:
ls data/alignments_trimmed
1kP_LAR_orthogroup_manual-selection-1_guidev4_aligned-mafft-linsi_trim-gt6-seq80.fasta
1kP_LAR_orthogroup_manual-selection-1_guidev4_aligned-mafft-linsi_trim-gt6-seq80.html
1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi_trim-gt6.html
1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi_trim-gt6-seq80.fasta
1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi_trim-gt6-seq80.html
1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi_trim-gt6-seq90.html
1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi_trim-gt7.html
1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi_trim-gt8.html
1kP_LAR_orthogroup_random700_guidev4_aligned-mafft-linsi_trim-gt6-seq80.fasta
1kP_LAR_orthogroup_random700_guidev4_aligned-mafft-linsi_trim-gt6-seq80.html
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-auto.fasta
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-auto.html
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt6-seq20-res6.fasta
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt6-seq20-res6.html
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt6-seq80-res6.fasta
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt6-seq80-res6.html
(phylogenetics) 

In [320]:
conda deactivate

Let's look at the stats

auto:

Selected Sequences: 808 /Selected Residues: 145

Deleted Sequences: 0 /Deleted Residues: 2346

t6-seq80-res6

Selected Sequences: 808 /Selected Residues: 303

Deleted Sequences: 0 /Deleted Residues: 2188

First verdict, the auto is defenitely too strict for my taste. I do like the selection of columns in the manual parameters but think that the selection of sequences can be more strict to filter out banding like this:

image.png

gt6-seq20-res6

Selected Sequences: 808 /Selected Residues: 303

Deleted Sequences: 0 /Deleted Residues: 2188

So that was pointless, let's try 90 instead:

gt6-seq90-res6

Selected Sequences: 800 /Selected Residues: 303

Deleted Sequences: 8 /Deleted Residues: 2188

Marginally better, lets keep trying

trim-gt6-seq95-res7

Selected Sequences: 584 /Selected Residues: 303

Deleted Sequences: 224 /Deleted Residues: 2188

image.png

In the end I choose this one: trim-gt4-seq95-res7

It has 584 sequences, discarding 224, and keeping 306 columns of information. horizontal banding looks like above.

Tree building

We'll make fast trees (not so acurate, no bootstraps, but fast) (optional) and we'll make "propper trees" using the amazing iqtree

fasttree

In [ ]:
rm -rf ./data/LAR_orthogroup_fasttrees
In [ ]:
ls analyses/1kP_LAR_orthogroup_random700_guidev3_fasttrees
In [ ]:
for a in data/alignments_trimmed/"$inseq"_aligned*.fasta
do echo "making a fasttree of file $a"
   appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
   echo $appendix
   if   [ ! -d analyses/"$inseq"_fasttrees ]
   then mkdir  analyses/"$inseq"_fasttrees
   fi
   if   [ ! -d analyses/"$inseq"_fasttrees/"$appendix" ]
   then mkdir  analyses/"$inseq"_fasttrees/"$appendix" 
        fasttree -log analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_fasttree.log \
                $a \
                > analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_fasttree.tree \
                2> analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_fasttree.stderr &
    fi
done
In [ ]:
tail analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_fasttree.log
In [ ]:
ls analyses/"$inseq"_fasttrees/"$appendix"/

IQtree

Now let's loop over all made aligments, making a tree. Choose your parameters wisely, this can take a long time

  • Are doing a first testrun? Have a look at the -fast option
  • No time to wait for propper bootstraps, use the ultrafast bootstrap option, -bb 1000 (Don't believe nodes with less than 95% bootstrap
  • Writing for your amazing manuscript? Do propper bootstraps! -b 1000

Model finder, this is the best feature of iqtree. Evolution can happen in a lot of ways, and iqtree takes this into account. Use -m TEST to use modelfinder, or -m MFP for extended modelfinding (for your publication quality tree). Modelfinder, especially the extended version, can take a long time to calculate. So if you have done it once for a specific alignment, don't do it twice. ;)

Now have a look at these examples and the manual:

iqtree ultra fast bootstrap

In [ ]:
ls data/alignments_trimmed/"$inseq"_aligned*gt*.fasta
In [ ]:
grep 'Best-fit model'  ./analyses/1kP_LAR_orthogroup_random700_guidev3_trees/aligned-mafft-linsi_trim-gt6-seq80/1kP_LAR_orthogroup_random700_guidev3_aligned-mafft-linsi_trim-gt6-seq80_iqtree-bb2000-shalrt2000
In [230]:
iqpendix='iqtree-bb2000-alrt2000'

#for a in data/alignments_trimmed/"$inseq"_aligned*trim*.fasta
for a in data/alignments_trimmed/"$inseq"_aligned*trim-gt4-seq95-res7.fasta
do echo "making a tree of file $a"
   appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
   echo $appendix
   if   [ ! -d analyses/"$inseq"_trees ]
   then mkdir  analyses/"$inseq"_trees
   fi
   if   [ ! -d analyses/"$inseq"_trees/"$appendix" ]
   then mkdir  analyses/"$inseq"_trees/"$appendix" 
   fi
   if   [ ! -f analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix".tree ]
   then     nice iqtree -s $a \
            -bb 2000 \
            -alrt 2000 \
            -nt AUTO \
            -ntmax 12  \
            -pre analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix" \
            2>  analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix"-b1000 \
            >  analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix" \
            -m MFP && \
            cat analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix".log | mail -s LAR_selectionv1_guidev5 laura.w.dijkhuizen@gmail.com &
   fi
done
(phylogenetics) (phylogenetics) (phylogenetics) making a tree of file data/alignments_trimmed/1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7.fasta
aligned-mafft-linsi_trim-gt4-seq95-res7
[1] 17056
(phylogenetics) 

In [289]:
iqpendix='iqtree-b200'

#for a in data/alignments_trimmed/"$inseq"_aligned*trim*.fasta
for a in data/alignments_trimmed/"$inseq"_aligned*trim-gt4-seq95-res7.fasta
do echo "making a tree of file $a"
   appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
   echo $appendix
   if   [ ! -d analyses/"$inseq"_trees ]
   then mkdir  analyses/"$inseq"_trees
   fi
   if   [ ! -d analyses/"$inseq"_trees/"$appendix" ]
   then mkdir  analyses/"$inseq"_trees/"$appendix" 
   fi
   if   [ ! -f analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix".tree ]
   then     nice iqtree -s $a \
            -b 200 \
            -nt AUTO \
            -ntmax 12  \
            -pre analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix" \
            2>  analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix"-b1000 \
            >  analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix" \
            -m LG+R7 && \
            cat analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix".log | mail -s LAR_selectionv1_guidev5 laura.w.dijkhuizen@gmail.com &
   fi
done
making a tree of file data/alignments_trimmed/1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7.fasta
aligned-mafft-linsi_trim-gt4-seq95-res7
[1] 10412
In [290]:
ls analyses/"$inseq"_trees/aligned*/
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-b200
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-b200-b1000
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-b200.boottrees
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-b200.ckp.gz
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-b200.log
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000-b1000
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.bionj
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.ckp.gz
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.contree
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.iqtree
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.log
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.mldist
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.model.gz
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.splits.nex
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.treefile
1kP_LAR_selectionv1_guide_v5_aligned-mafft-linsi_trim-gt4-seq95-res7_iqtree-bb2000-alrt2000.uniqueseq.phy
In [321]:
tail -n 20 analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix".log
Iteration 20 / LogL: -125634.698 / Time: 0h:3m:58s
Finish initializing candidate tree set (20)
Current best tree score: -125320.382 / CPU time: 201.466
Number of iterations: 20
--------------------------------------------------------------------
|               OPTIMIZING CANDIDATE TREE SET                      |
--------------------------------------------------------------------
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 21: -125311.000
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 22: -125303.691
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 23: -125276.323
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 24: -125273.840
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 26: -125254.052
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 30: -125246.818
Iteration 30 / LogL: -125246.818 / Time: 0h:5m:24s (0h:18m:37s left)
In [325]:
grep 'START' analyses/"$inseq"_trees/"$appendix"/"$inseq"_"$appendix"_"$iqpendix".log
===> START BOOTSTRAP REPLICATE NUMBER 1
===> START BOOTSTRAP REPLICATE NUMBER 2
===> START BOOTSTRAP REPLICATE NUMBER 3
===> START BOOTSTRAP REPLICATE NUMBER 4
===> START BOOTSTRAP REPLICATE NUMBER 5
===> START BOOTSTRAP REPLICATE NUMBER 6
===> START BOOTSTRAP REPLICATE NUMBER 7
===> START BOOTSTRAP REPLICATE NUMBER 8
===> START BOOTSTRAP REPLICATE NUMBER 9
===> START BOOTSTRAP REPLICATE NUMBER 10
===> START BOOTSTRAP REPLICATE NUMBER 11
===> START BOOTSTRAP REPLICATE NUMBER 12
===> START BOOTSTRAP REPLICATE NUMBER 13
===> START BOOTSTRAP REPLICATE NUMBER 14
===> START BOOTSTRAP REPLICATE NUMBER 15
===> START BOOTSTRAP REPLICATE NUMBER 16
===> START BOOTSTRAP REPLICATE NUMBER 17
===> START BOOTSTRAP REPLICATE NUMBER 18
===> START BOOTSTRAP REPLICATE NUMBER 19
===> START BOOTSTRAP REPLICATE NUMBER 20
===> START BOOTSTRAP REPLICATE NUMBER 21
===> START BOOTSTRAP REPLICATE NUMBER 22
===> START BOOTSTRAP REPLICATE NUMBER 23
===> START BOOTSTRAP REPLICATE NUMBER 24
===> START BOOTSTRAP REPLICATE NUMBER 25
===> START BOOTSTRAP REPLICATE NUMBER 26
===> START BOOTSTRAP REPLICATE NUMBER 27
===> START BOOTSTRAP REPLICATE NUMBER 28
===> START BOOTSTRAP REPLICATE NUMBER 29
===> START BOOTSTRAP REPLICATE NUMBER 30
===> START BOOTSTRAP REPLICATE NUMBER 31
===> START BOOTSTRAP REPLICATE NUMBER 32
===> START BOOTSTRAP REPLICATE NUMBER 33
===> START BOOTSTRAP REPLICATE NUMBER 34
===> START BOOTSTRAP REPLICATE NUMBER 35
===> START BOOTSTRAP REPLICATE NUMBER 36
===> START BOOTSTRAP REPLICATE NUMBER 37
===> START BOOTSTRAP REPLICATE NUMBER 38
===> START BOOTSTRAP REPLICATE NUMBER 39
===> START BOOTSTRAP REPLICATE NUMBER 40
===> START BOOTSTRAP REPLICATE NUMBER 41
===> START BOOTSTRAP REPLICATE NUMBER 42
===> START BOOTSTRAP REPLICATE NUMBER 43
===> START BOOTSTRAP REPLICATE NUMBER 44
===> START BOOTSTRAP REPLICATE NUMBER 45
===> START BOOTSTRAP REPLICATE NUMBER 46
===> START BOOTSTRAP REPLICATE NUMBER 47
===> START BOOTSTRAP REPLICATE NUMBER 48
===> START BOOTSTRAP REPLICATE NUMBER 49
===> START BOOTSTRAP REPLICATE NUMBER 50
===> START BOOTSTRAP REPLICATE NUMBER 51
===> START BOOTSTRAP REPLICATE NUMBER 52
===> START BOOTSTRAP REPLICATE NUMBER 53
===> START BOOTSTRAP REPLICATE NUMBER 54
===> START BOOTSTRAP REPLICATE NUMBER 55
===> START BOOTSTRAP REPLICATE NUMBER 56
===> START BOOTSTRAP REPLICATE NUMBER 57
===> START BOOTSTRAP REPLICATE NUMBER 58
===> START BOOTSTRAP REPLICATE NUMBER 59
===> START BOOTSTRAP REPLICATE NUMBER 60
===> START BOOTSTRAP REPLICATE NUMBER 61
===> START BOOTSTRAP REPLICATE NUMBER 62
===> START BOOTSTRAP REPLICATE NUMBER 63
===> START BOOTSTRAP REPLICATE NUMBER 64
===> START BOOTSTRAP REPLICATE NUMBER 65
===> START BOOTSTRAP REPLICATE NUMBER 66
===> START BOOTSTRAP REPLICATE NUMBER 67
===> START BOOTSTRAP REPLICATE NUMBER 68
===> START BOOTSTRAP REPLICATE NUMBER 69
===> START BOOTSTRAP REPLICATE NUMBER 70
===> START BOOTSTRAP REPLICATE NUMBER 71
===> START BOOTSTRAP REPLICATE NUMBER 72
===> START BOOTSTRAP REPLICATE NUMBER 73
===> START BOOTSTRAP REPLICATE NUMBER 74
===> START BOOTSTRAP REPLICATE NUMBER 75
===> START BOOTSTRAP REPLICATE NUMBER 76
===> START BOOTSTRAP REPLICATE NUMBER 77
===> START BOOTSTRAP REPLICATE NUMBER 78
===> START BOOTSTRAP REPLICATE NUMBER 79

tree storage

find the trees in iTOL: https://itol.embl.de/shared/lauradijkhuizen

Ferns LAR and WannebeLAR clades

We have identified roughly three clades of fern LAR and LAR likes, here I select sequences from fern species that have representative sequences in each of these three clades.

LAR

UGNK-2002432-Marattia attenuata

MEKP-2009607-Dipteris conjugata

VIBO-2075523-Osmunda javanica

VIBO-2074404-Osmunda javanica

MEKP-2006126-Dipteris conjugata

PNZO-2008392-Culcita macrocarpa

PNZO-2022483-Culcita macrocarpa

MEKP-2104936-Dipteris conjugata

CQPW-2095554-Anemia tomentosa

PNZO-2020054-Culcita macrocarpa

CVEG-2005709-Azolla cf. caroliniana

Azolla filiculoides LAR

Azfi s0197.g057377 characterized LAR

YJJY-2005734-Woodsia scopulina

ZXJO-2003107-Hemionitis arifolia

ZXJO-2003105-Hemionitis arifolia

PIVW-2004875-Ceratopteris thalictroides

PIVW-2004244-Ceratopteris thalictroides

PIVW-2004243-Ceratopteris thalictroides

ZXJO-2004441-Hemionitis arifolia

ZXJO-2004443-Hemionitis arifolia

ZXJO-2004442-Hemionitis arifolia

WLAR1

CAPN-2038587-Equisetum diffusum

JVSZ-2129506-Equisetum hymale

JVSZ-2001809-Equisetum hymale

JVSZ-2001808-Equisetum hymale

JVSZ-2001810-Equisetum hymale

JVSZ-2001811-Equisetum hymale

UGNK-2017109-Marattia attenuata

UGNK-2021182-Marattia attenuata

MEKP-2006924-Dipteris conjugata

EWXK-2119988-Thyrsopteris elegans

PNZO-2007161-Culcita macrocarpa

PNZO-2007160-Culcita macrocarpa

PNZO-2013494-Culcita macrocarpa

EWXK-2121103-Thyrsopteris elegans

VIBO-2007835-Osmunda javanica

MEKP-2104845-Dipteris conjugata

EWXK-2121424-Thyrsopteris elegans

PNZO-2150800-Culcita macrocarpa

VIBO-2006262-Osmunda javanica

VIBO-2006264-Osmunda javanica

CQPW-2096458-Anemia tomentosa

MEKP-2001942-Dipteris conjugata

CQPW-2014224-Anemia tomentosa

PNZO-2149017-Culcita macrocarpa

EWXK-2024179-Thyrsopteris elegans

EWXK-2022152-Thyrsopteris elegans

PNZO-2026447-Culcita macrocarpa

CVEG-2016335-Azolla cf. caroliniana

CVEG-2016336-Azolla cf. caroliniana

Afi v2 s1241G000110.6 PCBER

Afi v2 s1241G000080.2 IFR

PIVW-2013241-Ceratopteris thalictroides

PIVW-2091514-Ceratopteris thalictroides

ZXJO-2007558-Hemionitis arifolia

YJJY-2003866-Woodsia scopulina

YJJY-2003867-Woodsia scopulina

WLAR2

MEKP-2105811-Dipteris conjugata

VIBO-2008406-Osmunda javanica

MEKP-2014651-Dipteris conjugata

MEKP-2104929-Dipteris conjugata

CQPW-2024177-Anemia tomentosa

PIVW-2009700-Ceratopteris thalictroides

ZXJO-2008167-Hemionitis arifolia

ZXJO-2008168-Hemionitis arifolia

EWXK-2012291-Thyrsopteris elegans

EWXK-2012289-Thyrsopteris elegans

YJJY-2008533-Woodsia scopulina

UGNK-2006528-Marattia attenuata

ALVQ-2018495-Tmesipteris parva

QVMR-2016059-Psilotum nudum

ALVQ-2000101-Tmesipteris parva

ALVQ-2000100-Tmesipteris parva

ALVQ-2007082-Tmesipteris parva

QVMR-2017404-Psilotum nudum

UGNK-2021731-Marattia attenuata

PNZO-2016614-Culcita macrocarpa

PNZO-2016613-Culcita macrocarpa

QVMR-2013662-Psilotum nudum

ALVQ-2004508-Tmesipteris parva

ALVQ-2004507-Tmesipteris parva

ALVQ-2004506-Tmesipteris parva

EWXK-2000842-Thyrsopteris elegans

EWXK-2000843-Thyrsopteris elegans

EWXK-2000844-Thyrsopteris elegans

UGNK-2020446-Marattia attenuata

MEKP-2101772-Dipteris conjugata

MEKP-2006789-Dipteris conjugata

MEKP-2006790-Dipteris conjugata

In [245]:
mkdir analyses/WLARs
In [246]:
ls analyses/WLARs
fernLARs.txt  fernWLAR1.txt  fernWLAR2.txt
In [250]:
wc -l ./analyses/WLARs/fern*LAR*.txt
  21 ./analyses/WLARs/fernLARs.txt
  36 ./analyses/WLARs/fernWLAR1.txt
  31 ./analyses/WLARs/fernWLAR2.txt
  88 total
In [257]:
#sed -i 's/ /_/g' ./analyses/WLARs/fern*LAR*.txt
In [276]:
/opt/rnaseq/bin/extract_sequence_by_id.pl --ID=./analyses/WLARs/fernLARs.txt  \
                                          --fileformat=txt \
                                          --sequences=./data/1kP_LAR_selectionv1_guide_v5.fasta \
                                          --outfile=./analyses/WLARs/fernLARs.fasta
In [281]:
/opt/rnaseq/bin/extract_sequence_by_id.pl --ID=./analyses/WLARs/fernWLAR1.txt  \
                                          --fileformat=txt \
                                          --sequences=./data/1kP_LAR_selectionv1_guide_v5.fasta \
                                          --outfile=./analyses/WLARs/fernWLAR1.fasta
In [261]:
/opt/rnaseq/bin/extract_sequence_by_id.pl --ID=./analyses/WLARs/fernWLAR2.txt  \
                                          --fileformat=txt \
                                          --sequences=./data/1kP_LAR_selectionv1_guide_v5.fasta \
                                          --outfile=./analyses/WLARs/fernWLAR2.fasta
In [282]:
grep '>' -c ./analyses/WLARs/fern*LAR*.fasta
./analyses/WLARs/fernLARs.fasta:20
./analyses/WLARs/fernWLAR1.fasta:34
./analyses/WLARs/fernWLAR2.fasta:32
In [278]:
diff <(sort analyses/WLARs/fernLARs.txt) <(grep '>' analyses/WLARs/fernLARs.fasta | tr -d '>' | sort)
1,2d0
< Azfi_s0197.g057377 (characterized LAR)
< Azolla filiculoides LAR

In [283]:
diff <(sort analyses/WLARs/fernWLAR1.txt) <(grep '>' analyses/WLARs/fernWLAR1.fasta | tr -d '>' | sort)
1,2d0
< Afi_v2_s1241G000080.2 (IFR)
< Afi_v2_s1241G000110.6 (PCBER)

In [279]:
diff <(sort analyses/WLARs/fernWLAR2.txt) <(grep '>' analyses/WLARs/fernWLAR2.fasta | tr -d '>' | sort)
In [287]:
cat analyses/WLARs/fernLARs.fasta | sed 's/^>/>LAR_/g'>  analyses/WLARs/fernLARclades.fasta
cat analyses/WLARs/fernWLAR1.fasta | sed 's/^>/>WLAR1_/g'>> analyses/WLARs/fernLARclades.fasta
cat analyses/WLARs/fernWLAR2.fasta | sed 's/^>/>WLAR2_/g'>> analyses/WLARs/fernLARclades.fasta
In [288]:
head ./analyses/WLARs/fernLARclades.fasta
>LAR_PIVW-2004875-Ceratopteris_thalictroides
MGFKSKVLIFGGTGYIGKHIVRASVAEGHPTSVFVRSSTLQTKAELVGSFKELGVSIIEGSLDDHTGLVASIKEADIVISAVGGPCVPEQEKIIAAITEAGNVKRFLPSEFGTDVDHAKPLEPLKTIYGRKVAIRRKIEEAGVPYTYISSNCFAGYTLSNLMQFGKPAPPRDKVVIYGTGDKKAIFLKEEDIGTYTIKCADDPRTLNKVVYLRPPQNILTVNEVVSLWEKKIGQSLDKEFVSEEDMIEQIKAASIPKNIVLSTVHNIFVMGDQYNFKVGEKGVDSNELFPDVKYTTASEYLDKFV
>LAR_YJJY-2005734-Woodsia_scopulina
MGGKSRVLIFGATGYIGKHIARASVAEGHPTSVFVRPSTLQTKAELVESFKALGITFVEGSLDDHAGLVAAIKEVDVVISALGGPAVPEQEKIIAAIKEAGTVKRFLPSEFGNDVDHARALEPVNTMYGKKAAIRRKVEEAGIPFTYVSSNAFAGYTLSNLVQFGKPAPPRDRVVIYGTGDVKAIYVKEEDIGTYTVKTIDDPRTLNKVVYLRPSANILSVNEVVALWEKKIGKTLEKEYISEADLVEQIKTSPIPKNIVLGTVHNIFVKGDQTNFEVEEKKGVEASQLYPDVKYTTASEYLDKYV
>LAR_MEKP-2006126-Dipteris_conjugata
MGLKSRVLIIGATGYIGKHVARACVAQGHPTSILIRPPTLISKKELVDSFTELGISFIHGSLDDHASLVAAIKEVDVVISTVGGPAILEQEKIIIAIKEVGTVKRFLPSEFGNDVDRAKALEPVNTMYGKKASIRRKIEAAGIPYTYISSNAFAGYSLANLVQFGKPAPPRDKVTIYGSGEAKAIFVKEEDIGTYTVKTIDDPRTLNKIVYIRPPANILSVNEVVTMWEKKIGKTVEREYVSEQEMIEQIKTSPIPKNIVLATVHNIFVKGDQYNFEIGSDGEEASSLYPEVKYTSASEYLDKYL
>LAR_ZXJO-2004443-Hemionitis_arifolia
MGVKGRVLIFGGTGYIGRHIVRASVAEGHPTSVFIRSSTLQSKAELVSSFKELGVSFVEGSLDDHAGLVAAIKEVDVVISAVGGPCIPEQEKIIAAIKEAGTVKRFLPSEFGSDVDHAKPLEPVKTMYGKKVAIRRKIEEAGIPYTYVSANAFAGYTLSNLMQFGKPAPPRDKVVFYGTGDVKAIFLKEEDIGTYTIKSIDDPRALNKVLYLRPPQNILTANEVATLWEKKIGHPLEKEYVSEEDMIEQIKTSPIPKNIVLATVHNIFVMGDQFNFKIGEKGVDSSELYPDVKYTTASEFLDNFV
>LAR_PNZO-2008392-Culcita_macrocarpa
KSRVLIIGATGYIGKHVARASIAEGHSTSILVRPSTLTSKAELVESFKSLGITLVEGSLDDHASLVAAIKEVDVVISTVGGPAILEQEKIIIAIKEVGTVKRFLPSEFGNDVDRAKALEPVNTMYGKKVTIRRKVEAAGIPFTYISSNAFAGYSLANLVQFGKPAPPKDKVVIYGSGDAKAIFLKEEDIGTYTIKTIDDPRTLNKVVYLRPPANILSVNELVTLWEKKIGQTLEKEYVSEEVMIELIKTSPIPKNIVLATVHNIFVKGDQYNFKIGAEGVEASELYPDVKYTTASEYLDK

then add these

WLAR1_Afi_v2_s1241G000080.2 MAGGEGESKRVLVLGATGYIGKFIALAGPSLGHPTFALIRPSTIASKPDIVQSLQSAGITILQGSLDDHESLVAAFKQVDVVISAVGGAQLKDQLKVLEA IKEAGTIKRFIPSEFGNDVDRTHSLEPAQSLFKGKIEVRRSIEDAGIPYTYVVSNGFAGYFLSNLLQEGHTSPPRDKVTIYGSGDVKAIAVHEEDIGTYT IKAAFDPRALNKTLHIRPPANIITLNELVDKWEKKIGKTLEKITVTEEEFVKKIESMSLNPFFFSLDLFSLSFVCFLLVKHSTDVMSARCRYSISREHFS IYFAWHCLQRGANKFRAWTQ WLAR1_Afi_v2_s1241G000110.6 MIHPSIHSIMHLDCDDSFIPSLHVPQSQQTQSSSMNDSADRTWESRERETDRQRGMAGGGEGESKRILILGATGYIGKFIALAGPSLGHPTFALIRPSTI ASKPDLVQSLRSAGISILQGSLDDHESLVAAFKQVDVVISAVGEAQLKDQLKILDAIKEVGTIKRFIPSEFGSDVNHSQGLGPAQSLFKAKVEIRRSIED AGIPYMYVVANGFAGYFLSSLLQEGHTSPPRDKVTIYGSGDVKVIAVYEEDVGTYTIKAAFDPRTLNKTLHIRPPANIVTFNELVDKWEKKIGKSLEKIT VTEEEFVKKIEGTPFPGNLFLSLLHGIVFKGDQTNFELGPNDVEATSLYPDVKYTSVDDYLDRFV LAR_Azolla filiculoides LAR MGVKSRVLIIGATGYIGKHVARASVAEGHPTSILIRPSTLTTKAELVTSFKDLGITLVEGS LDDHAGLVAAIKEVDVVISTVGGPAIPEQEKIIAAIKEAGNVLRFLPSEFGNDVDHAK ALEPVNTMYGKKVTIRRKIEEAGIPYTYISSNAFAGYSLSNLVQFGKPSPPRDKVTIYG SGDAKAIFLKEEDIGLFTIKTIDDPRTLNKIVYLRPPGNILSVNEVVSLWESKIGAKLER EYVSEEDMIVLIKTSPIPKNIVLATVHNIFVRGDQYNFEIGEKGVEASTLYPDVKYTTAS EYLDKFV LAR_Azfi_s0197.g057377 MGVKSRVLIIGATGYIGKHVARASVAEGHPTSILIRPSTLTTKAELVTSFKDLGITLVEGSLDDHAGLVA AIKEVDVSSRPSVGLPSPSKRRSLLLLKKQGMFLPSEFGNDVDHAKALEPVNTMYGKKVTIRRKIEEAGI PYTYISSNAFAGYSLSNLVQFGKPSPPRDKVTIYGSGDAKAIFLKEEDIGLFTIKTIDDPRTLNKIVYLR PPGNILSVNEVVSLWESKIGAKLEREYVSEEDMIVLIKTSPIPKNIVLATVHNIFVRGDQYNFEIGEKGV EASTLYPDVKYTTASEYLDKFV

In [299]:
conda activate phylogenetics
linsi --thread 6 analyses/WLARs/fernLARclades.fasta > analyses/WLARs/fernLARclades_aligned-mafft-linsi.fasta
conda deactivate
(phylogenetics) outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.455
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
6 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
   80 / 90
done.

Progressive alignment ... 
STEP    81 /89 (thread    0) 
Reallocating (by thread 1) ..done. *alloclen = 1732
STEP    89 /89 (thread    5) 
done.
tbfast (aa) Version 7.455
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
6 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 6
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

   80 / 90
Segment   1/  1    1- 480
004-0176-0 (thread    4) identical     
Converged.

Converged2.
done
dvtditr (aa) Version 7.455
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
6 thread(s)


Strategy:
 L-INS-i (Probably most accurate, very slow)
 Iterative refinement method (<16) with LOCAL pairwise alignment information

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

(phylogenetics) 
In [305]:
mkdir ./analyses/WLARs/fernLARclades_tree
nice iqtree -s ./analyses/WLARs/fernLARclades_aligned-mafft-linsi.fasta \
            -b 1000  \
            -nt AUTO \
            -ntmax 6  \
            -pre ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-b1000 \
            2> ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-b1000.stderr \
            >  ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-b1000.stdout \
            -m 'JTTDCMut+I+G4' && \
            cat ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-b1000.log | mail -s WLAR_tree laura.w.dijkhuizen@gmail.com &
mkdir: cannot create directory ‘./analyses/WLARs/fernLARclades_tree’: File exists
[2]+  Done                    nice iqtree -s ./analyses/WLARs/fernLARclades_aligned-mafft-linsi.fasta -bb 2000 -alrt 2000 -nt AUTO -ntmax 12 -pre ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-bb2000-alrt2000 -m MFP 2> ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-bb2000-alrt2000.stderr > ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-bb2000-alrt2000.stdout && cat ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-bb2000-alrt2000.log | mail -s WLAR_tree laura.w.dijkhuizen@gmail.com
[2] 18561
In [316]:
tail ./analyses/WLARs/fernLARclades_tree/fernLARclades_aligned-mafft-linsi_iqtree-b1000.stdout -n 40
Estimate model parameters (epsilon = 0.100)
1. Initial log-likelihood: -21555.439
2. Current log-likelihood: -21207.645
3. Current log-likelihood: -21196.585
4. Current log-likelihood: -21194.970
5. Current log-likelihood: -21194.577
6. Current log-likelihood: -21194.457
Optimal log-likelihood: -21194.413
Proportion of invariable sites: 0.015
Gamma shape alpha: 1.542
Parameters optimization took 6 rounds (2.019 sec)
Computing ML distances based on estimated model parameters... 2.679 sec
Computing BIONJ tree...
0.006 seconds
Log-likelihood of BIONJ tree: -21152.638
--------------------------------------------------------------------
|             INITIALIZING CANDIDATE TREE SET                      |
--------------------------------------------------------------------
Generating 98 parsimony trees... 5.373 second
Computing log-likelihood of 98 initial trees ... 13.320 seconds
Current best score: -21100.602

Do NNI search on 20 best initial trees
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 1: -21066.366
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 9: -21065.968
Iteration 10 / LogL: -21076.774 / Time: 0h:0m:35s
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 11: -21065.300
Iteration 20 / LogL: -21078.465 / Time: 0h:0m:49s
Finish initializing candidate tree set (20)
Current best tree score: -21065.300 / CPU time: 46.436
Number of iterations: 20
--------------------------------------------------------------------
|               OPTIMIZING CANDIDATE TREE SET                      |
--------------------------------------------------------------------
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 23: -21065.040
Iteration 30 / LogL: -21065.089 / Time: 0h:1m:8s (0h:3m:41s left)