Option1: Start with genome.fa and genome.gff3 file

bash genome2cdspep.sh genome.fa genome.gff3 genome_abbr key_str

key_str is the dividing character that distinguishes different transcripts, usually . or - Output file for KK4D:

genome_abbr.pep
genome_abbr.cds

for example: Ath.gff3

1       araport11       gene    3631    5899    .       +       .       ID=gene:AT1G01010;Name=NAC001;biotype=protein_coding;description=NAC domain-containing protein 1 [Source:UniProtKB/Swiss-Prot%3BAcc:Q0WV96];gene_id=AT1G01010;logic_name=araport11
1       araport11       mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=gene:AT1G01010;biotype=protein_coding;transcript_id=AT1G01010.1
1       araport11       five_prime_UTR  3631    3759    .       +       .       Parent=AT1G01010.1
1       araport11       exon    3631    3913    .       +       .       Parent=AT1G01010.1;Name=AT1G01010.1.exon1;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;exon_id=AT1G01010.1.exon1;rank=1
1       araport11       CDS     3760    3913    .       +       0       ID=CDS:AT1G01010.1;Parent=AT1G01010.1;protein_id=AT1G01010.1
1       araport11       exon    3996    4276    .       +       .       Parent=AT1G01010.1;Name=AT1G01010.1.exon2;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;exon_id=AT1G01010.1.exon2;rank=2
1       araport11       CDS     3996    4276    .       +       2       ID=CDS:AT1G01010.1;Parent=AT1G01010.1;protein_id=AT1G01010.1

Part of the content of Ath.gff3 is as follows. We can see protein_id=AT1G01010.1. AT1G01010.1 here is to distinguish different transcripts of the same gene based on the last .. So the command to obtain the cds and pep files of Ath should be similar to the following:

bash genome2cdspep.sh Ath.genome.fa Ath.gff3 Ath .

This script will finally output Ath.pep and Ath.cds, which are the protein sequence and CDS sequence as input to KK4D.

Option2: Start with genome.gff3, genome.pep, genome.cds file.

Input: You can copy config.ini to your working path and modify it to your own configuration information. Use -c config.ini to specify the location of the configuration parameter file, Or directly input various parameters.

get the config.ini file

KK4D.sh init This will create a config.ini file in your current working path.

for coline analysis

KK4D.sh coline -c /path/to/config.ini

from gff3 cds.fa protein.fa ,get 1 or 2 species all the above information.

KK4D.sh all -c /path/to/config.ini

for A.trichopoda and M.domestica genome chromosome1 gene and protein analysis

(This is for the purpose of the testing process only, the general situation is that the whole genome needs to be analyzed.)

KK4D.sh all -group 2 -cpu 32 -key ID ID -type mRNA mRNA -sample A.trichopoda M.domestica -abbr Ath Mdo -gff3 Ath.chr1.gff3 Mdo.chr1.gff3 -protein Ath.pep.fa.gz Mdo.genome.protein.fa -cds Ath.cds.fa.gz Mdo.cds.fa -chrnum 1 1

for M.domestica genome analysis

KK4D.sh all -group 1 -cpu 24 -key ID -type mRNA -sample M.domestica -abbr Mdo -gff3 gene_models_20170612.gff3.gz -protein /share/home/Mdo.pep.fa -cds /share/home/Mdo.cds.fa -chrnum 17