Rbims also includes functions to read and explore InterProScan annotations. To learn how to to this;

First, load the rbims package.

read_interpro(): Parse InterProScan Outputs

The read_interpro function is designed to parse and format raw output files from the InterProScan annotation tool. Aditionally, this function can parse the information of the KEGG IDs but, the KEGG analysis is just possible if InterProScan was run with the -pa option.

Input Requirements

  • File Path: Provide the path to the directory containing your InterProScan files. Even if its just one, the path needs to be explicitly defined for the directory only.

  • File Extension: The function specifically looks for files ending in *.tsv.

Data Structure

The processed input contains 15 key columns:

  • Protein accession (e.g. P51587)

  • Sequence MD5 digest (e.g. 14086411a2cdf1c4cba63020e1622579)

  • Sequence length (e.g. 3418)

  • Analysis (e.g. Pfam / PRINTS / Gene3D)

  • Signature accession (e.g. PF09103 / G3DSA:2.40.50.140)

  • Signature description (e.g. BRCA2 repeat profile)

  • Start location

  • Stop location

  • Score - is the e-value (or score) of the match reported by member database method (e.g. 3.1E-52)

  • Status - is the status of the match (T: true)

  • Date - is the date of the run

  • InterPro annotations - accession (e.g. IPR002093)

  • InterPro annotations - description (e.g. BRCA2 repeat)

  • GO annotations with their source(s), e.g. GO:0005515(InterPro)|GO:0006302(PANTHER)|GO:0007195(InterPro,PANTHER). This is an optional column; only displayed if the –goterms option is switched on

  • Pathways annotations, e.g. REACT_71. This is an optional column; only displayed if the –pathways option is switched on

If you want to follow this example, you can download the raw data here.

To obtain a wide output, use the argument profile = T. To explore more databases, the argument database can include the following options:

  • “INTERPRO”

  • “TIGRFAM”

  • “SUPERFAMILY”

  • “SMART”

  • “SFLD”

  • “ProSiteProfiles”

  • “ProSitePatterns”

  • “ProDom”

  • “PRINTS”

  • “PIRSF”

  • “MobiDBLite”

  • “Hamap”

  • “Gene3D”

  • “Coils”

  • “CDD”

interpro_profile_T <-read_interpro(data_interpro = "../test/results/01.iprscan",
                                   database= "Pfam", 
                                   profile = T, 
                                   write = F)
head(interpro_profile_T)
Table 1. InterProScan Profile Overview ( profile = T )
Pfam domain_name 5mSIPHEX1_0 5mSIPHEX1_1 5mSIPHEX1_10 5mSIPHEX1_11 5mSIPHEX1_13 5mSIPHEX1_15 5mSIPHEX1_18 5mSIPHEX1_19 5mSIPHEX1_2 5mSIPHEX1_25 5mSIPHEX1_26 5mSIPHEX1_32 5mSIPHEX1_33 5mSIPHEX1_37 5mSIPHEX1_8 5mSIPHEX1_9 5mSIPHEX2_10 5mSIPHEX2_14 5mSIPHEX2_16 5mSIPHEX2_18 5mSIPHEX2_25 5mSIPHEX2_3 5mSIPHEX2_5 5mSIPHEX2_7 700mSIPHEX1_0 700mSIPHEX1_1 700mSIPHEX1_12 700mSIPHEX1_15 700mSIPHEX1_17 700mSIPHEX1_18 700mSIPHEX1_2 700mSIPHEX1_20 700mSIPHEX1_3 700mSIPHEX1_8 700mSIPHEX2_13 700mSIPHEX2_14 700mSIPHEX2_16 700mSIPHEX2_21 700mSIPHEX2_22 700mSIPHEX2_23 700mSIPHEX2_24 700mSIPHEX2_9
PF00005 ABC transporter 74 42 76 30 54 21 28 26 27 30 20 28 21 61 100 20 69 78 16 31 91 60 29 54 41 98 36 29 29 36 32 33 76 44 44 29 36 29 33 34 34 83
PF08352 Oligopeptide/dipeptide transporter, C-terminal region 13 4 3 1 7 1 1 0 0 1 0 1 1 6 10 0 12 3 0 1 8 6 0 7 6 11 1 1 1 1 1 1 3 3 3 1 1 1 1 1 2 10
PF06242 Transcriptional cell cycle regulator TrcR 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
PF00586 AIR synthase related protein, N-terminal domain 3 1 2 3 2 4 3 1 3 3 2 4 2 2 2 2 3 2 1 3 2 2 2 2 1 3 3 3 2 3 2 3 2 0 0 2 3 3 2 3 2 3
PF02769 AIR synthase related protein, C-terminal domain 3 1 3 3 2 4 4 1 3 3 2 4 2 2 2 3 3 3 2 3 2 2 2 2 2 3 4 4 3 4 3 3 3 0 0 3 4 4 3 4 3 3
PF02021 Uncharacterised protein family UPF0102 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 2 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
PF01654 Cytochrome bd terminal oxidase subunit I 1 0 1 1 0 0 0 1 0 1 0 1 1 0 0 2 1 1 0 1 0 0 1 0 1 2 0 1 1 2 3 1 1 0 0 1 2 1 3 0 1 2
PF00528 Binding-protein-dependent transport system inner membrane component 54 30 39 10 31 3 7 8 4 10 0 9 2 40 78 2 53 39 0 10 67 36 9 31 31 59 2 6 2 8 6 12 39 29 29 2 8 6 7 2 18 52
PF00920 Dehydratase family 4 2 3 2 2 2 2 2 2 2 0 2 0 2 5 2 4 3 0 3 4 2 1 2 2 3 1 2 1 2 2 2 3 1 1 1 2 2 2 1 2 3
PF01458 SUF system FeS cluster assembly, SufBD 2 2 2 2 2 2 4 0 2 2 2 2 2 0 2 0 2 2 2 2 2 0 2 2 1 2 2 2 2 0 2 2 2 0 0 2 0 2 2 2 0 2

Or print a long table profile = F.

interpro_profile_T <-read_interpro(data_interpro = "../test/results/01.iprscan",
                                   database= "Pfam", 
                                   profile = T, 
                                   write = F)
head(interpro_profile_F)
Table 1. InterProScan Profile Overview ( profile = F )
Scaffold_name Bin_name Pfam domain_name Abundance
5mSIPHEX1_0_scaffold_3_c1_201 5mSIPHEX1_0 PF00005 ABC transporter 74
5mSIPHEX1_0_scaffold_3_c1_201 5mSIPHEX1_0 PF08352 Oligopeptide/dipeptide transporter, C-terminal region 13
5mSIPHEX1_0_scaffold_12_c2_232 5mSIPHEX1_0 PF06242 Transcriptional cell cycle regulator TrcR 1
5mSIPHEX1_0_scaffold_9_c1_171 5mSIPHEX1_0 PF00586 AIR synthase related protein, N-terminal domain 3
5mSIPHEX1_0_scaffold_9_c1_171 5mSIPHEX1_0 PF02769 AIR synthase related protein, C-terminal domain 3
5mSIPHEX1_0_scaffold_12_c2_272 5mSIPHEX1_0 PF02021 Uncharacterised protein family UPF0102 1
5mSIPHEX1_0_scaffold_4_c1_367 5mSIPHEX1_0 PF01654 Cytochrome bd terminal oxidase subunit I 1
5mSIPHEX1_0_scaffold_4_c1_39 5mSIPHEX1_0 PF00528 Binding-protein-dependent transport system inner membrane component 54
5mSIPHEX1_0_scaffold_62_c1_152 5mSIPHEX1_0 PF00920 Dehydratase family 4
5mSIPHEX1_0_scaffold_9_c1_116 5mSIPHEX1_0 PF01458 SUF system FeS cluster assembly, SufBD 2

You can export this profile like this:

write.table(interpro_profile_T, "Interpro_profile_T.tsv", 
            quote = F, 
            sep = "\t", 
            row.names = F, 
            col.names = T)

Or setting write write = T in the function read_interpro()