vignettes/04_Create_Interpro_profile.Rmd
04_Create_Interpro_profile.RmdRbims also includes functions to read and explore InterProScan annotations. To learn how to to this;
First, load the rbims package.
read_interpro(): Parse InterProScan Outputs
The read_interpro function is designed to parse and
format raw output files from the InterProScan annotation tool.
Aditionally, this function can parse the information of the KEGG IDs
but, the KEGG analysis is just possible if InterProScan
was run with the -pa option.
File Path: Provide the path to the directory containing your InterProScan files. Even if its just one, the path needs to be explicitly defined for the directory only.
File Extension: The function specifically looks for files ending
in *.tsv.
The processed input contains 15 key columns:
Protein accession (e.g. P51587)
Sequence MD5 digest (e.g. 14086411a2cdf1c4cba63020e1622579)
Sequence length (e.g. 3418)
Analysis (e.g. Pfam / PRINTS / Gene3D)
Signature accession (e.g. PF09103 / G3DSA:2.40.50.140)
Signature description (e.g. BRCA2 repeat profile)
Start location
Stop location
Score - is the e-value (or score) of the match reported by member database method (e.g. 3.1E-52)
Status - is the status of the match (T: true)
Date - is the date of the run
InterPro annotations - accession (e.g. IPR002093)
InterPro annotations - description (e.g. BRCA2 repeat)
GO annotations with their source(s), e.g. GO:0005515(InterPro)|GO:0006302(PANTHER)|GO:0007195(InterPro,PANTHER). This is an optional column; only displayed if the –goterms option is switched on
Pathways annotations, e.g. REACT_71. This is an optional column; only displayed if the –pathways option is switched on
If you want to follow this example, you can download the raw data here.
To obtain a wide output, use the argument profile = T.
To explore more databases, the argument database can
include the following options:
“INTERPRO”
“TIGRFAM”
“SUPERFAMILY”
“SMART”
“SFLD”
“ProSiteProfiles”
“ProSitePatterns”
“ProDom”
“PRINTS”
“PIRSF”
“MobiDBLite”
“Hamap”
“Gene3D”
“Coils”
“CDD”
interpro_profile_T <-read_interpro(data_interpro = "../test/results/01.iprscan",
database= "Pfam",
profile = T,
write = F)
head(interpro_profile_T)| Pfam | domain_name | 5mSIPHEX1_0 | 5mSIPHEX1_1 | 5mSIPHEX1_10 | 5mSIPHEX1_11 | 5mSIPHEX1_13 | 5mSIPHEX1_15 | 5mSIPHEX1_18 | 5mSIPHEX1_19 | 5mSIPHEX1_2 | 5mSIPHEX1_25 | 5mSIPHEX1_26 | 5mSIPHEX1_32 | 5mSIPHEX1_33 | 5mSIPHEX1_37 | 5mSIPHEX1_8 | 5mSIPHEX1_9 | 5mSIPHEX2_10 | 5mSIPHEX2_14 | 5mSIPHEX2_16 | 5mSIPHEX2_18 | 5mSIPHEX2_25 | 5mSIPHEX2_3 | 5mSIPHEX2_5 | 5mSIPHEX2_7 | 700mSIPHEX1_0 | 700mSIPHEX1_1 | 700mSIPHEX1_12 | 700mSIPHEX1_15 | 700mSIPHEX1_17 | 700mSIPHEX1_18 | 700mSIPHEX1_2 | 700mSIPHEX1_20 | 700mSIPHEX1_3 | 700mSIPHEX1_8 | 700mSIPHEX2_13 | 700mSIPHEX2_14 | 700mSIPHEX2_16 | 700mSIPHEX2_21 | 700mSIPHEX2_22 | 700mSIPHEX2_23 | 700mSIPHEX2_24 | 700mSIPHEX2_9 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PF00005 | ABC transporter | 74 | 42 | 76 | 30 | 54 | 21 | 28 | 26 | 27 | 30 | 20 | 28 | 21 | 61 | 100 | 20 | 69 | 78 | 16 | 31 | 91 | 60 | 29 | 54 | 41 | 98 | 36 | 29 | 29 | 36 | 32 | 33 | 76 | 44 | 44 | 29 | 36 | 29 | 33 | 34 | 34 | 83 |
| PF08352 | Oligopeptide/dipeptide transporter, C-terminal region | 13 | 4 | 3 | 1 | 7 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 6 | 10 | 0 | 12 | 3 | 0 | 1 | 8 | 6 | 0 | 7 | 6 | 11 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 3 | 3 | 1 | 1 | 1 | 1 | 1 | 2 | 10 |
| PF06242 | Transcriptional cell cycle regulator TrcR | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| PF00586 | AIR synthase related protein, N-terminal domain | 3 | 1 | 2 | 3 | 2 | 4 | 3 | 1 | 3 | 3 | 2 | 4 | 2 | 2 | 2 | 2 | 3 | 2 | 1 | 3 | 2 | 2 | 2 | 2 | 1 | 3 | 3 | 3 | 2 | 3 | 2 | 3 | 2 | 0 | 0 | 2 | 3 | 3 | 2 | 3 | 2 | 3 |
| PF02769 | AIR synthase related protein, C-terminal domain | 3 | 1 | 3 | 3 | 2 | 4 | 4 | 1 | 3 | 3 | 2 | 4 | 2 | 2 | 2 | 3 | 3 | 3 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 3 | 4 | 4 | 3 | 4 | 3 | 3 | 3 | 0 | 0 | 3 | 4 | 4 | 3 | 4 | 3 | 3 |
| PF02021 | Uncharacterised protein family UPF0102 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| PF01654 | Cytochrome bd terminal oxidase subunit I | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 1 | 1 | 2 | 3 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 3 | 0 | 1 | 2 |
| PF00528 | Binding-protein-dependent transport system inner membrane component | 54 | 30 | 39 | 10 | 31 | 3 | 7 | 8 | 4 | 10 | 0 | 9 | 2 | 40 | 78 | 2 | 53 | 39 | 0 | 10 | 67 | 36 | 9 | 31 | 31 | 59 | 2 | 6 | 2 | 8 | 6 | 12 | 39 | 29 | 29 | 2 | 8 | 6 | 7 | 2 | 18 | 52 |
| PF00920 | Dehydratase family | 4 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 0 | 2 | 5 | 2 | 4 | 3 | 0 | 3 | 4 | 2 | 1 | 2 | 2 | 3 | 1 | 2 | 1 | 2 | 2 | 2 | 3 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 2 | 3 |
| PF01458 | SUF system FeS cluster assembly, SufBD | 2 | 2 | 2 | 2 | 2 | 2 | 4 | 0 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | 0 | 0 | 2 | 0 | 2 | 2 | 2 | 0 | 2 |
Or print a long table profile = F.
interpro_profile_T <-read_interpro(data_interpro = "../test/results/01.iprscan",
database= "Pfam",
profile = T,
write = F)
head(interpro_profile_F)| Scaffold_name | Bin_name | Pfam | domain_name | Abundance |
|---|---|---|---|---|
| 5mSIPHEX1_0_scaffold_3_c1_201 | 5mSIPHEX1_0 | PF00005 | ABC transporter | 74 |
| 5mSIPHEX1_0_scaffold_3_c1_201 | 5mSIPHEX1_0 | PF08352 | Oligopeptide/dipeptide transporter, C-terminal region | 13 |
| 5mSIPHEX1_0_scaffold_12_c2_232 | 5mSIPHEX1_0 | PF06242 | Transcriptional cell cycle regulator TrcR | 1 |
| 5mSIPHEX1_0_scaffold_9_c1_171 | 5mSIPHEX1_0 | PF00586 | AIR synthase related protein, N-terminal domain | 3 |
| 5mSIPHEX1_0_scaffold_9_c1_171 | 5mSIPHEX1_0 | PF02769 | AIR synthase related protein, C-terminal domain | 3 |
| 5mSIPHEX1_0_scaffold_12_c2_272 | 5mSIPHEX1_0 | PF02021 | Uncharacterised protein family UPF0102 | 1 |
| 5mSIPHEX1_0_scaffold_4_c1_367 | 5mSIPHEX1_0 | PF01654 | Cytochrome bd terminal oxidase subunit I | 1 |
| 5mSIPHEX1_0_scaffold_4_c1_39 | 5mSIPHEX1_0 | PF00528 | Binding-protein-dependent transport system inner membrane component | 54 |
| 5mSIPHEX1_0_scaffold_62_c1_152 | 5mSIPHEX1_0 | PF00920 | Dehydratase family | 4 |
| 5mSIPHEX1_0_scaffold_9_c1_116 | 5mSIPHEX1_0 | PF01458 | SUF system FeS cluster assembly, SufBD | 2 |
You can export this profile like this:
write.table(interpro_profile_T, "Interpro_profile_T.tsv",
quote = F,
sep = "\t",
row.names = F,
col.names = T)Or setting write write = T in the function
read_interpro()