InterProScan profile creation • rbims

A common practice to annotate the genes of a new genome or reconstructed genome is to use InterProScan. Here are some functions to explore that information.

First, load the rbims package.

library(rbims)

The function to use that information is read_interpro. This function can parse the information of the PFAM, INTERPRO, and KEGG ids. The KEGG analysis is just possible if InterProScan was run with the -pa option. Two output options are also possible: a wide profile, or a long table.

The database argument will parse the database. In this example, I will explore the PFAM output.
The output format is chosen with the profile argument. When profile = T, a wide output is obtained.
The write argument saves the formatted table generated in .tsv extension. When write = F gives you the output but not saves the table in your current directory.

If you want to follow the example you can download the use rbims test file.

interpro_pfam_profile<-read_interpro(data_interpro = "../inst/extdata/Interpro_test.tsv", database="Pfam", profile =T)

head(interpro_pfam_profile)
#> # A tibble: 6 × 8
#>   PFAM    domain_name                   Bin_10 Bin_12 Bin_56 Bin_113 Bin_1 Bin_2
#>   <chr>   <chr>                          <int>  <int>  <int>   <int> <int> <int>
#> 1 PF03595 Voltage-dependent anion chan…      1      1      1       0     0     0
#> 2 PF00440 Bacterial regulatory protein…      0      0      0       1     1     0
#> 3 PF13305 WHG domain                         0      0      0       1     1     0
#> 4 PF01131 DNA topoisomerase                  1      0      0       0     0     1
#> 5 PF08272 Topoisomerase I zinc-ribbon-…      1      0      0       0     0     1
#> 6 PF01751 Toprim domain                      1      0      0       0     0     1

Or print a long table profile = F.

interpro_pfam_long<-read_interpro("../inst/extdata/Interpro_test.tsv", database="Pfam", profile = F)

head(interpro_pfam_long)
#> # A tibble: 6 × 5
#>   Bin_name Scaffold_name      PFAM    domain_name                      Abundance
#>   <chr>    <chr>              <chr>   <chr>                                <int>
#> 1 Bin_10   scaffold_441_c1_24 PF03595 Voltage-dependent anion channel          1
#> 2 Bin_12   scaffold_69_c1_124 PF03595 Voltage-dependent anion channel          1
#> 3 Bin_56   scaffold_71_c1_69  PF03595 Voltage-dependent anion channel          1
#> 4 Bin_113  scaffold_145_c1_85 PF00440 Bacterial regulatory proteins, …         1
#> 5 Bin_113  scaffold_145_c1_85 PF13305 WHG domain                               1
#> 6 Bin_1    scaffold_146_c1_1  PF00440 Bacterial regulatory proteins, …         1

You can export this to a table like this:

write.table(interpro_pfam_long, "Interpro.tsv", quote = F, sep = "\t", row.names = F, col.names = T)

Or setting write write = T.