A common practice to annotate the genes of a new genome or reconstructed genome is to use InterProScan. Here are some functions to explore that information.
First, load the rbims package.
The function to use that information is read_interpro
. This function can parse the information of the PFAM, INTERPRO, and KEGG ids. The KEGG analysis is just possible if InterProScan was run with the -pa option. As well tow output options are possible a wide profile or a long table.
The database
argument will parse the database. In this example, I will explore the PFAM output.
The output format is chosen with the profile
argument. When profile = T, then a wide output is obtained.
If you want to follow the example you can download the use rbims test file.
interpro_pfam_profile<-read_interpro(data_interpro = "Interpro_test.tsv", database="PFAM", profile = T)
head(interpro_pfam_profile)
#> # A tibble: 6 x 8
#> PFAM domain_name Bin_10 Bin_12 Bin_56 Bin_113 Bin_1 Bin_2
#> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 PF03595 Voltage-dependent anion chan… 1 1 1 0 0 0
#> 2 PF00440 Bacterial regulatory protein… 0 0 0 1 1 0
#> 3 PF13305 WHG domain 0 0 0 1 1 0
#> 4 PF01131 DNA topoisomerase 1 0 0 0 0 1
#> 5 PF08272 Topoisomerase I zinc-ribbon-… 1 0 0 0 0 1
#> 6 PF01751 Toprim domain 1 0 0 0 0 1
Or print a long table profile = F.
interpro_pfam_long<-read_interpro("Interpro_test.tsv", database="PFAM", profile = F)
head(interpro_pfam_long)
#> # A tibble: 6 x 5
#> Bin_name Scaffold_name PFAM domain_name Abundance
#> <chr> <chr> <chr> <chr> <int>
#> 1 Bin_10 scaffold_441_c1_… PF03595 Voltage-dependent anion channel 1
#> 2 Bin_12 scaffold_69_c1_1… PF03595 Voltage-dependent anion channel 1
#> 3 Bin_56 scaffold_71_c1_69 PF03595 Voltage-dependent anion channel 1
#> 4 Bin_113 scaffold_145_c1_… PF00440 Bacterial regulatory proteins, t… 1
#> 5 Bin_113 scaffold_145_c1_… PF13305 WHG domain 1
#> 6 Bin_1 scaffold_146_c1_1 PF00440 Bacterial regulatory proteins, t… 1
You can export this table in a table like this.
write.table(interpro_pfam_long, "Interpro.tsv", quote = F, sep = "\t", row.names = F, col.names = T)