Another tool to use is dbCAN. Here is the function to explore these type of files:
First, load the rbims package.
The function to use that information is read_dbcan3
.
This function can parse the information of the dbCAN3 files.
The input should be a path where dbCAN output files should be stored and should have the extension overview.txt. Output data should have 6 columns with the bin names followed by the Genes obtained in every algorithm (HMMER,Hotpep,DIAMOND), column ‘Signalp’ indcating if a Peptide signal is found and a column ’#ofTools” indicating the number of algorithms that found this Gene.
The output format is chosen with the profile
argument. When profile = T, a wide output is
obtained.
The write
argument saves the formatted table
generated in .tsv extension. When write = F gives you
the output but not saves the table in your current directory.
If you want to follow the example you can download the use rbims test file.
dbcan_profile <-read_dbcan3(dbcan_path = "../inst/extdata/test_data/", profile = T, write = F)
#> Warning: Expected 1 pieces. Additional pieces discarded in 12 rows [1, 2, 3, 7, 9, 13,
#> 16, 17, 22, 24, 31, 36].
#> Warning: Expected 1 pieces. Additional pieces discarded in 44 rows [1, 2, 3, 4, 5, 6, 7,
#> 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> [1] "Input Genes = 123"
#> [1] "Remained Genes after filtering = 44"
#> [1] "Percentage of genes remained = 36%"
#> [1] "Number of genes with signals = 2"
#> [1] "Number of genes with signals that passed filtering = 2"
head(dbcan_profile)
#> # A tibble: 6 × 4
#> dbCAN_family domain_name htn_bins_104_sub htn_bins_108
#> <chr> <chr> <dbl> <dbl>
#> 1 CBM2 carbohydrate-binding module [CBM] 1 0
#> 2 CE12 carbohydrate esterases [CEs] 1 0
#> 3 CE4 carbohydrate esterases [CEs] 1 0
#> 4 GH0 glycoside hydrolases [GHs] 1 0
#> 5 GH13 glycoside hydrolases [GHs] 1 4
#> 6 GH140 glycoside hydrolases [GHs] 1 0
Or print a long table profile = F.
dbcan_profile<-read_dbcan3(dbcan_path = "../inst/extdata/test_data/",
profile = F, write = F)
#> Warning: Expected 1 pieces. Additional pieces discarded in 12 rows [1, 2, 3, 7, 9, 13,
#> 16, 17, 22, 24, 31, 36].
#> Warning: Expected 1 pieces. Additional pieces discarded in 44 rows [1, 2, 3, 4, 5, 6, 7,
#> 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> [1] "Input Genes = 123"
#> [1] "Remained Genes after filtering = 44"
#> [1] "Percentage of genes remained = 36%"
#> [1] "Number of genes with signals = 2"
#> [1] "Number of genes with signals that passed filtering = 2"
head(dbcan_profile)
#> # A tibble: 6 × 5
#> # Groups: Bin_name, dbCAN_family, domain_name [6]
#> Bin_name dbCAN_family domain_name signalp Abundance
#> <chr> <chr> <chr> <chr> <int>
#> 1 htn_bins_104_sub CBM2 carbohydrate-binding module [… N 1
#> 2 htn_bins_104_sub CE12 carbohydrate esterases [CEs] N 1
#> 3 htn_bins_104_sub CE4 carbohydrate esterases [CEs] N 1
#> 4 htn_bins_104_sub GH0 glycoside hydrolases [GHs] N 1
#> 5 htn_bins_104_sub GH13 glycoside hydrolases [GHs] N 1
#> 6 htn_bins_104_sub GH140 glycoside hydrolases [GHs] N 1
Notice that in both cases, some lines are showed that gives the information recovered from the files input as the total number of genes, remaining genes after the the filtered, and the number of genes that have signals and passed the filtered.
You can export this to a table like this:
write.table(dbcan_profile, "dbcan.tsv", quote = F, sep = "\t", row.names = F, col.names = T)
Or setting write write = T.