Another tool to use is dbCAN. Here is the function to explore these type of files:

First, load the rbims package.

The function to use that information is read_dbcan3. This function can parse the information of the dbCAN3 files.

  • The input should be a path where dbCAN output files should be stored and should have the extension overview.txt. Output data should have 6 columns with the bin names followed by the Genes obtained in every algorithm (HMMER,Hotpep,DIAMOND), column ‘Signalp’ indcating if a Peptide signal is found and a column ’#ofTools” indicating the number of algorithms that found this Gene.

  • The output format is chosen with the profile argument. When profile = T, a wide output is obtained.

  • The write argument saves the formatted table generated in .tsv extension. When write = F gives you the output but not saves the table in your current directory.

If you want to follow the example you can download the use rbims test file.

dbcan_profile <-read_dbcan3(dbcan_path = "../inst/extdata/test_data/",  profile = T, write = F)
#> Warning: Expected 1 pieces. Additional pieces discarded in 12 rows [1, 2, 3, 7, 9, 13,
#> 16, 17, 22, 24, 31, 36].
#> Warning: Expected 1 pieces. Additional pieces discarded in 44 rows [1, 2, 3, 4, 5, 6, 7,
#> 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> [1] "Input Genes = 123"
#> [1] "Remained Genes after filtering = 44"
#> [1] "Percentage of genes remained = 36%"
#> [1] "Number of genes with signals = 2"
#> [1] "Number of genes with signals that passed filtering = 2"
head(dbcan_profile)
#> # A tibble: 6 × 4
#>   dbCAN_family domain_name                       htn_bins_104_sub htn_bins_108
#>   <chr>        <chr>                                        <dbl>        <dbl>
#> 1 CBM2         carbohydrate-binding module [CBM]                1            0
#> 2 CE12         carbohydrate esterases [CEs]                     1            0
#> 3 CE4          carbohydrate esterases [CEs]                     1            0
#> 4 GH0          glycoside hydrolases [GHs]                       1            0
#> 5 GH13         glycoside hydrolases [GHs]                       1            4
#> 6 GH140        glycoside hydrolases [GHs]                       1            0

Or print a long table profile = F.

dbcan_profile<-read_dbcan3(dbcan_path = "../inst/extdata/test_data/",  
                           profile = F, write = F)
#> Warning: Expected 1 pieces. Additional pieces discarded in 12 rows [1, 2, 3, 7, 9, 13,
#> 16, 17, 22, 24, 31, 36].
#> Warning: Expected 1 pieces. Additional pieces discarded in 44 rows [1, 2, 3, 4, 5, 6, 7,
#> 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> [1] "Input Genes = 123"
#> [1] "Remained Genes after filtering = 44"
#> [1] "Percentage of genes remained = 36%"
#> [1] "Number of genes with signals = 2"
#> [1] "Number of genes with signals that passed filtering = 2"
head(dbcan_profile)
#> # A tibble: 6 × 5
#> # Groups:   Bin_name, dbCAN_family, domain_name [6]
#>   Bin_name         dbCAN_family domain_name                    signalp Abundance
#>   <chr>            <chr>        <chr>                          <chr>       <int>
#> 1 htn_bins_104_sub CBM2         carbohydrate-binding module [… N               1
#> 2 htn_bins_104_sub CE12         carbohydrate esterases [CEs]   N               1
#> 3 htn_bins_104_sub CE4          carbohydrate esterases [CEs]   N               1
#> 4 htn_bins_104_sub GH0          glycoside hydrolases [GHs]     N               1
#> 5 htn_bins_104_sub GH13         glycoside hydrolases [GHs]     N               1
#> 6 htn_bins_104_sub GH140        glycoside hydrolases [GHs]     N               1

Notice that in both cases, some lines are showed that gives the information recovered from the files input as the total number of genes, remaining genes after the the filtered, and the number of genes that have signals and passed the filtered.

You can export this to a table like this:

write.table(dbcan_profile, "dbcan.tsv", quote = F, sep = "\t", row.names = F, col.names = T)

Or setting write write = T.