vignettes/02_Install_conda_environment.Rmd
02_Install_conda_environment.RmdRbims can read annotations generated by KofamScan, InterProScan, dbCAN, and MEROPS. You can create an environment containing these programs or load each output independently if you already have them.
Get the databases and configure it. (This instructions were taken from the official site)
In order to run the hmm tool correctly, you need to download the databases locally:
To extract the database from the original webpage we use
wget. To prevent the command to halt and to not affect
computational resources, we recommend to run it in the the background
with nohup &
cd DBs/kegg/
nohup wget https://www.genome.jp/ftp/db/kofam/ko_list.gz &
nohup wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz &Once the databases are fully downloaded, extract the information:
In order to run KofamScan, you need a configuration file to find the databases, for this you need to make the PATH explicitly accessible:
# Path to your KO-HMM database
# A database can be a .hmm file, a .hal file or a directory in which
# .hmm files are. Omit the extension if it is .hal or .hmm file
profile: /PATH to/USER/DBs/kegg/profiles
# Path to the KO list file
ko_list: /PATH to/USER/DBs/kegg/ko_list
# Path to an executable file of hmmsearch
# You do not have to set this if it is in your $PATH
hmmsearch: /PATH to/USER/.conda/envs/rbimsenv/bin/hmmsearch
# Path to an executable file of GNU parallel
# You do not have to set this if it is in your $PATH
parallel: /PATH to/USER/.conda/envs/rbimsenv/bin/parallel
# Number of hmmsearch processes to be run parallelly
cpu: 8Install Java v11
cd DBs/iprscan/
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.77-108.0/interproscan-5.77-107.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.77-108.0/interproscan-5.77-108.0-64-bit.tar.gz.md5
# Recommended checksum to confirm the download was successful:
md5sum -c interproscan-5.77-108.0-64-bit.tar.gz.md5
# Must return *interproscan-5.73-104.0-64-bit.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.
tar -pxvzf interproscan-5.77-108.0-*-bit.tar.gzcd
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
# export iprscan to activate environment
echo 'export PATH=/PATH to USER Interpro binary directory/DBs/iprscan/interproscan-5.77-108.0:$PATH' > $CONDA_PREFIX/etc/conda/activate.d/interproscan_activate.sh
# export iprscan to deactivate environment
echo 'export PATH=$(echo $PATH | sed -e "s|/PATH to USER Interpro binary directory/DBs/iprscan/interproscan-5.77-108.0:||")' > $CONDA_PREFIX/etc/conda/deactivate.d/interproscan_deactivate.sh# Install run_dbcan
conda install dbcan -c conda-forge -c bioconda
# Move to cazy dir
cd Dbs/cazy
# Clone the official dbCAN repository
git clone https://github.com/linnabrown/run_dbcan.git
# Move into the directory
cd run_dbcan
# Install dependencies in the current Conda environment
pip install .
# Test if dbcan_build works
dbcan_build --help
# Build the dbCAN database (adjust --db-dir to your preferred path)
dbcan_build --cpus 8 --db-dir ./db --clean⚠️ Important Note: The following scripts are practical examples to guide the usage of each tool in this workflow. They do not replace the official documentation or user manuals of each program. Please refer to the official manuals for advanced options, detailed explanations and cite these programs.
🧩 Example: Run KofamScan (exec_annotation) on multiple .faa files
# Create output directory for KofamScan results
mkdir -p test/data/faa
mkdir -p test/results/02.kofam
# Run KofamScan on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
echo "Processing $faa ..."
exec_annotation -o test/results/02.kofam/$(basename $faa).txt \
"$faa" \
--report-unannotated \
--tmp-dir test/results/02.kofam/$(basename $faa).tmp \
--cpu 28
done
echo "KofamScan annotation completed. Results are in test/results/02.kofam/"
#remove tmp dir (optional)
#rm -r test/results/02.kofam/*.tmp/🧩 Example: Running InterProScan Process all .faa protein files from the test dataset and save the results in the specified output directory.
Note: Sometimes, annotations previosuly ran with Prokka generate asterisks present at the end of each entry. This is an incompatibility with InterProScan, to remove them use the following command in the directory of your *.faa files
# Create output directory for InterProScan results
mkdir -p test/results/01.iprscan
# Run InterProScan on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
echo "Processing $faa ..."
interproscan.sh -i "$faa" -cpu 28 -d test/results/01.iprscan
done
echo "InterProScan analysis completed. Results are in test/results/01.iprscan/"i: Input protein FASTA file (.faa)
cpu 28: Number of CPU threads (adjust according to your machine)
d: Output directory for the results
gff3: Gene feature format
json: JSON formatted output
tsv: Tab-separated values (summary table, useful for downstream analysis)
xml: XML formatted output
💡 Tip: Adjust the number of CPUs depending on your available resources. Make sure the output directory exists before running the script, or use mkdir -p to create it.
🧩 Example: Run dbCAN (run_dbcan) on multiple .faa files
# Create output directory for dbCAN results
mkdir -p test/results/03.dbcan
# Define database directory
db="DBs/cazy/run_dbcan/db"
# Run dbCAN on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
locustag=$(basename "$faa" .faa)
echo "Processing $locustag ..."
run_dbcan "$faa" protein \
--dia_cpu 20 --hmm_cpu 20 \
--tf_cpu 20 --stp_cpu 20 \
--out_pre "$locustag" \
--out_dir "test/results/03.dbcan/$locustag" \
--db_dir "$db" --tools all \
--use_signalP=TRUE
done
echo "dbCAN analysis completed. Results are in $out/"–dia_cpu, –hmm_cpu, –tf_cpu, –stp_cpu: Number of CPU threads for each dbCAN module
–out_pre: Prefix for output files
–out_dir: Output directory
–db_dir: Path to the dbCAN database
–tools all: Run all available tools
–use_signalP=TRUE: Enable SignalP for signal peptide prediction
dbcan-sub.hmm.out: Sub HMM output for CAZy families
diamond.out: DIAMOND alignment output
hmmer.out: HMMER output
overview.txt: Summary of annotations
signalp.out: SignalP prediction results
uniInput: Unified input file for dbCAN pipeline
💡 Tip: The parameters used in this script are optional and provide a more complete annotation. However, for the purposes of rbims, the following simpler command is sufficient:
Adjust the number of CPUs and paths according to your environment. Make sure to replace DBs/cazy/run_dbcan/db with the actual path to your dbCAN database. The script will automatically process all .faa files in the input folder.
🧩 Example: Run MEROPS (BLASTp) on multiple .faa files This script runs BLASTp using the MEROPS protease database on all .faa protein files in your input directory, and outputs the results to a dedicated directory.
# Create output directory for MEROPS results
mkdir -p test/results/04.merops
# Define MEROPS database path (make sure it is prepared with makeblastdb)
db="DBs/merops/merops_db"
# Run BLASTp on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
locustag=$(basename "$faa" .faa)
echo "Processing $locustag ..."
blastp -query "$faa" \
-db "$db" \
-num_threads 32 \
-out "test/results/04.merops/${locustag}.txt" \
-outfmt "6 qseqid sseqid stitle pident evalue bitscore"
done
echo "MEROPS BLASTp analysis completed. Results are in $out/"🧩 Example: Running PICRUSt2 Pipeline
If you have a Feature Table (.biom) and Representative Sequences (.fna) exported from QIIME2, you can predict the functional profiles with PICRUSt2 to be used in rbims.
# Create output directory for PICRUSt2 results
mkdir -p test/results/05.picrust2
# Run the full pipeline
picrust2_pipeline.py -i test/data/16S/feature_table.biom \
-s test/data/16S/representative_seqs.fna \
-o test/results/05.picrust2/ \
-p 8 🔧 Parameters: