2. Create the conda environment • rbims

Rbims can read annotations generated by KofamScan, InterProScan, dbCAN, and MEROPS. You can create an environment containing these programs or load each output independently if you already have them.

Create environment

Create database directory

mkdir -p DBs/{kegg,cazy,merops,iprscan}

Create the environment

conda  create -n rbimsenv -c conda-forge -c bioconda -c defaults hmmer parallel python=3.8
conda activate rbimsenv

Install and download all the databases

Install KofamScan

Get the databases and configure it. (This instructions were taken from the official site)

conda install -c conda-forge ruby
conda install -c conda-forge -c bioconda -c defaults kofamscan

In order to run the hmm tool correctly, you need to download the databases locally:

Download database

To extract the database from the original webpage we use wget. To prevent the command to halt and to not affect computational resources, we recommend to run it in the the background with nohup &

cd DBs/kegg/
nohup wget https://www.genome.jp/ftp/db/kofam/ko_list.gz &
nohup wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz &

Once the databases are fully downloaded, extract the information:

gunzip ko_list.gz
tar -xvzf profiles.tar.gz
cd ../../

Configure database

In order to run KofamScan, you need a configuration file to find the databases, for this you need to make the PATH explicitly accessible:

nano /PATH to USER or SYSTEM/.conda/envs/rbimsenv/bin/config.yml

# Path to your KO-HMM database
# A database can be a .hmm file, a .hal file or a directory in which
# .hmm files are. Omit the extension if it is .hal or .hmm file
profile: /PATH to/USER/DBs/kegg/profiles

# Path to the KO list file
ko_list: /PATH to/USER/DBs/kegg/ko_list

# Path to an executable file of hmmsearch
# You do not have to set this if it is in your $PATH
hmmsearch: /PATH to/USER/.conda/envs/rbimsenv/bin/hmmsearch

# Path to an executable file of GNU parallel
# You do not have to set this if it is in your $PATH
parallel: /PATH to/USER/.conda/envs/rbimsenv/bin/parallel

# Number of hmmsearch processes to be run parallelly
cpu: 8

Install InterProScan

Install Java v11

conda install -c conda-forge openjdk=11

Download database

cd DBs/iprscan/
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.77-108.0/interproscan-5.77-107.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.77-108.0/interproscan-5.77-108.0-64-bit.tar.gz.md5
# Recommended checksum to confirm the download was successful:
md5sum -c interproscan-5.77-108.0-64-bit.tar.gz.md5
# Must return *interproscan-5.73-104.0-64-bit.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.
tar -pxvzf interproscan-5.77-108.0-*-bit.tar.gz

Create index models

cd interproscan-5.77-108.0/
python3 setup.py -f interproscan.properties

Configure database

cd
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d

# export iprscan to activate environment 
echo 'export PATH=/PATH to USER Interpro binary directory/DBs/iprscan/interproscan-5.77-108.0:$PATH' > $CONDA_PREFIX/etc/conda/activate.d/interproscan_activate.sh

# export iprscan to deactivate environment
echo 'export PATH=$(echo $PATH | sed -e "s|/PATH to USER Interpro binary directory/DBs/iprscan/interproscan-5.77-108.0:||")' > $CONDA_PREFIX/etc/conda/deactivate.d/interproscan_deactivate.sh

Install dbCAN

Download database

# Install run_dbcan
conda install dbcan -c conda-forge -c bioconda

# Move to cazy dir
cd Dbs/cazy

# Clone the official dbCAN repository
git clone https://github.com/linnabrown/run_dbcan.git

# Move into the directory
cd run_dbcan

# Install dependencies in the current Conda environment
pip install .

# Test if dbcan_build works
dbcan_build --help

# Build the dbCAN database (adjust --db-dir to your preferred path)
dbcan_build --cpus 8 --db-dir ./db --clean

Install MEROPS

Download database

#merops blast
conda install bioconda::blast
# download database
cd DBs/merops/
wget https://ftp.ebi.ac.uk/pub/databases/merops/current_release/protease.lib
makeblastdb -in protease.lib -dbtype prot -out merops_db

Install PICRUSt2 (for 16S Analysis)

Aditionally this step allows rbims to handle functional predictions derived from 16S rRNA gene sequences (e.g., from QIIME2). PICRUSt2 will predict the presence of functional genes (KEGG Orthologs) which rbims can then analyze.

# Install PICRUSt2
conda install -c bioconda -c conda-forge picrust2

# Verify installation
picrust2_pipeline.py --help

Install rbims

Rscript -e "devtools::install_github('mirnavazquez/RbiMs')"

Run the annotations

⚠️ Important Note: The following scripts are practical examples to guide the usage of each tool in this workflow. They do not replace the official documentation or user manuals of each program. Please refer to the official manuals for advanced options, detailed explanations and cite these programs.

KofamScan

🧩 Example: Run KofamScan (exec_annotation) on multiple .faa files

# Create output directory for KofamScan results
mkdir -p test/data/faa
mkdir -p test/results/02.kofam

# Run KofamScan on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
  echo "Processing $faa ..."
  exec_annotation -o test/results/02.kofam/$(basename $faa).txt \
                  "$faa" \
                  --report-unannotated \
                  --tmp-dir test/results/02.kofam/$(basename $faa).tmp \
                  --cpu 28
done

echo "KofamScan annotation completed. Results are in test/results/02.kofam/"

#remove tmp dir (optional)
#rm -r test/results/02.kofam/*.tmp/

🔧 Parameters:

-o: Output file for annotation results
–report-unannotated: Include unannotated sequences in the output
–tmp-dir: Temporary directory for intermediate files
–cpu 28: Number of CPU threads (adjust according to your machine)

Output formats:

txt: Annotation results per protein FASTA file
tmp/: Temporary directory with intermediate KofamScan files

InterproScan

🧩 Example: Running InterProScan Process all .faa protein files from the test dataset and save the results in the specified output directory.

Note: Sometimes, annotations previosuly ran with Prokka generate asterisks present at the end of each entry. This is an incompatibility with InterProScan, to remove them use the following command in the directory of your *.faa files

mkdir -p test/data/faa
cd test/data/faa/
find . -type f -exec sed -i 's/\*//g' {} +

# Create output directory for InterProScan results
mkdir -p test/results/01.iprscan

# Run InterProScan on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
  echo "Processing $faa ..."
  interproscan.sh -i "$faa" -cpu 28 -d test/results/01.iprscan
done

echo "InterProScan analysis completed. Results are in test/results/01.iprscan/"

🔧 Parameters:

i: Input protein FASTA file (.faa)
cpu 28: Number of CPU threads (adjust according to your machine)
d: Output directory for the results

Output formats:

gff3: Gene feature format
json: JSON formatted output
tsv: Tab-separated values (summary table, useful for downstream analysis)
xml: XML formatted output

💡 Tip: Adjust the number of CPUs depending on your available resources. Make sure the output directory exists before running the script, or use mkdir -p to create it.

dbCAN

🧩 Example: Run dbCAN (run_dbcan) on multiple .faa files

# Create output directory for dbCAN results
mkdir -p test/results/03.dbcan

# Define database directory
db="DBs/cazy/run_dbcan/db"

# Run dbCAN on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
  locustag=$(basename "$faa" .faa)
  echo "Processing $locustag ..."
  
  run_dbcan "$faa" protein \
    --dia_cpu 20 --hmm_cpu 20 \
    --tf_cpu 20 --stp_cpu 20 \
    --out_pre "$locustag" \
    --out_dir "test/results/03.dbcan/$locustag" \
    --db_dir "$db" --tools all \
    --use_signalP=TRUE
done
echo "dbCAN analysis completed. Results are in $out/"

🔧 Parameters:

–dia_cpu, –hmm_cpu, –tf_cpu, –stp_cpu: Number of CPU threads for each dbCAN module
–out_pre: Prefix for output files
–out_dir: Output directory
–db_dir: Path to the dbCAN database
–tools all: Run all available tools
–use_signalP=TRUE: Enable SignalP for signal peptide prediction

Output formats:

dbcan-sub.hmm.out: Sub HMM output for CAZy families
diamond.out: DIAMOND alignment output
hmmer.out: HMMER output
overview.txt: Summary of annotations
signalp.out: SignalP prediction results
uniInput: Unified input file for dbCAN pipeline

💡 Tip: The parameters used in this script are optional and provide a more complete annotation. However, for the purposes of rbims, the following simpler command is sufficient:

run_dbcan <input.faa> protein --out_dir <output_directory> --use_signalP=TRUE

Adjust the number of CPUs and paths according to your environment. Make sure to replace DBs/cazy/run_dbcan/db with the actual path to your dbCAN database. The script will automatically process all .faa files in the input folder.

MEROPS

🧩 Example: Run MEROPS (BLASTp) on multiple .faa files This script runs BLASTp using the MEROPS protease database on all .faa protein files in your input directory, and outputs the results to a dedicated directory.

# Create output directory for MEROPS results
mkdir -p test/results/04.merops

# Define MEROPS database path (make sure it is prepared with makeblastdb)
db="DBs/merops/merops_db"

# Run BLASTp on each .faa file in the input directory
for faa in test/data/faa/*.faa; do
  locustag=$(basename "$faa" .faa)
  echo "Processing $locustag ..."
  
  blastp -query "$faa" \
    -db "$db" \
    -num_threads 32 \
    -out "test/results/04.merops/${locustag}.txt" \
    -outfmt "6 qseqid sseqid stitle pident evalue bitscore"
done

echo "MEROPS BLASTp analysis completed. Results are in $out/"

🔧 Parameters:

query: Input protein FASTA file
db: Path to the prepared MEROPS BLAST database
num_threads: Number of CPU threads
out: Output file for results
outfmt 6: Tabular output format for easy downstream parsing

Output:

sampleID_output.txt

💡 Tip: Adjust the number of CPUs according to your available resources. The script will automatically process all .faa files in the input folder.

PICRUSt2

🧩 Example: Running PICRUSt2 Pipeline

If you have a Feature Table (.biom) and Representative Sequences (.fna) exported from QIIME2, you can predict the functional profiles with PICRUSt2 to be used in rbims.

# Create output directory for PICRUSt2 results
mkdir -p test/results/05.picrust2

# Run the full pipeline
picrust2_pipeline.py -i test/data/16S/feature_table.biom \
                     -s test/data/16S/representative_seqs.fna \
                     -o test/results/05.picrust2/ \
                     -p 8

🔧 Parameters:

-i: study_observations.biom (Exported from QIIME2)
-s: study_seqs.fna (Representative sequences)
-o: Output file for results
-o: threads