PhenDB is an automated pipeline for the prediction of microbial phenotypes based on comparative genomics.
Gene prediction by Prodigal (Hyatt et al., 2010) is run on the DNA sequence of the uploaded metagenomic bins or genomes.
Hmmer (Eddy et al., 2001) then searches orthologous groups from these protein sequences using the EggNOG DB (Huerta-Cepas et al., 2015).
Finally PICA (Feldbauer et al., 2015) uses support vector machine (SVM)-based models (calculated from genomes with known phenotypes) and the list of orthologous groups of proteins present in a bin/genome to predict whether the organism possesses a particular trait or not.
Currently, predictions for 44 different traits are calculated.
Please note that PhenDB is still in an early stage of development and for several models we are still working on improving the training data.
Thus, please take note of the "balanced_accuracy" values ascribed to predictions.
How to Use PhenDB
You may either upload a single metagenomic bin/genome in FASTA format (.gz-compressed or uncompressed), or an archive (.tar.gz or .zip) containing several bins/genomes.
Please note that a flat filestructure is required in the compressed folder.
The current maximum filesize for upload is 1 GB, the maximum file size per bin is 30 MB.
Duplicate .fasta files (determined by file content) will be silently dropped from the analysis.
Balanced Accuracy is a confidence measure computed from completeness/contamination of the uploaded bin and the model's predictive power. Predictions with a balanced accuracy below the chosen cutoff value (range: 0.5 - 1) are omitted from the result.
After submission, your job is queued and waits for completion of any previously submitted jobs.
When computation starts, expect about 1-1.5 min of calculation time per bin.
To receive a notification upon job completion, enter an email address during your submission.
Alternatively, you may save the URL to your submission to retrieve your results later.
Results are stored by PhenDB for 30 days.
PhenDB provides the output files in a .zip compressed folder named after your Job ID (i.e. the key after ../results/ in the URL). This folder contains:
- the folder individual_results, containing:
- a .results.txt file for every valid uploaded bin/genome.
This file contains the model names, predictions (YES/NO/NA) along with probability and balanced accuracy values.
- the folder "summaries", which contains:
- "summary_matrix.results.tsv": A summary file that shows for each model how many bins/genomes were predicted as "YES", "NO" or "N/A"
- "per_bin_matrix.results.tsv": A summary file that shows the verdict for each bin and each model as a matrix
- "invalid_input_files.log.txt": If one or more of your uploaded files were invalid (e.g. not in FASTA format), a warning will appear to check this file. If all files were correct, this file is empty.
- "PICA_trait_descriptions.txt": Contains the model names and the traits they are testing for.
Hyatt, Doug, et al. "Prodigal: prokaryotic gene recognition and translation initiation site identification." BMC bioinformatics 11.1 (2010): 119.
Eddy, Sean R. "HMMER: Profile hidden Markov models for biological sequence analysis." (2001).
Huerta-Cepas, Jaime, et al. "eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences." Nucleic acids research 44.D1 (2015): D286-D293.
Feldbauer, Roman, et al. "Prediction of microbial phenotypes based on comparative genomics." BMC bioinformatics 16.14 (2015): S1.