Introduction
BioMedGPS Explainer is a comprehensive toolkit for drug discovery analysis using knowledge graph embeddings. This guide will walk you through the complete process from installation to generating comprehensive analysis reports.
What You'll Learn
- How to set up and install the toolkit
- How to prepare and validate your data
- How to run drug discovery analysis
- How to filter and visualize results
- How to interpret the outputs
- How to troubleshoot common issues
Installation
System Requirements
- Python: 3.8 or higher
- Memory: At least 8GB RAM (16GB recommended for large datasets)
- Storage: At least 5GB free space for model files and results
- Operating System: Windows, macOS, or Linux
Step-by-Step Installation
1. Clone the Repository
git clone <repository-url>
cd biomedgps-explainer
2. Create Virtual Environment (Recommended)
python -m venv biomedgps_env
# On Windows
biomedgps_env\Scripts\activate
# On macOS/Linux
source biomedgps_env/bin/activate
3. Install the Package
pip install -e .
4. Verify Installation
biomedgps-explainer --help
Getting Started
Quick Start Example
1. Model Preparation
The toolkit automatically downloads pre-trained BioMedGPS model files from Weights & Biases (wandb) when you run the analysis. No manual model file preparation is required!
2. Validate Data (Optional)
python3 examples/run_data_validation.py
3. Run Complete Analysis
python3 examples/run_full_example.py
4. View Results
- Check the
results/
directory for output files - Open
results/visualization_report/analysis_report.html
for the interactive report
Model Preparation
Automatic Model Download
The toolkit automatically downloads pre-trained BioMedGPS model files from Weights & Biases (wandb) when you run the analysis. This includes:
- annotated_entities.tsv - Entity annotations
- knowledge_graph.tsv - Knowledge graph triples
- entity_embeddings.tsv - Entity embeddings
- relation_type_embeddings.tsv - Relation embeddings
File Formats (Reference)
The downloaded model files follow these formats:
Entity File Format
id label name
MONDO:0004979 Disease asthma
CHEBI:12345 Compound aspirin
Knowledge Graph Format
source_id source_type source_name target_id target_type target_name relation_type
CHEBI:12345 Compound aspirin MONDO:0004979 Disease asthma GNBR::T::Compound:Disease
Embeddings Format
entity_id entity_type embedding
MONDO:0004979 Disease 0.1|0.2|0.3|0.4|...
Data Validation (Optional)
You can optionally validate the downloaded data:
python3 examples/run_data_validation.py
This will:
- Check file existence and format
- Automatically decompress ZIP files if needed
- Verify data integrity
- Provide detailed error messages if issues are found
Basic Usage
Command Line Interface
1. Run Complete Analysis
biomedgps-explainer run --disease-id MONDO:0004979 --output-dir results/ --model-run-id 6vlvgvfq
Parameters:
--disease-id
: Disease ID (required)--output-dir
: Output directory (required)--model-run-id
: Model run ID (default: 6vlvgvfq)--top-n-diseases
: Number of similar diseases (default: 100)--threshold
: Drug filtering threshold (default: 0.5)--relation-type
: Relation type (default: GNBR::T::Compound:Disease)--top-n-drugs
: Number of drugs to interpret (default: 1000)
2. Filter Results
biomedgps-explainer filter \
--input-file results/annotated_drugs.xlsx \
--expression "score > 0.6 and existing == False" \
--output-file results/filtered_drugs.xlsx
3. Generate Visualizations
biomedgps-explainer visualize \
--input-file results/filtered_drugs.xlsx \
--output-dir results/visualizations/ \
--disease-id MONDO:0004979 \
--disease-name "asthma"
4. Run Complete Pipeline
biomedgps-explainer pipeline \
--disease-id MONDO:0004979 \
--model-run-id 6vlvgvfq \
--output-dir results/ \
--filter-expression "score > 0.6 and existing == False"
Description: Executes the complete workflow (run → filter → visualize) in a single command.
Python API
Basic Workflow
from drugs4disease.core import DrugDiseaseCore
from drugs4disease.filter import DrugFilter
from drugs4disease.visualizer import Visualizer
# Initialize components
core = DrugDiseaseCore()
filter_tool = DrugFilter()
visualizer = Visualizer(disease_id="MONDO:0004979", disease_name="asthma")
# Run analysis
core.run_full_pipeline(
disease_id="MONDO:0004979",
output_dir="results/",
top_n_diseases=50,
top_n_drugs=100
)
# Filter results
filter_tool.filter_drugs(
input_file="results/annotated_drugs.xlsx",
expression="score > 0.7 and num_of_shared_genes_in_path >= 1",
output_file="results/filtered_drugs.xlsx"
)
# Generate report
visualizer.generate_report(
data_file="results/filtered_drugs.xlsx",
output_file="results/analysis_report.html",
title="Drug Discovery Analysis Report"
)
Advanced Features
Advanced Filtering
The toolkit supports complex logical expressions for drug filtering:
Filter Expressions Examples
# High-scoring new drugs
"score > 0.7 and existing == False"
# Drugs with shared genes and pathways
"num_of_shared_genes_in_path >= 2 and num_of_shared_pathways >= 1"
# Network-central drugs
"drug_degree > 10 and num_of_key_genes >= 3"
# Complex combination
"score > 0.6 and existing == False and (num_of_shared_genes_in_path >= 1 or num_of_shared_pathways >= 1)"
Custom Parameters
You can customize various parameters for different analysis scenarios:
- Model Selection: Choose between different KGE models
- Threshold Adjustment: Fine-tune prediction thresholds
- Network Analysis: Configure centrality calculations
- Pathway Analysis: Set enrichment parameters
Visualization Guide
Output Files
The toolkit generates comprehensive output files:
Main Results
annotated_drugs.xlsx
- Complete drug analysis with all annotationsfiltered_drugs.xlsx
- Filtered drug candidates based on criteria
Visualization Reports
analysis_report.html
- Interactive HTML report with all visualizations- Individual chart files (PNG/JSON) for each analysis type
Analysis Components
predicted_drugs.xlsx
- Initial drug predictionsshared_genes_pathways.xlsx
- Gene and pathway overlap analysisshared_diseases.xlsx
- Disease similarity analysisnetwork_annotations.xlsx
- Network centrality features
Visualization Types
The toolkit generates 12 different types of visualizations:
- Score Distribution - Predicted score distribution of candidate drugs
- Predicted Score Boxplot - Score distribution by knowledge graph inclusion
- Disease Similarity Heatmap - Drug similarity based on shared diseases
- Network Centrality - Drug network centrality analysis
- Shared Genes and Pathways - Comprehensive gene/pathway overlap analysis
- Drug Similarity Network - Interactive drug relationship network
- Shared Gene Count - Distribution of shared genes between drugs and diseases
- Score vs Degree - Relationship between network degree and predicted scores
Troubleshooting
Common Issues
Memory Issues
Symptoms: Out of memory errors, slow performance
Solutions:
- Reduce
top_n_diseases
andtop_n_drugs
parameters - Use smaller model files if available
- Close other applications to free up memory
File Not Found Errors
Symptoms: "File not found" or "No such file or directory" errors
Solutions:
- Verify file paths and names
- Run data validation script
- Check file permissions
Import Errors
Symptoms: ModuleNotFoundError or ImportError
Solutions:
- Ensure virtual environment is activated
- Reinstall the package:
pip install -e .
- Check Python version compatibility
FAQ
Q: What disease IDs are supported?
A: The toolkit supports MONDO disease IDs (e.g., MONDO:0004979 for asthma). You can find disease IDs in the MONDO ontology or use the entity file to look up available diseases.
Q: How long does analysis take?
A: Analysis time depends on dataset size and parameters. Small analyses (100 drugs) typically take 5-10 minutes, while large analyses (1000+ drugs) can take 30-60 minutes.
Q: Can I use my own KGE model?
A: Yes, the toolkit is designed to work with any KGE model that follows the specified file formats. See the Model Usage guide for details.
Q: How do I interpret the results?
A: Results include multiple metrics and visualizations. Higher scores indicate stronger predicted associations. Use the filtering tools to focus on the most promising candidates.
Q: What if I get no results?
A: Try lowering the threshold parameter or increasing the number of similar diseases. Also check that your disease ID exists in the knowledge graph.