Introduction

BioMedGPS Explainer is a comprehensive toolkit for drug discovery analysis using knowledge graph embeddings. This guide will walk you through the complete process from installation to generating comprehensive analysis reports.

What You'll Learn

  • How to set up and install the toolkit
  • How to prepare and validate your data
  • How to run drug discovery analysis
  • How to filter and visualize results
  • How to interpret the outputs
  • How to troubleshoot common issues

Installation

System Requirements

  • Python: 3.8 or higher
  • Memory: At least 8GB RAM (16GB recommended for large datasets)
  • Storage: At least 5GB free space for model files and results
  • Operating System: Windows, macOS, or Linux

Step-by-Step Installation

1. Clone the Repository

git clone <repository-url>
cd biomedgps-explainer

2. Create Virtual Environment (Recommended)

python -m venv biomedgps_env

# On Windows
biomedgps_env\Scripts\activate

# On macOS/Linux
source biomedgps_env/bin/activate

3. Install the Package

pip install -e .

4. Verify Installation

biomedgps-explainer --help

Getting Started

Quick Start Example

1. Model Preparation

The toolkit automatically downloads pre-trained BioMedGPS model files from Weights & Biases (wandb) when you run the analysis. No manual model file preparation is required!

2. Validate Data (Optional)

python3 examples/run_data_validation.py

3. Run Complete Analysis

python3 examples/run_full_example.py

4. View Results

  • Check the results/ directory for output files
  • Open results/visualization_report/analysis_report.html for the interactive report

Model Preparation

Automatic Model Download

The toolkit automatically downloads pre-trained BioMedGPS model files from Weights & Biases (wandb) when you run the analysis. This includes:

  1. annotated_entities.tsv - Entity annotations
  2. knowledge_graph.tsv - Knowledge graph triples
  3. entity_embeddings.tsv - Entity embeddings
  4. relation_type_embeddings.tsv - Relation embeddings

File Formats (Reference)

The downloaded model files follow these formats:

Entity File Format

id  label   name
MONDO:0004979  Disease  asthma
CHEBI:12345    Compound aspirin

Knowledge Graph Format

source_id  source_type  source_name  target_id  target_type  target_name  relation_type
CHEBI:12345  Compound  aspirin  MONDO:0004979  Disease  asthma  GNBR::T::Compound:Disease

Embeddings Format

entity_id  entity_type  embedding
MONDO:0004979  Disease  0.1|0.2|0.3|0.4|...

Data Validation (Optional)

You can optionally validate the downloaded data:

python3 examples/run_data_validation.py

This will:

  • Check file existence and format
  • Automatically decompress ZIP files if needed
  • Verify data integrity
  • Provide detailed error messages if issues are found

Basic Usage

Command Line Interface

1. Run Complete Analysis

biomedgps-explainer run --disease-id MONDO:0004979 --output-dir results/ --model-run-id 6vlvgvfq

Parameters:

  • --disease-id: Disease ID (required)
  • --output-dir: Output directory (required)
  • --model-run-id: Model run ID (default: 6vlvgvfq)
  • --top-n-diseases: Number of similar diseases (default: 100)
  • --threshold: Drug filtering threshold (default: 0.5)
  • --relation-type: Relation type (default: GNBR::T::Compound:Disease)
  • --top-n-drugs: Number of drugs to interpret (default: 1000)

2. Filter Results

biomedgps-explainer filter \
  --input-file results/annotated_drugs.xlsx \
  --expression "score > 0.6 and existing == False" \
  --output-file results/filtered_drugs.xlsx

3. Generate Visualizations

biomedgps-explainer visualize \
  --input-file results/filtered_drugs.xlsx \
  --output-dir results/visualizations/ \
  --disease-id MONDO:0004979 \
  --disease-name "asthma"

4. Run Complete Pipeline

biomedgps-explainer pipeline \
  --disease-id MONDO:0004979 \
  --model-run-id 6vlvgvfq \
  --output-dir results/ \
  --filter-expression "score > 0.6 and existing == False"

Description: Executes the complete workflow (run → filter → visualize) in a single command.

Python API

Basic Workflow

from drugs4disease.core import DrugDiseaseCore
from drugs4disease.filter import DrugFilter
from drugs4disease.visualizer import Visualizer

# Initialize components
core = DrugDiseaseCore()
filter_tool = DrugFilter()
visualizer = Visualizer(disease_id="MONDO:0004979", disease_name="asthma")

# Run analysis
core.run_full_pipeline(
    disease_id="MONDO:0004979",
    output_dir="results/",
    top_n_diseases=50,
    top_n_drugs=100
)

# Filter results
filter_tool.filter_drugs(
    input_file="results/annotated_drugs.xlsx",
    expression="score > 0.7 and num_of_shared_genes_in_path >= 1",
    output_file="results/filtered_drugs.xlsx"
)

# Generate report
visualizer.generate_report(
    data_file="results/filtered_drugs.xlsx",
    output_file="results/analysis_report.html",
    title="Drug Discovery Analysis Report"
)

Advanced Features

Advanced Filtering

The toolkit supports complex logical expressions for drug filtering:

Filter Expressions Examples

# High-scoring new drugs
"score > 0.7 and existing == False"

# Drugs with shared genes and pathways
"num_of_shared_genes_in_path >= 2 and num_of_shared_pathways >= 1"

# Network-central drugs
"drug_degree > 10 and num_of_key_genes >= 3"

# Complex combination
"score > 0.6 and existing == False and (num_of_shared_genes_in_path >= 1 or num_of_shared_pathways >= 1)"

Custom Parameters

You can customize various parameters for different analysis scenarios:

  • Model Selection: Choose between different KGE models
  • Threshold Adjustment: Fine-tune prediction thresholds
  • Network Analysis: Configure centrality calculations
  • Pathway Analysis: Set enrichment parameters

Visualization Guide

Output Files

The toolkit generates comprehensive output files:

Main Results

  • annotated_drugs.xlsx - Complete drug analysis with all annotations
  • filtered_drugs.xlsx - Filtered drug candidates based on criteria

Visualization Reports

  • analysis_report.html - Interactive HTML report with all visualizations
  • Individual chart files (PNG/JSON) for each analysis type

Analysis Components

  • predicted_drugs.xlsx - Initial drug predictions
  • shared_genes_pathways.xlsx - Gene and pathway overlap analysis
  • shared_diseases.xlsx - Disease similarity analysis
  • network_annotations.xlsx - Network centrality features

Visualization Types

The toolkit generates 12 different types of visualizations:

  1. Score Distribution - Predicted score distribution of candidate drugs
  2. Predicted Score Boxplot - Score distribution by knowledge graph inclusion
  3. Disease Similarity Heatmap - Drug similarity based on shared diseases
  4. Network Centrality - Drug network centrality analysis
  5. Shared Genes and Pathways - Comprehensive gene/pathway overlap analysis
  6. Drug Similarity Network - Interactive drug relationship network
  7. Shared Gene Count - Distribution of shared genes between drugs and diseases
  8. Score vs Degree - Relationship between network degree and predicted scores

Troubleshooting

Common Issues

Memory Issues

Symptoms: Out of memory errors, slow performance

Solutions:

  • Reduce top_n_diseases and top_n_drugs parameters
  • Use smaller model files if available
  • Close other applications to free up memory

File Not Found Errors

Symptoms: "File not found" or "No such file or directory" errors

Solutions:

  • Verify file paths and names
  • Run data validation script
  • Check file permissions

Import Errors

Symptoms: ModuleNotFoundError or ImportError

Solutions:

  • Ensure virtual environment is activated
  • Reinstall the package: pip install -e .
  • Check Python version compatibility

FAQ

Q: What disease IDs are supported?

A: The toolkit supports MONDO disease IDs (e.g., MONDO:0004979 for asthma). You can find disease IDs in the MONDO ontology or use the entity file to look up available diseases.

Q: How long does analysis take?

A: Analysis time depends on dataset size and parameters. Small analyses (100 drugs) typically take 5-10 minutes, while large analyses (1000+ drugs) can take 30-60 minutes.

Q: Can I use my own KGE model?

A: Yes, the toolkit is designed to work with any KGE model that follows the specified file formats. See the Model Usage guide for details.

Q: How do I interpret the results?

A: Results include multiple metrics and visualizations. Higher scores indicate stronger predicted associations. Use the filtering tools to focus on the most promising candidates.

Q: What if I get no results?

A: Try lowering the threshold parameter or increasing the number of similar diseases. Also check that your disease ID exists in the knowledge graph.