User Guide - BioMedGPS Explainer

Introduction

BioMedGPS Explainer is a comprehensive toolkit for drug discovery analysis using knowledge graph embeddings. This guide will walk you through the complete process from installation to generating comprehensive analysis reports.

What You'll Learn

How to set up and install the toolkit
How to prepare and validate your data
How to run drug discovery analysis
How to filter and visualize results
How to interpret the outputs
How to troubleshoot common issues

Installation

System Requirements

Python: 3.8 or higher
Memory: At least 8GB RAM (16GB recommended for large datasets)
Storage: At least 5GB free space for model files and results
Operating System: Windows, macOS, or Linux

Step-by-Step Installation

1. Clone the Repository

git clone <repository-url>
cd biomedgps-explainer

2. Create Virtual Environment (Recommended)

python -m venv biomedgps_env

# On Windows
biomedgps_env\Scripts\activate

# On macOS/Linux
source biomedgps_env/bin/activate

3. Install the Package

pip install -e .

4. Verify Installation

biomedgps-explainer --help

Getting Started

Quick Start Example

1. Model Preparation

The toolkit automatically downloads pre-trained BioMedGPS model files from Weights & Biases (wandb) when you run the analysis. No manual model file preparation is required!

2. Validate Data (Optional)

python3 examples/run_data_validation.py

3. Run Complete Analysis

python3 examples/run_full_example.py

4. View Results

Check the results/ directory for output files
Open results/visualization_report/analysis_report.html for the interactive report

Model Preparation

Automatic Model Download

The toolkit automatically downloads pre-trained BioMedGPS model files from Weights & Biases (wandb) when you run the analysis. This includes:

annotated_entities.tsv - Entity annotations
knowledge_graph.tsv - Knowledge graph triples
entity_embeddings.tsv - Entity embeddings
relation_type_embeddings.tsv - Relation embeddings

File Formats (Reference)

The downloaded model files follow these formats:

Entity File Format

id  label   name
MONDO:0004979  Disease  asthma
CHEBI:12345    Compound aspirin

Knowledge Graph Format

source_id  source_type  source_name  target_id  target_type  target_name  relation_type
CHEBI:12345  Compound  aspirin  MONDO:0004979  Disease  asthma  GNBR::T::Compound:Disease

Embeddings Format

entity_id  entity_type  embedding
MONDO:0004979  Disease  0.1|0.2|0.3|0.4|...

Data Validation (Optional)

You can optionally validate the downloaded data:

python3 examples/run_data_validation.py

This will:

Check file existence and format
Automatically decompress ZIP files if needed
Verify data integrity
Provide detailed error messages if issues are found

Basic Usage

Command Line Interface

1. Run Complete Analysis

biomedgps-explainer run --disease-id MONDO:0004979 --output-dir results/ --model-run-id 6vlvgvfq

Parameters:

--disease-id: Disease ID (required)
--output-dir: Output directory (required)
--model-run-id: Model run ID (default: 6vlvgvfq)
--top-n-diseases: Number of similar diseases (default: 100)
--threshold: Drug filtering threshold (default: 0.5)
--relation-type: Relation type (default: GNBR::T::Compound:Disease)
--top-n-drugs: Number of drugs to interpret (default: 1000)

2. Filter Results

biomedgps-explainer filter \
  --input-file results/annotated_drugs.xlsx \
  --expression "score > 0.6 and existing == False" \
  --output-file results/filtered_drugs.xlsx

3. Generate Visualizations

biomedgps-explainer visualize \
  --input-file results/filtered_drugs.xlsx \
  --output-dir results/visualizations/ \
  --disease-id MONDO:0004979 \
  --disease-name "asthma"

4. Run Complete Pipeline

biomedgps-explainer pipeline \
  --disease-id MONDO:0004979 \
  --model-run-id 6vlvgvfq \
  --output-dir results/ \
  --filter-expression "score > 0.6 and existing == False"

Description: Executes the complete workflow (run → filter → visualize) in a single command.

Python API

Basic Workflow

from drugs4disease.core import DrugDiseaseCore
from drugs4disease.filter import DrugFilter
from drugs4disease.visualizer import Visualizer

# Initialize components
core = DrugDiseaseCore()
filter_tool = DrugFilter()
visualizer = Visualizer(disease_id="MONDO:0004979", disease_name="asthma")

# Run analysis
core.run_full_pipeline(
    disease_id="MONDO:0004979",
    output_dir="results/",
    top_n_diseases=50,
    top_n_drugs=100
)

# Filter results
filter_tool.filter_drugs(
    input_file="results/annotated_drugs.xlsx",
    expression="score > 0.7 and num_of_shared_genes_in_path >= 1",
    output_file="results/filtered_drugs.xlsx"
)

# Generate report
visualizer.generate_report(
    data_file="results/filtered_drugs.xlsx",
    output_file="results/analysis_report.html",
    title="Drug Discovery Analysis Report"
)

Advanced Features

Advanced Filtering

The toolkit supports complex logical expressions for drug filtering:

Filter Expressions Examples

# High-scoring new drugs
"score > 0.7 and existing == False"

# Drugs with shared genes and pathways
"num_of_shared_genes_in_path >= 2 and num_of_shared_pathways >= 1"

# Network-central drugs
"drug_degree > 10 and num_of_key_genes >= 3"

# Complex combination
"score > 0.6 and existing == False and (num_of_shared_genes_in_path >= 1 or num_of_shared_pathways >= 1)"

Custom Parameters

You can customize various parameters for different analysis scenarios:

Model Selection: Choose between different KGE models
Threshold Adjustment: Fine-tune prediction thresholds
Network Analysis: Configure centrality calculations
Pathway Analysis: Set enrichment parameters

Visualization Guide

Output Files

The toolkit generates comprehensive output files:

Main Results

annotated_drugs.xlsx - Complete drug analysis with all annotations
filtered_drugs.xlsx - Filtered drug candidates based on criteria

Visualization Reports

analysis_report.html - Interactive HTML report with all visualizations
Individual chart files (PNG/JSON) for each analysis type

Analysis Components

predicted_drugs.xlsx - Initial drug predictions
shared_genes_pathways.xlsx - Gene and pathway overlap analysis
shared_diseases.xlsx - Disease similarity analysis
network_annotations.xlsx - Network centrality features

Visualization Types

The toolkit generates 12 different types of visualizations:

Score Distribution - Predicted score distribution of candidate drugs
Predicted Score Boxplot - Score distribution by knowledge graph inclusion
Disease Similarity Heatmap - Drug similarity based on shared diseases
Network Centrality - Drug network centrality analysis
Shared Genes and Pathways - Comprehensive gene/pathway overlap analysis
Drug Similarity Network - Interactive drug relationship network
Shared Gene Count - Distribution of shared genes between drugs and diseases
Score vs Degree - Relationship between network degree and predicted scores

Troubleshooting

Common Issues

Memory Issues

Symptoms: Out of memory errors, slow performance

Solutions:

Reduce top_n_diseases and top_n_drugs parameters
Use smaller model files if available
Close other applications to free up memory

File Not Found Errors

Symptoms: "File not found" or "No such file or directory" errors

Solutions:

Verify file paths and names
Run data validation script
Check file permissions

Import Errors

Symptoms: ModuleNotFoundError or ImportError

Solutions:

Ensure virtual environment is activated
Reinstall the package: pip install -e .
Check Python version compatibility

FAQ

Q: What disease IDs are supported?

A: The toolkit supports MONDO disease IDs (e.g., MONDO:0004979 for asthma). You can find disease IDs in the MONDO ontology or use the entity file to look up available diseases.

Q: How long does analysis take?

A: Analysis time depends on dataset size and parameters. Small analyses (100 drugs) typically take 5-10 minutes, while large analyses (1000+ drugs) can take 30-60 minutes.

Q: Can I use my own KGE model?

A: Yes, the toolkit is designed to work with any KGE model that follows the specified file formats. See the Model Usage guide for details.

Q: How do I interpret the results?

A: Results include multiple metrics and visualizations. Higher scores indicate stronger predicted associations. Use the filtering tools to focus on the most promising candidates.

Q: What if I get no results?

A: Try lowering the threshold parameter or increasing the number of similar diseases. Also check that your disease ID exists in the knowledge graph.