Model Usage Guide - BioMedGPS Explainer

Overview

BioMedGPS Explainer uses Knowledge Graph Embedding (KGE) models to predict potential drug-disease associations. This guide explains how to work with different model types, understand their parameters, and optimize performance for your specific use case.

What are Knowledge Graph Embeddings?

Knowledge Graph Embeddings represent entities (drugs, diseases, genes) and relationships as vectors in a continuous space. This allows the model to learn complex patterns and make predictions about potential associations between drugs and diseases.

Model Setup

Automatic Model Download

BioMedGPS Explainer automatically downloads pre-trained model files and configuration from Weights & Biases (wandb) when you run the analysis. No manual setup is required!

What's Downloaded Automatically?

The toolkit automatically retrieves:

Pre-trained KGE model and embeddings
Model configuration (config.json) with optimal parameters
Entity annotations and embeddings
Knowledge graph data
Relation embeddings

Model Selection

You can choose different pre-trained models using the --model-run-id parameter, which corresponds to run IDs from the wandb project.

Using Different Models

# CLI usage with specific model run ID
biomedgps-explainer run --disease-id MONDO:0004979 --model-run-id 6vlvgvfq --output-dir results/

# Python API usage with specific model run ID
core = DrugDiseaseCore()
core.run_full_pipeline(
    disease_id="MONDO:0004979",
    model_run_id="6vlvgvfq",  # specify wandb run ID
    output_dir="results/"
)

Finding Model Run IDs

Browse available pre-trained models at wandb.ai/yjcyxky/biomedgps-kge-v1 to find different model run IDs. Each run represents a different model configuration or training setup.

Supported Model Types

TransE

Translation-based model that treats relationships as translations in the embedding space.

Pros:

Simple and interpretable
Fast training
Good for 1-to-1 relationships

Cons:

Limited for complex relationships
May struggle with 1-to-many relationships

TransH

Hyperplane-based translation model that projects entities onto relation-specific hyperplanes.

Pros:

Better for complex relationships
Handles 1-to-many relationships
More flexible than TransE

Cons:

More complex training
Higher computational cost

RotatE

Rotation-based model that treats relationships as rotations in complex space.

Pros:

Excellent for symmetric relationships
Handles complex patterns
Good theoretical foundation

Cons:

Complex implementation
Slower training

Data Format Requirements

Required Files

The toolkit requires four main data files in TSV (Tab-Separated Values) format:

1. Entity Annotations (annotated_entities.tsv)

id  label   name
MONDO:0004979  Disease  asthma
CHEBI:12345    Compound aspirin
HGNC:1234      Gene     TNF

Columns:

id: Unique entity identifier
label: Entity type (Disease, Compound, Gene, etc.)
name: Human-readable entity name

2. Knowledge Graph (knowledge_graph.tsv)

source_id  source_type  source_name  target_id  target_type  target_name  relation_type
CHEBI:12345  Compound  aspirin  MONDO:0004979  Disease  asthma  GNBR::T::Compound:Disease
HGNC:1234    Gene      TNF      MONDO:0004979  Disease  asthma  GNBR::T::Gene:Disease

Columns:

source_id: Source entity identifier
source_type: Source entity type
source_name: Source entity name
target_id: Target entity identifier
target_type: Target entity type
target_name: Target entity name
relation_type: Type of relationship

3. Entity Embeddings (entity_embeddings.tsv)

entity_id  entity_type  embedding
MONDO:0004979  Disease  0.1|0.2|0.3|0.4|...
CHEBI:12345    Compound 0.5|0.6|0.7|0.8|...

Columns:

entity_id: Entity identifier
entity_type: Entity type
embedding: Vector representation (pipe-separated)

4. Relation Embeddings (relation_type_embeddings.tsv)

relation_type  embedding
GNBR::T::Compound:Disease  0.1|0.2|0.3|0.4|...
GNBR::T::Gene:Disease      0.5|0.6|0.7|0.8|...

Columns:

relation_type: Relationship type
embedding: Vector representation (pipe-separated)

Model Selection Guide

Choosing the Right Model

The choice of KGE model depends on your specific use case and data characteristics:

For Simple Drug-Disease Associations

Recommended: TransE_l2

Use when you have straightforward drug-disease relationships and want fast, reliable predictions.

For Complex Biological Networks

Recommended: TransH or RotatE

Use when dealing with complex multi-entity relationships and biological pathways.

For Large-Scale Analysis

Recommended: TransE_l2

Use for large datasets where computational efficiency is important.

Model Comparison

Model	Training Speed	Prediction Accuracy	Memory Usage	Best For
TransE_l2	Fast	Good	Low	General use, large datasets
TransH	Medium	Better	Medium	Complex relationships
RotatE	Slow	Best	High	Research, complex patterns

Model Parameters

Key Parameters

gamma (Margin Parameter)

Default: 12.0

Range: 1.0 - 50.0

Description: Controls the margin between positive and negative samples during training. Higher values make the model more discriminative but may reduce generalization.

Recommendation: Start with 12.0 and adjust based on validation performance.

threshold (Prediction Threshold)

Default: 0.5

Range: 0.0 - 1.0

Description: Minimum score threshold for considering a drug-disease association as positive.

Recommendation: Use 0.5-0.7 for balanced precision/recall, higher for precision, lower for recall.

top_n_diseases

Default: 100

Range: 10 - 1000

Description: Number of similar diseases to consider for drug prediction.

Recommendation: Use 50-100 for most cases, increase for rare diseases.

top_n_drugs

Default: 1000

Range: 100 - 10000

Description: Maximum number of drugs to analyze.

Recommendation: Use 500-1000 for focused analysis, higher for comprehensive screening.

Parameter Optimization

Start with Defaults

Begin with the default parameter values to establish a baseline performance.

Adjust Threshold

Fine-tune the prediction threshold based on your precision/recall requirements.

Optimize Gamma

Experiment with different gamma values to find the optimal margin for your data.

Validate Results

Use cross-validation or holdout sets to validate parameter choices.

Performance Optimization

Computational Requirements

Memory Usage

Small datasets: 4-8 GB RAM
Medium datasets: 8-16 GB RAM
Large datasets: 16+ GB RAM

Processing Time

100 drugs: 5-10 minutes
500 drugs: 15-30 minutes
1000+ drugs: 30-60 minutes

Storage

Model files: 2-5 GB
Results: 100-500 MB
Total: 3-6 GB

Performance Tips

Use SSD Storage

Solid-state drives significantly improve I/O performance for large model files.

Optimize Parameters

Start with smaller parameter values and scale up based on your computational resources.

Batch Processing

Process multiple diseases in batches to optimize memory usage and processing time.

Monitor Resources

Use system monitoring tools to track memory and CPU usage during analysis.

Using Custom Models

Custom Model Integration

BioMedGPS Explainer supports custom KGE models that follow the specified data format. Here's how to integrate your own models:

Prepare Your Data

Ensure your model outputs follow the required TSV format for entity and relation embeddings.

Organize Files

Place your model files in the appropriate data directory structure.

Validate Format

Run the data validation script to ensure compatibility.

Test Integration

Run a small test analysis to verify your model works correctly.

Model Validation

Before using a custom model, validate it using the built-in validation tools:

# Validate custom model files
python3 examples/run_data_validation.py --model-dir path/to/your/model

# Test with small dataset
python3 examples/run_full_example.py --disease MONDO:0004979 --top-n-drugs 10

Best Practices for Custom Models

Consistent Format: Ensure all files follow the exact TSV format specifications
Entity Coverage: Make sure your model covers all entities in your knowledge graph
Embedding Quality: Validate embedding quality using similarity metrics
Documentation: Document your model's training process and parameters