Overview

BioMedGPS Explainer uses Knowledge Graph Embedding (KGE) models to predict potential drug-disease associations. This guide explains how to work with different model types, understand their parameters, and optimize performance for your specific use case.

What are Knowledge Graph Embeddings?

Knowledge Graph Embeddings represent entities (drugs, diseases, genes) and relationships as vectors in a continuous space. This allows the model to learn complex patterns and make predictions about potential associations between drugs and diseases.

Model Setup

Automatic Model Download

BioMedGPS Explainer automatically downloads pre-trained model files and configuration from Weights & Biases (wandb) when you run the analysis. No manual setup is required!

What's Downloaded Automatically?

The toolkit automatically retrieves:

  • Pre-trained KGE model and embeddings
  • Model configuration (config.json) with optimal parameters
  • Entity annotations and embeddings
  • Knowledge graph data
  • Relation embeddings

Model Selection

You can choose different pre-trained models using the --model-run-id parameter, which corresponds to run IDs from the wandb project.

Using Different Models

# CLI usage with specific model run ID
biomedgps-explainer run --disease-id MONDO:0004979 --model-run-id 6vlvgvfq --output-dir results/

# Python API usage with specific model run ID
core = DrugDiseaseCore()
core.run_full_pipeline(
    disease_id="MONDO:0004979",
    model_run_id="6vlvgvfq",  # specify wandb run ID
    output_dir="results/"
)

Finding Model Run IDs

Browse available pre-trained models at wandb.ai/yjcyxky/biomedgps-kge-v1 to find different model run IDs. Each run represents a different model configuration or training setup.

Supported Model Types

TransE

Translation-based model that treats relationships as translations in the embedding space.

Pros:

  • Simple and interpretable
  • Fast training
  • Good for 1-to-1 relationships

Cons:

  • Limited for complex relationships
  • May struggle with 1-to-many relationships

TransH

Hyperplane-based translation model that projects entities onto relation-specific hyperplanes.

Pros:

  • Better for complex relationships
  • Handles 1-to-many relationships
  • More flexible than TransE

Cons:

  • More complex training
  • Higher computational cost

RotatE

Rotation-based model that treats relationships as rotations in complex space.

Pros:

  • Excellent for symmetric relationships
  • Handles complex patterns
  • Good theoretical foundation

Cons:

  • Complex implementation
  • Slower training

Data Format Requirements

Required Files

The toolkit requires four main data files in TSV (Tab-Separated Values) format:

1. Entity Annotations (annotated_entities.tsv)

id  label   name
MONDO:0004979  Disease  asthma
CHEBI:12345    Compound aspirin
HGNC:1234      Gene     TNF

Columns:

  • id: Unique entity identifier
  • label: Entity type (Disease, Compound, Gene, etc.)
  • name: Human-readable entity name

2. Knowledge Graph (knowledge_graph.tsv)

source_id  source_type  source_name  target_id  target_type  target_name  relation_type
CHEBI:12345  Compound  aspirin  MONDO:0004979  Disease  asthma  GNBR::T::Compound:Disease
HGNC:1234    Gene      TNF      MONDO:0004979  Disease  asthma  GNBR::T::Gene:Disease

Columns:

  • source_id: Source entity identifier
  • source_type: Source entity type
  • source_name: Source entity name
  • target_id: Target entity identifier
  • target_type: Target entity type
  • target_name: Target entity name
  • relation_type: Type of relationship

3. Entity Embeddings (entity_embeddings.tsv)

entity_id  entity_type  embedding
MONDO:0004979  Disease  0.1|0.2|0.3|0.4|...
CHEBI:12345    Compound 0.5|0.6|0.7|0.8|...

Columns:

  • entity_id: Entity identifier
  • entity_type: Entity type
  • embedding: Vector representation (pipe-separated)

4. Relation Embeddings (relation_type_embeddings.tsv)

relation_type  embedding
GNBR::T::Compound:Disease  0.1|0.2|0.3|0.4|...
GNBR::T::Gene:Disease      0.5|0.6|0.7|0.8|...

Columns:

  • relation_type: Relationship type
  • embedding: Vector representation (pipe-separated)

Model Selection Guide

Choosing the Right Model

The choice of KGE model depends on your specific use case and data characteristics:

For Simple Drug-Disease Associations

Recommended: TransE_l2

Use when you have straightforward drug-disease relationships and want fast, reliable predictions.

For Complex Biological Networks

Recommended: TransH or RotatE

Use when dealing with complex multi-entity relationships and biological pathways.

For Large-Scale Analysis

Recommended: TransE_l2

Use for large datasets where computational efficiency is important.

Model Comparison

Model Training Speed Prediction Accuracy Memory Usage Best For
TransE_l2 Fast Good Low General use, large datasets
TransH Medium Better Medium Complex relationships
RotatE Slow Best High Research, complex patterns

Model Parameters

Key Parameters

gamma (Margin Parameter)

Default: 12.0

Range: 1.0 - 50.0

Description: Controls the margin between positive and negative samples during training. Higher values make the model more discriminative but may reduce generalization.

Recommendation: Start with 12.0 and adjust based on validation performance.

threshold (Prediction Threshold)

Default: 0.5

Range: 0.0 - 1.0

Description: Minimum score threshold for considering a drug-disease association as positive.

Recommendation: Use 0.5-0.7 for balanced precision/recall, higher for precision, lower for recall.

top_n_diseases

Default: 100

Range: 10 - 1000

Description: Number of similar diseases to consider for drug prediction.

Recommendation: Use 50-100 for most cases, increase for rare diseases.

top_n_drugs

Default: 1000

Range: 100 - 10000

Description: Maximum number of drugs to analyze.

Recommendation: Use 500-1000 for focused analysis, higher for comprehensive screening.

Parameter Optimization

1

Start with Defaults

Begin with the default parameter values to establish a baseline performance.

2

Adjust Threshold

Fine-tune the prediction threshold based on your precision/recall requirements.

3

Optimize Gamma

Experiment with different gamma values to find the optimal margin for your data.

4

Validate Results

Use cross-validation or holdout sets to validate parameter choices.

Performance Optimization

Computational Requirements

Memory Usage

  • Small datasets: 4-8 GB RAM
  • Medium datasets: 8-16 GB RAM
  • Large datasets: 16+ GB RAM

Processing Time

  • 100 drugs: 5-10 minutes
  • 500 drugs: 15-30 minutes
  • 1000+ drugs: 30-60 minutes

Storage

  • Model files: 2-5 GB
  • Results: 100-500 MB
  • Total: 3-6 GB

Performance Tips

Use SSD Storage

Solid-state drives significantly improve I/O performance for large model files.

Optimize Parameters

Start with smaller parameter values and scale up based on your computational resources.

Batch Processing

Process multiple diseases in batches to optimize memory usage and processing time.

Monitor Resources

Use system monitoring tools to track memory and CPU usage during analysis.

Using Custom Models

Custom Model Integration

BioMedGPS Explainer supports custom KGE models that follow the specified data format. Here's how to integrate your own models:

1

Prepare Your Data

Ensure your model outputs follow the required TSV format for entity and relation embeddings.

2

Organize Files

Place your model files in the appropriate data directory structure.

3

Validate Format

Run the data validation script to ensure compatibility.

4

Test Integration

Run a small test analysis to verify your model works correctly.

Model Validation

Before using a custom model, validate it using the built-in validation tools:

# Validate custom model files
python3 examples/run_data_validation.py --model-dir path/to/your/model

# Test with small dataset
python3 examples/run_full_example.py --disease MONDO:0004979 --top-n-drugs 10

Best Practices for Custom Models

  • Consistent Format: Ensure all files follow the exact TSV format specifications
  • Entity Coverage: Make sure your model covers all entities in your knowledge graph
  • Embedding Quality: Validate embedding quality using similarity metrics
  • Documentation: Document your model's training process and parameters