Overview
BioMedGPS Explainer uses Knowledge Graph Embedding (KGE) models to predict potential drug-disease associations. This guide explains how to work with different model types, understand their parameters, and optimize performance for your specific use case.
What are Knowledge Graph Embeddings?
Knowledge Graph Embeddings represent entities (drugs, diseases, genes) and relationships as vectors in a continuous space. This allows the model to learn complex patterns and make predictions about potential associations between drugs and diseases.
Model Setup
Automatic Model Download
BioMedGPS Explainer automatically downloads pre-trained model files and configuration from Weights & Biases (wandb) when you run the analysis. No manual setup is required!
What's Downloaded Automatically?
The toolkit automatically retrieves:
- Pre-trained KGE model and embeddings
- Model configuration (config.json) with optimal parameters
- Entity annotations and embeddings
- Knowledge graph data
- Relation embeddings
Model Selection
You can choose different pre-trained models using the --model-run-id
parameter, which corresponds to run IDs from the wandb project.
Using Different Models
# CLI usage with specific model run ID
biomedgps-explainer run --disease-id MONDO:0004979 --model-run-id 6vlvgvfq --output-dir results/
# Python API usage with specific model run ID
core = DrugDiseaseCore()
core.run_full_pipeline(
disease_id="MONDO:0004979",
model_run_id="6vlvgvfq", # specify wandb run ID
output_dir="results/"
)
Finding Model Run IDs
Browse available pre-trained models at wandb.ai/yjcyxky/biomedgps-kge-v1 to find different model run IDs. Each run represents a different model configuration or training setup.
Supported Model Types
TransE
Translation-based model that treats relationships as translations in the embedding space.
Pros:
- Simple and interpretable
- Fast training
- Good for 1-to-1 relationships
Cons:
- Limited for complex relationships
- May struggle with 1-to-many relationships
TransH
Hyperplane-based translation model that projects entities onto relation-specific hyperplanes.
Pros:
- Better for complex relationships
- Handles 1-to-many relationships
- More flexible than TransE
Cons:
- More complex training
- Higher computational cost
RotatE
Rotation-based model that treats relationships as rotations in complex space.
Pros:
- Excellent for symmetric relationships
- Handles complex patterns
- Good theoretical foundation
Cons:
- Complex implementation
- Slower training
Data Format Requirements
Required Files
The toolkit requires four main data files in TSV (Tab-Separated Values) format:
1. Entity Annotations (annotated_entities.tsv)
id label name
MONDO:0004979 Disease asthma
CHEBI:12345 Compound aspirin
HGNC:1234 Gene TNF
Columns:
id
: Unique entity identifierlabel
: Entity type (Disease, Compound, Gene, etc.)name
: Human-readable entity name
2. Knowledge Graph (knowledge_graph.tsv)
source_id source_type source_name target_id target_type target_name relation_type
CHEBI:12345 Compound aspirin MONDO:0004979 Disease asthma GNBR::T::Compound:Disease
HGNC:1234 Gene TNF MONDO:0004979 Disease asthma GNBR::T::Gene:Disease
Columns:
source_id
: Source entity identifiersource_type
: Source entity typesource_name
: Source entity nametarget_id
: Target entity identifiertarget_type
: Target entity typetarget_name
: Target entity namerelation_type
: Type of relationship
3. Entity Embeddings (entity_embeddings.tsv)
entity_id entity_type embedding
MONDO:0004979 Disease 0.1|0.2|0.3|0.4|...
CHEBI:12345 Compound 0.5|0.6|0.7|0.8|...
Columns:
entity_id
: Entity identifierentity_type
: Entity typeembedding
: Vector representation (pipe-separated)
4. Relation Embeddings (relation_type_embeddings.tsv)
relation_type embedding
GNBR::T::Compound:Disease 0.1|0.2|0.3|0.4|...
GNBR::T::Gene:Disease 0.5|0.6|0.7|0.8|...
Columns:
relation_type
: Relationship typeembedding
: Vector representation (pipe-separated)
Model Selection Guide
Choosing the Right Model
The choice of KGE model depends on your specific use case and data characteristics:
For Simple Drug-Disease Associations
Recommended: TransE_l2
Use when you have straightforward drug-disease relationships and want fast, reliable predictions.
For Complex Biological Networks
Recommended: TransH or RotatE
Use when dealing with complex multi-entity relationships and biological pathways.
For Large-Scale Analysis
Recommended: TransE_l2
Use for large datasets where computational efficiency is important.
Model Comparison
Model | Training Speed | Prediction Accuracy | Memory Usage | Best For |
---|---|---|---|---|
TransE_l2 | Fast | Good | Low | General use, large datasets |
TransH | Medium | Better | Medium | Complex relationships |
RotatE | Slow | Best | High | Research, complex patterns |
Model Parameters
Key Parameters
gamma (Margin Parameter)
Default: 12.0
Range: 1.0 - 50.0
Description: Controls the margin between positive and negative samples during training. Higher values make the model more discriminative but may reduce generalization.
Recommendation: Start with 12.0 and adjust based on validation performance.
threshold (Prediction Threshold)
Default: 0.5
Range: 0.0 - 1.0
Description: Minimum score threshold for considering a drug-disease association as positive.
Recommendation: Use 0.5-0.7 for balanced precision/recall, higher for precision, lower for recall.
top_n_diseases
Default: 100
Range: 10 - 1000
Description: Number of similar diseases to consider for drug prediction.
Recommendation: Use 50-100 for most cases, increase for rare diseases.
top_n_drugs
Default: 1000
Range: 100 - 10000
Description: Maximum number of drugs to analyze.
Recommendation: Use 500-1000 for focused analysis, higher for comprehensive screening.
Parameter Optimization
Start with Defaults
Begin with the default parameter values to establish a baseline performance.
Adjust Threshold
Fine-tune the prediction threshold based on your precision/recall requirements.
Optimize Gamma
Experiment with different gamma values to find the optimal margin for your data.
Validate Results
Use cross-validation or holdout sets to validate parameter choices.
Performance Optimization
Computational Requirements
Memory Usage
- Small datasets: 4-8 GB RAM
- Medium datasets: 8-16 GB RAM
- Large datasets: 16+ GB RAM
Processing Time
- 100 drugs: 5-10 minutes
- 500 drugs: 15-30 minutes
- 1000+ drugs: 30-60 minutes
Storage
- Model files: 2-5 GB
- Results: 100-500 MB
- Total: 3-6 GB
Performance Tips
Use SSD Storage
Solid-state drives significantly improve I/O performance for large model files.
Optimize Parameters
Start with smaller parameter values and scale up based on your computational resources.
Batch Processing
Process multiple diseases in batches to optimize memory usage and processing time.
Monitor Resources
Use system monitoring tools to track memory and CPU usage during analysis.
Using Custom Models
Custom Model Integration
BioMedGPS Explainer supports custom KGE models that follow the specified data format. Here's how to integrate your own models:
Prepare Your Data
Ensure your model outputs follow the required TSV format for entity and relation embeddings.
Organize Files
Place your model files in the appropriate data directory structure.
Validate Format
Run the data validation script to ensure compatibility.
Test Integration
Run a small test analysis to verify your model works correctly.
Model Validation
Before using a custom model, validate it using the built-in validation tools:
# Validate custom model files
python3 examples/run_data_validation.py --model-dir path/to/your/model
# Test with small dataset
python3 examples/run_full_example.py --disease MONDO:0004979 --top-n-drugs 10
Best Practices for Custom Models
- Consistent Format: Ensure all files follow the exact TSV format specifications
- Entity Coverage: Make sure your model covers all entities in your knowledge graph
- Embedding Quality: Validate embedding quality using similarity metrics
- Documentation: Document your model's training process and parameters