API Documentation - BioMedGPS Explainer

Overview

This document provides comprehensive API documentation for the BioMedGPS Explainer toolkit, covering all major classes, methods, and their usage.

Core Classes

DrugDiseaseCore

The main class responsible for drug discovery analysis using knowledge graph embeddings.

Constructor

DrugDiseaseCore()

DrugDiseaseCore Methods

run_full_pipeline()

run_full_pipeline(
    disease_id: str,
    entity_file: Optional[str] = None,
    knowledge_graph: Optional[str] = None,
    entity_embeddings: Optional[str] = None,
    relation_embeddings: Optional[str] = None,
    output_dir: Optional[str] = None,
    model: str = 'TransE_l2',
    top_n_diseases: int = 100,
    gamma: float = 12.0,
    threshold: float = 0.5,
    relation_type: str = 'GNBR::T::Compound:Disease',
    top_n_drugs: int = 1000
) -> None

Description: One-click complete analysis pipeline that generates annotated_drugs.xlsx with all analysis results.

Parameters:

disease_id: Disease identifier (e.g., "MONDO:0004979")
entity_file: Path to entity annotations file (optional, automatically downloaded from wandb if not specified)
knowledge_graph: Path to knowledge graph file (optional, automatically downloaded from wandb if not specified)
entity_embeddings: Path to entity embeddings file (optional, automatically downloaded from wandb if not specified)
relation_embeddings: Path to relation embeddings file (optional, automatically downloaded from wandb if not specified)
output_dir: Output directory for results
model: KGE model type (default: 'TransE_l2')
top_n_diseases: Number of similar diseases to consider (default: 100)
gamma: Margin parameter for KGE training (default: 12.0)
threshold: Drug filtering threshold (default: 0.5)
relation_type: Relation type for drug-disease associations (default: 'GNBR::T::Compound:Disease')
top_n_drugs: Number of drugs to analyze (default: 1000)

Returns:

None

Example:

core = DrugDiseaseCore()
core.run_full_pipeline(
    disease_id="MONDO:0004979",
    output_dir="results/",
    model='TransE_l2',
    top_n_diseases=50,
    gamma=12.0,
    threshold=0.5,
    top_n_drugs=100
)

predict_drugs()

predict_drugs(
    disease_id: str,
    entity_file: str,
    knowledge_graph: str,
    entity_embeddings: str,
    relation_embeddings: str,
    model: str,
    top_n_diseases: int,
    gamma: float,
    threshold: float,
    relation_type: str,
    output_file: str
) -> None

Description: Generate potential drug list using KGE models.

Parameters:

disease_id: Target disease identifier
entity_file: Path to entity annotations file
knowledge_graph: Path to knowledge graph file
entity_embeddings: Path to entity embeddings file
relation_embeddings: Path to relation embeddings file
model: KGE model type
top_n_diseases: Number of similar diseases
gamma: Margin parameter
threshold: Prediction threshold
relation_type: Relation type for drug-disease associations
output_file: Output Excel file path

get_disease_name()

get_disease_name(disease_id: str, entity_file: str) -> str

Description: Get disease name from disease ID.

Parameters:

disease_id: Disease identifier
entity_file: Path to entity annotations file

Returns:

Disease name as string

get_drug_names()

get_drug_names(drug_ids: List[str], entity_file: str) -> List[str]

Description: Get drug names from drug IDs.

Parameters:

drug_ids: List of drug identifiers
entity_file: Path to entity annotations file

Returns:

List of drug names

DrugFilter

Class for filtering drug candidates based on various criteria.

Constructor

DrugFilter()

filter_drugs()

filter_drugs(
    input_file: str,
    expression: str,
    output_file: str,
    sheet_names: Tuple[str, str] = ("annotated_drugs", "filtered_drugs")
) -> None

Description: Filter drugs based on logical expressions.

Parameters:

input_file: Input Excel file path
expression: Filter expression (e.g., "score > 0.6 and existing == False")
output_file: Output Excel file path
sheet_names: Tuple of (input_sheet, output_sheet) names

Example:

filter = DrugFilter()
filter.filter_drugs(
    input_file="results/annotated_drugs.xlsx",
    expression="score > 0.7 and num_of_shared_genes_in_path >= 1",
    output_file="results/filtered_drugs.xlsx"
)

Supported Filter Expressions:

Numerical comparisons: >, <, >=, <=, ==, !=
Logical operators: and, or, not
Boolean fields: existing, is_key_gene
String matching and pattern matching

Visualizer

Class for generating comprehensive visualizations and reports.

Constructor

Visualizer(disease_id: str, disease_name: str, embed_images: bool = True)

Parameters:

disease_id: Disease identifier
disease_name: Disease name
embed_images: Whether to embed images in HTML report (default: True)

create_visualization()

create_visualization(
    data_file: str,
    viz_type: str,
    output_file: str,
    sheet_names: Tuple[str, str] = ("annotated_drugs", "filtered_drugs")
) -> str

Description: Generate specific visualization chart.

Parameters:

data_file: Input data file path
viz_type: Visualization type (see supported types below)
output_file: Output file path
sheet_names: Tuple of sheet names for input data

Supported Visualization Types:

score_distribution - Score distribution histogram
score_boxplot - Score distribution by existing status
disease_similarity - Disease similarity heatmap
network_centrality - Network centrality analysis
shared_genes_pathways - Gene and pathway overlap
drug_similarity_network - Drug similarity network
shared_gene_count - Shared gene count distribution
score_vs_degree - Score vs network degree

generate_report()

generate_report(
    data_file: str,
    output_file: str,
    title: str = "Drug Discovery Analysis Report"
) -> str

Description: Generate comprehensive HTML report with all visualizations.

Parameters:

data_file: Input data file path
output_file: Output HTML file path
title: Report title

Returns:

Path to generated HTML report

Command Line Interface

Main Commands

Run Analysis

biomedgps-explainer run [OPTIONS]

Options:

--disease-id: Disease ID (required)
--output-dir: Output directory (required)
--model-run-id: Model run ID (default: 6vlvgvfq)
--top-n-diseases: Number of similar diseases (default: 100)
--threshold: Drug filtering threshold (default: 0.5)
--relation-type: Relation type (default: GNBR::T::Compound:Disease)
--top-n-drugs: Number of drugs to analyze (default: 1000)

Filter Drugs

biomedgps-explainer filter [OPTIONS]

Options:

--input-file: Input Excel file (required)
--expression: Filter expression (required)
--output-file: Output Excel file (required)

Generate Visualizations

biomedgps-explainer visualize [OPTIONS]

Options:

--input-file: Input Excel file (required)
--output-dir: Output directory (required)
--viz-type: Visualization type (default: all)
--disease-id: Disease ID (required)
--disease-name: Disease name (required)

Run Complete Pipeline

biomedgps-explainer pipeline [OPTIONS]

Description: Execute complete workflow (run → filter → visualize) in a single command.

Options:

--disease-id: Disease ID (required)
--model-run-id: Model run ID (default: 6vlvgvfq)
--filter-expression: Filter expression (optional)
--output-dir: Output directory (default: results)
--top-n-diseases: Number of similar diseases (default: 100)
--threshold: Drug filtering threshold (default: 0.5)
--relation-type: Relation type (default: GNBR::T::Compound:Disease)
--top-n-drugs: Number of drugs to interpret (default: 100)

Data Structures

Input Data Format

Entity Annotations (annotated_entities.tsv)

id  label   name
MONDO:0004979  Disease  asthma
CHEBI:12345    Compound aspirin
HGNC:1234      Gene     TNF

Knowledge Graph (knowledge_graph.tsv)

source_id  source_type  source_name  target_id  target_type  target_name  relation_type
CHEBI:12345  Compound  aspirin  MONDO:0004979  Disease  asthma  GNBR::T::Compound:Disease
HGNC:1234    Gene      TNF      MONDO:0004979  Disease  asthma  GNBR::T::Gene:Disease

Entity Embeddings (entity_embeddings.tsv)

entity_id  entity_type  embedding
MONDO:0004979  Disease  0.1|0.2|0.3|0.4|...
CHEBI:12345    Compound 0.5|0.6|0.7|0.8|...

Output Data Format

Annotated Drugs (annotated_drugs.xlsx)

Excel file with multiple sheets containing:

annotated_drugs: Main results with all annotations
predicted_drugs: Initial drug predictions
shared_genes_pathways: Gene and pathway overlap analysis
shared_diseases: Disease similarity analysis
network_annotations: Network centrality features

Error Handling

Common Exceptions

FileNotFoundError

Cause: Required data files not found

Solution: Verify file paths and run data validation

ValueError

Cause: Invalid parameters or data format

Solution: Check parameter values and data format

MemoryError

Cause: Insufficient memory for large datasets

Solution: Reduce dataset size or increase system memory

Error Handling Best Practices

Always validate data before running analysis
Use try-catch blocks for file operations
Check system resources before large computations
Implement proper logging for debugging