Overview

This document provides comprehensive API documentation for the BioMedGPS Explainer toolkit, covering all major classes, methods, and their usage.

Core Classes

DrugDiseaseCore

The main class responsible for drug discovery analysis using knowledge graph embeddings.

Constructor

DrugDiseaseCore()

DrugDiseaseCore Methods

run_full_pipeline()

run_full_pipeline(
    disease_id: str,
    entity_file: Optional[str] = None,
    knowledge_graph: Optional[str] = None,
    entity_embeddings: Optional[str] = None,
    relation_embeddings: Optional[str] = None,
    output_dir: Optional[str] = None,
    model: str = 'TransE_l2',
    top_n_diseases: int = 100,
    gamma: float = 12.0,
    threshold: float = 0.5,
    relation_type: str = 'GNBR::T::Compound:Disease',
    top_n_drugs: int = 1000
) -> None

Description: One-click complete analysis pipeline that generates annotated_drugs.xlsx with all analysis results.

Parameters:

  • disease_id: Disease identifier (e.g., "MONDO:0004979")
  • entity_file: Path to entity annotations file (optional, automatically downloaded from wandb if not specified)
  • knowledge_graph: Path to knowledge graph file (optional, automatically downloaded from wandb if not specified)
  • entity_embeddings: Path to entity embeddings file (optional, automatically downloaded from wandb if not specified)
  • relation_embeddings: Path to relation embeddings file (optional, automatically downloaded from wandb if not specified)
  • output_dir: Output directory for results
  • model: KGE model type (default: 'TransE_l2')
  • top_n_diseases: Number of similar diseases to consider (default: 100)
  • gamma: Margin parameter for KGE training (default: 12.0)
  • threshold: Drug filtering threshold (default: 0.5)
  • relation_type: Relation type for drug-disease associations (default: 'GNBR::T::Compound:Disease')
  • top_n_drugs: Number of drugs to analyze (default: 1000)

Returns:

None

Example:

core = DrugDiseaseCore()
core.run_full_pipeline(
    disease_id="MONDO:0004979",
    output_dir="results/",
    model='TransE_l2',
    top_n_diseases=50,
    gamma=12.0,
    threshold=0.5,
    top_n_drugs=100
)

predict_drugs()

predict_drugs(
    disease_id: str,
    entity_file: str,
    knowledge_graph: str,
    entity_embeddings: str,
    relation_embeddings: str,
    model: str,
    top_n_diseases: int,
    gamma: float,
    threshold: float,
    relation_type: str,
    output_file: str
) -> None

Description: Generate potential drug list using KGE models.

Parameters:

  • disease_id: Target disease identifier
  • entity_file: Path to entity annotations file
  • knowledge_graph: Path to knowledge graph file
  • entity_embeddings: Path to entity embeddings file
  • relation_embeddings: Path to relation embeddings file
  • model: KGE model type
  • top_n_diseases: Number of similar diseases
  • gamma: Margin parameter
  • threshold: Prediction threshold
  • relation_type: Relation type for drug-disease associations
  • output_file: Output Excel file path

get_disease_name()

get_disease_name(disease_id: str, entity_file: str) -> str

Description: Get disease name from disease ID.

Parameters:

  • disease_id: Disease identifier
  • entity_file: Path to entity annotations file

Returns:

Disease name as string

get_drug_names()

get_drug_names(drug_ids: List[str], entity_file: str) -> List[str]

Description: Get drug names from drug IDs.

Parameters:

  • drug_ids: List of drug identifiers
  • entity_file: Path to entity annotations file

Returns:

List of drug names

DrugFilter

Class for filtering drug candidates based on various criteria.

Constructor

DrugFilter()

filter_drugs()

filter_drugs(
    input_file: str,
    expression: str,
    output_file: str,
    sheet_names: Tuple[str, str] = ("annotated_drugs", "filtered_drugs")
) -> None

Description: Filter drugs based on logical expressions.

Parameters:

  • input_file: Input Excel file path
  • expression: Filter expression (e.g., "score > 0.6 and existing == False")
  • output_file: Output Excel file path
  • sheet_names: Tuple of (input_sheet, output_sheet) names

Example:

filter = DrugFilter()
filter.filter_drugs(
    input_file="results/annotated_drugs.xlsx",
    expression="score > 0.7 and num_of_shared_genes_in_path >= 1",
    output_file="results/filtered_drugs.xlsx"
)

Supported Filter Expressions:

  • Numerical comparisons: >, <, >=, <=, ==, !=
  • Logical operators: and, or, not
  • Boolean fields: existing, is_key_gene
  • String matching and pattern matching

Visualizer

Class for generating comprehensive visualizations and reports.

Constructor

Visualizer(disease_id: str, disease_name: str, embed_images: bool = True)

Parameters:

  • disease_id: Disease identifier
  • disease_name: Disease name
  • embed_images: Whether to embed images in HTML report (default: True)

create_visualization()

create_visualization(
    data_file: str,
    viz_type: str,
    output_file: str,
    sheet_names: Tuple[str, str] = ("annotated_drugs", "filtered_drugs")
) -> str

Description: Generate specific visualization chart.

Parameters:

  • data_file: Input data file path
  • viz_type: Visualization type (see supported types below)
  • output_file: Output file path
  • sheet_names: Tuple of sheet names for input data

Supported Visualization Types:

  • score_distribution - Score distribution histogram
  • score_boxplot - Score distribution by existing status
  • disease_similarity - Disease similarity heatmap
  • network_centrality - Network centrality analysis
  • shared_genes_pathways - Gene and pathway overlap
  • drug_similarity_network - Drug similarity network
  • shared_gene_count - Shared gene count distribution
  • score_vs_degree - Score vs network degree

generate_report()

generate_report(
    data_file: str,
    output_file: str,
    title: str = "Drug Discovery Analysis Report"
) -> str

Description: Generate comprehensive HTML report with all visualizations.

Parameters:

  • data_file: Input data file path
  • output_file: Output HTML file path
  • title: Report title

Returns:

Path to generated HTML report

Command Line Interface

Main Commands

Run Analysis

biomedgps-explainer run [OPTIONS]
Options:
  • --disease-id: Disease ID (required)
  • --output-dir: Output directory (required)
  • --model-run-id: Model run ID (default: 6vlvgvfq)
  • --top-n-diseases: Number of similar diseases (default: 100)
  • --threshold: Drug filtering threshold (default: 0.5)
  • --relation-type: Relation type (default: GNBR::T::Compound:Disease)
  • --top-n-drugs: Number of drugs to analyze (default: 1000)

Filter Drugs

biomedgps-explainer filter [OPTIONS]
Options:
  • --input-file: Input Excel file (required)
  • --expression: Filter expression (required)
  • --output-file: Output Excel file (required)

Generate Visualizations

biomedgps-explainer visualize [OPTIONS]
Options:
  • --input-file: Input Excel file (required)
  • --output-dir: Output directory (required)
  • --viz-type: Visualization type (default: all)
  • --disease-id: Disease ID (required)
  • --disease-name: Disease name (required)

Run Complete Pipeline

biomedgps-explainer pipeline [OPTIONS]

Description: Execute complete workflow (run → filter → visualize) in a single command.

Options:
  • --disease-id: Disease ID (required)
  • --model-run-id: Model run ID (default: 6vlvgvfq)
  • --filter-expression: Filter expression (optional)
  • --output-dir: Output directory (default: results)
  • --top-n-diseases: Number of similar diseases (default: 100)
  • --threshold: Drug filtering threshold (default: 0.5)
  • --relation-type: Relation type (default: GNBR::T::Compound:Disease)
  • --top-n-drugs: Number of drugs to interpret (default: 100)

Data Structures

Input Data Format

Entity Annotations (annotated_entities.tsv)

id  label   name
MONDO:0004979  Disease  asthma
CHEBI:12345    Compound aspirin
HGNC:1234      Gene     TNF

Knowledge Graph (knowledge_graph.tsv)

source_id  source_type  source_name  target_id  target_type  target_name  relation_type
CHEBI:12345  Compound  aspirin  MONDO:0004979  Disease  asthma  GNBR::T::Compound:Disease
HGNC:1234    Gene      TNF      MONDO:0004979  Disease  asthma  GNBR::T::Gene:Disease

Entity Embeddings (entity_embeddings.tsv)

entity_id  entity_type  embedding
MONDO:0004979  Disease  0.1|0.2|0.3|0.4|...
CHEBI:12345    Compound 0.5|0.6|0.7|0.8|...

Output Data Format

Annotated Drugs (annotated_drugs.xlsx)

Excel file with multiple sheets containing:

  • annotated_drugs: Main results with all annotations
  • predicted_drugs: Initial drug predictions
  • shared_genes_pathways: Gene and pathway overlap analysis
  • shared_diseases: Disease similarity analysis
  • network_annotations: Network centrality features

Error Handling

Common Exceptions

FileNotFoundError

Cause: Required data files not found

Solution: Verify file paths and run data validation

ValueError

Cause: Invalid parameters or data format

Solution: Check parameter values and data format

MemoryError

Cause: Insufficient memory for large datasets

Solution: Reduce dataset size or increase system memory

Error Handling Best Practices

  • Always validate data before running analysis
  • Use try-catch blocks for file operations
  • Check system resources before large computations
  • Implement proper logging for debugging