Operational Ontology using Semantic Knowledge Graph in Snowflake
A validated architecture enabling scientists to ask complex questions that combine experimental data with biological knowledge—entirely within Snowflake
What We Built
The Problem
Pharmaceutical R&D generates experimental data (drug efficacy, cell line studies, gene expression) that needs to be analyzed in the context of biological knowledge (cell type hierarchies, gene functions). These two data domains are typically siloed—a query for "epithelial-derived cancers" cannot find liver cancer studies because the experimental data doesn't explicitly label cells as epithelial.
The Solution
A system that lets scientists ask questions like "Show drug efficacy for all epithelial-derived cancers"and automatically discovers that liver cells (hepatocytes), lung cells, and breast cells are all epithelial-derived because it understands the biological hierarchy—even though this information isn't in the experimental data itself.
Key Innovation: Closure Tables
Biological hierarchies can be very deep (hepatocyte → epithelial cell → eukaryotic cell → cell). Instead of searching through these chains every time, we pre-calculate all connections using closure tables. This converts expensive recursive graph traversal into simple table lookups, enabling sub-second query response times.
System Components
Data Layer
Ontology Tables (KG_NODE, KG_EDGE): Store entities and relationships from OWL/RDF ontologies. We ingested Cell Ontology (~2,700 cell types) and Gene Ontology (~155,000 terms), resulting in 158,395 nodes and 409,132 edges.
Data Product Tables: Store experimental results with foreign key references to ontology nodes. Includes TREATMENTS, CELL_LINE_CATALOG, IN_VIVO_STUDIES, IN_VITRO_STUDIES, and RNA_SEQ_RESULTS.
Graph Traversal Layer
Closure Tables: Pre-computed tables storing all ancestor-descendant pairs. KG_CELLTYPE_ANCESTORS contains 53,879 rows mapping each cell type to all its ancestors with hop count. KG_ANATOMY_PARTS_OF contains 22,168 rows for anatomical relationships.
This converts recursive graph traversal into simple WHERE clause filtering—no CTEs needed at query time.
Query Interface Layer
Semantic View (PHARMA_KG_SEMANTIC_VIEW): Native Snowflake semantic view defining tables, relationships, dimensions, and facts. Enables Cortex Analyst to generate SQL from natural language.
Cortex Search Service: Indexes all 158,395 ontology terms for semantic similarity search.
Cortex Agent: Orchestrates between Analyst (structured queries) and Search (concept discovery).
Data Used in the Proof of Concept
Ontologies Ingested
We ingested two standard biomedical ontologies from the OBO Foundry, representing biological knowledge as entities (nodes) and relationships (edges).
Cell Ontology (CL)
Hierarchical classification of cell types, enabling queries like "find all epithelial-derived cells"
Gene Ontology (GO)
Gene functions, biological processes, and cellular components for gene-level queries
Synthetic Data Products
We created synthetic pharmaceutical R&D data to simulate experimental results. Each table links to the ontology via NODE_ID foreign keys.
💊 TREATMENTS
Drug/compound registry with gene targets
- ADCs: ADC-X1, ADC-X2, ADC-Y1 (targeting HER2, TROP2, EGFR)
- Small Molecules: KinaseInhib-A/B, ApoptosisInd-1
- Biologics: CheckpointAb-1/2 (PD-1, PD-L1)
TARGET_GENE_NODE_ID → Gene Ontology🧫 CELL_LINE_CATALOG
Cell lines and xenograft models
- Cell Lines: MCF7, MDA-MB-231, A549, H1975, HepG2
- CDX Models: MCF7-CDX, A549-CDX
- PDX Models: BR-PDX-001, LU-PDX-001, LI-PDX-001
- Knockouts: MCF7-BRCA1-KO, A549-TP53-KO, HepG2-MYC-KO
CELL_TYPE_NODE_ID → Cell Ontology📊 RNA_SEQ_RESULTS
Gene expression data
- Baseline vs treated expression
- Fold change and p-values
- Knockout cell line expression
- Multiple timepoints (0h, 24h, 48h, 72h)
GENE_NODE_ID → Gene Ontology🐁 IN_VIVO_STUDIES
Tumor growth efficacy studies
- Tumor Growth Inhibition (TGI) %
- Response categories (CR, PR, SD, PD)
- Survival days
- Treatment schedules (Q7D, BIW, QD)
MODEL_ID → Cell Line Catalog🧪 IN_VITRO_STUDIES
Cell-based assay results
- Cell viability assays
- Dose-response data
- IC50 values
- Apoptosis assay results
CELL_LINE_ID → Cell Line CatalogValidation Results
Ontology Ingestion
Successfully loaded Cell Ontology and Gene Ontology
158,395 nodes • 409,132 edgesClosure Table Generation
Pre-computed all ancestry paths
53,879 cell type paths • 22,168 anatomy pathsData Product Linkage
All cell lines linked to ontology via NODE_ID foreign keys
~1,000 experimental records linkedLineage-Based Queries
Correctly discovers hepatocyte as epithelial-derived
Sub-second response timeNatural Language to SQL
Cortex Analyst generates correct joins automatically
Semantic View with relationshipsSemantic Search
Finds related ontology terms by meaning
158K terms indexedExample Questions That Now Work
- "Show drug efficacy for all epithelial-derived cancers" — Automatically finds liver, lung, breast, kidney models
- "Which drugs have the best TGI in PDX models?" — Queries across patient-derived xenograft studies
- "Compare ADC performance to small molecules" — Groups drugs by class and calculates averages
- "What is a hepatocyte?" — Searches biological knowledge base for definitions
- "Find all cell lines derived from leukocytes" — Traverses immune cell hierarchy
Concerns and Mitigations
Closure Table Size Explosion
Closure tables grow quadratically with hierarchy depth and branching factor. A naive implementation for large ontologies could produce billions of rows.
Closure Table Maintenance
When ontologies are updated (new terms, changed relationships), closure tables become stale. Full rebuild can be expensive for large ontologies.
Multiple Relationship Types
Ontologies contain various edge types (subClassOf, part_of, regulates). A single closure table conflates these semantics.
Query Accuracy for Novel Questions
Cortex Analyst may generate incorrect SQL for questions outside its training distribution or for complex multi-hop reasoning.
Scale Considerations
Conclusion
This proof-of-concept demonstrates that an operational ontology using semantic knowledge graph can be implemented in Snowflake using closure tables for graph traversal, semantic views for natural language access, and Cortex services for AI-powered querying. Scientists can now ask questions that span experimental data and biological knowledge without manual cross-referencing.