How Do You Connect 10,000 Diseases to Everything That Matters?

The question that defines most of what KISHO does isn't "how do we generate a report?" It's "how do we know which data belongs to which disease?"

That sounds simple. It is not.

KISHO covers 10,888+ rare diseases. Every day, new clinical trials are registered on ClinicalTrials.gov. News articles publish about drug approvals, company acquisitions, and research breakthroughs. The FDA grants orphan drug designations. State legislatures introduce bills affecting newborn screening panels. Researchers publish in journals. Patient assistance programs change their eligibility criteria.

None of that content shows up neatly labeled with a disease identifier. A news article about a gene therapy approval might mention the brand name of the drug, the company that makes it, the condition it treats by one of its three common names, and a gene target. A clinical trial listing might use a synonym that doesn't match the canonical disease name. A policy bill might reference "lysosomal storage disorders" as a category without naming a single specific condition.

The job is to take all of that, across every source, and connect it to the right disease records with enough confidence to be useful and enough caution to not be misleading.

Why search doesn't solve this.

The instinct is to treat this as a search problem. Take a disease name. Search for it across your sources. Return the matches.

That breaks almost immediately at scale. Rare diseases are notorious for having multiple names, outdated names, names that overlap with common English words, names that are subsets of other disease names, and names that changed when the underlying genetics were better understood. A disease might be known by its eponym in clinical settings, by its gene in research contexts, and by a completely different name in patient communities.

A basic keyword search for "Fabry disease" misses articles that reference "alpha-galactosidase A deficiency" or "Anderson-Fabry disease" or just "GLA mutations" in a clinical context. And a search broad enough to catch all of those will also catch articles about dozens of unrelated conditions that happen to share a gene pathway or a symptom overlap.

This is why KISHO is built on a medical ontology, not a search index.

The ontology is the spine.

KISHO uses the MONDO Disease Ontology as its foundational data structure. Every disease in the platform has a MONDO identifier that maps to its synonyms, its parent and child classifications, and its cross-references to other medical databases (OMIM, Orphanet, GARD, ICD-10, SNOMED). That identifier is the anchor point. Everything else connects through it.

When a piece of content arrives, whether it's a news article, a trial record, an FDA designation, or a policy bill, the system doesn't just search for the disease name. It resolves the content against the full ontology: canonical names, synonyms, cross-database identifiers, and in some cases, the genes and phenotypes associated with a condition. The matching runs at different confidence levels depending on the source and the specificity of the reference, and every link carries a confidence score.

This is not something AI invented for us. Ontology-based entity resolution is an established approach in biomedical informatics. What AI does in our pipeline is handle the messy middle: the cases where a news article references a condition obliquely, or a trial listing uses terminology that doesn't map cleanly to any single disease, or a policy bill affects a category of conditions rather than a named one.

Where AI earns its place, and where it doesn't.

I'm deliberate about where AI sits in this pipeline, because the default in the industry right now is to throw a language model at every problem and call it innovation. That's how you get hallucinated disease associations and fabricated trial data.

In KISHO, AI does not fetch data. It does not decide what's true. It does not fill in gaps from its training memory. Every factual claim in every report, every disease page, every alert traces back to a structured data source with a timestamp and a provenance chain.

What AI does is two things. First, it handles the fuzzy classification work that would be impossible to do with rules alone at this scale. When a news article arrives, determining which of 10,888 diseases it's relevant to, what category of news it represents, and how important it is to the people tracking that disease requires understanding natural language in a medical context. That's what language models are good at.

Second, AI synthesizes. Once the data is assembled, once the trials and genes and FDA designations and policy activity and news are all linked to a disease through the ontology, turning that structured dataset into a readable intelligence brief is a synthesis task. The model writes from a verified data packet. It doesn't go looking for information. It works with what it's given, and if the data isn't there, the report says so explicitly rather than papering over the gap.

This distinction matters more than any other architectural decision we've made. The AI is the engine, not the source of truth. The ontology and the structured data pipelines are the source of truth. The AI makes it usable.

The edges are the hard part.

I won't pretend this is a solved problem. The long tail of rare diseases includes conditions with almost no published literature, no active trials, and no patient organizations. For those diseases, the content mapping is thin because the content itself is thin. KISHO will tell you that. A disease page that says "no active clinical trials identified" is more honest than one that doesn't mention trials at all.

There are also ongoing challenges with disambiguation (two diseases that share a gene but have completely different clinical presentations), with temporal accuracy (making sure a trial that closed last month isn't still showing as active), and with source coverage (not every relevant news outlet has a structured feed we can ingest).

We handle these with layers: automated matching, confidence scoring, validation checks, and in some cases, human review. The system is designed to be cautious. A low-confidence match gets flagged rather than published. A data source that returns nothing gets noted as a gap rather than ignored.

Why this matters more than the AI.

Every week, someone launches a new "AI-powered" health platform. Most of them are a language model with a prompt. Ask a question, get an answer, hope it's right.

That works for casual queries. It does not work for a PAG leader making decisions about which clinical trial to promote to their community. It does not work for a pharma team running competitive intelligence on a therapeutic area. It does not work for a genetic counselor handing a family a disease summary that needs to be accurate.

The unglamorous work, the ontology mapping, the entity resolution, the confidence scoring, the source pipelines, the validation layers, is what makes the AI output trustworthy. It's also what makes it defensible. You can't replicate a continuously updated, cross-domain, ontology-linked data layer with a clever prompt.

That's the infrastructure KISHO is built on. Everything else, the reports, the alerts, the disease pages, the API, is a product of it.

Kisho

How Do You Connect 10,000 Diseases to Everything That Matters?

More like this

Most Rare Diseases Have No One Watching