Why Similarity Search Requires Representation Learning
Introduction
In the previous post, I discussed why forensic dental AI depends on more than just models.
Before a system can learn anything useful, it must first have access to:
- Structured dental records
- Reliable data governance
- Expert-reviewed information
- Trustworthy data pipelines
However, another challenge remains.
Even with a well-designed database, how can a system identify records that are similar but not identical?
This question leads directly to representation learning.
1. The Problem with Exact Matching
Traditional databases are designed around exact values.
Consider the following example.
Ante-Mortem (AM)
- Missing Tooth #36
- Crown on #11
Post-Mortem (PM)
- Missing Tooth #36
- Crown on #11
- Additional restoration on #24
A forensic expert might consider these records highly compatible.
The restoration on #24 could simply represent treatment performed after the AM record was created.
However, a strict database query may treat them as different records.
This illustrates a fundamental issue.
Databases typically reason about equality.
Human experts reason about similarity.
Those are not the same thing.
2. How Human Experts Actually Compare Records
Forensic identification rarely depends on finding perfect matches.
Instead, experts compare patterns.
Typical questions include:
- Which findings remain stable over time?
- Which findings are expected to change?
- Which characteristics are highly distinctive?
- Which differences are clinically acceptable?
The question is usually not:
Are these records identical?
Instead, it is:
Are these records sufficiently similar to support identification?
This is fundamentally a similarity problem.
3. Why Rule-Based Similarity Has Limits
One possible approach is to assign manual scores.
For example:
1
2
3
4
Crown = 5 points
Implant = 10 points
Bridge = 15 points
Denture = 12 points
The system could then calculate similarity using predefined weights.
While simple, this approach becomes difficult to maintain.
Questions quickly emerge:
- Is an implant always worth twice as much as a crown?
- How should combinations of findings be weighted?
- How should treatment changes be handled?
- How should rare anatomical features be represented?
As the system grows, manually designed rules become increasingly difficult to manage.
4. What Is Representation Learning?
Representation learning approaches the problem differently.
Instead of manually defining every rule, the system learns useful representations directly from data.
Conceptually:
1
2
3
4
5
6
7
Dental Record
↓
Encoder
↓
Embedding Vector
↓
Similarity Search
The embedding vector becomes a numerical representation of a dental record.
Records with similar characteristics are positioned closer together within the representation space.
Records with different characteristics move farther apart.
The goal is not to memorize records.
The goal is to learn meaningful relationships between records.
5. Why Embeddings Matter
Traditional database queries often ask:
Do all fields match exactly?
A similarity-based system asks a different question:
How close are these records within a learned representation space?
This distinction is important.
Two records may have:
- Different treatment dates
- Different restoration histories
- Different documentation styles
Yet they may still belong to the same individual.
Embedding-based retrieval makes this type of reasoning computationally possible.
6. Potential Sources of Representation
A future forensic dental AI system may learn from multiple sources of information.
Examples include:
Structured Data
- Odontograms
- Dental findings
- Treatment history
- Prosthetic information
Imaging Data
- Panoramic radiographs
- Periapical radiographs
- CBCT images
Clinical Context
- Clinical notes
- Expert annotations
- Case metadata
Each source captures different aspects of dental identity.
No single source is likely sufficient on its own.
7. Toward Multimodal Representations
Future systems will likely combine multiple information sources.
1
2
3
4
5
6
7
8
9
Odontogram
+
Radiographic Data
+
Clinical Documentation
↓
Shared Embedding Space
↓
Similarity Retrieval
In this architecture:
- Structured records provide explicit information
- Images provide anatomical information
- Clinical documentation provides contextual information
The objective is not replacement.
The objective is integration.
8. Why Infrastructure Still Comes First
Representation learning may sound like a modeling problem.
In reality, it remains heavily dependent on infrastructure.
If data quality is poor:
- Embeddings become unreliable
- Similarity scores become misleading
- Evaluation becomes difficult
This means representation learning still depends on:
- Structured records
- Standardized schemas
- Governance frameworks
- Expert review
- Secure data pipelines
The quality of the learned representation can never exceed the quality of the underlying data.
9. A Practical Engineering Roadmap
From an engineering perspective, the progression may look like:
1
2
3
4
5
6
7
8
9
10
11
Structured Dental Records
↓
Searchable Database
↓
Candidate Retrieval
↓
Similarity Search
↓
Representation Learning
↓
Multimodal AI
This sequence is important.
A useful retrieval system can exist before advanced AI models.
In many healthcare domains, infrastructure matures before intelligence.
Forensic dental identification is likely no exception.
Conclusion
Exact matching works well when records are complete and perfectly standardized.
Real forensic cases rarely satisfy those conditions.
As forensic dental systems evolve, identification will increasingly depend on similarity rather than equality.
This creates a new challenge.
The system must learn meaningful representations of dental information.
Representation learning offers one possible path toward that goal.
However, embeddings do not emerge in isolation.
They depend on structured records, governance frameworks, expert oversight, and trustworthy infrastructure.
Before a system can learn similarity, it must first learn from data that deserves to be trusted.
In the next post, I will discuss why building a dental search engine may be a more practical first step than building an AI model.