Why Similarity Search Requires Representation Learning

Posted May 26, 2026

By Dong Jun Lee

4 min read

Introduction

In the previous post, I discussed why forensic dental AI depends on more than just models.

Before a system can learn anything useful, it must first have access to:

Structured dental records
Reliable data governance
Expert-reviewed information
Trustworthy data pipelines

However, another challenge remains.

Even with a well-designed database, how can a system identify records that are similar but not identical?

This question leads directly to representation learning.

1. The Problem with Exact Matching

Traditional databases are designed around exact values.

Consider the following example.

Ante-Mortem (AM)

Missing Tooth #36
Crown on #11

Post-Mortem (PM)

Missing Tooth #36
Crown on #11
Additional restoration on #24

A forensic expert might consider these records highly compatible.

The restoration on #24 could simply represent treatment performed after the AM record was created.

However, a strict database query may treat them as different records.

This illustrates a fundamental issue.

Databases typically reason about equality.

Human experts reason about similarity.

Those are not the same thing.

2. How Human Experts Actually Compare Records

Forensic identification rarely depends on finding perfect matches.

Instead, experts compare patterns.

Typical questions include:

Which findings remain stable over time?
Which findings are expected to change?
Which characteristics are highly distinctive?
Which differences are clinically acceptable?

The question is usually not:

Are these records identical?

Instead, it is:

Are these records sufficiently similar to support identification?

This is fundamentally a similarity problem.

3. Why Rule-Based Similarity Has Limits

One possible approach is to assign manual scores.

For example:

Crown      = 5 points
Implant    = 10 points
Bridge     = 15 points
Denture    = 12 points

The system could then calculate similarity using predefined weights.

While simple, this approach becomes difficult to maintain.

Questions quickly emerge:

Is an implant always worth twice as much as a crown?
How should combinations of findings be weighted?
How should treatment changes be handled?
How should rare anatomical features be represented?

As the system grows, manually designed rules become increasingly difficult to manage.

4. What Is Representation Learning?

Representation learning approaches the problem differently.

Instead of manually defining every rule, the system learns useful representations directly from data.

Conceptually:

Dental Record
      ↓
    Encoder
      ↓
Embedding Vector
      ↓
Similarity Search

The embedding vector becomes a numerical representation of a dental record.

Records with similar characteristics are positioned closer together within the representation space.

Records with different characteristics move farther apart.

The goal is not to memorize records.

The goal is to learn meaningful relationships between records.

5. Why Embeddings Matter

Traditional database queries often ask:

Do all fields match exactly?

A similarity-based system asks a different question:

How close are these records within a learned representation space?

This distinction is important.

Two records may have:

Different treatment dates
Different restoration histories
Different documentation styles

Yet they may still belong to the same individual.

Embedding-based retrieval makes this type of reasoning computationally possible.

6. Potential Sources of Representation

A future forensic dental AI system may learn from multiple sources of information.

Examples include:

Structured Data

Odontograms
Dental findings
Treatment history
Prosthetic information

Imaging Data

Panoramic radiographs
Periapical radiographs
CBCT images

Clinical Context

Clinical notes
Expert annotations
Case metadata

Each source captures different aspects of dental identity.

No single source is likely sufficient on its own.

7. Toward Multimodal Representations

Future systems will likely combine multiple information sources.

Odontogram
      +
Radiographic Data
      +
Clinical Documentation
      ↓
Shared Embedding Space
      ↓
Similarity Retrieval

In this architecture:

Structured records provide explicit information
Images provide anatomical information
Clinical documentation provides contextual information

The objective is not replacement.

The objective is integration.

8. Why Infrastructure Still Comes First

Representation learning may sound like a modeling problem.

In reality, it remains heavily dependent on infrastructure.

If data quality is poor:

Embeddings become unreliable
Similarity scores become misleading
Evaluation becomes difficult

This means representation learning still depends on:

Structured records
Standardized schemas
Governance frameworks
Expert review
Secure data pipelines

The quality of the learned representation can never exceed the quality of the underlying data.

9. A Practical Engineering Roadmap

From an engineering perspective, the progression may look like:

Structured Dental Records
          ↓
Searchable Database
          ↓
Candidate Retrieval
          ↓
Similarity Search
          ↓
Representation Learning
          ↓
Multimodal AI

This sequence is important.

A useful retrieval system can exist before advanced AI models.

In many healthcare domains, infrastructure matures before intelligence.

Forensic dental identification is likely no exception.

Conclusion

Exact matching works well when records are complete and perfectly standardized.

Real forensic cases rarely satisfy those conditions.

As forensic dental systems evolve, identification will increasingly depend on similarity rather than equality.

This creates a new challenge.

The system must learn meaningful representations of dental information.

Representation learning offers one possible path toward that goal.

However, embeddings do not emerge in isolation.

They depend on structured records, governance frameworks, expert oversight, and trustworthy infrastructure.

Before a system can learn similarity, it must first learn from data that deserves to be trusted.

In the next post, I will discuss why building a dental search engine may be a more practical first step than building an AI model.

AI, Dentistry

This post is licensed under CC BY 4.0 by the author.