Terminology, retrieval bias, and field definition in forensic genetic genealogy

Bibliometric analyses are valuable tools for depicting the development, structure, and influence of scientific disciplines. The validity of searches, however, is contingent on the completeness and accuracy of the underlying corpus. In evolving interdisciplinary areas of study, where terminology is not yet standardized, the design of the literature search strategy is a critical determinant of the resulting dataset and whether obtained data provide an accurate picture.

In a recent bibliometric analysis of “investigative genetic genealogy” published in Forensic Science International: Synergy, Pellegrino and Stasi present a structured overview of the field using a Scopus-derived corpus of 147 publications, without apparent triangulation across complementary indexing sources. While this work provides some characterization of growth trends and thematic clustering, the extent to which broader conclusions can be drawn from the analysis is inherently constrained by methodological choices made during corpus construction, particularly in relation to terminology and search strategy design.

While advanced DNA sequencing technologies have been available for nearly two decades, continued improvements have expanded capabilities. Since 2018, notable advances in DNA sequencing, genetic genealogy database search capabilities, and genealogical reconstruction have contributed to the application of forensic genetic genealogy across forensic science, law enforcement, and humanitarian identification efforts. At the same time, the terminology used to describe these approaches has continued to vary and still is being refined and debated.

In this context, incomplete or inconsistent proffering of terminology at the point of literature retrieval can introduce systematic bias that propagates through all subsequent stages of analysis. We further discuss how terminology in this field reflects deeper conceptual differences regarding its classification and scope, and how these factors should be considered in the design and interpretation of bibliometric studies.

The terminology used to describe DNA-based identification using genealogical and population genetic methods remains unsettled. Commonly used terms include “investigative genetic genealogy,” “forensic genetic genealogy,” and “forensic investigative genetic genealogy.” These terms are sometimes used interchangeably in the literature, but they are not strictly synonymous.

The lack of standardization reflects the interdisciplinary nature of the field, which spans forensic science, genomics, bioinformatics, genealogy, legal scholarship, ethics, and academia. Different stakeholder groups have adopted terminology that aligns with their respective perspectives. As a result, no single term fully captures the breadth of relevant work.

In such contexts, reliance on a single or limited set of applicable terms for literature retrieval can lead to systematic exclusion of relevant work. These challenges are compounded when literature retrieval is restricted to a single database and not cross-validated through direct searches of core journals representative of the field, as coverage varies across disciplines and publication types.

Given that terminology in this field is both evolving and inconsistently applied across disciplines, search strategies that do not explicitly account for this variation introduce bias at the point of corpus construction. As a result, subsequent analyses may reflect the structure of the search strategy rather than the structure of the field itself.

One methodological concern in bibliometric analyses of this field is asymmetric inclusion of terminology during data retrieval. In the work by Pellegrino and Stasi, the primary Boolean query includes “investigative genetic genealogy” but does not appear to include “forensic genetic genealogy,” despite the authors acknowledging variability in terminology. This approach risks systematically underrepresenting portions of the literature that uses alternative but widely accepted terminology.

This bias is introduced at the point of corpus construction and cannot be corrected through downstream preprocessing steps such as keyword harmonization. Even when multiple terms are acknowledged conceptually, failure to include them explicitly in a search query limits recall and shapes the perceived structure of the field.

An additional issue arises from inclusion of terms that are not conceptually equivalent. The search strategy employed by Pellegrino and Stasi includes the term “forensic genealogy”. While this term has been used loosely by some, it describes a distinct and different discipline involving traditional genealogical research in legal contexts such as probate and heir tracing, rather than DNA-based identification in forensics.

Including such terms in a search strategy introduces literature that is unrelated to SNP-based (as well as lineage marker-based) identification methods while simultaneously excluding relevant work that uses more relevant terminology. Thus, both false positives and false negatives are generated, resulting in a corpus that is neither comprehensive nor conceptually coherent and importantly an inadequate representation.

Because bibliometric outputs such as author rankings, collaboration networks, and thematic clusters depend entirely on the composition of the underlying dataset, biases introduced during retrieval propagate through all subsequent analyses. Apparent patterns in the data therefore may reflect artifacts of the search strategy rather than the actual structure of the field. This risk is further exacerbated when analyses rely on a single indexing database without cross-validation and when no validation step is performed to confirm that key publications and contributors are captured within the dataset. For example, complementary use of openly accessible resources such as PubMed can facilitate cross-validation and improve transparency and reproducibility.

The diversity of terminology in this field is not solely a product of linguistic drift. It also reflects deliberate efforts by different stakeholders to define the scope and nature of the work.

For example, some authors have adopted the term “investigative genetic genealogy” to emphasize that the method produces investigative leads rather than direct evidence of identity. This framing has been used to support a position [sic] that the practice is distinct from forensic science and should not be subject to the same accountability and accreditation frameworks. In this view, the argument is that genealogical analysis concludes with the generation of a hypothesis, which must then be confirmed through conventional forensic DNA testing.

At the same time, many established forensic processes, including database searches such as those performed within CODIS, are also characterized as generating investigative leads that require confirmation and are consistent with established forensic workflows. The designation of a method as “lead-generating” has not historically excluded it (and should not be excluded) from the domain of forensic science. Indeed, the outputs of many forensic analyses can be understood as investigative leads that require confirmation through additional evidence. As such, the use of a modifier such as “investigative” for one discipline does not meaningfully distinguish it from other established forensic practices and may introduce ambiguity in how the field is characterized. This framing could be interpreted as implying that different standards of scrutiny or oversight apply, which is an interpretation that is not appropriate for forensic genetic genealogy.

Furthermore, recent developments suggest increasing convergence rather than separation of scientific accountability. Standards organizations and collaborative initiatives have begun to incorporate these methods into existing forensic validation and accreditation frameworks. At the same time, courts have evaluated and scrutinized these approaches in legal proceedings, subjecting them to admissibility standards and practices of reliability, transparency, and supportability consistent with quality forensic requirements.

These observations suggest that terminology in this field reflects competing conceptual definitions of the work itself or requirements of accountability, rather than a shared or settled understanding. The choice of terminology is therefore not neutral and has implications for how the field is perceived, regulated, and studied.

Ongoing discussion about terminology is a normal feature of emerging scientific fields which does not necessarily reflect an immature science. Similar patterns have been observed in other domains. For example, what is now widely referred to as cloud computing was previously described using terms such as “remote servers” or “hosted infrastructure.” Terminology evolves alongside advances in technology and shifts in conceptual understanding fostered by collaborative exchange among various stakeholders.

A similar process is occurring in this field. While current terminology emphasizes genealogical methods, the underlying capability extends beyond solely genealogical reconstruction. Genome-wide SNP analysis supports a broader range of identity-related applications, including pairwise kinship inference and population-based analysis, direct one-to-one comparisons, biogeographical ancestry and outwardly visible characteristics, which can generate investigative leads without requiring full genealogical tree construction.

As the field continues to mature and terminology evolves, new terms may emerge that more accurately reflect its expanding scope. For example, genomic identification, or some similar terminology, encompasses genome-wide approaches to identity determination regardless of whether genealogical reconstruction is involved. Looking forward, even broader frameworks such as identity inference may emerge, reflecting a transition from match-based paradigms toward inference-based approaches capable of deriving identity in the absence of direct reference comparisons.

Recognizing this trajectory is important for how the field is described and studied. As terminology continues to evolve, bibliometric analyses must account for these shifts to ensure that the resulting representations reflect the full scope of the underlying capability and progression of the field.

Terminology, retrieval bias, and field definition in forensic genetic genealogy

It is long past time to standardize the language of this field, either by aligning on a primary term or by adopting a broader, more inclusive framework.

Recommended

Why “Genetic Match” Is a Misleading Term in Forensic Genetic Genealogy

Building a Stronger Foundation for Forensic Genetic Genealogy

Averting the Global Failure of Forensic Science Infrastructure