A Comparative Analysis of Open and Commercial Bibliographic Infrastructures: Scale, Metadata Standardization, and Implications for Bibliometric Evaluation

Share:
Descripción
Additional Info
Reviews
Descripción

Recommended citation:

De-Moya-Anegón, Félix; Sánchez-Jiménez, Rodrigo; Halevi, Gali; Guerrero-Bote, Vicente P.; Guerrero-Castillo, Pablo; Rivadeneyra, Federico (2026). A Comparative Analysis of Open and Commercial Bibliographic Infrastructures: Scale, Metadata Standardization, and Implications for Bibliometric Evaluation. Granada: Ediciones Profesionales de la Información, 48 pp. ISBN: 978-84-125757-8-1

 

https://doi.org/10.3145/aca
Download full-text

 

Executive summary

This report evaluates the structural viability of open bibliographic infrastructures for research assessment purposes, with a particular focus on how leading open databases compare with Scopus in terms of coverage, metadata quality, transparency, interoperability, and suitability for research evaluation workflows.

While recent policy frameworks such as the Coalition for Advancing Research Assessment (CoARA) and the Barcelona Declaration mandate a transition toward open research data, an empirical analysis reveals a critical bottleneck: a structural trade-off between scale and metadata standardization. Platforms such as OpenAIRE, which aggregates more than 150 million records, and open bibliographic platforms including OpenAlex and The Lens, each with over 200 million records, significantly surpass the publication volume covered by commercial curated databases, most notably Scopus, across the analyzed 1996–2024 period.

However, this aggregation model prioritizes recall over structural consistency, which can lead to metadata gaps that compromise direct bibliometric application. The massive ingestion capabilities of open platforms are counterbalanced by substantial limitations in key metadata fields. Affiliation data are absent in more than 55% of records, severely constraining the feasibility of institutional evaluations, and key identifiers such as ISSNs and DOIs exhibit significantly lower levels of completeness than in Scopus. Document type classification also frequently lacks editorial rigor, relying heavily on algorithmic labeling that does not consistently standardize the categorization of scholarly outputs.

Furthermore, the analysis of citation flows reveals a markedly asymmetric dynamic: the expansive long tail of open databases functions primarily as a reference feeder that reinforces the impact indicators of the already established commercial core, rather than substantially redistributing measured impact across the broader scholarly corpus. In this way, the additional literature that open sources seek to incorporate ultimately serves to strengthen the prominence of the publications already represented in commercial databases. This finding points to a structural paradox in open scholarly infrastructures and raises important questions that warrant further reflection and investigation.

Geographic and editorial analyses reveal persistent asymmetries. Within the Global South, representation trajectories diverge: while regions such as Africa and Latin America have improved their visibility, significant coverage gaps, reaching up to 25%, remain in Asia and the Middle East. Additionally, deficits persist in specialized humanities monographs and complex publication structures like conference proceedings. Consequently, the theoretical advantage of the open «long tail» cannot currently be leveraged to offset these geographic and editorial biases, as its source-level metadata remains structurally incomplete or absent.

This operational friction stems from a fundamentally bifurcated data reality. Within the core literature that overlaps with Scopus, open infrastructures achieve high metadata completeness in fields essential for research evaluation. However, the extended literature outside this overlapping core suffers from profound structural deficiencies, including empty essential fields, duplication, and incomplete source data.

The corpus derived from Scopus’s editorial processes exhibits a structural consistency without a direct equivalent in open platforms. While all databases utilize normalization methods, open infrastructures depend intensively on algorithmic procedures which are notably prominent in OpenAlex. Conversely, Scopus integrates automated processes with author and institutional feedback to refine data disambiguation. Although the data indicates that Scopus captures a higher number of affiliations per document, this study does not include an empirical comparison regarding the effectiveness of their respective disambiguation systems.

Conversely, open platforms face significant structural trade-offs: The Lens struggles with global metadata standardization, reporting the lowest global rates of ISSN and DOI presence and a 71.67% deficit in capturing conference proceedings. OpenAlex relies heavily on unstructured source data, with 41.5% of its records (having a source) lacking an ISSN, and faces potential analytical bias due to algorithmic over-labeling of documents as «articles». Finally, OpenAIRE presents important technical anomalies, including over one million duplicated DOIs and the highest rate of unclassified documents (23.1%) within the curated core, resulting in the lowest overall citation impact ratio of the group.

Despite the structural limitations observed in their extended corpora, open bibliographic infrastructures present advantages when applied to targeted use cases. The Lens, with over 215 million records, integrates scholarly outputs with patent data, making it highly effective for mapping technology transfer while maintaining a 96.1% citable document density within its core overlap. OpenAlex demonstrates the highest absolute alignment with commercial standards by capturing 63.8 million Scopus-indexed records and highest citation density in that core among the three open databases. Finally, OpenAIRE offers the highest coverage of persistent identifiers (73.2% for DOIs and 59.7% for ISSNs) and the lowest rate of missing institutional affiliations (40.55%) among the open platforms.

The high structural availability of open data must not be uniformly equated with evaluative viability. Uncritical adoption of the full open dataset in its raw state risks introducing new, systemic biases into the global science policy landscape, imposing significant methodological compromises. Nevertheless, these infrastructures have evolved considerably. While direct aggregation currently complicates standard institutional evaluation, these platforms can deliver highly functional solutions for specialized bibliometric analyses, provided that institutions commit to investing in rigorous data normalization and disambiguation processes. Consequently, the transition toward open research assessment requires a technical shift from mere data accessibility to active data validation.

 

 

Additional Info
Reviews
Item added to cart View Cart Checkout