Researchers Develop Open-Access Raman Spectral Library to Identify Biomolecules with AI Precision

November 12, 2025

A team led by data engineer and researcher Marcelo Terán from the Universitat Oberta de Catalunya (UOC) has developed an open-access Raman spectral database designed to help scientists identify key biomolecules with greater precision. Working in collaboration with the Institute of Photonic Sciences (ICFO), Terán and his colleagues have compiled a collection of 140 biomolecules, including nucleic acids, proteins, lipids, and carbohydrates.

Terán, M., Ruiz, J. J., Loza-Alvarez, P., Masip, D., & Merino, D. (2025). Open Raman spectral library for biomolecule identification. Chemometrics and Intelligent Laboratory Systems, 264, 105476. https://doi.org/10.1016/j.chemolab.2025.105476

Raman spectroscopy, which analyzes the interaction of light and matter to determine chemical composition and molecular structure, has long been recognized as a powerful non-invasive tool for studying biological materials. However, its broader use in biomedical and analytical engineering has been slowed by the absence of standardized, openly available spectral reference data. The UOC and ICFO project addresses this gap by building a unified and accessible database that can serve as a benchmark for future studies and clinical research.

Marcelo Terán from the Universitat Oberta de Catalunya (UOC) stated,

“It is still unusual for scientific articles to share data openly, especially in the field of Raman spectroscopy. This lack of access to data limits biomedical research considerably. If AI is to be successfully applied, it needs large volumes of reliable and accessible data, and this is where open science projects play a key role”.

To create the database, the researchers gathered Raman spectra from previously published work and developed custom computer-vision tools to automatically extract spectral information from figures and graphs. These data were then processed, corrected, and standardized for consistency. The resulting library includes both full spectra and peak data along with metadata describing the molecule type and experimental conditions.

Two algorithms were implemented to enhance the accuracy of molecular identification. The first compares full spectra using similarity metrics, while the second focuses on matching the key peaks of Raman signatures. When tested on pure biomolecule samples, both approaches achieved 100 percent accuracy in identifying not only the specific molecule but also its general category, such as protein or lipid. This demonstrates a level of precision that could help standardize Raman analysis across laboratories and reduce dependence on subjective visual interpretation of spectra.

For biomedical and engineering applications, the database represents a step toward more automated and objective molecular analysis. Raman spectroscopy’s ability to examine samples without altering them is particularly useful in medical diagnostics, where precision and non-invasiveness are essential. The researchers emphasize that accessible, high-quality data are vital for training artificial intelligence models capable of interpreting biological spectra and identifying subtle changes linked to disease processes.

Despite its strengths, the project still faces challenges. The current library includes about 140 biomolecules, which is only a fraction of the thousands found in biological systems. Many of the collected spectra were reconstructed from published figures, introducing some limitations in data quality. Furthermore, while the algorithms perform well on pure biomolecules, complex biological mixtures may present greater difficulties in spectral separation and analysis. Continued collaboration across the scientific community will be key to expanding the dataset and improving its accuracy.

The team envisions the library growing into a collaborative, community-driven resource. They hope that other researchers will contribute new spectra and validation data, helping the library become a widely recognized reference for molecular spectroscopy. This open-science approach could also accelerate the development of machine-learning models capable of interpreting spectral data in real time, advancing both engineering applications and biomedical research.

The creation of this Raman spectral library aligns with a broader movement in analytical science toward transparency, reproducibility, and open data. Similar efforts have emerged across materials science and biotechnology, highlighting the importance of accessible spectral databases for machine-learning integration and sensor development. For engineers, the library provides a standardized tool to benchmark Raman-based instrumentation and design new diagnostic and sensing systems grounded in consistent molecular data.

According to Terán, the long-term goal is to enable Raman spectroscopy to become a faster and more objective method for identifying molecular components in biological and chemical systems. As the database expands, it could form the backbone for artificial intelligence models that interpret spectra automatically, paving the way for new applications in diagnostics, monitoring, and real-time molecular sensing.

The study marks a milestone in open-source spectroscopy, providing the foundation for future tools that blend optics, data engineering, and artificial intelligence to better understand the molecular world.

Leave a Reply

Your email address will not be published.

Previous Story

Harnessing Plasma Physics to Remove Forever Chemicals from Water

Next Story

Eco-Friendly? UCT Prague Finds Toxic Chemicals in Bamboo and Bio-Based Tableware

Privacy Preference Center