GenRefBR: Bridging the genetic information gap for Brazilian biodiversity

doi:10.21203/rs.3.rs-8214784/v1

Download PDF

Research Article

GenRefBR: Bridging the genetic information gap for Brazilian biodiversity

https://doi.org/10.21203/rs.3.rs-8214784/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Large volumes of genomic data are continuously deposited in public repositories. However, retrieving and analyzing this information from a geographic or national perspective remains challenging, as most international databases do not require occurrence metadata such as country or locality during submission. To address this limitation and support research on Brazilian biodiversity, we developed GenRefBR (Genetic References from Brazil), the first web-based platform specifically designed to centralize, organize, and facilitate access to genomic information relevant to Brazil’s biodiversity. GenRefBR compiles a wide array of genetic information, including mitochondrial markers, organellar genomes, and nuclear assemblies, organized alongside vital metadata for easy access. The platform enables users to easily search, filter, and download data, and provides direct links to original entries. By bridging this critical data gap, GenRefBR empowers researchers, conservationists, and environmental authorities to advance biodiversity research, environmental monitoring, and informed decision-making for conservation planning in Brazil. GenRefBR is publicly available at https://genrefbr.itv.org.

web-portal

database

genomics

Brazil

biodiversity

Brazil is recognized globally for its exceptional biodiversity, encompassing a wide range of ecosystems, from the Amazon rainforests to the expansive Cerrado savannas and the unique wetlands of the Pantanal [1, 2]. The six terrestrial biomes (Amazon, Caatinga, Cerrado, Pantanal, Atlantic Forest, and Pampa), along with an extensive coastline, host a remarkable variety of species and exhibit high levels of endemism [1, 3]. Approximately 15% of all known species worldwide are found in Brazil, making it one of the most biodiverse countries on Earth [3, 4, 1].

Brazil harbors over 53,000 plant species (Flora do Brasil 2025, https://floradobrasil.jbrj.gov.br/) and more than 125,000 animal species (Catálogo Taxonômico da Fauna do Brasil, http://fauna.jbrj.gov.br/), including over 11,000 described vertebrates [5] and an estimated 137,000 invertebrate species [6]. While the country is recognized as a biodiversity hotspot, it also faces significant ecological threats, such as deforestation and climate change [7, 1]. In response to these challenges, Brazil has been a pioneer in conservation efforts, notably through the establishment of its first official list of threatened species in 1968 [1, 5]. Currently, the national process for assessing extinction risk is coordinated by the Chico Mendes Institute for Biodiversity Conservation (ICMBio), with periodic evaluations publicly available through its System of Extinction Risk Assessment (SALVE - Sistema de Avaliação do Risco de Extinção da Biodiversidade), accessible at https://salve.icmbio.gov.br/ [7].

Brazil’s extinction risk assessments follow the International Union for Conservation of Nature (IUCN) framework, a globally adopted standard. Species are categorized according to their risk status, of which Critically Endangered (CR), Endangered (EN), and Vulnerable (VU) are officially recognized as threatened and in urgent need of conservation action [7]. Biodiversity preservation is both a national priority and a global concern, as these ecosystems harbor unique species and ecological processes [1, 8].

Genomic data provide insights into species’ functional traits and ecological interactions, support evidence-based conservation strategies, and enable predictions about responses to environmental changes [9, 10]. Although Brazil’s biodiversity has been extensively documented, a notable gap in genetic data remains for many native species [8, 2]. The limited availability of genomic resources constrains our understanding of evolutionary histories, adaptive mechanisms, and conservation requirements.

To address this knowledge gap, the Genomics of the Brazilian Biodiversity (GBB, https://www.itv.org/en/genomics-of-the-brazilian-biodiversity-gbb/) initiative aims to bridge the genetic data gap for many native species by generating chromosome-level reference and organellar genomes, and resequencing genomes for selected species of fauna and flora, including those relevant to Brazil’s bioeconomy. This initiative aims to enhance genomic knowledge, strengthen Brazil’s role in global biodiversity research, and contribute to more informed and effective conservation efforts [8].

In support of this initiative, we introduce GenRefBR, a web portal developed to compile, centralize, and integrate genomic data for Brazilian species (Fig. 1). The platform simplifies access to genetic information and promotes innovative, data-driven biodiversity management. GenRefBR provides comprehensive datasets (from simple genetic markers to whole nuclear genomes) accompanied by metadata on assembly quality and sequencing technology. Additionally, the portal allows users to create customized DNA barcode databases, facilitating more accurate species identification in environmental DNA (eDNA) surveys. Our web portal will help identify species that lack genetic information, guiding targeted sequencing studies to enhance our understanding of biodiversity. Ultimately, the platform expands and disseminates genomic resources for Brazilian biodiversity, providing users with the option to download genomic files as needed. GenRefBR is registered under the accession BR512025002346-4 at the http://www.inpi.gov.br/.

Data sources and curation

We used the species list available on the SALVE database as input to acquire the genomic data. The SALVE database comprises 15,821 species (as of October 6th, 2025), encompassing the taxonomic groups of Amphibia, Aves, Mammalia, Reptilia, and invertebrates (see Fig. 3F). So far, the GenRefBr web-portal incorporated 9,869 vertebrate species, including Actinopterygii (n = 4,944), Amphibia (n = 1,102), Aves (n = 2,037), Elasmobranchii (n = 184), Holocephali (n = 6), Mammalia (n = 750), Myxini (n = 5), Reptilia (n = 840), and Sarcopterygii (n = 1). Actinopterygii, Elasmobranchii, Holocephali, Myxini, and Sarcopterygii were grouped under the category “Fishes” (n = 5,140). In addition to the focal taxonomic groups, the list also includes information on species endemism, updated taxonomic classifications, IUCN Red List categories, and biome-level distribution data.

Subsequently, the BioPython Entrez module was used to access the NCBI API, utilizing the species list obtained from SALVE. For mitochondrial data, a filter is applied within the Nucleotide database (Fig. 2). The corresponding accession IDs were retrieved, and associated GenBank files were downloaded. An in-house Python script was used to parse all the information contained in these files, including the submission date, sequence topology (linear or circular), sequence completeness, and nucleotide sequences for coding sequences (CDS), ribosomal RNA (rRNA), and transporter RNA (tRNA). The names of CDS (atp1-atp6, cox1-cox3, nad1-nad6, cytB), rRNA (rrnS and rrnL), and tRNA (trn[amino acid code], e.g, trnA) were standardized to enable consistent counting and grouping of all these elements. Based on this information, it was possible to identify which organisms possess specific genomic features and whether the mitogenome is complete or a draft genome. A mitogenome is considered complete if the sequence contains more than 10,000 base pairs (bp), 13 CDS, 22 tRNA, two rRNA genes, and is deposited as a circular molecule. A draft mitogenome is defined as a sequence exceeding 10,000 bp and containing more than 10 CDS. In addition to data retrieved from NCBI, mitochondrial barcode sequences were also obtained from the BOLD Systems database via the public data packages available at https://boldsystems.org/data/data-packages (as of March 28th, 2025). All geographic coordinate data for species occurrences were retrieved from the Global Biodiversity Information Facility (GBIF) via https://www.gbif.org/occurrence/search?occurrence_ status = present. For both databases, we used the same species list, and we retained only relevant species by applying a Python-based filtering script.

GenRefBR utilizes the NCBI Genome databases for acquiring nuclear genome data. The table available at https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/eukaryotes.txt was used as a reference. This file is part of a set of summary reports provided by NCBI, which organizes genome assembly information submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Genome assembly accessions, along with basic assembly metrics, were extracted from this table. The corresponding BioSample IDs were then used to retrieve metadata on sequencing platforms and library preparation methods from the NCBI Sequence Read Archive (SRA) database. In addition, the SRA database was queried to identify available sequencing data for the initial species list.

The taxonomic framework used for constructing the GenRefBR database was primarily obtained from the SALVE platform. However, in cases where taxonomic information was unavailable, such as for subspecies, taxonomic data were obtained from the NCBI Taxonomy database. This taxonomic database was also used to identify species with taxonomic differences relative to the SALVE database.

System Architecture and Configuration

GenRefBR was developed in Python using the Dash framework (https://plotly.com/dash/), a tool designed for building interactive web dashboards. Dash is based on the Flask web server (https://flask.palletsprojects.com/), the Plotly.js data visualization library (https://plotly.com/javascript/), and the React JavaScript library (https://react.dev/) that composes the user interface elements. Data loading and manipulation from .csv files are handled using the Pandas library (https://pandas.pydata.org/), enabling efficient and flexible analysis of tabular data. The visual style is developed using the Tailwind CSS framework (https://tailwindcss.com/). The application is internally structured into multiple pages, with layout, callbacks, and visualization files organized separately.

Each page layout is composed of Dash components, including Dash HTML Components, allowing the interface to maintain a consistent, modern, and responsive appearance, which structures the interface with elements such as titles, texts, dividers, and containers; and Dash Core Components, which provide interactive elements like drop-down menus, sliders, and input fields. Data visualizations — such as bar charts, line charts, and scatter plots — are generated using the Plotly library and integrated as dynamic components into the page layouts. These elements, along with inputs and other layout components, can be dynamically updated through callbacks, which define the logic interaction by monitoring changes in input components and triggering real-time updates across the interface.

Deployment

The system architecture is containerized for production, consisting of a single Docker container managed via a docker-compose.yml file. This setup allows the application and its dependencies to be launched with a single command, streamlining infrastructure management and deployment automation. The back-end container runs the Dash application using the Gunicorn WSGI server (https://gunicorn.org/), providing scalability and robustness for handling concurrent HTTP requests in a production environment. As an external service, Microsoft Azure Application Gateway manages HTTP/HTTPS traffic. It acts as a reverse proxy, forwarding requests to the application container while also handling SSL termination, request routing, and optional features such as authentication and load balancing.

Dependency management and version control

Python dependency management and environment isolation are handled using Poetry (https://python-poetry.org/). Poetry allows precise version control of dependencies and virtual environments through a declarative configuration (pyproject.toml). It simplifies the installation, updating, and locking of packages, ensuring reproducibility across systems.

A .gitignore file is configured to exclude unnecessary or environment-specific files from the Git repository. This includes cache files, log files, local configurations, and other artifacts that should not be included in version control. This practice helps keep the repository clean and focused.

Automated code quality checks

A robust pre-commit hook system is integrated into the repository to enforce code quality standards automatically before commits are accepted. This includes: blocking commits to protected branches, detecting large or binary files unintentionally stage, validating script executability and proper shebang usage, preventing the inclusion of private keys or unresolved merge conflicts, detecting leftover debugging statements, ensuring consistent formatting and line endings in code and configuration files, validating and formatting structured data (JSON, YAML, TOML, XML), and the enforcement of Python coding standards. These checks are version-controlled and based on widely used community hooks, ensuring predictable and consistent behavior in every environment.

In recent decades, several large-scale public databases have become essential tools for accessing and sharing genetic information. Among the most prominent are the National Center for Biotechnology Information (NCBI) [11], the European Molecular Biology Laboratory (EMBL-EBI) [12], and the DNA Data Bank of Japan (DDBJ) [13], which together form the International Nucleotide Sequence Database Collaboration (INSDC). These platforms host a vast and growing repository of genetic data, including molecular markers, complete genomes, transcriptomes, protein structures, and associated publications. Their broad scope and open access model have made them indispensable for research in genomics, evolutionary biology, biotechnology, and conservation. However, despite their richness and global relevance, these platforms are not always intuitive when it comes to retrieving data focused on specific regions or countries. For example, accessing all genomic records related to a particular country, such as Brazil, often requires additional filtering, programming skills, or extensive manual curation.

To address this gap, national initiatives have emerged in some countries to create platforms that organize genetic data by geographic origin. One example is the Australian Reference Genome Atlas (ARGA), which aggregates and indexes genomic data for Australian species, supporting regional research and conservation efforts [14]. Following this global trend, GenRefBR developed the first dedicated, user-friendly platform that centralizes curated genetic information specifically related to Brazilian biodiversity. GenRefBR integrates data on mitochondrial and nuclear genomes, taxonomic information, and sequencing metadata, enabling users to browse, search, and download content by taxonomic group, species name, or molecular marker. While the current version includes only vertebrate species, additional taxonomic groups, such as plants and invertebrates, are already being considered for future integration, expanding the platform’s potential as a national reference resource for genomic data.

The GenRefBR web-portal revealed that 75.5% (7456) of vertebrate species have an NCBI Tax ID, while the remaining 24.5% (2413) lack this identifier. This strongly suggests that no genetic data for these species have ever been deposited in public repositories, such as NCBI (Fig. 3A).

Complete nuclear genome data are available for only 4.6% (452) of species, reflecting the limited availability of genomic resources. Consequently, 95.4% (9417) of species lack any nuclear genome data, highlighting significant gaps in nuclear genome coverage of Brazilian vertebrates (Fig. 3B). Regarding mitochondrial genomes, only 11.9% (1174) of species have a fully sequenced mitochondrial genome, while 88.1% (8686) lack comprehensive mitogenome information (Fig. 3C). Approximately 72.1% (7,115) of species have at least one mitochondrial marker available as a reference, whereas 27.9% (2,754) lack any references, indicating a need for further research and suggesting that only a small fraction of Brazilian species currently possess substantial genetic information. (Fig. 3D). Mitochondrial markers counting reveals that the cytochrome c oxidase I gene (cox1) is the most frequently deposited gene (90,119 entries), followed by cytochrome b (cytB) with 60,072 entries, 16S ribosomal RNA (rrnL) with 33,311, NADH dehydrogenase subunit 2 (nad2) with 27,212, and 12S ribosomal RNA (rrnS) with 21,083 entries. These genes represent the most abundantly deposited mitochondrial references, reflecting their central roles in genetic identification, molecular phylogenetics, and evolutionary studies (Fig. 3E). Among vertebrate groups, birds have the highest number of nuclear genome entries, with 218 out of 452 available datasets. In contrast, fish lead in mitochondrial data, possessing the most complete mitochondrial genomes (648 out of 1,174) and the highest number of mitochondrial reference sequences (3,301 out of 7,115). Amphibians, however, are the least represented across all categories, with only five nuclear genome entries, 18 complete mitochondrial genomes, and 927 mitochondrial reference sequences. This data highlights significant taxonomic biases in the availability of genetic data and underscores the need for targeted sequencing initiatives to address understudied groups.

To address the high variability and inconsistency in mitochondrial gene feature names across genomic datasets, we created a mapping that links over 1,000 different names and misspellings of gene features to a smaller set of standard names. This mapping helps unify the data by correcting inconsistent naming, allowing for more accurate comparisons and counting. Additionally, for species whose scientific names have been updated, we account for these changes by presenting the data using the current accepted names while also including a column with the original names. This approach ensures consistency and traceability across datasets, enabling clearer and more reliable analysis.

The GenRefBR web portal highlights significant disparities in the availability of genomic data across species. A considerable proportion of species lack even basic genetic information, such as simple molecular markers. The uneven distribution of data among different taxonomic groups underscores the need to prioritize certain lineages, expand research efforts, and generate genetic reference data. These findings emphasize the urgency of generating genomic data for a broader range of species, thereby enabling a more comprehensive representation and a deeper understanding of biodiversity.

Brazil’s rich biodiversity faces pressing threats from deforestation, illegal hunting, and climate change, jeopardizing its unique ecosystems and species. Recognizing the global importance of these habitats, efforts to preserve biodiversity are paramount. The GBB initiative, spearheaded by GenRefBR, aims to bridge genetic data gaps by compiling comprehensive genomic resources. This initiative not only advances Brazil’s genomic research capabilities but also enhances global conservation strategies. By facilitating access to crucial genetic information, GenRefBR empowers scientists and conservationists to understand better, protect, and sustain Brazil’s extraordinary biodiversity for future generations, serving not only as a repository but also as a decision-support tool that guides conservation efforts, national biodiversity policies, and accelerates the development of genomic infrastructure in megadiverse regions.

COMPETING INTERESTS

No competing interest is declared.

FUNDING

This work has been supported by Vale S.A. (Projeto Genômica da Biodiversidade Brasileira, R100603.GB). AA thanks the Brazilian Research Council for a Productivity Fellowship (CNPq #309243/2023-8).

Author Contribution

G.N. and R.R.M.O. conceived the web portal. A.A. helped with taxonomic databases. B.M.S. downloaded and curated the databases. R.O.S.J. developed the web portal. G.N., B.M.S., and R.O.S.J. wrote the manuscript. G.N., A.A. and R.R.M.O. reviewed the manuscript.

Acknowledgement

We are grateful to ITV’s IT team for its support in implementing the platform on Azure and creating the domain, as well as to all the researchers who reviewed the portal and provided suggestions for improvements.

Data Availability

GenRefBR is publicly available at https://genrefbr.itv.org.

Ellwanger JH. Carlos Afonso Nobre, and José Artur Bogo Chies. Brazilian biodiversity as a source of power and sustainable development: A neglected opportunity. Sustainability 2023, Vol. 15, Page 482, 15:482, 12 2022.
Ubirajara Oliveira BS, Soares-Filho AJ, Santos et al. Biodiversity conservation gaps in the brazilian protected areas. Sci Rep, 7:1–9, 12 2017.
Metzger JP, Mercedes MC, Bustamante GE, Overbeck, et al. Why brazil needs its legal reserves. Perspect Ecol Conserv. 7 2019;17:91–103.
Marilia Valli HM, Russo, Vanderlan da Silva Bolzani. The potential contribution of the natural products from brazilian biodiversity to bioeconomy. An Acad Bras Cienc. 2018;90:763–78.
Walter A, Boeger MP, Valim, Yuri LR, Leite, et al. Catálogo taxonômico da fauna do brasil: Setting the baseline knowledge on the animal diversity in brazil. Zoologia (Curitiba). 2024;41:e24005.
Thomas M, Lewinsohn. Andr´e Victor Lucci Freitas, and Paulo Inácio Prado. Conservation of terrestrial invertebrates and their habitats in brazil. Conserv Biol, 19:640–5, 6 2005.
Estevão Carino Fernandes de Souza, Brant A, Rangel CA, Subirá RJ, et al. Avaliação do risco de Extinção da fauna brasileira: Ponto de partida para a conservação da biodiversidade. Diversidade e Gestão. 2018;2:62–75.
Sibelle Torres Vilaça, Amanda F, Vidal A, Aleixo et al. Leveraging genomes to support conservation and bioeconomy policies in a megadiverse country. Cell Genomics, 0(0), October 2024. Publisher: Elsevier.
Myriam Heuertz SB, Carvalho P, Garnier-Geré, et al. The application gap: Genomics for biodiversity and ecosystem service management. Biol Conserv. 2 2023;278:109883.
Jacqueline S, Lima L, Ballesteros-Mejia MS, Lima-Ribeiro, Rosane G. Collevatti. Climatic changes can drive the loss of genetic diversity in a neotropical savanna tree species. Glob Change Biol. 11 2017;23:4639–50.
Eric W, Sayers M, Cavanaugh I, Karsch-Mizrachi, et al. Genbank 2025 update. Nucleic Acids Res. 1 2025;53:D56–61.
Thakur M, Brooksbank C, McEntyre J, et al. Embl’s european bioinformatics institute (embl-ebi) in 2024. Nucleic Acids Res. 1 2025;53:D10–9.
Yuichi Kodama T, Ara M, Arita, et al. Ddbj update in 2024: the ddbj group cloud service for sharing pre- publication data. Nucleic Acids Res. 1 2025;53:D45–8.
Kathryn Hall M, Andrews P, Brenton, et al. The australian reference genome atlas (arga): Finding, sharing and reusing australian genomics data in an occurrence-driven context. Biodivers Inform Sci Stand. 9 2023;7(e112129):e112129.

No competing interests reported.

Download PDF

Reviewers agreed at journal
17 Dec, 2025
Reviewers agreed at journal
14 Dec, 2025
Reviewers invited by journal
11 Dec, 2025
Editor assigned by journal
02 Dec, 2025
Submission checks completed at journal
02 Dec, 2025
First submitted to journal
26 Nov, 2025

You are reading this latest preprint version

GenRefBR: Bridging the genetic information gap for Brazilian biodiversity

Status:

Version 1

Abstract

Figures

INTRODUCTION

METHODS

Data sources and curation

System Architecture and Configuration

Deployment

Dependency management and version control

Automated code quality checks

RESULTS AND DISCUSSION

CONCLUSIONS

Declarations

COMPETING INTERESTS

FUNDING

Author Contribution

Acknowledgement

Data Availability

References

Additional Declarations

Status:

Version 1