Data sources and curation
We used the species list available on the SALVE database as input to acquire the genomic data. The SALVE database comprises 15,821 species (as of October 6th, 2025), encompassing the taxonomic groups of Amphibia, Aves, Mammalia, Reptilia, and invertebrates (see Fig. 3F). So far, the GenRefBr web-portal incorporated 9,869 vertebrate species, including Actinopterygii (n = 4,944), Amphibia (n = 1,102), Aves (n = 2,037), Elasmobranchii (n = 184), Holocephali (n = 6), Mammalia (n = 750), Myxini (n = 5), Reptilia (n = 840), and Sarcopterygii (n = 1). Actinopterygii, Elasmobranchii, Holocephali, Myxini, and Sarcopterygii were grouped under the category “Fishes” (n = 5,140). In addition to the focal taxonomic groups, the list also includes information on species endemism, updated taxonomic classifications, IUCN Red List categories, and biome-level distribution data.
Subsequently, the BioPython Entrez module was used to access the NCBI API, utilizing the species list obtained from SALVE. For mitochondrial data, a filter is applied within the Nucleotide database (Fig. 2). The corresponding accession IDs were retrieved, and associated GenBank files were downloaded. An in-house Python script was used to parse all the information contained in these files, including the submission date, sequence topology (linear or circular), sequence completeness, and nucleotide sequences for coding sequences (CDS), ribosomal RNA (rRNA), and transporter RNA (tRNA). The names of CDS (atp1-atp6, cox1-cox3, nad1-nad6, cytB), rRNA (rrnS and rrnL), and tRNA (trn[amino acid code], e.g, trnA) were standardized to enable consistent counting and grouping of all these elements. Based on this information, it was possible to identify which organisms possess specific genomic features and whether the mitogenome is complete or a draft genome. A mitogenome is considered complete if the sequence contains more than 10,000 base pairs (bp), 13 CDS, 22 tRNA, two rRNA genes, and is deposited as a circular molecule. A draft mitogenome is defined as a sequence exceeding 10,000 bp and containing more than 10 CDS. In addition to data retrieved from NCBI, mitochondrial barcode sequences were also obtained from the BOLD Systems database via the public data packages available at https://boldsystems.org/data/data-packages (as of March 28th, 2025). All geographic coordinate data for species occurrences were retrieved from the Global Biodiversity Information Facility (GBIF) via https://www.gbif.org/occurrence/search?occurrence_ status = present. For both databases, we used the same species list, and we retained only relevant species by applying a Python-based filtering script.
GenRefBR utilizes the NCBI Genome databases for acquiring nuclear genome data. The table available at https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/eukaryotes.txt was used as a reference. This file is part of a set of summary reports provided by NCBI, which organizes genome assembly information submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Genome assembly accessions, along with basic assembly metrics, were extracted from this table. The corresponding BioSample IDs were then used to retrieve metadata on sequencing platforms and library preparation methods from the NCBI Sequence Read Archive (SRA) database. In addition, the SRA database was queried to identify available sequencing data for the initial species list.
The taxonomic framework used for constructing the GenRefBR database was primarily obtained from the SALVE platform. However, in cases where taxonomic information was unavailable, such as for subspecies, taxonomic data were obtained from the NCBI Taxonomy database. This taxonomic database was also used to identify species with taxonomic differences relative to the SALVE database.
System Architecture and Configuration
GenRefBR was developed in Python using the Dash framework (https://plotly.com/dash/), a tool designed for building interactive web dashboards. Dash is based on the Flask web server (https://flask.palletsprojects.com/), the Plotly.js data visualization library (https://plotly.com/javascript/), and the React JavaScript library (https://react.dev/) that composes the user interface elements. Data loading and manipulation from .csv files are handled using the Pandas library (https://pandas.pydata.org/), enabling efficient and flexible analysis of tabular data. The visual style is developed using the Tailwind CSS framework (https://tailwindcss.com/). The application is internally structured into multiple pages, with layout, callbacks, and visualization files organized separately.
Each page layout is composed of Dash components, including Dash HTML Components, allowing the interface to maintain a consistent, modern, and responsive appearance, which structures the interface with elements such as titles, texts, dividers, and containers; and Dash Core Components, which provide interactive elements like drop-down menus, sliders, and input fields. Data visualizations — such as bar charts, line charts, and scatter plots — are generated using the Plotly library and integrated as dynamic components into the page layouts. These elements, along with inputs and other layout components, can be dynamically updated through callbacks, which define the logic interaction by monitoring changes in input components and triggering real-time updates across the interface.
Deployment
The system architecture is containerized for production, consisting of a single Docker container managed via a docker-compose.yml file. This setup allows the application and its dependencies to be launched with a single command, streamlining infrastructure management and deployment automation. The back-end container runs the Dash application using the Gunicorn WSGI server (https://gunicorn.org/), providing scalability and robustness for handling concurrent HTTP requests in a production environment. As an external service, Microsoft Azure Application Gateway manages HTTP/HTTPS traffic. It acts as a reverse proxy, forwarding requests to the application container while also handling SSL termination, request routing, and optional features such as authentication and load balancing.
Dependency management and version control
Python dependency management and environment isolation are handled using Poetry (https://python-poetry.org/). Poetry allows precise version control of dependencies and virtual environments through a declarative configuration (pyproject.toml). It simplifies the installation, updating, and locking of packages, ensuring reproducibility across systems.
A .gitignore file is configured to exclude unnecessary or environment-specific files from the Git repository. This includes cache files, log files, local configurations, and other artifacts that should not be included in version control. This practice helps keep the repository clean and focused.
Automated code quality checks
A robust pre-commit hook system is integrated into the repository to enforce code quality standards automatically before commits are accepted. This includes: blocking commits to protected branches, detecting large or binary files unintentionally stage, validating script executability and proper shebang usage, preventing the inclusion of private keys or unresolved merge conflicts, detecting leftover debugging statements, ensuring consistent formatting and line endings in code and configuration files, validating and formatting structured data (JSON, YAML, TOML, XML), and the enforcement of Python coding standards. These checks are version-controlled and based on widely used community hooks, ensuring predictable and consistent behavior in every environment.