bdc

A toolkit for standardizing, integrating, and cleaning biodiversity data.
https://github.com/brunobrr/bdc

Category: Biosphere
Sub Category: Biodiversity Data Cleaning and Standardization

Keywords

bdc biodiversity-data workflow

Last synced: about 8 hours ago
JSON representation

Repository metadata

Check out the vignettes with detailed documentation on each module of the bdc package

README.Rmd

          ---
output: github_document
editor_options: 
  markdown: 
    wrap: 80
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# ***bdc*** 

## **A toolkit for standardizing, integrating, and cleaning biodiversity data**



[![CRAN
status](https://www.r-pkg.org/badges/version/bdc)](https://CRAN.R-project.org/package=bdc)
[![downloads](https://cranlogs.r-pkg.org/badges/grand-total/bdc)](https://cranlogs.r-pkg.org:443/badges/grand-total/bdc)


[![R-CMD-check](https://github.com/brunobrr/bdc/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/brunobrr/bdc/actions/workflows/R-CMD-check.yaml)
[![Codecov test
coverage](https://codecov.io/gh/brunobrr/bdc/branch/master/graph/badge.svg?token=9AUF86G9LJ)](https://app.codecov.io/gh/brunobrr/bdc)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6450390.svg)](https://doi.org/10.5281/zenodo.6450390)
[![License](https://img.shields.io/badge/license-GPL%20(%3E=%203)-lightgrey.svg?style=flat)](http://www.gnu.org/licenses/gpl-3.0.html)



#### **Overview**

Handle biodiversity data from several different sources is not an easy task.
Here, we present the **B**iodiversity **D**ata **C**leaning (*bdc*), an R
package to address quality issues and improve the fitness-for-use of biodiversity
datasets. *bdc* contains functions to harmonize and integrate data from
different sources following common standards and protocols, and implements
various tests and tools to flag, document, clean, and correct taxonomic,
spatial, and temporal data.

Compared to other available R packages, the main strengths of the *bdc* package
are that it brings together available tools – and a series of new ones – to
assess the quality of different dimensions of biodiversity data into a single
and flexible toolkit. The functions can be applied to a multitude of taxonomic
groups, datasets (including regional or local repositories), countries, or
worldwide.

#### **Structure of *bdc***

The *bdc* toolkit is organized in thematic modules related to different
biodiversity dimensions.

--------------------------------------------------------------------------------

> :warning: The modules illustrated, and **functions** within, **were linked to
> form** a proposed reproducible **workflow** (see
> [**vignettes**](https://brunobrr.github.io/bdc/)). However, all functions
> **can also be executed independently**.

--------------------------------------------------------------------------------

#### ![](https://raw.githubusercontent.com/brunobrr/bdc/master/inst/extdata/icon_vignettes/Figure1.png)


#### 1. [**Merge databases**](https://brunobrr.github.io/bdc/articles/integrate_datasets.html) Standardization and integration of different datasets into a standard database. - `bdc_standardize_datasets()` Standardization and integration of different datasets into a new dataset with column names following Darwin Core terminology #### 2. [**Pre-filter**](https://brunobrr.github.io/bdc/articles/prefilter.html) Flagging and removal of invalid or non-interpretable information, followed by data amendments (e.g., correct transposed coordinates and standardize country names). - `bdc_scientificName_empty()` Identification of records lacking names or with names not interpretable - `bdc_coordinates_empty()` Identification of records lacking information on latitude or longitude - `bdc_coordinates_outOfRange()` Identification of records with out-of-range coordinates (latitude \> 90 or -90; longitude \>180 or -180) - `bdc_basisOfRecords_notStandard()` Identification of records from doubtful sources (e.g., fossil or machine observation) impossible to interpret and not compatible with Darwin Core recommended vocabulary - `bdc_country_from_coordinates()` Derive country name from valid geographic coordinates - `bdc_country_standardized()` Standardization of country names and retrieve country code - `bdc_coordinates_transposed()` Identification of records with potentially transposed latitude and longitude - `bdc_coordinates_country_inconsistent()` Identification of coordinates in other countries or far from a specified distance from the coast of a reference country (i.e., in the ocean) - `bdc_coordinates_from_locality()` Identification of records lacking coordinates but with a detailed description of the locality associate with records from which coordinates can be derived #### 3. [**Taxonomy**](https://brunobrr.github.io/bdc/articles/taxonomy.html) Cleaning, parsing, and harmonization of scientific names against multiple taxonomic references. - `bdc_clean_names()` Name-checking routines to clean and split a taxonomic name into its binomial and authority components - `bdc_query_names_taxadb()` Harmonization of scientific names by correcting spelling errors and converting nomenclatural synonyms to currently accepted names. - `bdc_filter_out_names()` Function used to filter out records according to their taxonomic status present in the column "notes". For example, to filter only valid accepted names categorized as "accepted" #### 4. [**Space**](https://brunobrr.github.io/bdc/articles/space.html) Flagging of erroneous, suspicious, and low-precision geographic coordinates. - `bdc_coordinates_precision()` Identification of records with a coordinate precision below a specified number of decimal places - `clean_coordinates()` (From *CoordinateCleaner* package and part of the data-cleaning workflow). Identification of potentially problematic geographic coordinates based on geographic gazetteers and metadata. Include tests for flagging records: around country capitals or country or province centroids, duplicated, with equal coordinates, around biodiversity institutions, within urban areas, plain zeros in the coordinates, and suspect geographic outliers #### 5. [**Time**](https://brunobrr.github.io/bdc/articles/time.html) Flagging and, whenever possible, correction of inconsistent collection date. - `bdc_eventDate_empty()` Identification of records lacking information on event date (i.e., when a record was collected or observed) - `bdc_year_outOfRange()` Identification of records with illegitimate or potentially imprecise collecting year. The year provided can be out-of-range (e.g., in the future) or collected before a specified year supplied by the user (e.g., 1900) - `bdc_year_from_eventDate()` This function extracts four-digit year from unambiguously interpretable collecting dates #### [**Other functions**](https://brunobrr.github.io/bdc/reference/index.html) Aim to facilitate the **documentation, visualization, and interpretation** of results of data quality tests the package contains functions for documenting the results of the data-cleaning tests, including functions for saving i) records needing further inspection, ii) figures, and iii) data-quality reports. - `bdc_create_report()` Creation of data-quality reports documenting the results of data-quality tests and the taxonomic harmonization process - `bdc_create_figures()` Creation of figures (i.e., bar plots and maps) reporting the results of data-quality tests - `bdc_filter_out_flags()` Removal of columns containing the results of data quality tests (i.e., column starting with ".") or other columns specified - `bdc_quickmap()` Creation of a map of points using ggplot2. Helpful in inspecting the results of data-cleaning tests - `bdc_summary_col()` This function creates or updates the column summarizing the results of data quality tests (i.e., the column ".summary") #### **Installation** ```{r eval=FALSE} install.packages("bdc") library(bdc) ``` or the development version from [GitHub](https://github.com/brunobrr/bdc) using: ```{r, message=FALSE, warning=FALSE,echo=TRUE,eval=FALSE} install.packages("remotes") remotes::install_github("brunobrr/bdc") ``` Load the package with: ```{r, message=FALSE, warning=FALSE,echo=TRUE,eval=TRUE} library(bdc) ``` #### **Package website** See *bdc* package website () for detailed explanation on each module. #### **Getting help** > If you encounter a clear bug, please file an issue > [**here**](https://github.com/brunobrr/bdc/issues). For questions or > suggestion, please send us a email (ribeiro.brr\@gmail.com). #### **Citation** Ribeiro, BR; Velazco, SJE; Guidoni-Martins, K; Tessarolo, G; Jardim, Lucas; Bachman, SP; Loyola, R (2022). bdc: A toolkit for standardizing, integrating, and cleaning biodiversity data. Methods in Ecology and Evolution. [doi.org/10.1111/2041-210X.13868](https://doi.org/10.1111/2041-210X.13868)

Owner metadata


GitHub Events

Total
Last Year

Committers metadata

Last synced: 9 days ago

Total Commits: 907
Total Committers: 10
Avg Commits per committer: 90.7
Development Distribution Score (DDS): 0.494

Commits in past year: 68
Committers in past year: 7
Avg Commits per committer in past year: 9.714
Development Distribution Score (DDS) in past year: 0.618

Name Email Commits
Bruno R. Ribeiro r****r@g****m 459
Karlo Guidoni Martins k****s@g****m 248
lucas-jardim l****9@g****m 86
Santiago Velazco s****o@g****m 45
sjevelazco s****c@g****m 37
Geiziane g****s@g****m 26
Zander z****o@g****m 2
brunobrr b****o@M****l 2
Your Namebrunobrr y****u@e****m 1
Ronald Bergmann i****o@b****t 1

Committer domains:


Issue and Pull Request metadata

Last synced: 1 day ago

Total issues: 31
Total pull requests: 85
Average time to close issues: 4 months
Average time to close pull requests: about 2 hours
Total issue authors: 20
Total pull request authors: 7
Average comments per issue: 3.29
Average comments per pull request: 0.04
Merged pull request: 81
Bot issues: 0
Bot pull requests: 0

Past year issues: 1
Past year pull requests: 2
Past year average time to close issues: N/A
Past year average time to close pull requests: less than a minute
Past year issue authors: 1
Past year pull request authors: 2
Past year average comments per issue: 0.0
Past year average comments per pull request: 0.0
Past year merged pull request: 1
Past year bot issues: 0
Past year bot pull requests: 0

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/brunobrr/bdc

Top Issue Authors

  • GilbertAlarcon-Cruz (4)
  • Barros56789 (3)
  • kguidonimartins (3)
  • black-snow (3)
  • lucas-jardim (2)
  • brunobrr (2)
  • julianastropp (1)
  • jt-tbc (1)
  • max-sfeeri (1)
  • sjevelazco (1)
  • fredtaka (1)
  • oliveirab (1)
  • rsbivand (1)
  • brunomioto (1)
  • paschatz (1)

Top Pull Request Authors

  • sjevelazco (37)
  • brunobrr (23)
  • kguidonimartins (20)
  • Geiziane (2)
  • matthewsrogan (1)
  • black-snow (1)
  • andrew-1234 (1)

Top Issue Labels

  • bug (3)
  • dependency (3)
  • faq (3)

Top Pull Request Labels


Package metadata

cran.r-project.org: bdc

Biodiversity Data Cleaning

  • Homepage: https://brunobrr.github.io/bdc/ (website) https://github.com/brunobrr/bdc
  • Documentation: http://cran.r-project.org/web/packages/bdc/bdc.pdf
  • Licenses: GPL (≥ 3)
  • Latest release: 1.1.5 (published 4 months ago)
  • Last Synced: 2025-04-29T14:08:42.171Z (1 day ago)
  • Versions: 6
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 388 Last month
  • Rankings:
    • Forks count: 7.999%
    • Stargazers count: 11.875%
    • Average: 18.78%
    • Downloads: 21.874%
    • Dependent repos count: 24.3%
    • Dependent packages count: 27.852%
  • Maintainers (1)

Dependencies

DESCRIPTION cran
  • CoordinateCleaner * imports
  • DT * imports
  • dplyr * imports
  • foreach * imports
  • fs * imports
  • ggplot2 * imports
  • here * imports
  • magrittr * imports
  • purrr * imports
  • qs * imports
  • readr * imports
  • rgnparser * imports
  • rnaturalearth * imports
  • sf >= 1.0.5 imports
  • stringdist * imports
  • stringi * imports
  • stringr * imports
  • taxadb >= 0.1.3 imports
  • tibble * imports
  • tidyselect * imports
  • DBI * suggests
  • contentid >= 0.0.15 suggests
  • countrycode * suggests
  • covr * suggests
  • cowplot * suggests
  • doParallel * suggests
  • duckdb >= 0.3.2 suggests
  • knitr >= 1.31 suggests
  • maps * suggests
  • markdown * suggests
  • rangeBuilder * suggests
  • rappdirs * suggests
  • raster * suggests
  • remotes * suggests
  • rlang >= 1.0.1 suggests
  • rmarkdown * suggests
  • rnaturalearthdata * suggests
  • rvest * suggests
  • sp * suggests
  • testthat >= 3.0.0 suggests
  • xml2 * suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/upload-artifact v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite
.github/workflows/pkgdown.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite

Score: 11.672490034641568