BioCube

Contains the code used to engineer BioCube: A Multimodal Dataset for Biodiversity Research.
https://github.com/biodt/bfm-data

Category: Biosphere
Sub Category: Biodiversity Data Access and Management

Keywords

biodiversity dataset engineering machine-learning multimodal

Last synced: about 3 hours ago
JSON representation

Repository metadata

BioCube: A Multimodal Dataset for Biodiversity

README.md

BioCube: Engineering a Multimodal Dataset for Biodiversity

Alt text

Description

This repository contains the code used to engineer BioCube: A Multimodal Dataset for Biodiversity Research. The produced dataset, can be found on the BioCube's Hugging Face page with detailed descriptions of the modalities it contains.

This codebase offers the below core functionalities:

  • Download
  • Ingestion
  • Preprocessing
  • File handling & storage
  • Dataset creation
  • Batch creation

Getting started

To run the code, to add libraries, and basically to manage the application is done by poetry.

poetry run run-app      # for running the code
poetry add ...          # for adding dependencies

Download (New) Data

Currently you can download data calling the respective modality-function from the src/main.py. We have made the workflow with args, so you can use it without any code changes or run the functions manually from main. The instructions with the params is in the main.

era5(mode = 'range', start_year = '2020', end_year = '2024')

Ingest Data

The relevant scripts can be found at src/data_ingestion. Here you can find scripts, to download data from csv files that have been located manually. Running the scripts for example for the indicators for the region or for the world will create a new csv in with the countries, the bounding boxes of each country and the values.

# To proccess all the agriculture files and create new csvs
run_agriculture_data_processing(region = 'Europe', global_mode = False, irrigated = True, arable = True, cropland = True)

# And then to merge them in one file (/data/storage/data/Agriculture/Europe_combined_agriculture_data.csv)
run_agriculture_merging()

Preprocess Data

The scripts for the preprocessing workflows can be found at src/data_preprocessing. The script src/data_preprocessing/preprocessing.py combines all the preprocessing functions, which then are used to create the species dataset parquet file.

Create the Dataset

Firstly we have to create the species dataset. Now we dont put all the images and the sounds inside. All the species data are located /data/projects/data/Life. Just run

create_species_dataset(root_folder = /data/projects/data/Life, filepath = /data/projects/processed_data/species_dataset.parquet, start_year: int = 2000, end_year: int = 2020)

When we create the species parquet we have all the data for species there. We have the CSVs for the indicators, and red list, ndvi and we are ready to create the data batches.

Create the Batch

At this point, we can select any kind of modalities and slice them for specific coordinates or timestamps, producing a unified representation we define as Batch. The structure is very flexible and easy to use for any kind of downstream task or use case, especially for Foundation Model training. A visualisation is given below.

Alt text

Creating Batches can be done in two settings based on the sampling frequence (daily and monthly) and requires that you have downloaded BioCube and setted up the path variables appropriately.

Daily

To create daily Batches, just call the function:

create_dataset(
    species_file="/data/projects/vector_db/species_dataset.parquet",
    era5_directory=paths.ERA5_DIR,
    agriculture_file=paths.AGRICULTURE_COMBINED_FILE,
    land_file=paths.LAND_COMBINED_FILE,
    forest_file=paths.FOREST_FILE,
    species_extinction_file=paths.SPECIES_EXTINCTION_FILE,
    load_type="day-by-day",
)

Monthly

To download BioCube and create monthly Batches, just run the below script:

bfm_data/dataset_creation/batch_creation/create_batches.sh

Or a step-by-step workflow:

# First run
python bfm_data/dataset_creation/batch_creation/scan_biocube.py --root biocube_data/data --out catalog_report.parquet
# Then run
python bfm_data/dataset_creation/batch_creation/build_batches_monthly.py

You can inspect the created Batches by using the streamlit run batch_viewer.py --data_dir ./batches that is located on the same folder as the previous scripts.

To produce statistics from the Batches that can be used for downstream tasks (e.g. normalization), just run python batch_stats.py --batch_dir batches --out batches_stats.json

Storage

Data folder contains raw data.

Dataset_files contains the csv files from the sources or txt files or json files, that we need them to extract the data and save them to data folder.

Modality folders contain txt files which shows which folders contain which modalities. Is produced by command in terminal, run once, there is no code. It should be updated.

Processed_data contains labels mapping, timestamps extracted from species dataset. And the species dataset.

Extra Information

For more detailed information about the workflows settings available, have a look at documentation.

License

See LICENSE.txt.

Acknowledgments

This study has received funding from the European Union's Horizon Europe research and innovation programme under grant agreement No 101057437 (BioDT project, https://doi.org/10.3030/101057437). Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.
This publication is part of the project Biodiversity Foundation Model of the research programme Computing Time on National Computing Facilities that is (co-) funded by the Dutch Research Council (NWO). We acknowledge NWO for providing access to Snellius, hosted by SURF through the Computing Time on National Computer Facilities call for proposals.
This work used the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no. EINF-10148

Citation

If you find our work useful, please consider citing us!

@article{stasinos2025biocube,
  title={BioCube: A Multimodal Dataset for Biodiversity Research},
  author={Stasinos, Stylianos and Mensio, Martino and Lazovik, Elena and Trantas, Athanasios},
  journal={arXiv preprint arXiv:2505.11568},
  year={2025}
}

Useful commands

Copy files between clusters: cluster_1=a cluster , cluster_2=SURF Snellius

ssh to cluster_1
generate ssh key ssh-keygen -t ed25519
copy your public key to https://portal.cua.surf.nl/user/keys cat ~/.ssh/id_ed25519.pub
from cluster_1 ssh to cluster_2 to test: ssh USERNAME@snellius.surf.nl then exit

# one of these two, find which one is better / faster
rsync -a --ignore-existing --info=progress2 --info=name0 /data/projects/biodt/storage/ USERNAME@snellius.surf.nl:/projects/data/projects/biodt/storage
rsync -a --update --info=progress2 --info=name0 /data/projects/biodt/storage/ USERNAME@snellius.surf.nl:/projects/data/projects/biodt/storage

Owner metadata


GitHub Events

Total
Last Year

Committers metadata

Last synced: about 16 hours ago

Total Commits: 103
Total Committers: 4
Avg Commits per committer: 25.75
Development Distribution Score (DDS): 0.32

Commits in past year: 57
Committers in past year: 3
Avg Commits per committer in past year: 19.0
Development Distribution Score (DDS) in past year: 0.175

Name Email Commits
Martino Mensio m****o@t****l 70
Stylianos Stasinos s****s@t****l 22
Thanasis Trantas t****s@t****l 10
StasinosStylianos s****s@g****m 1

Committer domains:


Issue and Pull Request metadata

Last synced: 3 months ago

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 6 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull request: 1
Bot issues: 0
Bot pull requests: 0

Past year issues: 0
Past year pull requests: 1
Past year average time to close issues: N/A
Past year average time to close pull requests: 6 minutes
Past year issue authors: 0
Past year pull request authors: 1
Past year average comments per issue: 0
Past year average comments per pull request: 0.0
Past year merged pull request: 1
Past year bot issues: 0
Past year bot pull requests: 0

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/biodt/bfm-data

Top Issue Authors

Top Pull Request Authors

  • DjAzDeck (1)

Top Issue Labels

Top Pull Request Labels


Dependencies

poetry.lock pypi
  • 313 dependencies
pyproject.toml pypi
  • memray ^1.15.0 develop
  • pytest ^8.2.2 develop
  • apache-airflow (>=2.10.2)
  • apache-airflow-providers-celery (>=3.8.3)
  • apache-airflow-providers-postgres (>=5.13.1)
  • cachetools (>=5.5.0)
  • cartopy (>=0.24.1)
  • cdsapi (>=0.7.2)
  • country-bounding-boxes (>=0.2.3)
  • country-converter (>=1.2)
  • dask [dataframe] (>=2024.9.0)
  • earthengine-api (>=1.1.4)
  • fastparquet (>=2024.5.0)
  • ffmpeg (>=1.4)
  • flake8-pyproject (>=1.2.3)
  • geopandas (>=1.0.1)
  • geopy (>=2.4.1)
  • h5netcdf (>=1.6.1,<2.0.0)
  • librosa (>=0.10.2.post1)
  • matplotlib (>=3.9.1)
  • netcdf4 (>=1.7.1.post2)
  • nltk (>=3.9.1)
  • numpy (>=2.0.0)
  • opencv-python (>=4.10.0.84)
  • pandas (>=2.2.2)
  • path (>=16.14.0)
  • pds4-tools (>=1.3)
  • piexif (>=1.1.3)
  • pillow (>=10.4.0)
  • pre-commit (>=3.7.1)
  • pvl (>=1.3.2)
  • pyarrow (>=17.0.0)
  • pycountry (>=24.6.1)
  • pydub (>=0.25.1)
  • pygbif (>=0.6.4)
  • pyhdf (>=0.11.4)
  • python-dotenv (>=1.0.1)
  • rasterio (>=1.4.1)
  • requests (>=2.32.3)
  • reverse-geocode (>=1.6.5)
  • rich (>=13.7.1)
  • scikit-bio (>=0.6.2)
  • scikit-image (>=0.24.0)
  • streamlit (>=1.45.1,<2.0.0)
  • textblob (>=0.18.0.post0)
  • torch (>=2.5.1)
  • torchaudio (>=2.5.1)
  • torchtext (>=0.18.0)
  • torchvision (>=0.20.1)
  • transformers (>=4.44.2)
  • typer (>=0.13.1)
  • xarray (>=2024.7.0)
  • xenopy (>=0.0.4)

Score: 2.995732273553991