BIOSCAN-5M

A comprehensive multi-modal dataset comprised of over 5 million specimens, 98% of which are insects.
https://github.com/bioscan-ml/bioscan-5m

Category: Biosphere
Sub Category: Biodiversity Data Access and Management

Last synced: about 10 hours ago
JSON representation

Repository metadata

BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

Host: GitHub
URL: https://github.com/bioscan-ml/bioscan-5m
Owner: bioscan-ml
License: other
Created: 2024-04-12T06:31:01.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-09T01:22:42.000Z (19 days ago)
Last Synced: 2025-06-17T19:02:48.995Z (11 days ago)
Language: Python
Homepage: https://biodiversitygenomics.net/projects/5m-insects/
Size: 99.6 MB
Stars: 14
Watchers: 5
Forks: 1
Open Issues: 1
Releases: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

BIOSCAN-5M

Overview

This repository contains the code and data related to the BIOSCAN-5M
project.
BIOSCAN-5M is a comprehensive multi-modal dataset comprised of over 5 million specimens, 98% of which are insects.
Every record has both image and DNA data.

If you make use of the BIOSCAN-5M dataset and/or this code repository, please cite the following paper:

@inproceedings{gharaee2024bioscan5m,
    title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},
    booktitle={Advances in Neural Information Processing Systems},
    author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias
        and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum
        and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor
        and Paul Fieguth and Angel X. Chang
    },
    editor={A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
    pages={36285--36313},
    publisher={Curran Associates, Inc.},
    year={2024},
    volume={37},
    url={https://proceedings.neurips.cc/paper_files/paper/2024/file/3fdbb472813041c9ecef04c20c2b1e5a-Paper-Datasets_and_Benchmarks_Track.pdf},
}

Getting Started with BIOSCAN-5M

I. Environment Setup

To set up the BIOSCAN-5M project, create and activate the required environment using the provided bioscan5m.yaml file.
Run the following command:

conda env create -f bioscan5m.yaml

conda activate bioscan5m

II. Dataset Quick Start

Quickly access the BIOSCAN-5M dataset by installing the dataset package and initializing the data loader. Use the following commands:

pip install bioscan-dataset

from bioscan_dataset import BIOSCAN5M

ds = BIOSCAN5M("~/Datasets/bioscan-5m", download=True)

For more detailed information, please visit BIOSCAN-5M Dataset Package

III. Task-Specific Settings

Please note that to work with all modules connected to this repository,
you may need to install additional dependencies specific to each module (if any).
Be sure to follow the instructions provided within each module's folder for further setup details.

Dataset

We present BIOSCAN-5M dataset to the machine learning community.
We hope this dataset will facilitate the development of tools to automate aspects of the monitoring of global insect biodiversity.

Each record of the BIOSCAN-5M dataset contains six primary attributes:

RGB image
- Metadata field: processid
DNA barcode sequence
- Metadata field: dna_barcode
Barcode Index Number (BIN)
- Metadata field: dna_bin
Biological taxonomic classification
- Metadata fields: phylum, class, order, family, subfamily, genus, species, taxon
Geographical information
- Metadata fields: country, province_state, latitude, longitude
Specimen size
- Metadata fields: image_measurement_value, area_fraction, scale_factor

Dataset Access

All dataset image packages and metadata files are accessible for download through the
GoogleDrive folder.
Additionally, the dataset is available on research and data sharing platforms such as Zenodo,
Kaggle, and HuggingFace.

Dataset Browser

The BIOSCAN-5M Dataset Browser is an interactive tool designed to explore the BIOSCAN-5M dataset efficiently.
It allows you to navigate through taxonomic ranks, visualize specimens, and analyze DNA barcode sequences.
The browser supports advanced filtering, sorting, and visualization capabilities to facilitate in-depth data exploration for researchers and developers.

Metadata

The dataset metadata file BIOSCAN_5M_Insect_Dataset_metadata contains biological information, geographic information as well as
size information of the organisms. We provide this metadata in both CSV and JSONLD file types.

RGB Image

The BIOSCAN-5M dataset comprises resized and cropped images.
We have provided various packages of the BIOSCAN-5M dataset, each tailored for specific purposes.

Cropped images

We trained a model on examples from this dataset in order to create a tool introduced in BIOSCAN-1M, which can automatically generate bounding boxes around the insect.
We used this to crop each image down to only the region of interest.

Image packages

BIOSCAN_5M_original_full: The raw images of the dataset.
BIOSCAN_5M_cropped: Images after cropping with our cropping tool.
BIOSCAN_5M_original_256: Original images resized to 256 on their shorter side.
BIOSCAN_5M_cropped_256: Cropped images resized to 256 on their shorter side.

Geographical Information

The BIOSCAN-5M dataset provides Geographic information associated with the collection sites of the organisms.
The following geographic data is presented in the country, province_state, latitude, and
longitude fields of the metadata file(s):

Latitude and Longitude coordinates
Country
Province or State

Size Information

The BIOSCAN-5M dataset provides information about size of the organisms.
The size data is presented in the image_measurement_value, area_fraction, and
scale_factor fields of the metadata file(s):

Image measurement value: Total number of pixels occupied by the organism

Furthermore, utilizing our cropping tool, we calculated the following information about size of the organisms:

Area fraction: Fraction of the original image, the cropped image comprises.
Scale factor: Ratio of the cropped image to the cropped and resized image.

Benchmark Experiments

Data Partitions

We partitioned the BIOSCAN-5M dataset into splits for both closed-world and open-world machine learning problems.
To use the partitions we propose, see the split field of the metadata file(s).

The closed-world classification task uses samples labelled with a scientific name for their species
(train, val, and test partitions).
- This task requires the model to correctly classify new images and DNA barcodes of across a known set of species labels that were seen during training.
The open-world classification task uses samples whose species name is a placeholder name,
and whose genus name is a scientific name
(key_unseen, val_unseen, and test_unseen partitions).
- This task requires the model to correctly group together new species that were not seen during training.
- In the retreival paradigm, this task can be performed using test_unseen records as queries against keys from the key_unseen records.
- Alternatively, this data can be evaluated at the genus-level by classification via the species in the train partition.
Samples labelled with placeholder species names, and whose genus name is not a scientific name are placed in the other_heldout partition.
- This data can be used to train an unseen species novelty detector.
Samples without species labels are placed in the pretrain partition, which comprises 90% of the data.
- This data can be used for self-supervised or semi-supervised training.

Task-I: DNA-based taxonomic classification

Two stages of the proposed semi-supervised learning set-up based on BarcodeBERT.

Pretraining: DNA sequences are tokenized using non-overlapping k-mers and 50% of the tokens are masked for the MLM task.
Tokens are encoded and fed into a transformer model. The output embeddings are used for token-level classification.
Fine-tuning: All DNA sequences in a dataset are tokenized using non-overlapping $k$-mer tokenization and all tokenized sequences, without masking, are passed through the pretrained transformer model. Global mean-pooling is applied over the token-level embeddings and the output is used for taxonomic classification.

Results

The performance of the taxonomic classification using DNA barcode sequences of the BIOSCAN-5M dataset is summarized as follows:

Performance of DNA-based sequence models in closed- and open-world settings.
For the closed-world setting, we show the species-level accuracy (%) for predicting seen species.
For the open-world setting, we show genus-level accuracy (%) for unseen species, while using seen species to fit the model.
Bold values indicate the best result, and italicized values indicate the second best.

Model	Architecture	SSL-Pretraining	Tokens Seen	Fine-tuned Seen: Species	Linear Probe Seen: Species	1NN-Probe Unseen: Genus
CNN baseline	CNN	--	--	97.70	--	29.88
NT	Transformer	Multi-Species	300 B	98.99	52.41	21.67
DNABERT-2	Transformer	Multi-Species	512 B	99.23	67.81	17.99
DNABERT-S	Transformer	Multi-Species	~1,000 B	98.99	95.50	17.70
HyenaDNA	SSM	Human DNA	5 B	98.71	54.82	19.26
BarcodeBERT	Transformer	DNA barcodes	5 B	98.52	91.93	23.15
Ours	Transformer	DNA barcodes	7 B	99.28	94.47	47.03

Task-II: Zero-shot transfer learning

We follow the experimental setup recommended by zero-shot clustering,
expanded to operate on multiple modalities.

Take pretrained encoders.
Extract feature vectors from the stimuli by passing them through the pretrained encoder.
Reduce the embeddings with UMAP.
Cluster the reduced embeddings with Agglomerative Clustering.
Evaluate against the ground-truth annotations with Adjusted Mutual Information.

Results

The performance of the zero-shot transfer learning experiments on the BIOSCAN-5M dataset is summarized as follows:

Task-III: Multimodal retrieval learning

Our experiments using the CLIBD are conducted in two steps.

Training: Multiple modalities, including RGB images, textual taxonomy, and DNA sequences, are encoded separately,
and trained using a contrastive loss function.
Inference: Image vs DNA embedding is used as a query, and compared to the embeddings obtained from a database of image,
DNA and text (keys). The cosine similarity is used to find the closest key embedding, and the corresponding taxonomic label is used to classify the query.

Results

The performance of the multimodal retrieval learning experiments on the BIOSCAN-5M dataset is summarized as follows:

Copyright and License

The images and metadata included in the BIOSCAN-5M dataset available through this repository are subject to copyright
and licensing restrictions shown in the following:

Copyright Holder: CBG Photography Group
Copyright Institution: Centre for Biodiversity Genomics (email:[email protected])
Photographer: CBG Robotic Imager
Copyright License: Creative Commons Attribution 3.0 Unported (CC BY 3.0)
Copyright Contact: [email protected]

Owner metadata

Name: BIOSCAN
Login: bioscan-ml
Email: [email protected]
Kind: organization
Description: Illuminating biodiversity with DNA-based identification systems
Website: https://biodiversitygenomics.net/research/bioscan/
Location:
Twitter:
Company:
Icon url: https://avatars.githubusercontent.com/u/175227258?v=4
Repositories: 1
Last ynced at: 2024-10-29T05:17:09.632Z
Profile URL: https://github.com/bioscan-ml

GitHub Events

Total

Issues event: 1
Watch event: 10
Delete event: 6
Issue comment event: 5
Push event: 100
Pull request event: 11
Fork event: 1
Create event: 7

Last Year

Issues event: 1
Watch event: 10
Delete event: 6
Issue comment event: 5
Push event: 100
Pull request event: 11
Fork event: 1
Create event: 7

Committers metadata

Last synced: 2 days ago

Total Commits: 431
Total Committers: 4
Avg Commits per committer: 107.75
Development Distribution Score (DDS): 0.232

Commits in past year: 121
Committers in past year: 4
Avg Commits per committer in past year: 30.25
Development Distribution Score (DDS) in past year: 0.066

Name	Email	Commits
zahrag	z**e@g**m	331
Scott Lowe	s**e@g**m	61
zmgong	m**4@g**m	35
Pablo	p**7@g**m	4

Committer domains:

Issue and Pull Request metadata

Last synced: 2 days ago

Total issues: 1
Total pull requests: 11
Average time to close issues: N/A
Average time to close pull requests: about 18 hours
Total issue authors: 1
Total pull request authors: 5
Average comments per issue: 1.0
Average comments per pull request: 0.45
Merged pull request: 7
Bot issues: 0
Bot pull requests: 0

Past year issues: 1
Past year pull requests: 8
Past year average time to close issues: N/A
Past year average time to close pull requests: about 20 hours
Past year issue authors: 1
Past year pull request authors: 5
Past year average comments per issue: 1.0
Past year average comments per pull request: 0.63
Past year merged pull request: 5
Past year bot issues: 0
Past year bot pull requests: 0

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/bioscan-ml/bioscan-5m

Top Issue Authors

Wz1h1NG (1)

Top Pull Request Authors

scottclowe (3)
millanp95 (3)
zmgong (3)
annavik (1)
zahrag (1)

Top Issue Labels

Top Pull Request Labels

Score: 4.0943445622221

BIOSCAN-5M

Repository metadata

README.md

BIOSCAN-5M

Overview

Getting Started with BIOSCAN-5M

I. Environment Setup

II. Dataset Quick Start

III. Task-Specific Settings

Dataset

Dataset Access

Dataset Browser

Metadata

RGB Image

Cropped images

Image packages

Geographical Information

Size Information

Benchmark Experiments

Data Partitions

Task-I: DNA-based taxonomic classification

Results

Task-II: Zero-shot transfer learning

Results

Task-III: Multimodal retrieval learning

Results

Copyright and License

Owner metadata

GitHub Events

Total

Last Year

Committers metadata

Committer domains:

Issue and Pull Request metadata

Top Issue Authors

Top Pull Request Authors

Top Issue Labels

Top Pull Request Labels