CLIBD

A model uses contrastive learning to map biological images, DNA barcodes, and textual taxonomic labels to the same latent space.
https://github.com/bioscan-ml/clibd

Category: Biosphere
Sub Category: Biodiversity Analysis and Metrics

Last synced: about 7 hours ago
JSON representation

Repository metadata

Host: GitHub
URL: https://github.com/bioscan-ml/clibd
Owner: bioscan-ml
License: mit
Created: 2024-05-27T00:16:21.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-13T07:03:41.000Z (15 days ago)
Last Synced: 2025-06-26T17:49:31.063Z (1 day ago)
Language: Python
Size: 14.7 MB
Stars: 15
Watchers: 3
Forks: 4
Open Issues: 2
Releases: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

This is the official implementation for "CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale".
Links: website | paper

Overview

Teaser
Taxonomically classifying organisms at scale is crucial for monitoring biodiversity, understanding ecosystems, and preserving sustainability. It is possible to taxonomically classify organisms based on their image or their DNA barcode. While DNA barcodes are precise at species identification, they are less readily available than images. Thus, we investigate whether we can use DNA barcodes to improve taxonomic classification using image.

We introduce CLIBD, a model uses contrastive learning to map biological images, DNA barcodes, and textual taxonomic labels to the same latent space. The model is initialized using pretrained encoders for images (vit-base-patch16-224), DNA barcodes (BarcodeBERT), and textual taxonomic labels (BERT-small), and the weights of the encoders are fine-tuned using LoRA.
The aligned image-DNA embedding space improves taxonomic classification using images and allows us to do cross-modal retrieval from image to DNA. We train CLIBD on the BIOSCAN-1M and BIOSCAN-5M insect datasets. These datasets provides paired images of insects and their DNA barcodes, along with their taxonomic labels.

Setup environment

CLIBD was developed using Python 3.10 and PyTorch 2.0.1. We recommend the use of GPU and CUDA for efficient training and inference. Our models were developed with CUDA 11.7 and 12.4.
We also recommend the use of miniconda for managing your environments.

To setup the environment width necessary dependencies, type the following commands:

conda create -n CLIBD python=3.10 -y
conda activate CLIBD
conda install pytorch=2.0.1 torchvision=0.15.2 torchtext=0.15.2 pytorch-cuda=11.7 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install -e .
pip install git+https://github.com/Baijiong-Lin/LoRA-Torch

Depending on your GPU version, you may have to modify the torch version and other package versions in requirements.txt.

Pretrained embeddings and models

We provide pretrained embeddings and model weights. We evaluate our models by encoding the image or DNA barcode, and using the taxonomic labels from the closest matching embedding (either using image or DNA barcode). See Download dataset and Running Experiments for how to get the data, and to train and evaluate the models.

NOTE: Currently the checkpoints and config files pointed by these links are expired, we are working to update them as soon as possible. We deeply apologize for any inconvenience this may have caused.

Training data	Aligned modalities	Embeddings	Model	Config
BIOSCAN-1M	None	Embedding	N/A	Link
BIOSCAN-1M	Image + DNA	Embedding	Link	Link
BIOSCAN-1M	Image + DNA + Tax	Embedding	Link	Link
BIOSCAN-5M	None	Embedding	N/A	Link
BIOSCAN-5M	Image + DNA	Embedding	Link	Link
BIOSCAN-5M	Image + DNA + Tax	Embedding	Link	Link

We also provide checkpoints trained with LoRA layers. You can download them from this Link

Quick start

Instead of conducting a full training, you can choose to download pre-trained models or pre-extracted embeddings for evaluation from the table. You may need to posistion the downloaded checkpoints and extracted features in to the proper position based on the config file.

Download dataset

Data Partioning Visual
For BIOSCAN 1M, we partition the dataset for our CLIBD experiments into a training set for contrastive learning, and validation and test partitions. The training set has records without any species labels as well as a set of seen species. The validation and test sets include seen and unseen species. These images are further split into subpartitions of queries and keys for evaluation.

For BIOSCAN 5M, we use the dataset partitioning established in the BIOSCAN-5M paper.

For training and reproducing our experiments, we provide HDF5 files with BIOSCAN-1M and BIOSCAN-5M images. See DATA.md for format details. We also provide scripts for generating the HDF5 files directly from the BIOSCAN-1M and BIOSCAN-5M data.

Download BIOSCAN-1M data (79.7 GB)

# From project folder
mkdir -p data/BIOSCAN_1M/split_data
cd data/BIOSCAN_1M/split_data
wget https://aspis.cmpt.sfu.ca/projects/bioscan/clip_project/data/version_0.2.1/BioScan_data_in_splits.hdf5

Download BIOSCAN-5M data (190.4 GB)

# From project folder
mkdir -p data/BIOSCAN_5M
cd data/BIOSCAN_5M
wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/BIOSCAN_5M.hdf5

For more information about the hdf5 files, please check DATA.md.

You can also download the processed data by checking our huggimgface repo

Download data for generating hdf5 files

You can check BIOSCAN-1M and BIOSCAN-5M to download tsv files. But they are actually not necessary.

Running experiments

We recommend the use of weights and biases to track and log experiments

Activate Wandb

Register/Login for a free wandb account

wandb login
# Paste your wandb's API key

Note: To enable wandb, you also need to modify /bioscanclip/config/global_config and set:

debug_flag: false

Checkpoints

Download checkpoint for BarcodeBERT and bioscan_clip and place them under ckpt.

pip install huggingface-cli
# From project folder
huggingface-cli download bioscan-ml/clibd --include "ckpt/*" --local-dir .

You can also check this link to download the files manually.

Train

Use train_cl.py with the appropriate model_config to train CLIBD.

# From project folder
python scripts/train_cl.py 'model_config={config_name}'

To train the full model (I+D+T) using BIOSCAN-1M:

# From project folder
python scripts/train_cl.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'

For multi-GPU training, you may need to specify the transport communication between the GPU using NCCL_P2P_LEVEL:

NCCL_P2P_LEVEL=NVL python scripts/train_cl.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'

For example, using the following command, you can load the pre-trained ViT-B, BarcodeBERT, and BERT-small and fine-tune them through contrastive learning. Note that this training will only update their LoRA layers, not all the parameters.

python scripts/train_cl.py 'model_config=for_bioscan_5m/lora_vit_lora_barcode_bert_lora_bert_5m_no_loading.yaml'

Evaluation

During evaluation, we using the trained encoders to obtain embeddings for input image or DNA, and the find the closest matching image or DNA and use the corresponding taxonomical labels as the predicted labels. We report both the micro and class averaged accuracy for seen and unseen species.

To run evaluation for BIOSCAN-1M:

# From project folder
python scripts/inference_and_eval.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'

To run evaluation for BIOSCAN-5M:

python scripts/inference_and_eval.py 'model_config=for_bioscan_5m/final_experiments/image_dna_text_seed_42.yaml'

For BZSL experiment with the INSECT dataset.

To download unprocessed INSECT dataset, you can reference BZSL:

mkdir -p data/INSECT
cd data/INSECT
# Download the images and metadata here.


# Note that we need to get the other three labels because the INSECT dataset only has the species label.
# For that, please edit get_all_species_taxo_labels_dict_and_save_to_json.py, change Entrez.email = None to your email 
pip install biopython
python get_all_species_taxo_labels_dict_and_save_to_json.py

# Then, generate CSV and hdf5 file for the dataset.
python process_insect_dataset.py

The downloaded data should be organized in this way:

data
├── INSECT
│   ├── att_splits.mat
│   ├── res101.mat
│   ├── images
│   │   │   ├── Abax parallelepipedus
│   │   │   │   ├── BC_ZSM_COL_02878+1311934584.jpg
│   │   │   │   ├── BC_ZSM_COL_05487+1338577126.JPG
│   │   │   │   ├── ...
│   │   │   ├── Abax parallelus
│   │   │   ├── Acordulecera dorsalis
│   │   │   ├── ...

You can also download the processed file with:

wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/INSECT_data/processed_data.zip
unzip processed_data.zip

Train CLIBD with INSECT dataset

python scripts/train_cl.py 'model_config=for_bioscan_1m/lora_vit_lora_barcode_bert_lora_bert_ssl_on_insect.yaml'

Extract image and DNA features of INSECT dataset.

To perform contrastive learning for fine-tuning on the INSECT dataset.

python scripts/train_cl.py 'model_config=for_bioscan_1m/fine_tune_on_INSECT_dataset/image_dna_text_seed_42_on_INSECT_dataset.yaml'

To perform supervise fine-tune image encoder with INSECT dataset.

python scripts/BZSL/fine_tune_bioscan_clip_image_on_insect.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'

For feature extracting

python scripts/extract_feature_for_insect_dataset.py 'model_config=for_bioscan_1m/fine_tune_on_INSECT_dataset/image_dna_text_seed_42_on_INSECT_dataset.yaml'

Then, you may move the extracted features to the BZSL folder or download the pre-extracted feature.

mkdir -p Fine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip
cp extracted_embedding/INSECT/dna_embedding_from_bioscan_clip.csv Fine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip/dna_embedding_from_bioscan_clip.csv
cp extracted_embedding/INSECT/image_embedding_from_bioscan_clip.csvFine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip/image_embedding_from_bioscan_clip.csv

Run BZSL for evaluation.

cd Fine-Grained-ZSL-with-DNA/BZSL-Python
python Demo.py --using_bioscan_clip_image_feature --datapath ../data --side_info dna_bioscan_clip --alignment --tuning

Flatten the `results.csv`.

python scripts/flattenCsv.pya -i PATH_TO_RESULTS_CSV -o PATH_TO_FLATTEN_CSV

Citing CLIBD

If you use CLIBD in your research, please cite:

@inproceedings{gong2025clibd,
    title={{CLIBD}: Bridging Vision and Genomics for Biodiversity Monitoring at Scale},
    author={ZeMing Gong and Austin Wang and Xiaoliang Huo and Joakim Bruslund Haurum
        and Scott C. Lowe and Graham W. Taylor and Angel X Chang
    },
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=d5HUnyByAI},
}

Version log

Version 1.0 (Current version)

Release for the initial ICLR camera-ready submission
Support loading the checkpoint from Hugging Face when no checkpoint is found in the local path

Acknowledgements

We would like to express our gratitude for the use of the INSECT dataset, which played a pivotal role in the completion of our experiments. Additionally, we acknowledge the use and modification of code from the Fine-Grained-ZSL-with-DNA repository, which facilitated part of our experimental work. The contributions of these resources have been invaluable to our project, and we appreciate the efforts of all developers and researchers involved.

This reseach was supported by the Government of Canada’s New Frontiers in Research Fund (NFRF) [NFRFT-2020-00073],
Canada CIFAR AI Chair grants, and the Pioneer Centre for AI (DNRF grant number P1).
This research was also enabled in part by support provided by the Digital Research Alliance of Canada (alliancecan.ca).

Owner metadata

Name: BIOSCAN
Login: bioscan-ml
Email: [email protected]
Kind: organization
Description: Illuminating biodiversity with DNA-based identification systems
Website: https://biodiversitygenomics.net/research/bioscan/
Location:
Twitter:
Company:
Icon url: https://avatars.githubusercontent.com/u/175227258?v=4
Repositories: 1
Last ynced at: 2024-10-29T05:17:09.632Z
Profile URL: https://github.com/bioscan-ml

GitHub Events

Total

Create event: 16
Release event: 1
Issues event: 19
Watch event: 5
Delete event: 10
Member event: 1
Issue comment event: 7
Push event: 317
Pull request review comment event: 3
Pull request review event: 6
Pull request event: 29
Fork event: 3

Last Year

Create event: 16
Release event: 1
Issues event: 19
Watch event: 5
Delete event: 10
Member event: 1
Issue comment event: 7
Push event: 317
Pull request review comment event: 3
Pull request review event: 6
Pull request event: 29
Fork event: 3

Committers metadata

Last synced: 6 days ago

Total Commits: 599
Total Committers: 8
Avg Commits per committer: 74.875
Development Distribution Score (DDS): 0.187

Commits in past year: 571
Committers in past year: 8
Avg Commits per committer in past year: 71.375
Development Distribution Score (DDS) in past year: 0.18

Name	Email	Commits
zmgong	m**4@g**m	487
rust-in	h**n@1**m	42
angelxuanchang	a**x@g**m	20
Chuanqi	1**0@q**m	17
Scott Lowe	s**e@g**m	15
mga113	m**3@c**a	13
Austin Wang	a**g@g**m	4
cta156	c**6@c**a	1

Committer domains:

Issue and Pull Request metadata

Last synced: 2 days ago

Total issues: 10
Total pull requests: 35
Average time to close issues: 5 days
Average time to close pull requests: 3 days
Total issue authors: 3
Total pull request authors: 3
Average comments per issue: 1.1
Average comments per pull request: 0.09
Merged pull request: 34
Bot issues: 0
Bot pull requests: 0

Past year issues: 6
Past year pull requests: 35
Past year average time to close issues: 3 days
Past year average time to close pull requests: 3 days
Past year issue authors: 2
Past year pull request authors: 3
Past year average comments per issue: 0.5
Past year average comments per pull request: 0.09
Past year merged pull request: 34
Past year bot issues: 0
Past year bot pull requests: 0

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/bioscan-ml/clibd

Top Issue Authors

zmgong (5)
jane-pyc (4)
Thundermean-sky (1)

Top Pull Request Authors

zmgong (25)
scottclowe (6)
charlie1587 (4)

Top Issue Labels

bug (2)
documentation (2)

Top Pull Request Labels

Dependencies

requirements.txt pypi

chardet *
edgegpt *
ftfy ==6.1.1
h5py *
hydra-core *
matplotlib *
numpy *
omegaconf *
pandas *
pillow *
plotly ==5.18.0
safetensors *
scikit-learn *
scipy *
seaborn *
timm *
tqdm *
transformers ==4.29.2
umap-learn ==0.5.5
wandb *

setup.py pypi

Score: 4.9126548857360515