CLIBD
A model uses contrastive learning to map biological images, DNA barcodes, and textual taxonomic labels to the same latent space.
https://github.com/bioscan-ml/clibd
Category: Biosphere
Sub Category: Biodiversity Analysis and Metrics
Last synced: about 7 hours ago
JSON representation
Repository metadata
- Host: GitHub
- URL: https://github.com/bioscan-ml/clibd
- Owner: bioscan-ml
- License: mit
- Created: 2024-05-27T00:16:21.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-13T07:03:41.000Z (15 days ago)
- Last Synced: 2025-06-26T17:49:31.063Z (1 day ago)
- Language: Python
- Size: 14.7 MB
- Stars: 15
- Watchers: 3
- Forks: 4
- Open Issues: 2
- Releases: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
README.md
CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
This is the official implementation for "CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale".
Links: website | paper
Overview
Taxonomically classifying organisms at scale is crucial for monitoring biodiversity, understanding ecosystems, and preserving sustainability. It is possible to taxonomically classify organisms based on their image or their DNA barcode. While DNA barcodes are precise at species identification, they are less readily available than images. Thus, we investigate whether we can use DNA barcodes to improve taxonomic classification using image.
We introduce CLIBD, a model uses contrastive learning to map biological images, DNA barcodes, and textual taxonomic labels to the same latent space. The model is initialized using pretrained encoders for images (vit-base-patch16-224), DNA barcodes (BarcodeBERT), and textual taxonomic labels (BERT-small), and the weights of the encoders are fine-tuned using LoRA.
The aligned image-DNA embedding space improves taxonomic classification using images and allows us to do cross-modal retrieval from image to DNA. We train CLIBD on the BIOSCAN-1M and BIOSCAN-5M insect datasets. These datasets provides paired images of insects and their DNA barcodes, along with their taxonomic labels.
Setup environment
CLIBD was developed using Python 3.10 and PyTorch 2.0.1. We recommend the use of GPU and CUDA for efficient training and inference. Our models were developed with CUDA 11.7 and 12.4.
We also recommend the use of miniconda for managing your environments.
To setup the environment width necessary dependencies, type the following commands:
conda create -n CLIBD python=3.10 -y
conda activate CLIBD
conda install pytorch=2.0.1 torchvision=0.15.2 torchtext=0.15.2 pytorch-cuda=11.7 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install -e .
pip install git+https://github.com/Baijiong-Lin/LoRA-Torch
Depending on your GPU version, you may have to modify the torch version and other package versions in requirements.txt.
Pretrained embeddings and models
We provide pretrained embeddings and model weights. We evaluate our models by encoding the image or DNA barcode, and using the taxonomic labels from the closest matching embedding (either using image or DNA barcode). See Download dataset and Running Experiments for how to get the data, and to train and evaluate the models.
NOTE: Currently the checkpoints and config files pointed by these links are expired, we are working to update them as soon as possible. We deeply apologize for any inconvenience this may have caused.
Training data | Aligned modalities | Embeddings | Model | Config |
---|---|---|---|---|
BIOSCAN-1M | None | Embedding | N/A | Link |
BIOSCAN-1M | Image + DNA | Embedding | Link | Link |
BIOSCAN-1M | Image + DNA + Tax | Embedding | Link | Link |
BIOSCAN-5M | None | Embedding | N/A | Link |
BIOSCAN-5M | Image + DNA | Embedding | Link | Link |
BIOSCAN-5M | Image + DNA + Tax | Embedding | Link | Link |
We also provide checkpoints trained with LoRA layers. You can download them from this Link
Quick start
Instead of conducting a full training, you can choose to download pre-trained models or pre-extracted embeddings for evaluation from the table. You may need to posistion the downloaded checkpoints and extracted features in to the proper position based on the config file.
Download dataset
For BIOSCAN 1M, we partition the dataset for our CLIBD experiments into a training set for contrastive learning, and validation and test partitions. The training set has records without any species labels as well as a set of seen species. The validation and test sets include seen and unseen species. These images are further split into subpartitions of queries and keys for evaluation.
For BIOSCAN 5M, we use the dataset partitioning established in the BIOSCAN-5M paper.
For training and reproducing our experiments, we provide HDF5 files with BIOSCAN-1M and BIOSCAN-5M images. See DATA.md for format details. We also provide scripts for generating the HDF5 files directly from the BIOSCAN-1M and BIOSCAN-5M data.
Download BIOSCAN-1M data (79.7 GB)
# From project folder
mkdir -p data/BIOSCAN_1M/split_data
cd data/BIOSCAN_1M/split_data
wget https://aspis.cmpt.sfu.ca/projects/bioscan/clip_project/data/version_0.2.1/BioScan_data_in_splits.hdf5
Download BIOSCAN-5M data (190.4 GB)
# From project folder
mkdir -p data/BIOSCAN_5M
cd data/BIOSCAN_5M
wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/BIOSCAN_5M.hdf5
For more information about the hdf5 files, please check DATA.md.
You can also download the processed data by checking our huggimgface repo
Download data for generating hdf5 files
You can check BIOSCAN-1M and BIOSCAN-5M to download tsv files. But they are actually not necessary.
Running experiments
We recommend the use of weights and biases to track and log experiments
Activate Wandb
free wandb account
Register/Login for awandb login
# Paste your wandb's API key
Note: To enable wandb, you also need to modify /bioscanclip/config/global_config and set:
debug_flag: false
Checkpoints
Download checkpoint for BarcodeBERT and bioscan_clip and place them under ckpt
.
pip install huggingface-cli
# From project folder
huggingface-cli download bioscan-ml/clibd --include "ckpt/*" --local-dir .
You can also check this link to download the files manually.
Train
Use train_cl.py with the appropriate model_config
to train CLIBD.
# From project folder
python scripts/train_cl.py 'model_config={config_name}'
To train the full model (I+D+T) using BIOSCAN-1M:
# From project folder
python scripts/train_cl.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'
For multi-GPU training, you may need to specify the transport communication between the GPU using NCCL_P2P_LEVEL:
NCCL_P2P_LEVEL=NVL python scripts/train_cl.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'
For example, using the following command, you can load the pre-trained ViT-B, BarcodeBERT, and BERT-small and fine-tune them through contrastive learning. Note that this training will only update their LoRA layers, not all the parameters.
python scripts/train_cl.py 'model_config=for_bioscan_5m/lora_vit_lora_barcode_bert_lora_bert_5m_no_loading.yaml'
Evaluation
During evaluation, we using the trained encoders to obtain embeddings for input image or DNA, and the find the closest matching image or DNA and use the corresponding taxonomical labels as the predicted labels. We report both the micro and class averaged accuracy for seen and unseen species.
To run evaluation for BIOSCAN-1M:
# From project folder
python scripts/inference_and_eval.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'
To run evaluation for BIOSCAN-5M:
python scripts/inference_and_eval.py 'model_config=for_bioscan_5m/final_experiments/image_dna_text_seed_42.yaml'
For BZSL experiment with the INSECT dataset.
To download unprocessed INSECT dataset, you can reference BZSL:
mkdir -p data/INSECT
cd data/INSECT
# Download the images and metadata here.
# Note that we need to get the other three labels because the INSECT dataset only has the species label.
# For that, please edit get_all_species_taxo_labels_dict_and_save_to_json.py, change Entrez.email = None to your email
pip install biopython
python get_all_species_taxo_labels_dict_and_save_to_json.py
# Then, generate CSV and hdf5 file for the dataset.
python process_insect_dataset.py
The downloaded data should be organized in this way:
data
├── INSECT
│ ├── att_splits.mat
│ ├── res101.mat
│ ├── images
│ │ │ ├── Abax parallelepipedus
│ │ │ │ ├── BC_ZSM_COL_02878+1311934584.jpg
│ │ │ │ ├── BC_ZSM_COL_05487+1338577126.JPG
│ │ │ │ ├── ...
│ │ │ ├── Abax parallelus
│ │ │ ├── Acordulecera dorsalis
│ │ │ ├── ...
You can also download the processed file with:
wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/INSECT_data/processed_data.zip
unzip processed_data.zip
Train CLIBD with INSECT dataset
python scripts/train_cl.py 'model_config=for_bioscan_1m/lora_vit_lora_barcode_bert_lora_bert_ssl_on_insect.yaml'
Extract image and DNA features of INSECT dataset.
To perform contrastive learning for fine-tuning on the INSECT dataset.
python scripts/train_cl.py 'model_config=for_bioscan_1m/fine_tune_on_INSECT_dataset/image_dna_text_seed_42_on_INSECT_dataset.yaml'
To perform supervise fine-tune image encoder with INSECT dataset.
python scripts/BZSL/fine_tune_bioscan_clip_image_on_insect.py 'model_config=for_bioscan_1m/final_experiments/image_dna_text_seed_42.yaml'
For feature extracting
python scripts/extract_feature_for_insect_dataset.py 'model_config=for_bioscan_1m/fine_tune_on_INSECT_dataset/image_dna_text_seed_42_on_INSECT_dataset.yaml'
Then, you may move the extracted features to the BZSL folder or download the pre-extracted feature.
mkdir -p Fine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip
cp extracted_embedding/INSECT/dna_embedding_from_bioscan_clip.csv Fine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip/dna_embedding_from_bioscan_clip.csv
cp extracted_embedding/INSECT/image_embedding_from_bioscan_clip.csvFine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip/image_embedding_from_bioscan_clip.csv
Run BZSL for evaluation.
cd Fine-Grained-ZSL-with-DNA/BZSL-Python
python Demo.py --using_bioscan_clip_image_feature --datapath ../data --side_info dna_bioscan_clip --alignment --tuning
results.csv
.
Flatten the python scripts/flattenCsv.pya -i PATH_TO_RESULTS_CSV -o PATH_TO_FLATTEN_CSV
Citing CLIBD
If you use CLIBD in your research, please cite:
@inproceedings{gong2025clibd,
title={{CLIBD}: Bridging Vision and Genomics for Biodiversity Monitoring at Scale},
author={ZeMing Gong and Austin Wang and Xiaoliang Huo and Joakim Bruslund Haurum
and Scott C. Lowe and Graham W. Taylor and Angel X Chang
},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=d5HUnyByAI},
}
Version log
Version 1.0 (Current version)
- Release for the initial ICLR camera-ready submission
- Support loading the checkpoint from Hugging Face when no checkpoint is found in the local path
Acknowledgements
We would like to express our gratitude for the use of the INSECT dataset, which played a pivotal role in the completion of our experiments. Additionally, we acknowledge the use and modification of code from the Fine-Grained-ZSL-with-DNA repository, which facilitated part of our experimental work. The contributions of these resources have been invaluable to our project, and we appreciate the efforts of all developers and researchers involved.
This reseach was supported by the Government of Canada’s New Frontiers in Research Fund (NFRF) [NFRFT-2020-00073],
Canada CIFAR AI Chair grants, and the Pioneer Centre for AI (DNRF grant number P1).
This research was also enabled in part by support provided by the Digital Research Alliance of Canada (alliancecan.ca).
Owner metadata
- Name: BIOSCAN
- Login: bioscan-ml
- Email: [email protected]
- Kind: organization
- Description: Illuminating biodiversity with DNA-based identification systems
- Website: https://biodiversitygenomics.net/research/bioscan/
- Location:
- Twitter:
- Company:
- Icon url: https://avatars.githubusercontent.com/u/175227258?v=4
- Repositories: 1
- Last ynced at: 2024-10-29T05:17:09.632Z
- Profile URL: https://github.com/bioscan-ml
GitHub Events
Total
- Create event: 16
- Release event: 1
- Issues event: 19
- Watch event: 5
- Delete event: 10
- Member event: 1
- Issue comment event: 7
- Push event: 317
- Pull request review comment event: 3
- Pull request review event: 6
- Pull request event: 29
- Fork event: 3
Last Year
- Create event: 16
- Release event: 1
- Issues event: 19
- Watch event: 5
- Delete event: 10
- Member event: 1
- Issue comment event: 7
- Push event: 317
- Pull request review comment event: 3
- Pull request review event: 6
- Pull request event: 29
- Fork event: 3
Committers metadata
Last synced: 6 days ago
Total Commits: 599
Total Committers: 8
Avg Commits per committer: 74.875
Development Distribution Score (DDS): 0.187
Commits in past year: 571
Committers in past year: 8
Avg Commits per committer in past year: 71.375
Development Distribution Score (DDS) in past year: 0.18
Name | Commits | |
---|---|---|
zmgong | m****4@g****m | 487 |
rust-in | h****n@1****m | 42 |
angelxuanchang | a****x@g****m | 20 |
Chuanqi | 1****0@q****m | 17 |
Scott Lowe | s****e@g****m | 15 |
mga113 | m****3@c****a | 13 |
Austin Wang | a****g@g****m | 4 |
cta156 | c****6@c****a | 1 |
Committer domains:
- cs-venus-08.cmpt.sfu.ca: 1
- cs-3dlg-01.cmpt.sfu.ca: 1
- qq.com: 1
- 163.com: 1
Issue and Pull Request metadata
Last synced: 2 days ago
Total issues: 10
Total pull requests: 35
Average time to close issues: 5 days
Average time to close pull requests: 3 days
Total issue authors: 3
Total pull request authors: 3
Average comments per issue: 1.1
Average comments per pull request: 0.09
Merged pull request: 34
Bot issues: 0
Bot pull requests: 0
Past year issues: 6
Past year pull requests: 35
Past year average time to close issues: 3 days
Past year average time to close pull requests: 3 days
Past year issue authors: 2
Past year pull request authors: 3
Past year average comments per issue: 0.5
Past year average comments per pull request: 0.09
Past year merged pull request: 34
Past year bot issues: 0
Past year bot pull requests: 0
Top Issue Authors
- zmgong (5)
- jane-pyc (4)
- Thundermean-sky (1)
Top Pull Request Authors
- zmgong (25)
- scottclowe (6)
- charlie1587 (4)
Top Issue Labels
- bug (2)
- documentation (2)
Top Pull Request Labels
Dependencies
- chardet *
- edgegpt *
- ftfy ==6.1.1
- h5py *
- hydra-core *
- matplotlib *
- numpy *
- omegaconf *
- pandas *
- pillow *
- plotly ==5.18.0
- safetensors *
- scikit-learn *
- scipy *
- seaborn *
- timm *
- tqdm *
- transformers ==4.29.2
- umap-learn ==0.5.5
- wandb *
Score: 4.9126548857360515