{"id":305512,"name":"BIOSCAN-5M","description":"A comprehensive multi-modal dataset comprised of over 5 million specimens, 98% of which are insects.","url":"https://github.com/bioscan-ml/bioscan-5m","last_synced_at":"2026-05-16T21:30:20.974Z","repository":{"id":244860799,"uuid":"785566713","full_name":"bioscan-ml/BIOSCAN-5M","owner":"bioscan-ml","description":"A multimodal dataset of 5M insect specimens for biodiversity research.","archived":false,"fork":false,"pushed_at":"2026-05-02T02:31:14.000Z","size":104509,"stargazers_count":21,"open_issues_count":16,"forks_count":1,"subscribers_count":5,"default_branch":"main","last_synced_at":"2026-05-11T19:03:23.578Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://biodiversitygenomics.net/projects/5m-insects/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bioscan-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-04-12T06:31:01.000Z","updated_at":"2026-05-02T02:31:13.000Z","dependencies_parsed_at":"2024-12-19T05:01:32.524Z","dependency_job_id":"e50accaf-48f5-41c7-b41c-954cd0b110a2","html_url":"https://github.com/bioscan-ml/BIOSCAN-5M","commit_stats":null,"previous_names":["zahrag/bioscan-5m","bioscan-ml/bioscan-5m"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bioscan-ml/BIOSCAN-5M","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2FBIOSCAN-5M","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2FBIOSCAN-5M/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2FBIOSCAN-5M/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2FBIOSCAN-5M/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bioscan-ml","download_url":"https://codeload.github.com/bioscan-ml/BIOSCAN-5M/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2FBIOSCAN-5M/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33080345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-15T20:25:35.270Z","status":"ssl_error","status_checked_at":"2026-05-15T20:25:34.732Z","response_time":103,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"owner":{"login":"bioscan-ml","name":"BIOSCAN","uuid":"175227258","kind":"organization","description":"Illuminating biodiversity with DNA-based identification systems","email":"contact@bioscancanada.org","website":"https://biodiversitygenomics.net/research/bioscan/","location":null,"twitter":null,"company":null,"icon_url":"https://avatars.githubusercontent.com/u/175227258?v=4","repositories_count":1,"last_synced_at":"2024-10-29T05:17:09.632Z","metadata":{"has_sponsors_listing":false},"html_url":"https://github.com/bioscan-ml","funding_links":[],"total_stars":10,"followers":3,"following":0,"created_at":"2024-10-29T05:17:09.653Z","updated_at":"2024-10-29T05:17:09.653Z","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bioscan-ml","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bioscan-ml/repositories"},"packages":[],"commits":{"id":7539531,"full_name":"bioscan-ml/bioscan-5m","default_branch":"main","total_commits":439,"total_committers":6,"total_bot_commits":0,"total_bot_committers":0,"mean_commits":73.16666666666667,"dds":0.24601366742596809,"past_year_total_commits":10,"past_year_total_committers":4,"past_year_total_bot_commits":0,"past_year_total_bot_committers":0,"past_year_mean_commits":2.5,"past_year_dds":0.4,"last_synced_at":"2026-05-13T20:02:00.095Z","last_synced_commit":"918bd6e18b1ab61f1db3d2b125f3e6e366050bcd","created_at":"2024-12-12T00:08:08.048Z","updated_at":"2026-05-13T20:01:50.709Z","committers":[{"name":"zahrag","email":"zahra.gharaee@gmail.com","login":"zahrag","count":331},{"name":"Scott Lowe","email":"scott.code.lowe@gmail.com","login":"scottclowe","count":62},{"name":"zmgong","email":"ming2280089874@gmail.com","login":"zmgong","count":35},{"name":"Anna Viklund","email":"annamariaviklund@gmail.com","login":"annavik","count":6},{"name":"Pablo","email":"pablito9507@gmail.com","login":"millanp95","count":4},{"name":"Copilot","email":"198982749+Copilot","login":"Copilot","count":1}],"past_year_committers":[{"name":"Anna Viklund","email":"annamariaviklund@gmail.com","login":"annavik","count":6},{"name":"zahrag","email":"zahra.gharaee@gmail.com","login":"zahrag","count":2},{"name":"Scott Lowe","email":"scott.code.lowe@gmail.com","login":"scottclowe","count":1},{"name":"Copilot","email":"198982749+Copilot","login":"Copilot","count":1}],"commits_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2Fbioscan-5m/commits","host":{"name":"GitHub","url":"https://github.com","kind":"github","last_synced_at":"2026-05-15T00:00:35.990Z","repositories_count":6234725,"commits_count":894549360,"contributors_count":34908088,"owners_count":1153531,"icon_url":"https://github.com/github.png","host_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub/repositories"}},"issues_stats":{"full_name":"bioscan-ml/bioscan-5m","html_url":"https://github.com/bioscan-ml/bioscan-5m","last_synced_at":"2026-05-03T15:01:32.048Z","status":"active","issues_count":19,"pull_requests_count":20,"avg_time_to_close_issue":14456.0,"avg_time_to_close_pull_request":75864.94736842105,"issues_closed_count":2,"pull_requests_closed_count":19,"pull_request_authors_count":6,"issue_authors_count":3,"avg_comments_per_issue":0.9473684210526315,"avg_comments_per_pull_request":0.65,"merged_pull_requests_count":14,"bot_issues_count":13,"bot_pull_requests_count":0,"past_year_issues_count":18,"past_year_pull_requests_count":6,"past_year_avg_time_to_close_issue":14456.0,"past_year_avg_time_to_close_pull_request":59546.8,"past_year_issues_closed_count":2,"past_year_pull_requests_closed_count":5,"past_year_pull_request_authors_count":3,"past_year_issue_authors_count":2,"past_year_avg_comments_per_issue":0.9444444444444444,"past_year_avg_comments_per_pull_request":1.3333333333333333,"past_year_bot_issues_count":13,"past_year_bot_pull_requests_count":0,"past_year_merged_pull_requests_count":5,"created_at":"2024-12-12T00:08:08.436Z","updated_at":"2026-05-03T15:01:32.048Z","repository_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2Fbioscan-5m","issues_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioscan-ml%2Fbioscan-5m/issues","issue_labels_count":{"report":13,"image":11,"documentation":4,"label":1},"pull_request_labels_count":{"documentation":1},"issue_author_associations_count":{"NONE":14,"COLLABORATOR":5},"pull_request_author_associations_count":{"COLLABORATOR":10,"MEMBER":9,"CONTRIBUTOR":1},"issue_authors":{"bioscan-browser[bot]":13,"gwtaylor":5,"Wz1h1NG":1},"pull_request_authors":{"zmgong":6,"scottclowe":5,"annavik":4,"millanp95":2,"zahrag":2,"Copilot":1},"host":{"name":"GitHub","url":"https://github.com","kind":"github","last_synced_at":"2026-05-15T00:00:53.591Z","repositories_count":14606319,"issues_count":34196153,"pull_requests_count":111971963,"authors_count":11263262,"icon_url":"https://github.com/github.png","host_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories","owners_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/owners","authors_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors"},"past_year_issue_labels_count":{"report":13,"image":11,"documentation":4,"label":1},"past_year_pull_request_labels_count":{"documentation":1},"past_year_issue_author_associations_count":{"NONE":13,"COLLABORATOR":5},"past_year_pull_request_author_associations_count":{"MEMBER":5,"CONTRIBUTOR":1},"past_year_issue_authors":{"bioscan-browser[bot]":13,"gwtaylor":5},"past_year_pull_request_authors":{"annavik":4,"Copilot":1,"scottclowe":1},"maintainers":[{"login":"zmgong","count":6,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/zmgong"},{"login":"gwtaylor","count":5,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/gwtaylor"},{"login":"scottclowe","count":5,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/scottclowe"},{"login":"annavik","count":4,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/annavik"},{"login":"millanp95","count":2,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/millanp95"},{"login":"zahrag","count":2,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/zahrag"}],"active_maintainers":[{"login":"gwtaylor","count":5,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/gwtaylor"},{"login":"annavik","count":4,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/annavik"},{"login":"scottclowe","count":1,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/scottclowe"}]},"events":{"total":{"DeleteEvent":7,"PullRequestEvent":15,"ForkEvent":1,"IssuesEvent":6,"WatchEvent":13,"IssueCommentEvent":20,"PushEvent":108,"PullRequestReviewEvent":1,"CreateEvent":8},"last_year":{"DeleteEvent":1,"PullRequestEvent":4,"IssuesEvent":5,"WatchEvent":4,"IssueCommentEvent":16,"PushEvent":10,"PullRequestReviewEvent":1,"CreateEvent":2}},"keywords":[],"dependencies":[],"score":5.402677381872279,"created_at":"2024-12-12T00:08:01.375Z","updated_at":"2026-05-16T21:30:20.975Z","avatar_url":"https://github.com/bioscan-ml.png","language":"Python","category":"Biosphere","sub_category":"Biodiversity Data Access and Management","monthly_downloads":0,"total_dependent_repos":0,"total_dependent_packages":0,"readme":"BIOSCAN-5M\n==========\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/fig1.png\" alt=\"An example record from the BIOSCAN-5M dataset\" /\u003e\n  \u003cp\u003e\u003cb\u003eFigure 1:\u003c/b\u003e A BIOSCAN-5M dataset sample with multimodal data types.\n\u003c/div\u003e\n\n\nOverview\n--------\nThis repository contains the code and data related to the [BIOSCAN-5M](https://biodiversitygenomics.net/5M-insects/)\nproject.\nBIOSCAN-5M is a comprehensive multi-modal dataset comprised of over 5 million specimens, 98% of which are insects.\nEvery record has **both image and DNA** data.\n\nIf you make use of the BIOSCAN-5M dataset and/or this code repository, please cite the following [paper](https://arxiv.org/abs/2406.12723):\n\n```bibtex\n@inproceedings{gharaee2024bioscan5m,\n    title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},\n    booktitle={Advances in Neural Information Processing Systems},\n    author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias\n        and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum\n        and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor\n        and Paul Fieguth and Angel X. Chang\n    },\n    editor={A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},\n    pages={36285--36313},\n    publisher={Curran Associates, Inc.},\n    year={2024},\n    volume={37},\n    url={https://proceedings.neurips.cc/paper_files/paper/2024/file/3fdbb472813041c9ecef04c20c2b1e5a-Paper-Datasets_and_Benchmarks_Track.pdf},\n}\n```\n\n## Getting Started with BIOSCAN-5M\n\n### I. Environment Setup\nTo set up the BIOSCAN-5M project, create and activate the required environment using the provided `bioscan5m.yaml` file. \nRun the following command:\n\n```bash\nconda env create -f bioscan5m.yaml\n````\n\n```bash\nconda activate bioscan5m\n````\n\n### II. Dataset Quick Start\n\nQuickly access the BIOSCAN-5M dataset by installing the dataset package and initializing the data loader. Use the following commands:\n\n```bash\npip install bioscan-dataset\n``` \n\n```bash\nfrom bioscan_dataset import BIOSCAN5M\n\nds = BIOSCAN5M(\"~/Datasets/bioscan-5m\", download=True)\n```\nFor more detailed information, please visit [BIOSCAN-5M Dataset Package](https://github.com/bioscan-ml/dataset)\n\n### III. Task-Specific Settings\nPlease note that to work with all modules connected to this repository, \nyou may need to install additional dependencies specific to each module (if any).\nBe sure to follow the instructions provided within each module's folder for further setup details.\n\n\nDataset\n-------\nWe present BIOSCAN-5M dataset to the machine learning community.\nWe hope this dataset will facilitate the development of tools to automate aspects of the monitoring of global insect biodiversity.\n\nEach record of the BIOSCAN-5M dataset contains six primary attributes:\n* RGB image\n  * Metadata field: \u003ccode\u003eprocessid\u003c/code\u003e\n* DNA barcode sequence\n  * Metadata field: \u003ccode\u003edna_barcode\u003c/code\u003e\n* Barcode Index Number (BIN)\n  * Metadata field: \u003ccode\u003edna_bin\u003c/code\u003e\n* Biological taxonomic classification\n  * Metadata fields: \u003ccode\u003ephylum\u003c/code\u003e, \u003ccode\u003eclass\u003c/code\u003e, \u003ccode\u003eorder\u003c/code\u003e, \u003ccode\u003efamily\u003c/code\u003e, \u003ccode\u003esubfamily\u003c/code\u003e, \u003ccode\u003egenus\u003c/code\u003e, \u003ccode\u003especies\u003c/code\u003e, \u003ccode\u003etaxon\u003c/code\u003e\n* Geographical information \n  * Metadata fields: \u003ccode\u003ecountry\u003c/code\u003e, \u003ccode\u003eprovince_state\u003c/code\u003e, \u003ccode\u003elatitude\u003c/code\u003e, \u003ccode\u003elongitude\u003c/code\u003e\n* Specimen size\n  * Metadata fields: \u003ccode\u003eimage_measurement_value\u003c/code\u003e, \u003ccode\u003earea_fraction\u003c/code\u003e, \u003ccode\u003escale_factor\u003c/code\u003e\n\n\n### Dataset Access\nAll dataset image packages and metadata files are accessible for download through the\n[GoogleDrive](https://drive.google.com/drive/u/1/folders/1Jc57eKkeiYrnUBc9WlIp-ZS_L1bVlT-0) folder.\nAdditionally, the dataset is available on research and data sharing platforms such as [Zenodo](https://zenodo.org/records/11973457),\n[Kaggle](https://www.kaggle.com/datasets/zahragharaee/bioscan-5m), and [HuggingFace](https://huggingface.co/datasets/Gharaee/BIOSCAN-5M).\n \n### Dataset Browser\nThe [BIOSCAN Browser](https://browser.bioscan-ml.org/) is an interactive tool designed to explore the BIOSCAN-5M dataset efficiently. \nIt allows you to navigate through taxonomic ranks, visualize specimens, and analyze DNA barcode sequences. \nThe browser supports advanced filtering, sorting, and visualization capabilities to facilitate in-depth data exploration for researchers and developers.\n\n### Metadata \nThe dataset metadata file **BIOSCAN_5M_Insect_Dataset_metadata** contains biological information, geographic information as well as \nsize information of the organisms. We provide this metadata in both CSV and JSONLD file types.\n\n\n### RGB Image \nThe BIOSCAN-5M dataset comprises resized and cropped images.\nWe have provided various packages of the BIOSCAN-5M dataset, each tailored for specific purposes.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/images_n.png\" alt=\"An array of example insect images from the BIOSCAN-5M dataset.\" /\u003e\n  \u003cp\u003e\u003cb\u003eFigure 2:\u003c/b\u003e Examples of the original images of the BIOSCAN-5M dataset.\n\u003c/div\u003e\n\n#### Cropped images\nWe trained a model on examples from this dataset in order to create a tool introduced in [BIOSCAN-1M](https://github.com/zahrag/BIOSCAN-1M), which can automatically generate bounding boxes around the insect.\nWe used this to crop each image down to only the region of interest.\n\n#### Image packages\n* **BIOSCAN_5M_original_full**: The raw images of the dataset.\n* **BIOSCAN_5M_cropped**: Images after cropping with our cropping tool.\n* **BIOSCAN_5M_original_256**: Original images resized to 256 on their shorter side. \n* **BIOSCAN_5M_cropped_256**: Cropped images resized to 256 on their shorter side.\n \n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"background-color: #f2f2f2;\"\u003e\n      \u003cth\u003eBIOSCAN_5M_original_full\u003c/th\u003e\n      \u003cth\u003eBIOSCAN_5M_cropped\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e\n        \u003cul\u003e\n          \u003cli\u003eBIOSCAN_5M_original_full.01.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_original_full.02.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_original_full.03.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_original_full.04.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_original_full.05.zip\u003c/li\u003e\n        \u003c/ul\u003e\n      \u003c/td\u003e\n      \u003ctd\u003e\n        \u003cul\u003e\n          \u003cli\u003eBIOSCAN_5M_cropped.01.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_cropped.02.zip\u003c/li\u003e\n        \u003c/ul\u003e\n      \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr style=\"background-color: #f2f2f2;\"\u003e\n      \u003cth\u003eBIOSCAN_5M_original_256\u003c/th\u003e\n      \u003cth\u003eBIOSCAN_5M_cropped_256\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e\n        \u003cul\u003e\n          \u003cli\u003eBIOSCAN_5M_original_256.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_original_256_pretrain.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_original_256_train.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_original_256_eval.zip\u003c/li\u003e\n        \u003c/ul\u003e\n      \u003c/td\u003e\n      \u003ctd\u003e\n        \u003cul\u003e\n          \u003cli\u003eBIOSCAN_5M_cropped_256.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_cropped_256_pretrain.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_cropped_256_train.zip\u003c/li\u003e\n          \u003cli\u003eBIOSCAN_5M_cropped_256_eval.zip\u003c/li\u003e\n        \u003c/ul\u003e\n      \u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\n\n### Geographical Information\nThe BIOSCAN-5M dataset provides Geographic information associated with the collection sites of the organisms. \nThe following geographic data is presented in the \u003ccode\u003ecountry\u003c/code\u003e, \u003ccode\u003eprovince_state\u003c/code\u003e, \u003ccode\u003elatitude\u003c/code\u003e, and \n\u003ccode\u003elongitude\u003c/code\u003e fields of the metadata file(s):\n* Latitude and Longitude coordinates\n* Country\n* Province or State\n\n\u003cfigure style=\"text-align: center;\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/BIOSCAN_5M_Insect_Dataset_lat_lon_map.png\" alt=\"World map overlaid with the distribution of sample collection sites and their frequencies.\" /\u003e\n  \u003cfigcaption\u003e\u003cb\u003eFigure 3:\u003c/b\u003e Locations obtained from latitude and longitude coordinates associated with the sites of collection.\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003cbr\u003e\u003cbr\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/map_supplement3.png\" alt=\"World map overlaid with the number of samples collected per country.\" /\u003e\n  \u003cp\u003e\u003cb\u003eFigure 4:\u003c/b\u003e Countries associated with the sites of collection.\n\u003c/div\u003e\n\n### Size Information\nThe BIOSCAN-5M dataset provides information about size of the organisms. \nThe size data is presented in the \u003ccode\u003eimage_measurement_value\u003c/code\u003e, \u003ccode\u003earea_fraction\u003c/code\u003e, and \n\u003ccode\u003escale_factor\u003c/code\u003e fields of the metadata file(s):\n\n* Image measurement value: Total number of pixels occupied by the organism\n\nFurthermore, utilizing our cropping tool, we calculated the following information about size of the organisms:\n* Area fraction: Fraction of the original image, the cropped image comprises.\n* Scale factor: Ratio of the cropped image to the cropped and resized image.\n\n\u003cfigure style=\"text-align: center;\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/images_masks.png\" alt=\"Example pixel masks of the organism.\" /\u003e\n  \u003cfigcaption\u003e\u003cb\u003eFigure 5:\u003c/b\u003e Examples of original images (top) and their corresponding masks (bottom) depicting pixels occupied by the organism.\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\nBenchmark Experiments\n---------------------\n\n### Data Partitions\nWe partitioned the BIOSCAN-5M dataset into splits for both closed-world and open-world machine learning problems. \nTo use the partitions we propose, see the \u003ccode\u003esplit\u003c/code\u003e field of the metadata file(s).\n\n\n* The **closed-world** classification task uses samples labelled with a scientific name for their species\n(\u003ccode\u003etrain\u003c/code\u003e, \u003ccode\u003eval\u003c/code\u003e, and \u003ccode\u003etest\u003c/code\u003e partitions).\n  * This task requires the model to correctly classify new images and DNA barcodes of across a known set of species labels that were seen during training.\n\n* The **open-world** classification task uses samples whose species name is a placeholder name,\nand whose genus name is a scientific name\n(\u003ccode\u003ekey_unseen\u003c/code\u003e, \u003ccode\u003eval_unseen\u003c/code\u003e, and \u003ccode\u003etest_unseen\u003c/code\u003e partitions).\n  * This task requires the model to correctly group together new species that were not seen during training.\n  * In the retreival paradigm, this task can be performed using \u003ccode\u003etest_unseen\u003c/code\u003e records as queries against keys from the \u003ccode\u003ekey_unseen\u003c/code\u003e records.\n  * Alternatively, this data can be evaluated at the genus-level by classification via the species in the \u003ccode\u003etrain\u003c/code\u003e partition.\n\n* Samples labelled with placeholder species names, and whose genus name is not a scientific name are placed in the \u003ccode\u003eother_heldout\u003c/code\u003e partition.\n  * This data can be used to train an unseen species novelty detector.\n\n* Samples without species labels are placed in the \u003ccode\u003epretrain\u003c/code\u003e partition, which comprises 90% of the data.\n  * This data can be used for self-supervised or semi-supervised training. \n\n\u003cdiv align=\"center\" style=\"display: flex; justify-content: center; gap: 20px;\"\u003e\n  \u003cdiv\u003e\n    \u003cimg src=\"BIOSCAN_images/repo_images/bioscan5m_split_seen.png\" \n    alt=\"a\" \n         style=\"max-width: 300px; height: 300px;\" /\u003e\n    \u003cimg src=\"BIOSCAN_images/repo_images/bioscan5m_split_unseen.png\" \nalt=\"b\" \n         style=\"max-width: 300px; height: 300px;\" /\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv align=\"left\"\u003e\n  \u003cp\u003e\u003cb\u003eFigure 6: Distribution of class- and (Insecta) order-level taxa\u003c/b\u003e for seen and unseen data partitions. The distributions reflect that of known and newly-discovered species, respectively.\u003c/p\u003e\n\u003c/div\u003e\n\n\n\n### Task-I: DNA-based taxonomic classification \nTwo stages of the proposed semi-supervised learning set-up based on [BarcodeBERT](https://arxiv.org/abs/2311.02401). \n1. Pretraining: DNA sequences are tokenized using non-overlapping k-mers and 50% of the tokens are masked for the MLM task. \nTokens are encoded and fed into a transformer model. The output embeddings are used for token-level classification. \n2. Fine-tuning: All DNA sequences in a dataset are tokenized using non-overlapping $k$-mer tokenization and all tokenized sequences, without masking, are passed through the pretrained transformer model.  Global mean-pooling is applied over the token-level embeddings and the output is used for taxonomic classification.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/barcode_bert_n2.png\" alt=\"Methodology for BarcodeBERT experiments.\" /\u003e\n  \u003cp\u003e\u003cb\u003eFigure 7:\u003c/b\u003e BarcodeBERT model architecture.\n\u003c/div\u003e\n\n#### Results\nThe performance of the taxonomic classification using DNA barcode sequences of the BIOSCAN-5M dataset is summarized as follows:\n\n**Performance of DNA-based sequence models** in closed- and open-world settings.  \nFor the closed-world setting, we show the species-level accuracy (%) for predicting seen species.  \nFor the open-world setting, we show genus-level accuracy (%) for unseen species, while using seen species to fit the model.  \n_Bold values indicate the best result, and italicized values indicate the second best._\n\n\n| Model        | Architecture | SSL-Pretraining | Tokens Seen | Fine-tuned Seen: Species | Linear Probe Seen: Species  | 1NN-Probe Unseen: Genus |\n|--------------|--------------|-----------------|-------------|--------------------------|-----------------------------|-------------------------|\n| CNN baseline | CNN          | --              | --          | 97.70                    | --                          | *29.88*                 |\n| NT           | Transformer  | Multi-Species   | 300 B       | 98.99                    | 52.41                       | 21.67                   |\n| DNABERT-2    | Transformer  | Multi-Species   | 512 B       | *99.23*                  | 67.81                       | 17.99                   |\n| DNABERT-S    | Transformer  | Multi-Species   | ~1,000 B    | 98.99                    | **95.50**                   | 17.70                   |\n| HyenaDNA     | SSM          | Human DNA       | 5 B         | 98.71                    | 54.82                       | 19.26                   |\n| BarcodeBERT  | Transformer  | DNA barcodes    | 5 B         | 98.52                    | 91.93                       | 23.15                   |\n| **Ours**     | Transformer  | DNA barcodes    | 7 B         | **99.28**                | *94.47*                     | **47.03**               |\n\n\n### Task-II: Zero-shot transfer learning \nWe follow the experimental setup recommended by [zero-shot clustering](https://arxiv.org/abs/2406.02465),\nexpanded to operate on multiple modalities.\n1. Take pretrained encoders.\n2. Extract feature vectors from the stimuli by passing them through the pretrained encoder.\n3. Reduce the embeddings with UMAP.\n4. Cluster the reduced embeddings with Agglomerative Clustering.\n5. Evaluate against the ground-truth annotations with Adjusted Mutual Information.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/bioscan_zsc_n1.png\" alt=\"Methodology for zero-shot clustering experiments.\" /\u003e\n  \u003cp\u003e\u003cb\u003eFigure 8:\u003c/b\u003e BIOSCAN-ZSC model architecture.\n\u003c/div\u003e\n\n#### Results\nThe performance of the zero-shot transfer learning experiments on the BIOSCAN-5M dataset is summarized as follows:\n\n\u003cdiv align=\"center\" style=\"display: flex; justify-content: space-between; gap: 20px; flex-wrap: nowrap;\"\u003e\n  \u003cdiv\u003e\n    \u003cimg src=\"BIOSCAN_images/repo_images/bioscan5m_zsc_image.png\" \n         alt=\"a\" \n         style=\"max-width: 350px; height: 350px;\" /\u003e\n    \u003cimg src=\"BIOSCAN_images/repo_images/bioscan5m_zsc_dna.png\" \n         alt=\"b\" \n         style=\"max-width: 350px; height: 350px;\" /\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv align=\"left\"\u003e\n\u003cp\u003e\u003cb\u003eFigure 9: Zero-shot clustering AMI (\\%) performance\u003c/b\u003e across taxonomic ranks.\n    For images (left), pretrained encoders only capture coarse-grained information, but with DNA barcodes (right), clustering yields high performance to species-level, even without model retraining.\u003c/div\u003e\n\n### Task-III: Multimodal retrieval learning \nOur experiments using the [CLIBD](https://github.com/bioscan-ml/clibd) are conducted in two steps.\n\n\n1. Training: Multiple modalities, including RGB images, textual taxonomy, and DNA sequences, are encoded separately, \nand trained using a contrastive loss function. \n2. Inference: Image vs DNA embedding is used as a query, and compared to the embeddings obtained from a database of image, \nDNA and text (keys). The cosine similarity is used to find the closest key embedding, and the corresponding taxonomic label is used to classify the query.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"BIOSCAN_images/repo_images/bioscan_clip.png\" alt=\"Methodology for BIOSCAN-CLIP experiments.\" /\u003e\n  \u003cp\u003e\u003cb\u003eFigure 10:\u003c/b\u003e CLIBD model architecture.\n\u003c/div\u003e\n\n#### Results\nThe performance of the multimodal retrieval learning experiments on the BIOSCAN-5M dataset is summarized as follows:\n\n\u003cdiv align=\"center\" style=\"display: flex; justify-content: center; gap: 20px;\"\u003e\n  \u003cdiv\u003e\n    \u003cimg src=\"BIOSCAN_images/repo_images/bioscan5m_clibd_noalign.png\" \n    alt=\"a\" \n    style=\"max-width: 250px; height: 250px;\" /\u003e\n    \u003cimg src=\"BIOSCAN_images/repo_images/bioscan5m_clibd_idt.png\" \n     alt=\"b\" \n     style=\"max-width: 250px; height: 250px;\"/\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv align=\"left\"\u003e\n  \u003cp\u003e\u003cb\u003eFigure 11: Multimodal retrieval accuracy (\\%)\u003c/b\u003e on seen and unseen species across different methods of retrieval (image-to-image, image-to-DNA, and DNA-to-DNA).\n    \u003cb\u003eLeft\u003c/b\u003e: retrieval accuracy before alignment of encoders. \u003cb\u003eRight\u003c/b\u003e: retrieval accuracy after aligning images, DNA, and taxonomic labels.\u003c/div\u003e\n\n\n## Copyright and License \nThe images and metadata included in the BIOSCAN-5M dataset available through this repository are subject to copyright \nand licensing restrictions shown in the following:\n\n - Copyright Holder: CBG Photography Group\n - Copyright Institution: Centre for Biodiversity Genomics (email: cbg.analytics@uoguelph.ca)\n - Photographer: CBG Robotic Imager\n - Copyright License: Creative Commons Attribution 3.0 Unported ([CC BY 3.0](https://creativecommons.org/licenses/by/3.0/))\n - Copyright Contact: cbg.collections@uoguelph.ca\n - Copyright Year: 2021\n","funding_links":[],"readme_doi_urls":[],"works":{},"citation_counts":{},"total_citations":0,"keywords_from_contributors":[],"project_url":"https://ost.ecosyste.ms/api/v1/projects/305512","html_url":"https://ost.ecosyste.ms/projects/305512"}