{"id":106826,"name":"gbifdb","description":"Provide a relational database interface to a parquet based serializations of gbif's AWS snapshots of its public data.","url":"https://github.com/ropensci/gbifdb","last_synced_at":"2026-06-01T01:01:19.581Z","repository":{"id":41255554,"uuid":"425330108","full_name":"ropensci/gbifdb","owner":"ropensci","description":":package: A High Performance Interface to GBIF","archived":false,"fork":false,"pushed_at":"2025-09-14T18:53:02.000Z","size":407,"stargazers_count":40,"open_issues_count":0,"forks_count":5,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-05-17T18:32:30.495Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://docs.ropensci.org/gbifdb/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ropensci.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":"codemeta.json","zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-11-06T19:25:41.000Z","updated_at":"2025-10-05T17:27:56.000Z","dependencies_parsed_at":"2023-10-05T05:08:42.462Z","dependency_job_id":"f38ef0f9-7859-4600-9cab-50fce18d1fc1","html_url":"https://github.com/ropensci/gbifdb","commit_stats":{"total_commits":81,"total_committers":3,"mean_commits":27.0,"dds":0.06172839506172845,"last_synced_commit":"35b60fded47c800c26c908b1289e98b6f2edef78"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ropensci/gbifdb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropensci","download_url":"https://codeload.github.com/ropensci/gbifdb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33412082,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T18:09:33.147Z","status":"ssl_error","status_checked_at":"2026-05-23T18:09:31.380Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"owner":{"login":"ropensci","name":"rOpenSci","uuid":"1200269","kind":"organization","description":"","email":"info@ropensci.org","website":"https://ropensci.org/","location":"Berkeley, CA","twitter":"rOpenSci","company":null,"icon_url":"https://avatars.githubusercontent.com/u/1200269?v=4","repositories_count":307,"last_synced_at":"2023-03-10T20:30:59.242Z","metadata":{"has_sponsors_listing":false},"html_url":"https://github.com/ropensci","funding_links":[],"total_stars":null,"followers":null,"following":null,"created_at":"2022-11-02T19:23:08.224Z","updated_at":"2023-03-10T20:30:59.305Z","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropensci","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropensci/repositories"},"packages":[{"id":4367271,"name":"gbifdb","ecosystem":"cran","description":"High Performance Interface to 'GBIF'","homepage":"https://docs.ropensci.org/gbifdb/","licenses":"Apache License (≥ 2)","normalized_licenses":["Apache-1.1"],"repository_url":"https://github.com/ropensci/gbifdb","keywords_array":[],"namespace":null,"versions_count":2,"first_release_published_at":"2022-05-21T00:00:00.000Z","latest_release_published_at":"2023-10-19T00:00:00.000Z","latest_release_number":"1.0.0","last_synced_at":"2026-05-28T00:07:03.478Z","created_at":"2022-05-24T11:00:40.721Z","updated_at":"2026-05-28T00:07:03.478Z","registry_url":"https://cran.r-project.org/package=gbifdb","install_command":null,"documentation_url":"http://cran.r-project.org/web/packages/gbifdb/gbifdb.pdf","metadata":{},"repo_metadata":{"id":41255554,"uuid":"425330108","full_name":"ropensci/gbifdb","owner":"ropensci","description":":package: A High Performance Interface to GBIF","archived":false,"fork":false,"pushed_at":"2024-02-05T23:52:40.000Z","size":400,"stargazers_count":35,"open_issues_count":0,"forks_count":4,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-10-29T22:29:20.953Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://docs.ropensci.org/gbifdb/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ropensci.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-11-06T19:25:41.000Z","updated_at":"2024-09-24T12:48:42.000Z","dependencies_parsed_at":"2023-10-05T05:08:42.462Z","dependency_job_id":"492c8e1d-456c-4030-bac6-8f30f13faacb","html_url":"https://github.com/ropensci/gbifdb","commit_stats":{"total_commits":81,"total_committers":3,"mean_commits":27.0,"dds":0.06172839506172845,"last_synced_commit":"35b60fded47c800c26c908b1289e98b6f2edef78"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropensci","download_url":"https://codeload.github.com/ropensci/gbifdb/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222155692,"owners_count":16940391,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"},"owner_record":{"login":"ropensci","name":"rOpenSci","uuid":"1200269","kind":"organization","description":"","email":"info@ropensci.org","website":"https://ropensci.org/","location":"Berkeley, CA","twitter":"rOpenSci","company":null,"icon_url":"https://avatars.githubusercontent.com/u/1200269?v=4","repositories_count":307,"last_synced_at":"2023-03-10T20:30:59.242Z","metadata":{"has_sponsors_listing":false},"html_url":"https://github.com/ropensci","funding_links":[],"total_stars":null,"followers":null,"following":null,"created_at":"2022-11-02T19:23:08.224Z","updated_at":"2023-03-10T20:30:59.305Z","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropensci","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ropensci/repositories"},"tags":[]},"repo_metadata_updated_at":"2024-10-30T02:58:40.129Z","dependent_packages_count":0,"downloads":178,"downloads_period":"last-month","dependent_repos_count":0,"rankings":{"downloads":79.2730891342463,"dependent_repos_count":35.45467469080226,"dependent_packages_count":29.796711368051938,"stargazers_count":11.325403970999384,"forks_count":14.855707719281618,"average":34.1411173766763},"purl":"pkg:cran/gbifdb","advisories":[],"docker_usage_url":"https://docker.ecosyste.ms/usage/cran/gbifdb","docker_dependents_count":null,"docker_downloads_count":null,"usage_url":"https://repos.ecosyste.ms/usage/cran/gbifdb","dependent_repositories_url":"https://repos.ecosyste.ms/api/v1/usage/cran/gbifdb/dependencies","status":null,"funding_links":[],"critical":null,"issue_metadata":{"last_synced_at":"2024-10-29T21:00:22.007Z","issues_count":4,"pull_requests_count":8,"avg_time_to_close_issue":314567.0,"avg_time_to_close_pull_request":342874.5,"issues_closed_count":4,"pull_requests_closed_count":8,"pull_request_authors_count":3,"issue_authors_count":3,"avg_comments_per_issue":4.0,"avg_comments_per_pull_request":0.5,"merged_pull_requests_count":6,"bot_issues_count":0,"bot_pull_requests_count":0,"past_year_issues_count":2,"past_year_pull_requests_count":1,"past_year_avg_time_to_close_issue":557523.0,"past_year_avg_time_to_close_pull_request":2400051.0,"past_year_issues_closed_count":2,"past_year_pull_requests_closed_count":1,"past_year_pull_request_authors_count":1,"past_year_issue_authors_count":1,"past_year_avg_comments_per_issue":6.0,"past_year_avg_comments_per_pull_request":2.0,"past_year_bot_issues_count":0,"past_year_bot_pull_requests_count":0,"past_year_merged_pull_requests_count":0,"issues_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/issues","maintainers":[{"login":"cboettig","count":7,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/cboettig"}],"active_maintainers":[]},"versions_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/packages/gbifdb/versions","version_numbers_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/packages/gbifdb/version_numbers","latest_version_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/packages/gbifdb/latest_version","dependent_packages_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/packages/gbifdb/dependent_packages","related_packages_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/packages/gbifdb/related_packages","codemeta_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/packages/gbifdb/codemeta","maintainers":[{"uuid":"cboettig@gmail.com","login":null,"name":"Carl Boettiger","email":"cboettig@gmail.com","url":null,"packages_count":26,"html_url":null,"role":null,"created_at":"2022-11-14T17:26:30.886Z","updated_at":"2022-11-14T17:26:30.886Z","packages_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/maintainers/cboettig@gmail.com/packages"}],"registry":{"name":"cran.r-project.org","url":"https://cran.r-project.org","ecosystem":"cran","default":true,"packages_count":28538,"maintainers_count":15849,"namespaces_count":0,"keywords_count":0,"github":"r-project-org","metadata":{"icon_url":"https://cran.r-project.org/CRANlogo.png"},"icon_url":"https://cran.r-project.org/CRANlogo.png","created_at":"2022-04-06T16:32:25.637Z","updated_at":"2026-04-27T18:20:16.286Z","packages_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/packages","maintainers_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/maintainers","namespaces_url":"https://packages.ecosyste.ms/api/v1/registries/cran.r-project.org/namespaces"}}],"commits":{"id":1396334,"full_name":"ropensci/gbifdb","default_branch":"main","total_commits":82,"total_committers":4,"total_bot_commits":0,"total_bot_committers":0,"mean_commits":20.5,"dds":0.07317073170731703,"past_year_total_commits":1,"past_year_total_committers":1,"past_year_total_bot_commits":0,"past_year_total_bot_committers":0,"past_year_mean_commits":1.0,"past_year_dds":0.0,"last_synced_at":"2026-05-27T23:02:41.572Z","last_synced_commit":"12490ac39abc2f859937a2dc2bbfff3f378959ac","created_at":"2023-10-01T00:07:02.549Z","updated_at":"2026-05-27T23:02:28.661Z","committers":[{"name":"Carl","email":"cboettig@gmail.com","login":"cboettig","count":76},{"name":"Carl Boettiger","email":"cobettig@gmail.com","login":null,"count":4},{"name":"beausoleilmo","email":"beausoleilmo","login":"beausoleilmo","count":1},{"name":"Daniel Noesgaard","email":"dnoesgaard@gbif.org","login":"dnoesgaard","count":1}],"past_year_committers":[{"name":"beausoleilmo","email":"beausoleilmo","login":"beausoleilmo","count":1}],"commits_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/commits","host":{"name":"GitHub","url":"https://github.com","kind":"github","last_synced_at":"2026-05-30T00:00:22.241Z","repositories_count":6247850,"commits_count":883524063,"contributors_count":34986510,"owners_count":1160224,"icon_url":"https://github.com/github.png","host_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub/repositories"}},"issues_stats":{"full_name":"ropensci/gbifdb","html_url":"https://github.com/ropensci/gbifdb","last_synced_at":"2026-05-17T18:03:58.734Z","status":"error","issues_count":5,"pull_requests_count":11,"avg_time_to_close_issue":317803.2,"avg_time_to_close_pull_request":524526.4,"issues_closed_count":5,"pull_requests_closed_count":10,"pull_request_authors_count":4,"issue_authors_count":4,"avg_comments_per_issue":2.8,"avg_comments_per_pull_request":0.7272727272727273,"merged_pull_requests_count":7,"bot_issues_count":0,"bot_pull_requests_count":0,"past_year_issues_count":1,"past_year_pull_requests_count":2,"past_year_avg_time_to_close_issue":330748.0,"past_year_avg_time_to_close_pull_request":102217.0,"past_year_issues_closed_count":1,"past_year_pull_requests_closed_count":1,"past_year_pull_request_authors_count":1,"past_year_issue_authors_count":1,"past_year_avg_comments_per_issue":2.0,"past_year_avg_comments_per_pull_request":1.0,"past_year_bot_issues_count":0,"past_year_bot_pull_requests_count":0,"past_year_merged_pull_requests_count":1,"created_at":"2023-05-10T23:05:40.587Z","updated_at":"2026-05-17T18:03:58.735Z","repository_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb","issues_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories/ropensci%2Fgbifdb/issues","issue_labels_count":{},"pull_request_labels_count":{},"issue_author_associations_count":{"NONE":3,"MEMBER":1,"CONTRIBUTOR":1},"pull_request_author_associations_count":{"MEMBER":6,"NONE":4,"CONTRIBUTOR":1},"issue_authors":{"jtmiller28":2,"Pakillo":1,"cboettig":1,"beausoleilmo":1},"pull_request_authors":{"cboettig":6,"beausoleilmo":2,"nealrichardson":2,"dnoesgaard":1},"host":{"name":"GitHub","url":"https://github.com","kind":"github","last_synced_at":"2026-05-21T00:00:41.637Z","repositories_count":14655412,"issues_count":34146579,"pull_requests_count":111718417,"authors_count":11268624,"icon_url":"https://github.com/github.png","host_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories","owners_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/owners","authors_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors"},"past_year_issue_labels_count":{},"past_year_pull_request_labels_count":{},"past_year_issue_author_associations_count":{"CONTRIBUTOR":1},"past_year_pull_request_author_associations_count":{"NONE":2},"past_year_issue_authors":{"beausoleilmo":1},"past_year_pull_request_authors":{"beausoleilmo":2},"maintainers":[{"login":"cboettig","count":7,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/cboettig"}],"active_maintainers":[]},"events":{"total":{"PullRequestEvent":2,"IssuesEvent":2,"WatchEvent":2,"IssueCommentEvent":2,"PushEvent":1},"last_year":{"PullRequestEvent":2,"IssuesEvent":2,"WatchEvent":1,"IssueCommentEvent":2,"PushEvent":1}},"keywords":[],"dependencies":[],"score":10.262559621074582,"created_at":"2023-10-01T00:03:50.016Z","updated_at":"2026-06-01T01:01:19.741Z","avatar_url":"https://github.com/ropensci.png","language":"R","category":"Biosphere","sub_category":"Biodiversity Data Access and Management","monthly_downloads":178,"total_dependent_repos":0,"total_dependent_packages":0,"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  warning = FALSE,\n  message = FALSE,\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n\nSys.unsetenv(\"AWS_ACCESS_KEY_ID\")\nSys.unsetenv(\"AWS_SECRET_ACCESS_KEY\")\n\nSys.setenv(\"GBIF_HOME\"=\"/home/shared-data/gbif\")\n\n```\n\n# gbifdb\n\n\u003c!-- badges: start --\u003e\n[![CRAN status](https://www.r-pkg.org/badges/version/gbifdb)](https://CRAN.R-project.org/package=gbifdb)\n[![R-CMD-check](https://github.com/cboettig/gbifdb/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/cboettig/gbifdb/actions/workflows/R-CMD-check.yaml)\n\u003c!-- badges: end --\u003e\n\nThe goal of `gbifdb` is to provide a relational database interface to a `parquet` based serializations of `gbif`'s AWS snapshots of its public data [^1].\nInstead of requiring custom functions for filtering and selecting data from the central GBIF server (as in `rgbif`), `gbifdb` users can take advantage of the full array of `dplyr` and `tidyr` functions which can be automatically translated to SQL by `dbplyr`.\nUsers already familiar with SQL can construct SQL queries directly with `DBI` instead. \n`gbifdb` sends these queries to [`duckdb`](https://duckdb.org), a high-performance, columnar-oriented database engine which runs entirely inside the client,\n(unlike server-client databases such as MySQL or Postgres, no additional setup is needed outside of installing `gbifdb`.)\n`duckdb` is able to execute these SQL queries directly on-disk against the Parquet data files, side-stepping limitations of available RAM or the need to import the data. \nIt's highly optimized implementation can be faster even than in-memory operations in `dplyr`.\n`duckdb` supports the full set of SQL instructions, including windowed operations like `group_by`+`summarise` as well as table joins.\n\n\n[^1]: all CC0 and CC-BY licensed data in GBIF that have coordinates which passed automated quality checks,\n[see GBIF docs]https://github.com/gbif/occurrence/blob/master/aws-public-data.md)\n\n`gbifdb` has two mechanisms for providing database connections: one which the Parquet snapshot of GBIF must first be downloaded locally, \nand a second where the GBIF parquet snapshot can be accessed directly from an Amazon Public Data Registry S3 bucket without downloading a copy.\nThe latter approach will be faster for one-off operations and is also suitable when using a cloud-based computing provider in the same region.\n\n\n\n\n\n## Installation\n\n\u003c!--\n\nYou can install the released version of `gbifdb` from [CRAN](https://CRAN.R-project.org) with:\n\n``` r\ninstall.packages(\"gbifdb\")\n```\n\n--\u003e\n\nAnd the development version from [GitHub](https://github.com/) with:\n\n``` r\n# install.packages(\"devtools\")\ndevtools::install_github(\"ropensci/gbifdb\")\n```\n\n`gbifdb` has few dependencies: `arrow`, `duckdb` and `DBI` are required.  \n\n## Getting Started\n\n```{r message=FALSE}\nlibrary(gbifdb)\nlibrary(dplyr)  # optional, for dplyr-based operations\n```\n\n### Remote data access\n\nTo begin working with GBIF data directly without downloading the data first, simply establish a remote connection using `gbif_remote()`.\n\n```{r remote1}\ngbif \u003c- gbif_remote()\n```\n\nWe can now perform most `dplyr` operations:\n\n```{r remote2}\ngbif %\u003e%\n  filter(phylum == \"Chordata\", year \u003e 1990) %\u003e%\n  count(class, year) %\u003e%\n  collect()\n```\n\nBy default, this relies on an `arrow` connection, which currently lacks support for some more complex windowed operations in `dplyr`.\nA user can specify the option `to_duckdb = TRUE` in `gbif_remote()` (or simply pass the connection to `arrow::to_duckdb()`) to create\na `duckdb` connection. This is slightly slower at this time. \nKeep in mind that as with any database connection, to use non-`dplyr` functions the user will generally need to call `dplyr::collect()`,\nwhich pulls the data into working memory.  \nBe sure to subset the data appropriately first (e.g. with `filter`, `summarise`, etc), as attempting to `collect()` a large\ntable will probably exceed available RAM and crash your R session!\n\nWhen using a `gbif_remote()` connection, \nall I/O operations will be conducted over the network storage instead of your local disk,\nwithout downloading the full dataset first.\nConsequently, this mechanism will perform best on platforms with faster network connections.\nThese operations will be considerably slower than they would be if you download the entire dataset first \n(see below, unless you are on an AWS cloud instance in the same region as the remote host), but this does\navoid the download step all-together, which may be necessary if you do not have 100+ GB free storage space\nor the time to download the whole dataset first (e.g. for one-off queries).\n\n### Local data\n\nFor extended analysis of GBIF, users may prefer to download the entire GBIF parquet data first.\nThis requires over 100 GB free disk space, and will be a time-consuming process the first time.\nHowever, once downloaded, future queries will run much much faster, particularly if you are network-limited.\nUsers can download the current release of GBIF to local storage like so:\n\n```{r download, eval=FALSE}\ngbif_download()\n```\n\nBy default, this will download to the dir given by `gbif_dir()`.  \nAn alternative directory can be provided by setting the environmental variable, `GBIF_HOME`,\nor providing the path to the directory containing the parquet files directly.\n\nOnce you have downloaded the parquet-formatted GBIF data, \n`gbif_local()` will establish a connection to these local parquet files. \n\n```{r local}  \ngbif \u003c- gbif_local()\ngbif\n```\n\n```{r colnames}\ncolnames(gbif)\n```\n\nNow, we can use `dplyr` to perform standard queries: \n\n```{r growth}\ngrowth \u003c- gbif %\u003e%\n  filter(phylum == \"Chordata\", year \u003e 1990) %\u003e%\n  count(class, year) %\u003e% arrange(year)\ngrowth\n```\n\nRecall that when database connections in `dplyr`, the data remains in the database (i.e. on disk, not in working RAM).  \nThis is fine for any further operations using `dplyr`/`tidyr` functions which can be translated into SQL.  \nUsing such functions we can usually reduce our resulting table to something much smaller, \nwhich can then be pulled into memory in R for further analysis using `collect()`:\n\n```{r plot}\nlibrary(ggplot2)\nlibrary(forcats)\n# GBIF: the global bird information facility?\ngrowth %\u003e%\n collect() %\u003e%\n  mutate(class = fct_lump_n(class, 6)) %\u003e%\n  ggplot(aes(year, n, fill=class)) + geom_col() +\n  ggtitle(\"GBIF observations of vertebrates by class\")\n```\n\n\n## Visualizing all of GBIF\n\nDatabase operations such as rounding provide an easy way to \"rasterize\" the data for spatial visualizations.\nHere we quickly generate where color intensity reflects the logarithmic occurrence count in that pixel:\n\n```{r}\nlibrary(terra)\nlibrary(viridisLite)\n\ndf \u003c- gbif |\u003e\n  mutate(latitude = round(decimallatitude,1),\n         longitude = round(decimallongitude,1)) |\u003e \n  count(longitude, latitude) |\u003e \n  collect() |\u003e \n  mutate(n = log(n)) |\u003e\n  filter(!is.na(n))\n\nr \u003c- rast(df, crs=\"epsg:4326\")\nplot(r, col= viridis(1e3), legend=FALSE, maxcell=6e6, colNA=\"black\", axes=FALSE)\n```\n\n## Performance notes\n\nBecause `parquet` is a columnar-oriented dataset, performance can be improved by including a `select()` call at the end of a dplyr function chain to only return the columns you actually need. This can be particularly helpful on remote connections using `gbif_remote()`.\n\n\n```{r include=FALSE}\nSys.unsetenv(\"GBIF_HOME\")\ncodemeta::write_codemeta()\n```\n\n\n","funding_links":[],"readme_doi_urls":[],"works":{},"citation_counts":{},"total_citations":0,"keywords_from_contributors":[],"project_url":"https://ost.ecosyste.ms/api/v1/projects/106826","html_url":"https://ost.ecosyste.ms/projects/106826"}