{"id":298523,"name":"BEETHOVEN","description":"Building an extensible, reproducible, test-driven, harmonised, open-source, versioned, ensemble model for air quality.","url":"https://github.com/niehs/beethoven","last_synced_at":"2026-04-20T10:30:19.897Z","repository":{"id":181096612,"uuid":"666215399","full_name":"NIEHS/beethoven","owner":"NIEHS","description":"BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality ","archived":false,"fork":false,"pushed_at":"2026-03-11T14:07:50.000Z","size":682544,"stargazers_count":7,"open_issues_count":10,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2026-04-15T08:03:04.940Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://niehs.github.io/beethoven/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NIEHS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"contributing_guide.qmd","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-07-14T01:45:02.000Z","updated_at":"2025-10-08T19:09:15.000Z","dependencies_parsed_at":"2023-09-27T13:46:24.917Z","dependency_job_id":"c621c424-176d-45f1-a221-76c0e09bc3ae","html_url":"https://github.com/NIEHS/beethoven","commit_stats":{"total_commits":861,"total_committers":21,"mean_commits":41.0,"dds":0.6155632984901278,"last_synced_commit":"524fcd75353d359d0ffb8f78cc634e8c632098cd"},"previous_names":["spatiotemporal-exposures-and-toxicology/nrt-ap-model","spatiotemporal-exposures-and-toxicology/nrtapmodel","spatiotemporal-exposures-and-toxicology/beethoven","kyle-messier/beethoven","niehs/beethoven"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NIEHS/beethoven","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIEHS%2Fbeethoven","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIEHS%2Fbeethoven/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIEHS%2Fbeethoven/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIEHS%2Fbeethoven/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NIEHS","download_url":"https://codeload.github.com/NIEHS/beethoven/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIEHS%2Fbeethoven/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32001064,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"owner":{"login":"NIEHS","name":"National Institute of Environmental Health Science","uuid":"22938605","kind":"organization","description":"The mission of the National Institute of Environmental Health Sciences is to discover how the environment affects people in order to promote healthier lives.","email":null,"website":"https://www.niehs.nih.gov/","location":"Durham, NC","twitter":null,"company":null,"icon_url":"https://avatars.githubusercontent.com/u/22938605?v=4","repositories_count":55,"last_synced_at":"2023-08-13T07:43:51.753Z","metadata":{"has_sponsors_listing":false},"html_url":"https://github.com/NIEHS","funding_links":[],"total_stars":null,"followers":null,"following":null,"created_at":"2022-11-07T03:50:33.323Z","updated_at":"2023-08-13T07:43:52.290Z","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NIEHS","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NIEHS/repositories"},"packages":[],"commits":{"id":1642203,"full_name":"niehs/beethoven","default_branch":"main","total_commits":1116,"total_committers":21,"total_bot_commits":0,"total_bot_committers":0,"mean_commits":53.142857142857146,"dds":0.703405017921147,"past_year_total_commits":90,"past_year_total_committers":3,"past_year_total_bot_commits":0,"past_year_total_bot_committers":0,"past_year_mean_commits":30.0,"past_year_dds":0.19999999999999996,"last_synced_at":"2026-04-19T09:26:05.685Z","last_synced_commit":"aaad05d453e6d9426dabf2711d8627f99700ff6f","created_at":"2024-08-17T00:13:51.443Z","updated_at":"2026-04-19T09:23:36.981Z","committers":[{"name":"Insang Song","email":"insang.song@nih.gov","login":null,"count":331},{"name":"mitchellmanware","email":"memanware@gmail.com","login":"mitchellmanware","count":315},{"name":"{SET}group","email":"127860447+Spatiotemporal-Exposures-and-Toxicology","login":"Spatiotemporal-Exposures-and-Toxicology","count":194},{"name":"Kyle Messier","email":"messierkp@ehshpclp143.hpc.niehs.nih.gov","login":null,"count":83},{"name":"Insang Song","email":"sigmafelix@hotmail.com","login":"sigmafelix","count":44},{"name":"kyle-messier","email":"kyle.messier@nih.gov","login":"kyle-messier","count":43},{"name":"Eva Marques","email":"marquesel@ehshpclp143.hpc.niehs.nih.gov","login":null,"count":40},{"name":"Spatiotemporal-Exposures-and-Toxicology","email":"messierkp@almbp02184136.local.niehs.nih.gov","login":null,"count":15},{"name":"Mitchell Manware","email":"mitchellmanware@Mitchells-MacBook-Pro.local","login":null,"count":13},{"name":"Spatiotemporal-Exposures-and-Toxicology","email":"messierkp@almbp02184136.niehs.nih.gov","login":null,"count":8},{"name":"Eva Marques","email":"marquesel@cn040603.hpc.niehs.nih.gov","login":null,"count":7},{"name":"Eva Marques","email":"eva0marques@gmail.com","login":"eva0marques","count":4},{"name":"Messier","email":"messierkp@almbp02121578.niehs.nih.gov","login":null,"count":4},{"name":"Mariana Kassien","email":"kassienma@ehshpclp143.hpc.niehs.nih.gov","login":null,"count":4},{"name":"Ranadeep Daw","email":"36753043+dawranadeep","login":"dawranadeep","count":3},{"name":"Eva Marques","email":"marquesel@gn040801.hpc.niehs.nih.gov","login":null,"count":2},{"name":"dzilber","email":"daszilber@gmail.com","login":"dzilber","count":2},{"name":"Daniel Zilber","email":"daniel.zilber@nih.gov","login":null,"count":1},{"name":"Mitchell Manware","email":"manwareme@ehshpclp143.hpc.niehs.nih.gov","login":null,"count":1},{"name":"Spatiotemporal-Exposures-and-Toxicology","email":"messierkp@almbp02184136.local","login":null,"count":1},{"name":"Spatiotemporal-Exposures-and-Toxicology","email":"messierkp@almbp02184136.localdomain","login":null,"count":1}],"past_year_committers":[{"name":"mitchellmanware","email":"memanware@gmail.com","login":"mitchellmanware","count":72},{"name":"kyle-messier","email":"kyle.messier@nih.gov","login":"kyle-messier","count":10},{"name":"Insang Song","email":"sigmafelix@hotmail.com","login":"sigmafelix","count":8}],"commits_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub/repositories/niehs%2Fbeethoven/commits","host":{"name":"GitHub","url":"https://github.com","kind":"github","last_synced_at":"2026-04-19T00:00:13.908Z","repositories_count":6214217,"commits_count":900062046,"contributors_count":34915039,"owners_count":1143434,"icon_url":"https://github.com/github.png","host_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://commits.ecosyste.ms/api/v1/hosts/GitHub/repositories"}},"issues_stats":{"full_name":"niehs/beethoven","html_url":"https://github.com/niehs/beethoven","last_synced_at":"2026-02-03T21:00:39.480Z","status":"active","issues_count":101,"pull_requests_count":119,"avg_time_to_close_issue":10751506.625,"avg_time_to_close_pull_request":215984.96396396396,"issues_closed_count":64,"pull_requests_closed_count":111,"pull_request_authors_count":6,"issue_authors_count":7,"avg_comments_per_issue":2.4158415841584158,"avg_comments_per_pull_request":0.9831932773109243,"merged_pull_requests_count":91,"bot_issues_count":0,"bot_pull_requests_count":0,"past_year_issues_count":21,"past_year_pull_requests_count":38,"past_year_avg_time_to_close_issue":1443906.7777777778,"past_year_avg_time_to_close_pull_request":230965.34375,"past_year_issues_closed_count":9,"past_year_pull_requests_closed_count":32,"past_year_pull_request_authors_count":3,"past_year_issue_authors_count":3,"past_year_avg_comments_per_issue":2.380952380952381,"past_year_avg_comments_per_pull_request":0.631578947368421,"past_year_bot_issues_count":0,"past_year_bot_pull_requests_count":0,"past_year_merged_pull_requests_count":22,"created_at":"2024-08-17T00:13:58.295Z","updated_at":"2026-02-03T21:00:39.480Z","repository_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories/niehs%2Fbeethoven","issues_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories/niehs%2Fbeethoven/issues","issue_labels_count":{"Covariate development":10,"development":8,"documentation":7,"models":7,"enhancement":5,"covariates":4,"bug":4,"Production":3,"Test-Driven-Development":3,"Refactor":2,"test-driven development":2,"Exploratory":1,"refactor":1,"AQS data":1,"help wanted":1,"question":1},"pull_request_labels_count":{"development":2,"documentation":2,"test-driven development":2,"enhancement":1},"issue_author_associations_count":{"COLLABORATOR":99,"MEMBER":1,"CONTRIBUTOR":1},"pull_request_author_associations_count":{"COLLABORATOR":117,"MEMBER":1,"CONTRIBUTOR":1},"issue_authors":{"kyle-messier":41,"sigmafelix":27,"mitchellmanware":21,"eva0marques":7,"MAKassien":3,"Sanisha003":1,"dawranadeep":1},"pull_request_authors":{"mitchellmanware":46,"kyle-messier":33,"sigmafelix":31,"eva0marques":7,"dawranadeep":1,"MAKassien":1},"host":{"name":"GitHub","url":"https://github.com","kind":"github","last_synced_at":"2026-04-17T00:00:09.649Z","repositories_count":14294729,"issues_count":34555309,"pull_requests_count":113089886,"authors_count":11236671,"icon_url":"https://github.com/github.png","host_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/repositories","owners_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/owners","authors_url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors"},"past_year_issue_labels_count":{"covariates":3,"development":3,"bug":3,"models":2,"documentation":1,"enhancement":1,"refactor":1},"past_year_pull_request_labels_count":{},"past_year_issue_author_associations_count":{"COLLABORATOR":15,"MEMBER":1},"past_year_pull_request_author_associations_count":{"COLLABORATOR":19,"MEMBER":1},"past_year_issue_authors":{"kyle-messier":10,"mitchellmanware":4,"sigmafelix":2},"past_year_pull_request_authors":{"mitchellmanware":11,"sigmafelix":8,"kyle-messier":1},"maintainers":[{"login":"kyle-messier","count":74,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/kyle-messier"},{"login":"mitchellmanware","count":67,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/mitchellmanware"},{"login":"sigmafelix","count":58,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/sigmafelix"},{"login":"eva0marques","count":14,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/eva0marques"},{"login":"MAKassien","count":4,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/MAKassien"},{"login":"Sanisha003","count":1,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/Sanisha003"}],"active_maintainers":[{"login":"mitchellmanware","count":15,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/mitchellmanware"},{"login":"kyle-messier","count":11,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/kyle-messier"},{"login":"sigmafelix","count":10,"url":"https://issues.ecosyste.ms/api/v1/hosts/GitHub/authors/sigmafelix"}]},"events":{"total":{"DeleteEvent":20,"PullRequestEvent":41,"ForkEvent":2,"IssuesEvent":72,"WatchEvent":6,"IssueCommentEvent":103,"PushEvent":261,"PullRequestReviewCommentEvent":6,"PullRequestReviewEvent":15,"GollumEvent":1,"CreateEvent":23},"last_year":{"DeleteEvent":10,"PullRequestEvent":12,"ForkEvent":1,"IssuesEvent":22,"IssueCommentEvent":44,"PushEvent":81,"PullRequestReviewCommentEvent":5,"PullRequestReviewEvent":4,"CreateEvent":8}},"keywords":[],"dependencies":[{"ecosystem":"actions","filepath":".github/workflows/codecov.yaml","sha":null,"kind":"manifest","created_at":"2023-09-27T13:46:24.248Z","updated_at":"2023-09-27T13:46:24.248Z","repository_link":"https://github.com/NIEHS/beethoven/blob/main/.github/workflows/codecov.yaml","dependencies":[{"id":13977280405,"package_name":"actions/checkout","ecosystem":"actions","requirements":"v2","direct":true,"kind":"composite","optional":false},{"id":13977280406,"package_name":"r-lib/actions/setup-r","ecosystem":"actions","requirements":"v2","direct":true,"kind":"composite","optional":false},{"id":13977280407,"package_name":"actions/cache","ecosystem":"actions","requirements":"v2","direct":true,"kind":"composite","optional":false},{"id":13977280408,"package_name":"codecov/codecov-action","ecosystem":"actions","requirements":"v3","direct":true,"kind":"composite","optional":false}]},{"ecosystem":"actions","filepath":".github/workflows/learn-github-actions.yml","sha":null,"kind":"manifest","created_at":"2023-09-27T13:46:24.489Z","updated_at":"2023-09-27T13:46:24.489Z","repository_link":"https://github.com/NIEHS/beethoven/blob/main/.github/workflows/learn-github-actions.yml","dependencies":[{"id":13977280604,"package_name":"actions/checkout","ecosystem":"actions","requirements":"v3","direct":true,"kind":"composite","optional":false},{"id":13977280607,"package_name":"actions/setup-node","ecosystem":"actions","requirements":"v3","direct":true,"kind":"composite","optional":false}]},{"ecosystem":"actions","filepath":".github/workflows/test-coverage.yaml","sha":null,"kind":"manifest","created_at":"2023-09-27T13:46:24.611Z","updated_at":"2023-09-27T13:46:24.611Z","repository_link":"https://github.com/NIEHS/beethoven/blob/main/.github/workflows/test-coverage.yaml","dependencies":[{"id":13977280734,"package_name":"actions/checkout","ecosystem":"actions","requirements":"v3","direct":true,"kind":"composite","optional":false},{"id":13977280735,"package_name":"r-lib/actions/setup-r","ecosystem":"actions","requirements":"v2","direct":true,"kind":"composite","optional":false},{"id":13977280736,"package_name":"r-lib/actions/setup-r-dependencies","ecosystem":"actions","requirements":"v2","direct":true,"kind":"composite","optional":false},{"id":13977280737,"package_name":"actions/upload-artifact","ecosystem":"actions","requirements":"v3","direct":true,"kind":"composite","optional":false}]},{"ecosystem":"cran","filepath":"DESCRIPTION","sha":null,"kind":"manifest","created_at":"2023-09-27T13:46:24.710Z","updated_at":"2023-09-27T13:46:24.710Z","repository_link":"https://github.com/NIEHS/beethoven/blob/main/DESCRIPTION","dependencies":[{"id":13977280759,"package_name":"covr","ecosystem":"cran","requirements":"*","direct":true,"kind":"suggests","optional":false},{"id":13977280760,"package_name":"knitr","ecosystem":"cran","requirements":"*","direct":true,"kind":"suggests","optional":false},{"id":13977280761,"package_name":"rmarkdown","ecosystem":"cran","requirements":"*","direct":true,"kind":"suggests","optional":false},{"id":13977280762,"package_name":"testthat","ecosystem":"cran","requirements":"\u003e= 3.0.0","direct":true,"kind":"suggests","optional":false},{"id":13977280763,"package_name":"terra","ecosystem":"cran","requirements":"*","direct":true,"kind":"suggests","optional":false},{"id":13977280764,"package_name":"sf","ecosystem":"cran","requirements":"*","direct":true,"kind":"suggests","optional":false},{"id":13977280765,"package_name":"sftime","ecosystem":"cran","requirements":"*","direct":true,"kind":"suggests","optional":false}]}],"score":5.87773578177964,"created_at":"2024-08-17T00:13:50.465Z","updated_at":"2026-04-20T10:30:19.934Z","avatar_url":"https://github.com/NIEHS.png","language":"R","category":"Natural Resources","sub_category":"Air Quality","monthly_downloads":0,"total_dependent_repos":0,"total_dependent_packages":0,"readme":"# Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality \u003ca href=\"https://niehs.github.io/beethoven\"\u003e\u003cimg align=\"right\" src=\"man/figures/beethoven-logo.png\" width=\"168px\" alt=\"two hexagons with distributed tan, orange, and teal with geometric symbols placed. Two hexagons are diagonally placed from the top left to the bottom right\" /\u003e\u003ca\u003e\n\n\n\u003cp\u003e\n \n[![R-CMD-check](https://github.com/NIEHS/beethoven/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/NIEHS/beethoven/actions/workflows/check-standard.yaml)\n[![cov](https://NIEHS.github.io/beethoven/badges/coverage.svg)](https://github.com/NIEHS/beethoven/actions/workflows/test-coverage.yaml)\n[![lint](https://github.com/NIEHS/beethoven/actions/workflows/lint.yaml/badge.svg)](https://github.com/NIEHS/beethoven/actions/workflows/lint.yaml)\n[![Lifecycle:\nexperimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n\n\nGroup Project for the Spatiotemporal Exposures and Toxicology group with help from friends :smiley: :cowboy_hat_face: :earth_americas: \n\n\n\u003c/p\u003e\n    \n## Installation\n\n```r\nremotes::install_github(\"NIEHS/beethoven\")\n```\n\n## Overall Project Workflow\n\nTargets: Make-like Reproducible Analysis Pipeline\n 1) AQS Data\n 2) Generate Covariates\n 3) Fit Base Learners\n 4) Fit Meta Learners\n 5) Predictions\n 6) Summary Stats\n\n```mermaid\ngraph TD\n\n    subgraph AQS data with `amadeus`\n        AQS[PM2.5 download]--\u003eAQS2[PM2.5 Process]\n    end\n\n    AQS2 --\u003e Cov1\n    AQS2 --\u003e Cov2\n    AQS2 --\u003e Cov3\n    AQS2 --\u003e Cov4\n    subgraph Covariate Calculation with `amadeus`\n        Cov1[Meterology]\n        Cov2[NLCD]\n        Cov3[...]\n        Cov4[MERRA2]\n    end\n\n    subgraph Processed Covariates\n        PC[Baselearner Input]\n    end\n\n    Cov1 --\u003e PC\n    Cov2 --\u003e PC\n    Cov3 --\u003e PC\n    Cov4 --\u003e PC\n\n    PC --\u003e A1\n    PC --\u003e A2\n    PC --\u003e A3\n    subgraph MLP Baselearner   \n        A1[ M_i is a 30% random sample of N] --\u003e B1A[Spatial CV]\n        A1[ M_i is a 30% random sample of N] --\u003e B1B[Temporal CV]\n        A1[ M_i is a 30% random sample of N] --\u003e B1C[Space/Time CV]\n        B1A --\u003e C1[3. M_i is fit with a MLP model]\n        B1B --\u003e C1\n        B1C --\u003e C1\n    end\n\n    subgraph LightGBM Baselearner\n        A2[ M_i is a 30% random sample of N] --\u003e B2A[Spatial CV]\n        A2[ M_i is a 30% random sample of N] --\u003e B2B[Temporal CV]\n        A2[ M_i is a 30% random sample of N] --\u003e B2C[Space/Time CV]\n        B2A --\u003e C2[3. M_i is fit with a LightGBM model]\n        B2B --\u003e C2\n        B2C --\u003e C2\n    end\n    subgraph Elastic-Net Baselearner\n        A3[ M_i is a 30% random sample of N] --\u003e B3A[Spatial CV]\n        A3[ M_i is a 30% random sample of N] --\u003e B3B[Temporal CV]\n        A3[ M_i is a 30% random sample of N] --\u003e B3C[Space/Time CV]\n        B3A --\u003e C3[3. M_i is fit with a glmnet model]\n        B3B --\u003e C3\n        B3C --\u003e C3\n    end\n    C1 --\u003e D1[Elastic-Net Meta-Learner]\n    C2 --\u003e D1[Elastic-Net Meta-Learner]\n    C3 --\u003e D1[Elastic-Net Meta-Learner]\n\n    subgraph Meta-Learner Phase\n        D1 --\u003e E1[Perform 50% column-wise subsampling K times]\n        E1 --\u003e E1b[Proper Scoring CRPS CV with 1 of 3 categories with equal probability, Spatial, Temporal, or Space/Time]\n        E1b --\u003e M1[Elastic-Net Model 1]\n        E1b --\u003e M2[Elastic-Net Model 2]\n        E1b --\u003e M3[Elastic-Net Model 3]\n        E1b --\u003e M4[Elastic-Net Model K-1]\n        E1b --\u003e M5[Elastic-Net Model K]\n    end\n\n\n    subgraph Posterior Summary\n        M1 --\u003e P1[Complete Posterior Summary at daily, 1-km]\n        M2 --\u003e P1\n        M3 --\u003e P1\n        M4 --\u003e P1\n        M5 --\u003e P1\n        P1 --\u003e P5[Version and Deploy with Vetiver]\n        P1 --\u003e P2[Spatial and Temporal Average Summaries]\n        P2 --\u003e P5\n    end\n\n   style A1 fill:#d3d3d3,stroke:#000,stroke-width:2px\n    style B1A fill:#d3d3d3,stroke:#000,stroke-width:2px\n    style B1B fill:#d3d3d3,stroke:#000,stroke-width:2px\n    style B1C fill:#d3d3d3,stroke:#000,stroke-width:2px\n    style C1 fill:#d3d3d3,stroke:#000,stroke-width:2px     \n\n    style A2 fill:#62C6F2,stroke:#000,stroke-width:2px\n    style B2A fill:#62C6F2,stroke:#000,stroke-width:2px\n    style B2B fill:#62C6F2,stroke:#000,stroke-width:2px\n    style B2C fill:#62C6F2,stroke:#000,stroke-width:2px\n    style C2 fill:#62C6F2,stroke:#000,stroke-width:2px     \n\n    style A3 fill:#ffb09c,stroke:#000,stroke-width:2px\n    style B3A fill:#ffb09c,stroke:#000,stroke-width:2px\n    style B3B fill:#ffb09c,stroke:#000,stroke-width:2px\n    style B3C fill:#ffb09c,stroke:#000,stroke-width:2px\n    style C3 fill:#ffb09c,stroke:#000,stroke-width:2px     \n\n    style P1 fill:#abf7b1,stroke:#000,stroke-width:2px\n    style P2 fill:#abf7b1,stroke:#000,stroke-width:2px\n    style P5 fill:#abf7b1,stroke:#000,stroke-width:2px      \n\n```\n \n**Placeholder for up-to-date rendering of targets**\n\n```r\ntar_visnetwork(targets)\n```\n    \n\n## Project Organization\n\nHere, we describe the structure of the project and the naming conventions used. The most up to date file paths and names are recorded here for reference.\n\n### File Structure\n\n#### Folder Structure\n\n- `R/` This is where the main R code (e.g. .R files) lives. Nothing else but .R files should be in here. i.e. Target helper functions, model fitting and post-processing, plotting and summary functions. \n- `tests/` This is where the unit and integration tests reside. The structure is based off the standard practices of the [testthat](https://testthat.r-lib.org/) R package for unit testing.\n    - `testthat` Unit and integration tests for CI/CD reside here\n    - `testdata` Small test datasets including our small (in size) complete pipeline testing. \n    - `testthat.R` Special script created and maintained by testthat\n- `man/` This sub-directory contains .Rd and othe files created by the [roxygen2](https://roxygen2.r-lib.org/) package for assisted documentation of R packages\n- `vignettes/` Rmd (and potentially Qmd) narrative text and code files. These are rendered into the **Articles** for the package website created by [pkgdown](https://pkgdown.r-lib.org/) \n- `inst/` Is a sub-directory for arbitrary files outside of the main `R/` directory\n     - `targets` which include the important pipeline file `_targets.R`\n- `.github/workflows/` This hidden directory is where the GitHub CI/CD yaml files reside\n\n##### The following sub-directories are not including the package build and included only in the source code here\n- `tools/` This sub-directory is dedicated to educational or demonstration material (e.g. Rshiny).\n  \n#### Relevant files \n\n- LICENSE\n- DESCRIPTION\n- NAMESPACE \n- README.md\n\n### Naming Conventions\n\nNaming things is hard and somewhat subjective. Nonetheless, consistent naming conventions make for better reproducibility, interpretability, and future extensibility. \nHere, we provide the `beethoven` naming conventions for objects as used in `targets` and for naming functions within the package (i.e. **R/**). \nFor `tar_target` functions, we use the following naming conventions:\n\n\nNaming conventions for `targets` objects. We are motivated by the [Compositional Forecast](https://cfconventions.org/Data/cf-standard-names/docs/guidelines.html) (CF) model naming conventions:\n\ne.g. [surface] [component] standard_name [at surface] [in medium] [due to process] [assuming condition]\nIn CF, the entire process can be known from the required and optional naming pieces. \n\nHere, we use the following naming convention:\n\n**[R object type]\\_[role-suffix]\\_[stage]\\_[source]\\_[spacetime]**\n\n Each section is in the brackets [] and appears in this order. For some objects, not all naming sections are required. If two keywords in a section apply, then they are appended with a `-`\n\nExamples: 1) `sf_PM25_log10-fit_AQS_siteid` is an `sf` object for `PM25` data that is log-transformed and ready for base-learner fitting, derived from AQS data and located at the siteid locations. \n2) `SpatRast_process_MODIS` is a terra `SpatRast` object that has been processed from MODIS.\n\n\n#### Naming section definitions:\n\n- **R object type**: chr (character), list, sf, dt (datatable), tibble, SpatRaster, SpatVector\n\n- **role:** Detailed description of the role of the object in the pipeline. Allowable keywords:\n\n  - PM25\n  - feat (feature) (i.e. geographic covariate) \n  - base_model\n    - base_model suffix types: linear, random_forest, lgb (lightGBM), xgb (xgboost), mlp (neural network, multilayer perceptron) etc.\n  - meta_model \n  - prediction\n  - plot\n    -plot suffix types: scatter, map, time_series, histogram, density etc. \n  \n- **stage**: the stage of the pipeline the object is used in. Object transformations\nare also articulated here. Allowable keywords: \n\n  - raw\n  - calc: results from processing-calculation chains\n  - fit: Ready for base/meta learner fitting\n  - result: Final result\n  - log\n  - log10 \n\n- **source:** the original data source\n\n  - AQS\n  - MODIS\n  - GMTED \n  - NLCD\n  - NARR\n  - GEOSCF\n  - TRI\n  - NEI\n  - KOPPENGEIGER\n  - HMS\n  - gROADS\n  - POPULATION\n  - [Note, we can add and/or update these sources as needed] \n\n- **spacetime:** relevant spatial or temporal information \n\n  - spatial: \n    - siteid\n    - censustract\n    - grid\n  - time: \n    - daily  [optional YYYYMMDD]\n    - annual  [optional YYYY]\n\n\n### Function Naming Convenctions \n\nWe have adopted naming conventions in functions in this package as well as `amadeus` which is a key input package. \n\n**[High-Level-Process]\\_[Source]\\_[Object]**\n\n- **High-Level-Process**\n     - download\n     - process\n     - calc\n\n- **source:** the original data source. Same as source section for tar objects\n\n- **Object** An object that the function may be acting on\n     - base_model (base)\n     - meta_model (meta)\n     - feature (feat) \n\n\n \n### To run the pipeline\n#### Post-checkout hook setting\nAs safeguard measures, we limit the write permission of `_targets.R` to authorized users. To activate post-checkout hook, run `setup_hook.sh` at the project root.\n\n```shell\n. setup_hook.sh\n```\n\nThe write privilege lock is applied immediately. Users will be able to run the pipeline with the static `_targets.R` file to (re-)generate outputs from the pipeline.\n\n#### User settings\n`beethoven` pipeline is configured for SLURM with defaults for NIEHS HPC settings. For adapting the settings to users' environment, consult with the documentation of your platform and edit the `_targets.R` and `inst/targets/targets_calculate.R` (i.e., resource management) accordingly.\n\n#### Setting `_targets.R`\nFor general users, all `targets` objects and `meta` information can be saved in a directory other than the pipeline default by changing `store` value in `tar_config_set()` at `_targets.R` in project root.\n\n```r\n# replacing yaml file.\ntar_config_set(\n  store = \"__your_directory__\"\n)\n```\n\nUsers could comment out the three lines to keep targets in `_targets` directory under the project root. Common arguments are generated in the earlier lines in `_targets.R` file. Details of the function generating the arguments, `set_args_calc`, are described in the following.\n\n\n#### Using `set_args_calc`\n`set_args_calc` function exports or returns common parameters that are used repeatedly throughout the calculation process. The default commands are as below:\n\n```r\nset_args_calc(\n  char_siteid = \"site_id\",\n  char_timeid = \"time\",\n  char_period = c(\"2018-01-01\", \"2022-12-31\"),\n  num_extent = c(-126, -62, 22, 52),\n  char_user_email = paste0(Sys.getenv(\"USER\"), \"@nih.gov\"),\n  export = FALSE,\n  path_export = \"inst/targets/calc_spec.qs\",\n  path_input = \"input\",\n  nthreads_nasa = 14L,\n  nthreads_tri = 5L,\n  nthreads_geoscf = 10L,\n  nthreads_gmted = 4L,\n  nthreads_narr = 24L,\n  nthreads_groads = 3L,\n  nthreads_population = 3L\n)\n```\n\nAll arguments except for `char_siteid` and `char_timeid` should be carefully set to match users' environment. `export = TRUE` is recommended if there is no pre-generated qs file for calculation parameters. For more details, consult `?set_args_calc` after loading `beethoven` in your R interactive session.\n\n#### Running the pipeline\nAfter switching to the project root directory (in terminal, `cd [project_root]`, replace `[project_root]` with the proper path), users can run the pipeline.\n\n\u003e [!NOTE]\n\u003e With `export = TRUE`, it will take some time to proceed to the next because it will recursively search hdf file paths. The time is affected by the number of files to search or the length of the period (`char_period`).\n\n\u003e [!WARNING]\n\u003e Please make sure that you are at the project root before proceeding to the following. The HPC example requires additional edits related to SBATCH directives and project root directory.\n\n```shell\nRscript inst/targets/targets_start.R \u0026\n```\n\nOr in NIEHS HPC, modify several lines to match your user environment:\n\n```shell\n# ...\n#SBATCH --output=YOUR_PATH/pipeline_out.out\n#SBATCH --error=YOUR_PATH/pipeline_err.err\n# ...\n# The --mail-user flag is optional\n#SBATCH --mail-user=MYACCOUNT@nih.gov\n# ...\nUSER_PROJDIR=/YOUR/PROJECT/ROOT\nnohup nice -4 Rscript $USER_PROJDIR/inst/targets/targets_start.R\n```\n\n`YOUR_PATH`, `MYACCOUNT` and `/YOUR_PROJECT_ROOT` should be changed. In the end, you can run the following command:\n\n```shell\nsbatch inst/targets/run.sh\n```\n\nThe script will submit a job with effective commands with SLURM level directives defined by lines starting `#SBATCH`, which allocate CPU threads and memory from the specified partition.\n\n`inst/targets/run.sh` includes several lines exporting environment variables to bind GDAL/GEOS/PROJ versions newer than system default, geospatial packages built upon these libraries, and the user library location where required packages are installed. The environment variables need to be changed following NIEHS HPC system changes in the future.\n\n\u003e [!WARNING]\n\u003e `set_args_*` family for downloading and summarizing prediction outcomes will be added in the future version.\n\n\n\n\n# Developer's guide\n\n\n## Preamble\nThe objective of this document is to provide developers with the current implementation of `beethoven` pipeline for version 0.3.9.\n\nWe assume the potential users have basic knowledge of `targets` and `tarchetypes` packages as well as functional and meta-programming. It is recommended to read Advanced R (by Hadley Wickham)'s chapters for these topics.\n\n\n## Pipeline component and basic implementation\nThe pipeline is based on `targets` package. All targets are **stored** in a designated storage, which can be either a directory path or a URL when one uses cloud storage or web servers. Here we classify the components into three groups:\n\n1. Pipeline execution components: the highest level script to run the pipeline.\n2. Pipeline configuration components: function arguments that are injected into the  functions in each target.\n3. Pipeline target components: definitions of each target, essentially lists of `targets::tar_target()` call classified by pipeline steps\n\n\nLet's take a moment to be a user. You should consult specific file when:\n\n- `_targets.R`: you need to modify or saw errors on library locations, targets storage locations, required libraries\n  - Check `set_args_*()` function parts when you encounter \"file or directory not found\" error\n- `run_slurm.sh`: \"the pipeline status is not reported to my email address.\"\n- `inst/targets/targets_*.R` files: any errors related to running targets except for lower level issues in `beethoven` or `amadeus` functions\n\n\u003e [!NOTE]\n\u003e Please expand the toggle below to display function trees for `inst/targets/targets_*.R` files. Only functions that are directly called in each file are displayed due to screen real estate and readability concerns.\n\n\n\u003cdetails\u003e\n\u003csummary\u003e`targets_*.R` file function tree\u003c/summary\u003e\n\n\n```mermaid\ngraph LR\n\n    %% Define styles for the target files\n    style arglist fill:#ffcccc,stroke-width:2px,stroke:#000000,opacity:0.5\n    style baselearner fill:#ccffcc,stroke-width:2px,stroke:#000000,opacity:0.5\n    style calculateF fill:#ccccff,stroke-width:2px,stroke:#000000,opacity:0.5\n    style download fill:#ffccff,stroke-width:2px,stroke:#000000,opacity:0.5\n    style initialize fill:#ccffff,stroke-width:2px,stroke:#000000,opacity:0.5\n    style metalearner fill:#ffffcc,stroke-width:2px,stroke:#000000,opacity:0.5\n    style predict fill:#ffcc99,stroke-width:2px,stroke:#000000,opacity:0.5\n    \n    %% Define the target files as nodes\n    arglist[\"**inst/targets/targets_arglist.R**\"]\n    baselearner[\"**inst/targets/targets_baselearner.R**\"]\n    calculateF[\"**inst/targets/targets_calculate.R**\"]\n    download[\"**inst/targets/targets_download.R**\"]\n    initialize[\"**inst/targets/targets_initialize.R**\"]\n    metalearner[\"**inst/targets/targets_metalearner.R**\"]\n    predict[\"**inst/targets/targets_predict.R**\"]\n\n    %% Define the branches with arrowhead connections\n    fargdown[\"`set_args_download`\"] ---|`set_args_download`| arglist\n    fargcalc[\"`set_args_calc`\"] ---|`set_args_calc`| arglist\n    fraw[\"`feature_raw_download`\"] ---|`feature_raw_download`| download\n    readlocs[\"`read_locs`\"] ---|`read_locs`| initialize\n    fitbase[\"`fit_base_learner`\"] ---|`fit_base_learner`| baselearner\n    switchmodel[\"`switch_model`\"] ---|`switch_model`| baselearner\n    makesub[\"`make_subdata`\"] ---|`make_subdata`| baselearner\n    covindexrset[\"`convert_cv_index_rset`\"] ---|`convert_cv_index_rset`| baselearner\n    attach[\"`attach_xy`\"] ---|`attach_xy`| baselearner\n    gencvsp[\"`generate_cv_index_sp`\"] ---|`generate_cv_index_sp`| baselearner\n    gencvts[\"`generate_cv_index_ts`\"] ---|`generate_cv_index_ts`| baselearner\n    gencvspt[\"`generate_cv_index_spt`\"] ---|`generate_cv_index_spt`| baselearner\n    switchrset[\"`switch_generate_cv_rset`\"] ---|`switch_generate_cv_rset`| baselearner\n    fcalc[\"`calculate`\"] ---|`calculate`| calculateF\n    fcalcinj[\"`inject_calculate`\"] ---|`inject_calculate`| calculateF\n    fcalcinjmod[\"`inject_modis_par`\"] ---|`inject_modis_par`| calculateF\n    fcalcinjgmted[\"`inject_gmted`\"] ---|`inject_gmted`| calculateF\n    fcalcinjmatch[\"`inject_match`\"] ---|`inject_match`| calculateF\n    fcalcgeos[\"`calc_geos_strict`\"] ---|`calc_geos_strict`| calculateF\n    fcalcgmted[\"`calc_gmted_direct`\"] ---|`calc_gmted_direct`| calculateF\n    fcalcnarr2[\"`calc_narr2`\"] ---|`calc_narr2`| calculateF\n    fparnarr[\"`par_narr`\"] ---|`par_narr`| calculateF\n    fmetalearn[\"`fit_meta_learner`\"] ---|`fit_meta_learner`| metalearner\n    G[\"`pred`\"] ---|`pred`| predict\n\n    %% Apply thin solid dark grey lines to the branches\n    classDef branchStyle stroke-width:1px,stroke:#333333\n    class fargdown,fargcalc,fraw,readlocs,fitbase,switchmodel,makesub,covindexrset,attach,gencvsp,gencvts,gencvspt,switchrset,fcalc,fcalcinj,fcalcinjmod,fcalcinjgmted,fcalcinjmatch,fcalcgeos,fcalcgmted,fcalcnarr2,fparnarr,fmetalearn,G branchStyle\n```\n\n\u003c/details\u003e\n\n\n![](man/figures/pipeline-code-relations.svg)\n\nThe details of argument injection is illustrated below. The specific arguments to inject are loaded from QS files that are required to be saved in `inst/targets` directory. Each QS file contains a nested list object where function arguments for downloading raw data and calculating features are defined and store.\n\n\n#### `inst/targets/download_spec.qs`\nThe file is generated by a `beethoven` function `set_args_download`. In `_targets.R` file, one can skip to generate this file if raw data download is already done or unnecessary.\n\n```r\ngenerate_list_download \u003c- FALSE\n\narglist_download \u003c-\n  set_args_download(\n    char_period = c(\"2018-01-01\", \"2022-12-31\"),\n    char_input_dir = \"input\",\n    nasa_earth_data_token = NULL,#Sys.getenv(\"NASA_EARTHDATA_TOKEN\"),\n    export = generate_list_download,\n    path_export = \"inst/targets/download_spec.qs\"\n  )\n```\n\n\n#### `inst/targets/calc_spec.qs`\n`set_args_calc()` function will generate this file. The file name can be changed (` path_export = \"inst/targets/calc_spec.qs\" `), but it must start with `calc_` as the file name prefix is used to search QS files to manage different periods. Like `download_spec.qs`, whether or not to run this function can be specified by a logical variable named `generate_list_calc` in `_targets.R` file.\n\n```r\ngenerate_list_calc \u003c- FALSE\n\narglist_common \u003c-\n  set_args_calc(\n    char_siteid = \"site_id\",\n    char_timeid = \"time\",\n    char_period = c(\"2018-01-01\", \"2022-12-31\"),\n    num_extent = c(-126, -62, 22, 52),\n    char_user_email = paste0(Sys.getenv(\"USER\"), \"@nih.gov\"),\n    export = generate_list_calc,\n    path_export = \"inst/targets/calc_spec.qs\",\n    char_input_dir = \"/ddn/gs1/group/set/Projects/NRT-AP-Model/input\"\n  )\n```\nQUESTION: Where (which function calls) and when is `inst/targets/init_target.sh` used?\n\n![](man/figures/pipeline-schema.svg)\n\nAs a compromise between the layouts for standard R packages and `targets` pipelines, we mainly keep `tar_target()` definitions in `inst/targets/`, whereas the `targets` required components are stored in the project root. All targets are recorded in `_targets/` directory by default, and it can be changed to somewhere else by defining an external directory at `store` argument in `tar_config_set()` in `_targets.R`. If you change that part in `_targets.R`, you should run `init_targets_storage.sh` **in the project root** to create the specified directory.\n\n```shell\n. init_targets_storage.sh\n```\n\n```r\n# replacing yaml file.\ntar_config_set(\n  store = \"/__your__desired__location__\"\n)\n```\n\n## Before running the pipeline\nFor the future release and tests on various environments, one should check several lines across R and shell script files:\n\n- Shell script\n  - `/run_interactive.sh`: this file is for running the host `targets` process **in an interactive session**. All system variables including `PATH` and `LD_LIBRARY_PATH` to align with the current development system environment. The lines in the provided file are set for NIEHS HPC. Note that it may stall if there are too many other processes running on the interactive node.\n  - `/run_slurm.sh`: this file is for running the host `targets` process **on SLURM by SBATCH script**, meaning that one should run `sbatch run_slurm.sh`. The working directory is set in this bash script to the root of your project (i.e. `beethoven` clone root) : \n  \n```\n      # modify it into the proper directory path. and output/error paths in the\n      # # SBATCH directives\n      USER_PROJDIR=/ddn/gs1/home/$USER/projects\n\n      nohup nice -4 Rscript $USER_PROJDIR/beethoven/inst/targets/targets_start.R\n```\n- R script\n  - `/targets.R`: Lines 10-12, `tar_config_set(store = ...)` should be reviewed if it is set properly not to overwrite successfully run targets.\n  - `/targets.R`: `set_args_download` and `set_args_calc` functions, i.e., `char_input_dir` argument and `char_period`.\n  - `/targets.R`: `library` argument value in `tar_option_set` to match the current system environment\n\n\n\n## Basic structure of branches\nWe will call \"grand target\" as a set of branches if any branching technique is applied at a target.\n\nWhen one target is branched out, the grand target should be a list, either being a nested or a plain list, depending on the context or the command run inside each branch. Branch names include automatic hash after the grand target name as a suffix. Users may use their own suffixes for legibility. Branches have their own good to provide succinct network layout (i.e., an interactive plot generated by `tar_visnetwork(targets_only = TRUE)`), while they add complication to debug. It is strongly advised that the unit function that is applied to each branch should be fully tested.\n\n## Branching in beethoven\nBranching is actively employed in most parts in `beethoven`. Here we will navigate which targets are branched out and rationales for branching in each target.\n\n### Downloading raw data from the source\nDownload targets are separated from the calculation-model fitting sequence and operate a bit different from other targets. Arguments stored in a QS file or QS files (`inst/targets/download_*.qs`) are injected to `amadeus::download_data()` and it will initiate building raw data download targets. The target is rigorous branched out thus is represented as one square node when one runs `targets::tar_visnetwork()`. Building the target named 'lgl_rawdir_download' will download the raw data from the internet and it will be performed **sequentially** under the current setting.\n\nUsers may bypass the downloading targets by setting a temporary system variable with `Sys.setenv(\"BTV_DOWNLOAD_PASS\" = \"FALSE\")`, which is included in `_targets.R`.\n\n```r\n# bypass option\nSys.setenv(\"BTV_DOWNLOAD_PASS\" = \"FALSE\")\n\n# abridged for display...\n\n# # nullify download target if bypass option is set\nif (Sys.getenv(\"BTV_DOWNLOAD_PASS\") == \"TRUE\") {\n  target_download \u003c- NULL\n}\n```\n\n\n### `list_feat_calc_base`\nPer `beethoven` targets naming convention, this object will be a list and it has eight elements at the first level. We use \"first level\" here as the list is nested. It is also related to maintain `list_feat_calc_base_flat` at the following target. Eight elements are defined in a preceding target `chr_iter_calc_features`:\n\n```r\n    tar_target(\n      chr_iter_calc_features,\n      command = c(\"hms\", \"tri\", \"nei\",\n                  \"ecoregions\", \"koppen\", \"population\", \"groads\"),\n      iteration = \"list\",\n      description = \"Feature calculation\"\n    )\n\n```\n\nUsing `inject_calculate` function and argument lists generated by `set_args_calc` function, `chr_iter_calc_features` are passed to `amadeus` functions for calculation. Please note that the pattern of `list_feat_calc_base` is not simply `map(chr_iter_calc_features)`, rather `cross(file_prep_calc_args, chr_iter_calc_features)`, for potential expansion to keep multiple argument files in the future.\n\nEach element in `chr_iter_calc_features` is iterated as a list then `list_feat_calc_base` will be a nested list. `list_feat_calc_base` will merge nested elements into one `data.frame` (`data.table` actually), resulting in a non-nested `list`, which means each element in this `list` object is a `data.frame`.\n\n### `list_feat_calc_nlcd`\nFrom version 0.3.10, NLCD target is separated from `list_feat_calc_base` from runtime concerns. Here we take nested parallelization strategy, where each `amadeus::calc_nlcd()` run with different year and buffer size is parallelized where each will use 10 threads. In the initial study period, we have six combinations (two NLCD years in 2019 and 2021, and three radii of 1, 10, and 50 kilometers). Thus, the NLCD target will use 60 threads, but not necessarily concurrently. Each combination will get its slot in the resulting list target, therefore the following `dt_feat_calc_nlcd` is created by `data.frame` pivotting.\n\n\n\n### `list_feat_calc_nasa`\nMODIS-VIIRS product processing is a bit more complex than others since many preprocessing steps are involved in this raw data. Please note that `chr_iter_calc_nasa` divides MOD19A2 product by spatial resolution since difference in spatial resolution of raster layers makes it difficult to stack layers that can be advantageous to improve processing speed. The branching itself is simple to use a character vector of length 7 to iterate the process, but there is a different avenue that might introduce complexity in terms of computational infrastructure and implementation of parallel processing.\n\nWe introduced nested parallelization to expedite the MODIS/VIIRS processing, where `tar_make_future` will submit jobs per MODIS/VIIRS product code via SLURM batchtools and multiple threads are used in each job. If one wants to make a transition to `crew` based pipeline operation in the future, this part indeed requires a tremendous amount of refactoring not only in beethoven functions but also amadeus functions considering features of `crew`/`mirai` workers which are different from `future`.\n\n\n### `list_feat_calc_geoscf`\nWe use a character vector of length 2 to distinguish chm from aqc products. A modified version of `amadeus::calc_geos`, `calc_geos_strict` is employed to calculate features. The key modification is to fix the radius argument as zero then to remove the top-level argument radius from the function.\n\n\n### `list_feat_calc_gmted`\nHere we use custom function `calc_gmted_direct`, which has different logic from what was used in `amadeus::calc_gmted`. `inject_gmted` uses that function to parallelize the calculation by radius length.\n\n### `list_feat_calc_narr`\nAgain, modified functions `process_narr2` and `calc_narr2` are applied and the parallelization for NARR data is done by `par_narr`. Here we did not branch out by NARR variable names since they are a bit long (length of 46) such that each dispatched branch will add up overhead to submit SLURM job for each variable.\n\n\n## Merge branches\n\nFunctions with prefix `post_calc_` merge branches, which contain various internal structures. Most of the branches are list of depth 1, which means `data.frame` or `data.table` objects are in each list element. Others are list of depth 2. \n\n### Tackling space-time discrepancy\n\nEach source data have different temporal resolution and update frequency. This leads to the different dimensions across targets due to the measures to save time for computation. For example, NLCD targets will get N (number of sites) times 2 (2019 and 2021 per initial study period as of August 2024), whereas NARR targets will get N times $|D|$ (where $D$ is the set of dates), which equals to the full site-date combinations during the study period. To tackle the discrepancy across calculated targets, automatic expansion strategy is implemented by inferring temporal resolution from targets. Automatic expansion starts from resolving native temporal resolution from each target then proceeds to adding a provisional field year from date, which is removed after all required join operations will be completed. Most of the time, date-to-year conversion is performed internally in `expand` functions in `beethoven` and full space-time `data.frame` is prioritized to left join the multiple targets.\n\n### Value filling strategies\n\nTemporal resolution discrepancy makes `NA` values in joined `data.frame`s. In MODIS/VIIRS targets, NDVI (a subdataset of MOD13A1 product) is based on a 16-day cycle, differing from other products on a daily cycle. We consider the reported date of \"16-day cycle\" as the **last day** of the cycle. \n\n* **MODIS/VIIRS**: Therefore, the `NA` values introduced by joining `data.frame`s by date field are filled in `impute_all` using `data.table::setnafill` with next observation carried forward (`type = \"nocb\"`) option.\n* MODIS/VIIRS targets may have `NaN` values where nonexisting values are assigned as replacements. These values are replaced with `NA` at first, then with zeros.\n* Other nonignorable `NA`s in the joined target will be imputed by missForest (name of the original method used; actually using `missRanger` package for efficiency).\n\n### Autojoin functions\n\nAutomatic join function `post_calc_autojoin` is one of the most complex function in `beethoven` codebase, which encapsulates the efforts to resolve all sorts of space-time discrepancies across targets. Full and coarse site-date combinations and full and coarse site-year combinations are automatically resolved in the function. The coarse site-year combination is a challenge since some years are out of the study period and such *anchor* years should be repeated to fill in for no gaps in the joined data. Another `post_calc_df_year_expand` and its upstream `post_calc_year_expand` function repeat coarse site-year `data.frame`s properly to ensure that there will be no years with missing values.\n\n```r\npost_calc_autojoin \u003c-\n  function(\n    df_fine,\n    df_coarse,\n    field_sp = \"site_id\",\n    field_t = \"time\",\n    year_start = 2018L,\n    year_end = 2022L\n  ) {\n    # Dataset specific preprocessing\n    if (any(grepl(\"population\", names(df_coarse)))) {\n      df_coarse \u003c- df_coarse[, -c(\"time\"), with = FALSE]\n    }\n\n    # Detect common field names\n    common_field \u003c- intersect(names(df_fine), names(df_coarse))\n\n    # Clean inputs to retain necessary fields\n    df_fine \u003c- data.table::as.data.table(df_fine)\n    df_coarse \u003c- data.table::as.data.table(df_coarse)\n    df_fine \u003c- post_calc_drop_cols(df_fine)\n    df_coarse \u003c- post_calc_drop_cols(df_coarse)\n\n    # Take strategy depending on the length of common field names\n    # Length 1 means that `site_id` is the only intersecting field\n    if (length(common_field) == 1) {\n      print(common_field)\n      if (common_field == field_sp) {\n        joined \u003c- data.table::merge.data.table(\n          df_fine, df_coarse,\n          by = field_sp,\n          all.x = TRUE\n        )\n      }\n    }\n    # When space-time join is requested,\n    if (length(common_field) == 2) {\n      if (all(common_field %in% c(field_sp, field_t))) {\n        # Type check to characters\n        df_fine[[field_t]] \u003c- as.character(df_fine[[field_t]])\n        df_coarse[[field_t]] \u003c- as.character(df_coarse[[field_t]])\n        \n        # When `time` field contains years, `as.Date` call will return error(s)\n        t_coarse \u003c- try(as.Date(df_coarse[[field_t]][1]))\n        # If an error is detected, print information\n        if (inherits(t_coarse, \"try-error\")) {\n          message(\n            \"The time field includes years. Trying different join strategy.\"\n          )\n          coarse_years \u003c- sort(unique(unlist(as.integer(df_coarse[[field_t]]))))\n          \n          # coarse site-year combination is expanded\n          df_coarse2 \u003c- post_calc_df_year_expand(\n            df_coarse,\n            time_start = year_start,\n            time_end = year_end,\n            time_available = coarse_years\n          )\n          joined \u003c-\n            post_calc_join_yeardate(df_coarse2, df_fine, field_t, field_t)\n        } else {\n          # site-date combination data.frames are joined as they are regardless of coarseness\n          # Left join is enforced\n          joined \u003c- data.table::merge.data.table(\n            df_fine, df_coarse,\n            by = c(field_sp, field_t),\n            all.x = TRUE\n          )\n        }\n      }\n    }\n    return(joined)\n  }\n```\n\n### Managing calculated features\n\nThe calculation configuration files can be multiple, which means the calculated feature targets can also be multiple. The `dt_feat_calc_cumulative` target operates differently depending on the existence of a *.qs file in the `output/qs` directory. If there is any *.qs file in the `output/qs` directory, the `dt_feat_calc_design` target will be appended (i.e., `rbind()`-ed) to the contents of the `*.qs` files. The first run will assign a file name string to `dt_feat_calc_cumulative`.\n\n```r\nappend_predecessors(\n  path_qs = \"output/qs\",\n  period_new = arglist_common$char_period,\n  input_new = dt_feat_calc_design,\n  nthreads = arglist_common$nthreads_append\n)\n```\n\n### Imputation\n\nThe calculated features contain a fair amount of `NA` or `NaN`s depending on the raw dataset. We distinguish these into \"true zeros\" and \"true missing\" for the subsequent imputation process. For imputation, `missRanger` is used. The `missRanger` arguments can be adjusted in the `impute_all()` function.\n\n- True zeros: TRI features include many `NA`s as the raw data is a long `data.frame` with source location-chemicals pair keys. This structure requires long-to-wide pivoting, resulting in a sparse `data.frame` with `NA`s where no chemicals were reported in certain locations. Therefore, these `NA`s are considered true zeros.\n\n- Missing: daily satellite-derived features except for the 16-day NDVI are considered to include missing values. Such missing values are mainly coming from intermittent data transmission disruption or planned maintenance. `NA`s in the 16-day NDVI field are filled by the \"last observation carried forward\" principle. `NaN` values in others are replaced with `NA` and put into the imputation function.\n\n\n\n## Base learners\n\nFor efficiency, GPU-enabled version is recommended for `xgboost`/`lightgbm` and `brulee`. These packages need to be installed manually with modifications of system environment variables. Developers should consult `lightgbm` official documentation for building the package by hand, `xgboost` GitHub repository release page for installing the CUDA version manually and `brulee` GitHub repository (i.e., in `gpu` branch) to install the proper version of each package with careful consideration on the computing infrastructure. \"GPU\" here refers to CUDA-enabled devices produced by NVIDIA corporation. This does not necessarily mean that this package as a part of U.S. government work endorses NVIDIA corporation and its products in any sort.\n\n\u003e [!WARNING]\n\u003e As of version 0.3.10, `xgboost` \u003c v2.1.0 should be used due to breaking changes in v2.1.0 in handling additional arguments in `xgb.DMatrix` (cf. [xgboost pull record](https://github.com/dmlc/xgboost/pull/9862)), which leads to break `parsnip::boost_tree()` function call.\n\n\n### tidymodels infrastructure\n\nWe want to actively adopt evolving packages in the `tidymodels` ecosystem while keeping as minimal dependency tree as possible. In this package, major `tidymodels` packages that are used in base and meta learners include--\n\n* `parsnip`\n* `recipe`\n* `rsample`\n* `spatialsample`\n* `tune`\n* `workflow`\n\n### Branching\nWith rigorous branching, we maintain the base learner fitting targets as one node with 900 branches, which include $\\texttt{3 (base learners)}\\times \ntexttt{3 (CV strategies)}\\times \\texttt{100 resamples}$. LightGBM and multilayer perceptron models are running on GPUs, while elastic net models are fit on CPUs.\n\n\n### Cross validation\n\nDue to `rsample` design, each cross-validation fold will include an **actual** `data.frame` (`tibble`) object. It has own good for self-contained modeling practices that easily guarantee reproducibility, however, it also has limitations when used with large data and `targets` pipeline as `targets` **stores** such objects in disk space. Such characteristics lead to inflate the disk space for base and meta learner training. Ten-fold cross-validation sets from 900K*3.2K `data.frame` take $9\\texttt{M} \\times 3.2\\texttt{K} \\times 8\\texttt{bytes}$=230GB. Randomization schemes for model ensemble will increase that size to 10 times and more, which is equivalent to 2.3TB and more when uncompressed. The current development version modifies the original `rsample`'s `rset` design to store *row indices* of the joined `data.frame` target to reduce data size in disk.\n\n\n#### Use `rset` object in the last resort\n\n`rset` object is a powerful tool to ensure that all cross-validation sets \"flow\" through the modeling process, but has a limitation in large-scale modeling with `target`: storage issues. When one stores `rset` objects in the pipeline even with a mild randomization (e.g., 30% row sampling in the base learner step in `beethoven` pipeline), the total disk space required to keep `rset` object easily exceed several times of the original `data.frame` object. Thus, we prefer to keep *row indices* to restore `rset` object *inside* each base learner fitting function. Row indices here are derived from the row subsamples for base learners. `targets` will only store row indices bound with each subsample, such that the total usage of storage will be reduced significantly. Besides the disk space concerns, it has its own good to reduce the overhead or I/O for compressing massive `data.frame` (actually, `tibble`) objects.\n\n- `restore_*` functions restore `rset` object from row indices and their upstream `data.frame`\n- `generate_*` functions generate row indices from input `data.frame` by the user-defined cross-validation strategy.\n\n`fit_base_learner()` is a quite long and versatile function that accepts a dozen arguments, therefore developers should be aware of each component in the function. The current implementation separated `parsnip` and `tune` parts from `fit_base_learner()`. The flowchart of `fit_base_learner()` is displayed below.\n\n```mermaid\ngraph TD\n    %% Define the target files as nodes\n    frecipe[\"minimal data\"]\n    fittune[\"tuning results\"]\n    fmodel[\"parsnip model definition\"]\n    ftune[\"tuning functions\"] \n    bestmodel[\"best model from tuning\"]\n    bestworkflow[\"workflow of the best model\"]\n    fitmodel[\"fitted best model with full data\"]\n    bestfit[\"predicted values from one base learner\"]\n\n\n    %% Define the branches with arrowhead connections\n    frecipe ---|recipes::recipe()| fittune\n    fmodel ---|`switch_model()`| fittune\n    ftune ---|`tune_*()`| fittune\n    fittune ---|tune::select_best()| bestmodel\n    bestmodel ---|tune::finalize_workflow()| bestworkflow\n    bestworkflow ---|parsnip::fit()| fitmodel\n    fitmodel ---|predict()| bestfit\n```\n\n\n## Containerization\n- TODO: build GPU-enabled Apptainer image\n- TODO: make a new branch to replace `container-engine`","funding_links":[],"readme_doi_urls":[],"works":{},"citation_counts":{},"total_citations":0,"keywords_from_contributors":[],"project_url":"https://ost.ecosyste.ms/api/v1/projects/298523","html_url":"https://ost.ecosyste.ms/projects/298523"}