chopin

Computation for Climate and Health research On Parallelized infrastructure is a Scalable GIS methods for environmental and climate data analysis.
https://github.com/niehs/chopin

Last synced: 3 months ago
JSON representation

Acceptance Criteria

Revelant topics? false
External users? true
Open source license? true
Active? true
Fork? false

Repository metadata

Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing

Host: GitHub
URL: https://github.com/niehs/chopin
Owner: ropensci
License: other
Created: 2023-08-18T19:39:33.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-10-12T02:34:59.000Z (9 months ago)
Last Synced: 2024-10-30T04:49:04.714Z (8 months ago)
Language: R
Homepage: https://niehs.github.io/chopin/
Size: 80.9 MB
Stars: 12
Watchers: 2
Forks: 3
Open Issues: 8
Releases: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
- Codemeta: codemeta.json

README.Rmd

          ---
output:
  github_document:
    html_preview: false
always_allow_html: true
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing 



![Coverage](https://github.com/ropensci/chopin/actions/workflows/test-coverage-local.yaml/coverage.svg)
[![R-CMD-check](https://github.com/ropensci/chopin/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/ropensci/chopin/actions/workflows/check-standard.yaml)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/638_status.svg)](https://github.com/ropensci/software-review/issues/638)
[![runiverse-package](https://ropensci.r-universe.dev/badges/chopin?scale=1&color=pink&style=round)](https://ropensci.r-universe.dev/chopin)
[![Docs](https://img.shields.io/badge/docs-release-blue.svg)](https://docs.ropensci.org/chopin/)
[![Lifecycle:
stable](https://img.shields.io/badge/lifecycle-stable-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)



## Objective
This package automates [parallelization](https://en.wikipedia.org/wiki/Parallel_computing) in spatial operations with `chopin` functions as well as [sf](https://github.com/r-spatial/sf)/[terra](https://github.com/rspatial/terra) functions. With [GDAL](https://gdal.org)-compatible files and database tables, `chopin` functions help to calculate spatial variables from vector and raster data with no external software requirements. All who need to perform geospatial operations with large datasets may find this package useful to accelerate the covariate calculation process for further analysis and modeling may find the main functions useful. We assume that users have basic knowledge of [geographic information system data models](https://r.geocompx.org/spatial-class), [coordinate systems and transformations](https://doi.org/10.22224/gistbok/2023.1.2), [spatial operations](https://r.geocompx.org/spatial-operations), and [raster-vector overlay](https://r.geocompx.org/raster-vector).


## Overview

`chopin` encapsulates the parallel processing of spatial computation into three steps. __First,__ users will define the parallelization strategy, which is one of many supported in `future` and `future.mirai` packages. Users always need to register parallel workers with `future` before running the `par_*()` functions that will be introduced below.

```r
future::plan(future.mirai::mirai_multisession, workers = 4L)
# future::multisession, future::cluster are available,
# See future.batchtools and future.callr for other options
# the number of workers are up to users' choice
```

__Second,__ users choose the proper data parallelization configuration by creating a grid partition of the processing extent, defining the field name with values that are hierarchically coded, or entering multiple raster file paths into `par_multirasters()`. __Finally,__ users run `par_*()` function with the configurations set above to compute spatial variables from input data in parallel:

- `par_grid`: parallelize over artificial grid polygons that are generated from the maximum extent of inputs. `par_pad_grid` is used to generate the grid polygons before running this function.

- `par_hierarchy`: parallelize over hierarchy coded in identifier fields (for example, census blocks in each county in the US)

- `par_multirasters`: parallelize over multiple raster files

- Each of the `par_*` functions introduced above has `mirai` version with a suffix `_mirai` after the function names: `par_grid_mirai`, `par_hierarchy_mirai`, and `par_multirasters`. These functions will work properly after creating daemons with `mirai::daemons`.

```r
mirai::daemons(4L, dispatcher = "process")
```

For grid partitioning, the entire study area will be divided into partly overlapped grids. We suggest two flowcharts to help which function to use for parallel processing below. The upper flowchart is raster-oriented and the lower is vector-oriented. They are supplementary to each other. When a user follows the raster-oriented one, they might visit the vector-oriented flowchart at each end of the raster-oriented flowchart.


Processing functions accept [terra](https://github.com/rspatial/terra)/[sf](https://github.com/r-spatial/sf) classes for spatial data. Raster-vector overlay is done with `exactextractr`. Three helper functions encapsulate multiple geospatial data calculation steps over multiple CPU threads.

- `extract_at`: extract raster values with point buffers or polygons with or without kernel weights

- `summarize_sedc`: calculate sums of [exponentially decaying contributions](https://mserre.sph.unc.edu/BMElab_web/SEDCtutorial/index.html)

- `summarize_aw`: area-weighted covariates based on target and reference polygons



### Function selection guide

We provide two flowcharts to help users choose the right function for parallel processing. The raster-oriented flowchart is for users who want to start with raster data, and the vector-oriented flowchart is for users with large vector data.

In **raster-oriented selection**, we suggest four factors to consider:

- Number of raster files: for multiple files, `par_multirasters` is recommended. When there are multiple rasters that share the same extent and resolution, consider stacking the rasters into multilayer SpatRaster object by calling `terra::rast(filenames)`.
- Raster resolution: We suggest 100 meters as a threshold. Rasters with resolution coarser than 100 meters and a few layers would be better for the direct call of `exactextractr::exact_extract()`.
- Raster extent: Using `SpatRaster` in `exactextractr::exact_extract()` is often minimally affected by the raster extent.
- Memory size: `max_cells_in_memory` argument value of `exactextractr::exact_extract()`, raster resolution, and the number of layers in `SpatRaster` are multiplicatively related to the memory usage.

![](man/figures/README-flowchart-raster.png)


For **vector-oriented selection**, we suggest three factors to consider:

- Number of features: When the number of features is over 100,000, consider using `par_grid` or `par_hierarchy` to split the data into smaller chunks.
- Hierarchical structure: If the data has a hierarchical structure, consider using `par_hierarchy` to parallelize the operation.
- Data grouping: If the data needs to be grouped in similar sizes, consider using `par_pad_balanced` or `par_pad_grid` with `mode = "grid_quantile"`.

![](man/figures/README-flowchart-vector.png)


## Installation

`chopin` can be installed using `remotes::install_github` (also possible with `pak::pak` or `devtools::install_github`).

```r
rlang::check_installed("remotes")
remotes::install_github("ropensci/chopin")
```

or you can also set `repos` in `install.packages()` as ROpenSci repository:

```r
install.packages("chopin", repos = "https://ropensci.r-universe.dev")
```

## Examples
Examples will navigate `par_grid`, `par_hierarchy`, and `par_multirasters` functions in `chopin` to parallelize geospatial operations.

```{r load-packages}
# check and install packages to run examples
pkgs <- c("chopin", "dplyr", "sf", "terra", "future", "future.mirai", "mirai")
# install packages if anything is unavailable
rlang::check_installed(pkgs)

library(chopin)
library(dplyr)
library(sf)
library(terra)
library(future)
library(future.mirai)
library(mirai)

# disable spherical geometries
sf::sf_use_s2(FALSE)

# parallelization-safe random number generator
set.seed(2024, kind = "L'Ecuyer-CMRG")
```


### `par_grid`: parallelize over artificial grid polygons
Please refer to a small example below for extracting mean altitude values at circular point buffers and census tracts in North Carolina. Before running code chunks below, set the cloned `chopin` repository as your working directory with `setwd()`


```{r read-nc}
ncpoly <- system.file("shape/nc.shp", package = "sf")
ncsf <- sf::read_sf(ncpoly)
ncsf <- sf::st_transform(ncsf, "EPSG:5070")
plot(sf::st_geometry(ncsf))
```


#### Generate random points in NC
Ten thousands random point locations were generated inside the counties of North Carolina.
```{r gen-ncpoints}
ncpoints <- sf::st_sample(ncsf, 1e4)
ncpoints <- sf::st_as_sf(ncpoints)
ncpoints$pid <- sprintf("PID-%05d", seq(1, 1e4))
plot(sf::st_geometry(ncpoints))
```

#### Target raster dataset: [Shuttle Radar Topography Mission](https://www.usgs.gov/centers/eros/science/usgs-eros-archive-digital-elevation-shuttle-radar-topography-mission-srtm-1)
We use an elevation dataset with and a moderate spatial resolution (approximately 400 meters or 0.25 miles).

```{r load-srtm}
# data preparation
srtm_path <- file.path(tempdir(check = TRUE), "nc_srtm15_otm.tif")
srtm_url <-
  paste0(
    "https://raw.githubusercontent.com/",
    "ropensci/chopin/refs/heads/0.9.0-cran/",
    "tests/testdata/nc_srtm15_otm.tif"
  )
download.file(srtm_url, srtm_path, mode = "wb")

# terra SpatRaster objects are wrapped when exported to rds file
srtm_ras <- terra::rast(srtm_path)
terra::crs(srtm_ras) <- "EPSG:5070"
srtm_ras
terra::plot(srtm_ras)
```


```{r srtm-extract-single}
# ncpoints_tr <- terra::vect(ncpoints)
system.time(
  ncpoints_srtm <-
    chopin::extract_at(
      x = srtm_ras,
      y = ncpoints,
      id = "pid",
      mode = "buffer",
      radius = 1e4L  # 10,000 meters (10 km)
    )
)

```

#### Generate regular grid computational regions
`chopin::par_pad_grid()` takes a spatial dataset to generate regular grid polygons with `nx` and `ny` arguments with padding. Users will have both overlapping (by the degree of `radius`) and non-overlapping grids, both of which will be utilized to split locations and target datasets into sub-datasets for efficient processing.

```{r gen-compregions}
compregions <-
  chopin::par_pad_grid(
    ncpoints,
    mode = "grid",
    nx = 2L,
    ny = 2L,
    padding = 1e4L
  )
```

`compregions` is a list object with two elements named `original` (non-overlapping grid polygons) and `padded` (overlapping by `padding`). The figures below illustrate the grid polygons with and without overlaps.

```{r compare-compregions, fig.width = 8, fig.height = 8}
names(compregions)

oldpar <- par()
par(mfrow = c(2, 1))
terra::plot(
  terra::vect(compregions$original),
  main = "Original grids"
)
terra::plot(
  terra::vect(compregions$padded),
  main = "Padded grids"
)

```


#### Parallel processing
Using the grid polygons, we distribute the task of averaging elevations at 10,000 circular buffer polygons, which are generated from the random locations, with 10 kilometers radius by `chopin::par_grid()`. Users always need to **register** multiple CPU threads (logical cores) for parallelization. `chopin::par_*()` functions are flexible in terms of supporting generic spatial operations in `sf` and `terra`, especially where two datasets are involved. Users can inject generic functions' arguments (parameters) by writing them in the ellipsis (`...`) arguments, as demonstrated below:

```{r}
future::plan(future.mirai::mirai_multisession, workers = 4L)

system.time(
  ncpoints_srtm_mthr <-
    par_grid(
      grids = compregions,
      fun_dist = extract_at,
      x = srtm_ras,
      y = ncpoints,
      id = "pid",
      radius = 1e4L,
      .standalone = FALSE
    )
)

ncpoints_srtm <-
  extract_at(
    x = srtm_ras,
    y = ncpoints,
    id = "pid",
    radius = 1e4L
  )

future::plan(future::sequential)
mirai::daemons(0L)
```


```{r compare-single-multi}
colnames(ncpoints_srtm_mthr)[2] <- "mean_par"
ncpoints_compar <- merge(ncpoints_srtm, ncpoints_srtm_mthr)
# Are the calculations equal?
all.equal(ncpoints_compar$mean, ncpoints_compar$mean_par)
```

```{r plot results}
ncpoints_s <-
  merge(ncpoints, ncpoints_srtm)
ncpoints_m <-
  merge(ncpoints, ncpoints_srtm_mthr)

plot(ncpoints_s[, "mean"], main = "Single-thread", pch = 19, cex = 0.33)
plot(ncpoints_m[, "mean_par"], main = "Multi-thread", pch = 19, cex = 0.33)
```

The same workflow operates on `mirai` dispatchers.

```{r demo-par-grid-mirai}
mirai::daemons(n = 4L, dispatcher = "process")

system.time(
  ncpoints_srtm_mthr <-
    par_grid_mirai(
      grids = compregions,
      fun_dist = extract_at,
      x = srtm_ras,
      y = ncpoints,
      id = "pid",
      radius = 1e4L,
      .standalone = FALSE
    )
)

# remove mirai::daemons
mirai::daemons(0L)

```


### `chopin::par_hierarchy()`: parallelize geospatial computations using intrinsic data hierarchy

We usually have nested/exhaustive hierarchies in real-world datasets. For example, land is organized by administrative/jurisdictional borders where multiple levels exist. In the U.S. context, a state consists of several counties, counties are split into census tracts, and they have a group of block groups. `chopin::par_hierarchy()` leverages such hierarchies to parallelize geospatial operations, which means that a group of lower-level geographic units in a higher-level geography is assigned to a process. A demonstration below shows that census tracts are grouped by their counties then each county will be processed in a CPU thread.


#### Read data
```{r}
# nc_hierarchy.gpkg includes two layers: county and tracts
hierarchy_path <- file.path(tempdir(check = TRUE), "nc_hierarchy.gpkg")
hierarchy_url <-
  paste0(
    "https://raw.githubusercontent.com/",
    "ropensci/chopin/refs/heads/0.9.0-cran/",
    "tests/testdata/nc_hierarchy.gpkg"
  )
download.file(hierarchy_url, hierarchy_path, mode = "wb")

nc_data <- hierarchy_path
nc_county <- sf::st_read(nc_data, layer = "county")
nc_tracts <- sf::st_read(nc_data, layer = "tracts")

# reproject to Conus Albers Equal Area
nc_county <- sf::st_transform(nc_county, "EPSG:5070")
nc_tracts <- sf::st_transform(nc_tracts, "EPSG:5070")
nc_tracts$COUNTY <- substr(nc_tracts$GEOID, 1, 5)
```


#### Extract average SRTM elevations by single and multiple threads
```{r compare-runtime-hierarchy}
future::plan(future.mirai::mirai_multisession, workers = 4L)

# single-thread
system.time(
  nc_elev_tr_single <-
    chopin::extract_at(
      x = srtm_ras,
      y = nc_tracts,
      id = "GEOID"
    )
)

# hierarchical parallelization
system.time(
  nc_elev_tr_distr <-
    chopin::par_hierarchy(
      regions = nc_county, # higher level geometry
      regions_id = "GEOID", # higher level unique id
      fun_dist = extract_at,
      x = srtm_ras,
      y = nc_tracts, # lower level geometry
      id = "GEOID", # lower level unique id
      func = "mean"
    )
)

```


### `par_multirasters()`: parallelize over multiple rasters

There is a common case of having a large group of raster files at which the same operation should be performed. `chopin::par_multirasters()` is for such cases. An example below demonstrates where we have five elevation raster files to calculate the average elevation at counties in North Carolina.

```{r prep-multiraster}
# nccnty <- sf::st_read(nc_data, layer = "county")
ncelev <- terra::rast(srtm_path)
terra::crs(ncelev) <- "EPSG:5070"
names(ncelev) <- c("srtm15")
tdir <- tempdir(check = TRUE)

terra::writeRaster(ncelev, file.path(tdir, "test1.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test2.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test3.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test4.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test5.tif"), overwrite = TRUE)

# check if the raster files were exported as expected
testfiles <- list.files(tdir, pattern = "*.tif$", full.names = TRUE)
testfiles
```

```{r run-multiraster}
system.time(
  res <-
    chopin::par_multirasters(
      filenames = testfiles,
      fun_dist = extract_at,
      x = ncelev,
      y = nc_county,
      id = "GEOID",
      func = "mean"
    )
)
knitr::kable(head(res))

# remove temporary raster files
file.remove(testfiles)
```




## Parallelization of a generic geospatial operation
Other than `chopin` processing functions, `chopin::par_*()` functions support generic geospatial operations. An example below uses `terra::nearest()`, which gets the nearest feature's attributes, inside `chopin::par_grid()`.

```{r prep-par-generic}
ncrd1_path <- file.path(tempdir(check = TRUE), "ncroads_first.gpkg")
ncrd1_url <-
  paste0(
    "https://raw.githubusercontent.com/",
    "ropensci/chopin/refs/heads/0.9.0-cran/",
    "tests/testdata/ncroads_first.gpkg"
  )
download.file(ncrd1_url, ncrd1_path, mode = "wb")

# Generate 5000 random points
pnts <- sf::st_sample(nc_county, 5000)
pnts <- sf::st_as_sf(pnts)
# assign identifiers
pnts$pid <- sprintf("RPID-%04d", seq(1, 5000))
rd1 <- sf::st_read(ncrd1_path)

# reproject
pntst <- sf::st_transform(pnts, "EPSG:5070")
rd1t <- sf::st_transform(rd1, "EPSG:5070")

# generate grids
nccompreg <-
  chopin::par_pad_grid(
    input = pntst,
    mode = "grid",
    nx = 4L,
    ny = 2L,
    padding = 5e4L
  )

```

The figure below shows the padded grids (50 kilometers), primary roads, and points. Primary roads will be selected by a padded grid per iteration and used to calculate the distance from each point to the nearest primary road. Padded grids and their overlapping areas will look different according to `padding` argument in `chopin::par_pad_grid()`.



```{r map-all}
# plot
terra::plot(nccompreg$padded, border = "orange")
terra::plot(terra::vect(ncsf), add = TRUE)
terra::plot(rd1t, col = "blue", add = TRUE)
terra::plot(pntst, add = TRUE, cex = 0.3)
legend(1.02e6, 1.72e6,
       legend = c("Computation grids (50km padding)", "Major roads"),
       lty = 1, lwd = 1, col = c("orange", "blue"),
       cex = 0.5)
```


```{r compare-generic}
# terra::nearest run
system.time(
  restr <- terra::nearest(x = terra::vect(pntst), y = terra::vect(rd1t))
)

pnt_path <- file.path(tdir, "pntst.gpkg")
sf::st_write(pntst, pnt_path)

# we use four threads that were configured above
system.time(
  resd <-
    chopin::par_grid(
      grids = nccompreg,
      fun_dist = nearest,
      x = pnt_path,
      y = ncrd1_path,
      pad_y = TRUE
    )
)
```

- We will compare the results from the single-thread and multi-thread calculation.

```{r compare-distance}
resj <- merge(restr, resd, by = c("from_x", "from_y"))
all.equal(resj$distance.x, resj$distance.y)
```

Users should be mindful of caveats in the parallelization of nearest feature search, which may result in no or excess distance depending on the distribution of the target dataset to which the nearest feature is searched. For example, when one wants to calculate the nearest interstate from rural homes with fine grids, some grids may have no interstates then homes in such grids will not get any distance to the nearest interstate. Such problems can be avoided by choosing `nx`, `ny`, and `padding` values in `par_pad_grid()` meticulously.


## Map function arguments for `chopin` parallelization

If users' custom function with two main raster/vector arguments whose names are not `x` or `y`, users can use `par_convert_f` to map two arguments to `x` and `y`.

```r
sf_intersection_buffer <- function(a, b, buffer_dist = 100) {
  intersected <- st_intersection(a, b)
  buffered <- st_buffer(intersected, dist = buffer_dist)
  return(buffered)
}

# Convert the function to remap arguments dynamically
custom_intersect_buffer <- par_convert_f(sf_intersection_buffer, x = "a", y = "b", dist = "buffer_dist")

```

## Caveats

### Why parallelization is slower than the ordinary function run?
Parallelization may underperform when the datasets are too small to take advantage of divide-and-compute approach, where parallelization overhead is involved. Overhead here refers to the required amount of computational resources for transferring objects to multiple processes. Since the demonstrations above use quite small datasets, the advantage of parallelization was not as noticeable as it was expected. Should a large amount of data (spatial/temporal resolution or number of files, for example) be processed, users could find the efficiency of this package. A vignette in this package demonstrates use cases extracting various climate/weather datasets.


### Notes on data restrictions
`chopin` works best with **two-dimensional** (**planar**) geometries. Users should disable `s2` spherical geometry mode in `sf` by setting `sf::sf_use_s2(FALSE)`. Running any `chopin` functions at spherical or three-dimensional (e.g., including M/Z dimensions) geometries may produce incorrect or unexpected results.

Citation (CITATION.cff)

cff-version: 1.2.0
title: chopin
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Insang
    family-names: Song
    email: [email protected]
    affiliation: 'Seoul National University'
    orcid: 'https://orcid.org/0000-0001-8732-3256'
  - given-names: Kyle P.
    family-names: Messier
    email: [email protected]
    affiliation: NIEHS
    orcid: 'https://orcid.org/0000-0001-9508-9623'
repository-code: 'https://github.com/ropensci/chopin'
keywords:
  - spatial data
  - parallel processing
license: MIT

Owner metadata

Name: rOpenSci
Login: ropensci
Email: [email protected]
Kind: organization
Description:
Website: https://ropensci.org/
Location: Berkeley, CA
Twitter: rOpenSci
Company:
Icon url: https://avatars.githubusercontent.com/u/1200269?v=4
Repositories: 307
Last ynced at: 2023-03-10T20:30:59.242Z
Profile URL: https://github.com/ropensci

GitHub Events

Total

Issues event: 3
Watch event: 9
Delete event: 4
Issue comment event: 10
Push event: 55
Pull request event: 9
Create event: 4

Last Year

Issues event: 3
Watch event: 9
Delete event: 4
Issue comment event: 10
Push event: 55
Pull request event: 9
Create event: 4

Name	Email	Commits
Insang Song	i**g@n**v	238
Insang Song	s**x@h**m	32
{SET}group	1****y	4

Issue and Pull Request metadata

Last synced: 9 months ago

Total issues: 52
Total pull requests: 74
Average time to close issues: about 1 month
Average time to close pull requests: about 16 hours
Total issue authors: 4
Total pull request authors: 2
Average comments per issue: 2.04
Average comments per pull request: 0.76
Merged pull request: 69
Bot issues: 0
Bot pull requests: 0

Past year issues: 43
Past year pull requests: 67
Past year average time to close issues: about 1 month
Past year average time to close pull requests: about 17 hours
Past year issue authors: 4
Past year pull request authors: 2
Past year average comments per issue: 1.86
Past year average comments per pull request: 0.61
Past year merged pull request: 63
Past year bot issues: 0
Past year bot pull requests: 0

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/niehs/chopin

Top Issue Authors

sigmafelix (45)
kyle-messier (4)
cole-brokamp (2)
Aariq (1)

Top Pull Request Authors

sigmafelix (71)
kyle-messier (3)

Top Issue Labels

enhancement (18)
documentation (12)
release (12)
longterm (9)
function (4)
bug (4)
refactoring (2)
question (1)

Top Pull Request Labels

documentation (8)
enhancement (4)
release (3)

Score: 4.0943445622221

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Sustainable Technology