SDGmapR

R functions and datasets related to the mapping of text to the United Nations 17 Sustainable Development Goals.
https://github.com/CMUSustainability/SDGmapR

Category: Sustainable Development
Sub Category: Sustainable Development Goals

Last synced: about 11 hours ago
JSON representation

Repository metadata

R functions and datasets related to the mapping of text to the United Nations 17 Sustainable Development Goals (SDGs).

Host: GitHub
URL: https://github.com/CMUSustainability/SDGmapR
Owner: CMUSustainability
License: mit
Created: 2021-10-03T00:46:41.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-05-12T08:16:11.000Z (about 3 years ago)
Last Synced: 2025-05-09T18:32:06.573Z (about 2 months ago)
Language: R
Size: 3.07 MB
Stars: 12
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

README.Rmd

          ---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# SDGmapR




The goal of `SDGmapR` is to provide an open-source foundation for the systematic mapping
to the United Nations Sustainable Development Goals (SDGs). In this R package one can find publicly available [SDG keyword datasets](https://github.com/pwu97/SDGmapR/tree/main/datasets) in the `tidy` data format,
the [UN Official SDG color scheme](https://www.un.org/sustainabledevelopment/wp-content/uploads/2019/01/SDG_Guidelines_AUG_2019_Final.pdf) and [SDG Descriptions](https://github.com/pwu97/SDGmapR/blob/main/datasets/sdg_desc_cleaned.csv), and several functions related to the
mapping of text to particular sets of keywords.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("CMUSustainability/SDGmapR")
```

## Publicly Available SDG Keywords

The table below lists publicly available SDG keywords that have been published online. Some
of the lists have weights associated with every keyword, while some do not. For the purposes
of the `SDGmapR` package, we will assign an equal weight of one to every word if weights are not given. 
Note that the column for `SDG17` will represent whether the dataset has keywords
related to SDG17.

```{r, echo=FALSE, example}
library(knitr)
sdg_table <- data.frame(
  "Source" = c("[Core Elsevier (Work in Progress)](https://data.mendeley.com/datasets/87txkw7khs/1)", 
               "[Improved Elsevier Top 100](https://data.mendeley.com/datasets/9sxdykm8s4/2)", 
               "[SDSN](https://ap-unsdsn.org/regional-initiatives/universities-sdgs/)", 
               "[CMU Top 250 Words](https://www.cmu.edu/leadership/the-provost/provost-priorities/sustainability-initiative/sdg-definitions.html)",
               "[CMU Top 500 Words](https://www.cmu.edu/leadership/the-provost/provost-priorities/sustainability-initiative/sdg-definitions.html)",
               "[CMU Top 1000 Words](https://www.cmu.edu/leadership/the-provost/provost-priorities/sustainability-initiative/sdg-definitions.html)",
               "[University of Auckland (Work in Progress)](https://www.sdgmapping.auckland.ac.nz/)", "[University of Toronto (Work in Progress)](https://data.utoronto.ca/sustainable-development-goals-sdg-report/sdg-report-appendix/)"),
  "Dataset" = c("`elsevier_keywords`",
             "`elsevier100_keywords`",
             "`sdsn_keywords`",
             "`cmu250_keywords`",
             "`cmu500_keywords`",
             "`cmu1000_keywords`",
             "`auckland_keywords`",
             "`toronto_keywords`"),
  "CSV" = c("[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/elsevier_keywords_cleaned.csv)", "[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/elsevier100_keywords_cleaned.csv)", "[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/sdsn_keywords_cleaned.csv)", 
"[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/cmu250_keywords_cleaned.csv)",
"[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/cmu500_keywords_cleaned.csv)",
"[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/cmu1000_keywords_cleaned.csv)", "", ""),
  "SDG17" = c("No", "No", "Yes", "No", "No", "No", "Yes", "Yes")
)
kable(sdg_table)
```

## Example SDGMapR Usage

We can map to one SDG with the `count_sdg_keywords` function that adds up the
weights of the keywords found. We can find the keywords for one SDG with the
`tabulate_sdg_keywords` that returns the words as a vector, which we can view
in the `tidy` format by applying `unnest()` to our result.

```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(SDGmapR)

# Load first 1000 #tidytuesday tweets
tweets <- readRDS(url("https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-01-01/tidytuesday_tweets.rds?raw=true")) %>%
  select(text) %>%
  head(1000) %>%
  mutate(text = str_to_lower(text))

# Map to SDG 1 using Improved Elsevier Top 100 Keywords
tweets_sdg1 <- tweets %>%
  mutate(sdg_1_weight = count_sdg_weights(text, 1),
         sdg_1_words = tabulate_sdg_keywords(text, 1)) %>%
  arrange(desc(sdg_1_weight)) %>%
  select(text, sdg_1_weight, sdg_1_words)

# View SDG 1 matched keywords
tweets_sdg1 %>%
  unnest(sdg_1_words)
```

We can map to a different set of keywords by adding an additional input into
our function, using the `cmu250` (CMU Top 250 Keywords) dataset of SDG keywords instead of the default `elsevier1000` dataset of SDG keywords.

```{r}
# Map to SDG 3 using Elsevier Core keywords
tweets %>%
  mutate(sdg_weight = count_sdg_weights(text, 3, "cmu250")) %>%
  select(text, sdg_weight) %>%
  arrange(desc(sdg_weight))

# Map to SDG 5 using Elsevier Core keywords
tweets %>%
  mutate(sdg_weight = count_sdg_weights(text, 5, "cmu250")) %>%
  select(text, sdg_weight) %>%
  arrange(desc(sdg_weight))

# Map to SDG 7 using Elsevier Core keywords
tweets %>%
  mutate(sdg_weight = count_sdg_weights(text, 7, "cmu250")) %>%
  select(text, sdg_weight) %>%
  arrange(desc(sdg_weight))
```

We can map course descriptions as well. Below, we show the package being used to map the CMU course descriptions from Fall 2022 to the SDGs.

```{r}
# Create dataframe of CMU course descriptions from Fall 2022
classes <- readxl::read_excel("datasets/cmu_f22_course_info.xlsx") %>%
  rename(semester = `Semester`,
         course_title = `Course Title`,
         course_num = `Course Number`,
         course_desc = `Course Description`) %>% 
  mutate(course_dept = substr(course_num, 1, 2),
         course_level = substr(course_num, 3, 5),
         course_level_specific = substr(course_num, 3, 3)) %>%
  mutate(text = paste(str_to_lower(course_title), str_to_lower(course_desc))) %>%
  # Clean the punctuation
  mutate(text = gsub("[^[:alnum:]['-]", " ", text)) %>%
  arrange(desc(semester)) %>%
  distinct(course_num, .keep_all = TRUE) %>%
  # Only select 5% of courses for the purposes of this Markdown file
  sample_frac(0.05)

# Perform the mapping
all_sdg_keywords <- data.frame()
for (goal_num in 1:17) {
  classes %>%
    mutate(goal = goal_num,
           keyword = tabulate_sdg_keywords(text, goal_num, keywords = "cmu250")) %>%
    unnest(keyword) -> cur_sdg_keywords
  
  all_sdg_keywords <- rbind(all_sdg_keywords, cur_sdg_keywords) 
}
all_sdg_keywords %>%
  left_join(cmu250_keywords, by = c("goal", "keyword")) %>%
  select(keyword, weight, semester, course_num, goal, color) %>%
  arrange(course_num) -> all_sdg_keywords

# View mapped keywords dataset
all_sdg_keywords
```

## Frequently Asked Questions (FAQs)

Q: What are the `cmu1000`, `cmu500`, and `cmu250` datasets? Why 250, 500, and 1000?

A: These are SDG keyword datasets created by Carnegie Mellon University (CMU). The number indicates approximately how many words are in each SDG for that dataset. For instance, for the `cmu500` dataset, we would expect roughly 500 words in SDG6. We initially created the dataset `cmu1000` to represent the dataset with roughly 1000 words for each SDG, and then we took the top 250 and 500 words based on keyword weight to generate `cmu250` and `cmu500`.

Q: Is there any easy way to customize the SDG keyword dataset and add in and my own assessment of their weights?

A: Yes! Instead of passing in one of the known SDG keyword datasets, you can directly pass in your own SDG keyword dataset. All you have to do is ensure that the columns match up with `goal`, `keyword`, `pattern`, `weight`, and `color`.

Q: How were the weights generated for each keyword?

A: Very loosely, they were interpolated from the  [Elsevier SDG Keyword weights](https://elsevier.digitalcommonsdata.com/datasets/9sxdykm8s4/2). Using Google's Word2Vec, we assigned the weight of each word to be a weighted proportion of defined Elsevier keywords, or keywords that were in Word2Vec's dataset, based on how often they were a 100 nearest neighbors in terms of semantic similarity.

Q: Why didn't you use compound expressions like "poverty AND economic resources or "poverty AND (disaster OR disaster area)"?

A: We have attempted to use compound expressions for SDG mapping, but found that in practice, the specific compound expressions for SDG mapping were few and far between. For instance, when we tried to use compound expressions for SDG mapping using [Elsevier's newly released dataset](https://figshare.com/articles/dataset/Keywords_and_search_strings_for_generating_SDG_training_sets/17294255), we found that very few course descriptions had specific compound expression matchings. Thus, we used keyword weights instead.

Q: Words like "student", "semester", and "homework" seem like very general SDG4 keywords when mapping to SDG4. When mapping to course descriptions, wouldn't this tag almost every course with SDG4?

A: Yes. Thus, we filtered out words that were too general among course descriptions. The specific list of words we excluded for SDG4 mapping in mapping to course descriptions are: "education", "educational", "school", "schools", "student", "students", "teaching", "learning", "apprenticeship", "skill", "skills", "curriculum", "teachers", "trainees", "trainee", "teacher", "classroom", "educators", "math", "classrooms", "educator", "graduates", "diploma", "undergraduates", "undergrad", "course", "mathematics", "achievement", "courses", "elementary", "academic", "training", "pupils", "undergraduate", "college", "colleges", "learners", "algebra", "reading", "comprehension", "achievements", "universities", "faculty", "internship", "principal", "internships", "career", "maths", "adult", "principals", "curricula", "grad", "biology", "university", "semester", "scholars", "literacy", "exam", "exams", "tutoring", "literacy", "syllabus", "instructor", "instructors", "degree", "classes", "language", "science", "instruction", "campus", "homework", "instructional", "curricular", "humanities", "mentoring", "teach", "employment", "qualifications", "coursework", "graduate".

## Acknowledgements

Thank you to Jingwen Mu and Kevin Kang from the University of Auckland for discussions and insights about regular expression matchings with the SDG keywords.

Owner metadata

Name:
Login: CMUSustainability
Email:
Kind: user
Description:
Website:
Location:
Twitter:
Company:
Icon url: https://avatars.githubusercontent.com/u/91693153?v=4
Repositories: 1
Last ynced at: 2023-03-27T11:54:18.531Z
Profile URL: https://github.com/CMUSustainability

GitHub Events

Total

Last Year

Committers metadata

Last synced: 4 days ago

Total Commits: 50
Total Committers: 2
Avg Commits per committer: 25.0
Development Distribution Score (DDS): 0.1

Commits in past year: 0
Committers in past year: 0
Avg Commits per committer in past year: 0.0
Development Distribution Score (DDS) in past year: 0.0

Name	Email	Commits
pwu97	4****7	45
CMUSustainability	9****y	5

Committer domains:

Issue and Pull Request metadata

Last synced: 1 day ago

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull request: 0
Bot issues: 0
Bot pull requests: 0

Past year issues: 0
Past year pull requests: 0
Past year average time to close issues: N/A
Past year average time to close pull requests: N/A
Past year issue authors: 0
Past year pull request authors: 0
Past year average comments per issue: 0
Past year average comments per pull request: 0
Past year merged pull request: 0
Past year bot issues: 0
Past year bot pull requests: 0

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/CMUSustainability/SDGmapR

Top Issue Authors

Top Pull Request Authors

Top Issue Labels

Top Pull Request Labels

Dependencies

DESCRIPTION cran

R >= 2.10 depends

Score: 3.1780538303479458

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Sustainable Technology