SDGmapR
R functions and datasets related to the mapping of text to the United Nations 17 Sustainable Development Goals.
https://github.com/CMUSustainability/SDGmapR
Category: Sustainable Development
Sub Category: Sustainable Development Goals
Last synced: about 11 hours ago
JSON representation
Repository metadata
R functions and datasets related to the mapping of text to the United Nations 17 Sustainable Development Goals (SDGs).
- Host: GitHub
- URL: https://github.com/CMUSustainability/SDGmapR
- Owner: CMUSustainability
- License: mit
- Created: 2021-10-03T00:46:41.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-05-12T08:16:11.000Z (almost 3 years ago)
- Last Synced: 2024-10-29T20:59:55.795Z (6 months ago)
- Language: R
- Size: 3.07 MB
- Stars: 12
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
README.Rmd
--- output: github_document --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" ) ``` # SDGmapR The goal of `SDGmapR` is to provide an open-source foundation for the systematic mapping to the United Nations Sustainable Development Goals (SDGs). In this R package one can find publicly available [SDG keyword datasets](https://github.com/pwu97/SDGmapR/tree/main/datasets) in the `tidy` data format, the [UN Official SDG color scheme](https://www.un.org/sustainabledevelopment/wp-content/uploads/2019/01/SDG_Guidelines_AUG_2019_Final.pdf) and [SDG Descriptions](https://github.com/pwu97/SDGmapR/blob/main/datasets/sdg_desc_cleaned.csv), and several functions related to the mapping of text to particular sets of keywords. ## Installation You can install the development version from [GitHub](https://github.com/) with: ``` r # install.packages("devtools") devtools::install_github("CMUSustainability/SDGmapR") ``` ## Publicly Available SDG Keywords The table below lists publicly available SDG keywords that have been published online. Some of the lists have weights associated with every keyword, while some do not. For the purposes of the `SDGmapR` package, we will assign an equal weight of one to every word if weights are not given. Note that the column for `SDG17` will represent whether the dataset has keywords related to SDG17. ```{r, echo=FALSE, example} library(knitr) sdg_table <- data.frame( "Source" = c("[Core Elsevier (Work in Progress)](https://data.mendeley.com/datasets/87txkw7khs/1)", "[Improved Elsevier Top 100](https://data.mendeley.com/datasets/9sxdykm8s4/2)", "[SDSN](https://ap-unsdsn.org/regional-initiatives/universities-sdgs/)", "[CMU Top 250 Words](https://www.cmu.edu/leadership/the-provost/provost-priorities/sustainability-initiative/sdg-definitions.html)", "[CMU Top 500 Words](https://www.cmu.edu/leadership/the-provost/provost-priorities/sustainability-initiative/sdg-definitions.html)", "[CMU Top 1000 Words](https://www.cmu.edu/leadership/the-provost/provost-priorities/sustainability-initiative/sdg-definitions.html)", "[University of Auckland (Work in Progress)](https://www.sdgmapping.auckland.ac.nz/)", "[University of Toronto (Work in Progress)](https://data.utoronto.ca/sustainable-development-goals-sdg-report/sdg-report-appendix/)"), "Dataset" = c("`elsevier_keywords`", "`elsevier100_keywords`", "`sdsn_keywords`", "`cmu250_keywords`", "`cmu500_keywords`", "`cmu1000_keywords`", "`auckland_keywords`", "`toronto_keywords`"), "CSV" = c("[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/elsevier_keywords_cleaned.csv)", "[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/elsevier100_keywords_cleaned.csv)", "[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/sdsn_keywords_cleaned.csv)", "[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/cmu250_keywords_cleaned.csv)", "[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/cmu500_keywords_cleaned.csv)", "[Link](https://github.com/pwu97/SDGmapR/blob/main/datasets/cmu1000_keywords_cleaned.csv)", "", ""), "SDG17" = c("No", "No", "Yes", "No", "No", "No", "Yes", "Yes") ) kable(sdg_table) ``` ## Example SDGMapR Usage We can map to one SDG with the `count_sdg_keywords` function that adds up the weights of the keywords found. We can find the keywords for one SDG with the `tabulate_sdg_keywords` that returns the words as a vector, which we can view in the `tidy` format by applying `unnest()` to our result. ```{r, warning=FALSE, message=FALSE} library(tidyverse) library(SDGmapR) # Load first 1000 #tidytuesday tweets tweets <- readRDS(url("https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-01-01/tidytuesday_tweets.rds?raw=true")) %>% select(text) %>% head(1000) %>% mutate(text = str_to_lower(text)) # Map to SDG 1 using Improved Elsevier Top 100 Keywords tweets_sdg1 <- tweets %>% mutate(sdg_1_weight = count_sdg_weights(text, 1), sdg_1_words = tabulate_sdg_keywords(text, 1)) %>% arrange(desc(sdg_1_weight)) %>% select(text, sdg_1_weight, sdg_1_words) # View SDG 1 matched keywords tweets_sdg1 %>% unnest(sdg_1_words) ``` We can map to a different set of keywords by adding an additional input into our function, using the `cmu250` (CMU Top 250 Keywords) dataset of SDG keywords instead of the default `elsevier1000` dataset of SDG keywords. ```{r} # Map to SDG 3 using Elsevier Core keywords tweets %>% mutate(sdg_weight = count_sdg_weights(text, 3, "cmu250")) %>% select(text, sdg_weight) %>% arrange(desc(sdg_weight)) # Map to SDG 5 using Elsevier Core keywords tweets %>% mutate(sdg_weight = count_sdg_weights(text, 5, "cmu250")) %>% select(text, sdg_weight) %>% arrange(desc(sdg_weight)) # Map to SDG 7 using Elsevier Core keywords tweets %>% mutate(sdg_weight = count_sdg_weights(text, 7, "cmu250")) %>% select(text, sdg_weight) %>% arrange(desc(sdg_weight)) ``` We can map course descriptions as well. Below, we show the package being used to map the CMU course descriptions from Fall 2022 to the SDGs. ```{r} # Create dataframe of CMU course descriptions from Fall 2022 classes <- readxl::read_excel("datasets/cmu_f22_course_info.xlsx") %>% rename(semester = `Semester`, course_title = `Course Title`, course_num = `Course Number`, course_desc = `Course Description`) %>% mutate(course_dept = substr(course_num, 1, 2), course_level = substr(course_num, 3, 5), course_level_specific = substr(course_num, 3, 3)) %>% mutate(text = paste(str_to_lower(course_title), str_to_lower(course_desc))) %>% # Clean the punctuation mutate(text = gsub("[^[:alnum:]['-]", " ", text)) %>% arrange(desc(semester)) %>% distinct(course_num, .keep_all = TRUE) %>% # Only select 5% of courses for the purposes of this Markdown file sample_frac(0.05) # Perform the mapping all_sdg_keywords <- data.frame() for (goal_num in 1:17) { classes %>% mutate(goal = goal_num, keyword = tabulate_sdg_keywords(text, goal_num, keywords = "cmu250")) %>% unnest(keyword) -> cur_sdg_keywords all_sdg_keywords <- rbind(all_sdg_keywords, cur_sdg_keywords) } all_sdg_keywords %>% left_join(cmu250_keywords, by = c("goal", "keyword")) %>% select(keyword, weight, semester, course_num, goal, color) %>% arrange(course_num) -> all_sdg_keywords # View mapped keywords dataset all_sdg_keywords ``` ## Frequently Asked Questions (FAQs) Q: What are the `cmu1000`, `cmu500`, and `cmu250` datasets? Why 250, 500, and 1000? A: These are SDG keyword datasets created by Carnegie Mellon University (CMU). The number indicates approximately how many words are in each SDG for that dataset. For instance, for the `cmu500` dataset, we would expect roughly 500 words in SDG6. We initially created the dataset `cmu1000` to represent the dataset with roughly 1000 words for each SDG, and then we took the top 250 and 500 words based on keyword weight to generate `cmu250` and `cmu500`. Q: Is there any easy way to customize the SDG keyword dataset and add in and my own assessment of their weights? A: Yes! Instead of passing in one of the known SDG keyword datasets, you can directly pass in your own SDG keyword dataset. All you have to do is ensure that the columns match up with `goal`, `keyword`, `pattern`, `weight`, and `color`. Q: How were the weights generated for each keyword? A: Very loosely, they were interpolated from the [Elsevier SDG Keyword weights](https://elsevier.digitalcommonsdata.com/datasets/9sxdykm8s4/2). Using Google's Word2Vec, we assigned the weight of each word to be a weighted proportion of defined Elsevier keywords, or keywords that were in Word2Vec's dataset, based on how often they were a 100 nearest neighbors in terms of semantic similarity. Q: Why didn't you use compound expressions like "poverty AND economic resources or "poverty AND (disaster OR disaster area)"? A: We have attempted to use compound expressions for SDG mapping, but found that in practice, the specific compound expressions for SDG mapping were few and far between. For instance, when we tried to use compound expressions for SDG mapping using [Elsevier's newly released dataset](https://figshare.com/articles/dataset/Keywords_and_search_strings_for_generating_SDG_training_sets/17294255), we found that very few course descriptions had specific compound expression matchings. Thus, we used keyword weights instead. Q: Words like "student", "semester", and "homework" seem like very general SDG4 keywords when mapping to SDG4. When mapping to course descriptions, wouldn't this tag almost every course with SDG4? A: Yes. Thus, we filtered out words that were too general among course descriptions. The specific list of words we excluded for SDG4 mapping in mapping to course descriptions are: "education", "educational", "school", "schools", "student", "students", "teaching", "learning", "apprenticeship", "skill", "skills", "curriculum", "teachers", "trainees", "trainee", "teacher", "classroom", "educators", "math", "classrooms", "educator", "graduates", "diploma", "undergraduates", "undergrad", "course", "mathematics", "achievement", "courses", "elementary", "academic", "training", "pupils", "undergraduate", "college", "colleges", "learners", "algebra", "reading", "comprehension", "achievements", "universities", "faculty", "internship", "principal", "internships", "career", "maths", "adult", "principals", "curricula", "grad", "biology", "university", "semester", "scholars", "literacy", "exam", "exams", "tutoring", "literacy", "syllabus", "instructor", "instructors", "degree", "classes", "language", "science", "instruction", "campus", "homework", "instructional", "curricular", "humanities", "mentoring", "teach", "employment", "qualifications", "coursework", "graduate". ## Acknowledgements Thank you to Jingwen Mu and Kevin Kang from the University of Auckland for discussions and insights about regular expression matchings with the SDG keywords.
Owner metadata
- Name:
- Login: CMUSustainability
- Email:
- Kind: user
- Description:
- Website:
- Location:
- Twitter:
- Company:
- Icon url: https://avatars.githubusercontent.com/u/91693153?v=4
- Repositories: 1
- Last ynced at: 2023-03-27T11:54:18.531Z
- Profile URL: https://github.com/CMUSustainability
GitHub Events
Total
Last Year
Committers metadata
Last synced: 7 days ago
Total Commits: 50
Total Committers: 2
Avg Commits per committer: 25.0
Development Distribution Score (DDS): 0.1
Commits in past year: 0
Committers in past year: 0
Avg Commits per committer in past year: 0.0
Development Distribution Score (DDS) in past year: 0.0
Name | Commits | |
---|---|---|
pwu97 | 4****7 | 45 |
CMUSustainability | 9****y | 5 |
Committer domains:
Issue and Pull Request metadata
Last synced: 1 day ago
Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull request: 0
Bot issues: 0
Bot pull requests: 0
Past year issues: 0
Past year pull requests: 0
Past year average time to close issues: N/A
Past year average time to close pull requests: N/A
Past year issue authors: 0
Past year pull request authors: 0
Past year average comments per issue: 0
Past year average comments per pull request: 0
Past year merged pull request: 0
Past year bot issues: 0
Past year bot pull requests: 0
Top Issue Authors
Top Pull Request Authors
Top Issue Labels
Top Pull Request Labels
Dependencies
- R >= 2.10 depends
Score: 3.1780538303479458