https://github.com/allenai/dolma
Data and tools for generating and inspecting OLMo pre-training data.
https://github.com/allenai/dolma
Keywords
data-processing large-language-models llm machile-learning nlp
Keywords from Contributors
transformer language-model measur archiving mapper prompt generic optimize preprocessing animals
Last synced: 11 months ago
JSON representation
Acceptance Criteria
- Revelant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
Repository metadata
Data and tools for generating and inspecting OLMo pre-training data.
- Host: GitHub
- URL: https://github.com/allenai/dolma
- Owner: allenai
- License: apache-2.0
- Created: 2023-06-20T20:37:39.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-22T15:36:29.000Z (12 months ago)
- Last Synced: 2024-05-22T16:54:38.832Z (12 months ago)
- Topics: data-processing, large-language-models, llm, machile-learning, nlp
- Language: Python
- Homepage: https://allenai.github.io/dolma/
- Size: 54.1 MB
- Stars: 800
- Watchers: 17
- Forks: 78
- Open Issues: 21
- Releases: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Owner metadata
- Name: AI2
- Login: allenai
- Email: [email protected]
- Kind: organization
- Description:
- Website: http://www.allenai.org
- Location: Seattle, WA
- Twitter:
- Company:
- Icon url: https://avatars.githubusercontent.com/u/5667695?v=4
- Repositories: 454
- Last ynced at: 2024-04-14T22:06:46.803Z
- Profile URL: https://github.com/allenai
GitHub Events
Total
- Fork event: 21
- Create event: 63
- Issues event: 60
- Release event: 9
- Watch event: 272
- Delete event: 42
- Member event: 1
- Issue comment event: 87
- Public event: 1
- Push event: 480
- Pull request review event: 42
- Pull request review comment event: 30
- Pull request event: 121
Last Year
- Create event: 63
- Delete event: 42
- Fork event: 21
- Issue comment event: 87
- Issues event: 60
- Member event: 1
- Public event: 1
- Pull request event: 121
- Pull request review comment event: 30
- Pull request review event: 42
- Push event: 480
- Release event: 9
- Watch event: 272
Committers metadata
Last synced: over 1 year ago
Total Commits: 233
Total Committers: 13
Avg Commits per committer: 17.923
Development Distribution Score (DDS): 0.682
Commits in past year: 233
Committers in past year: 13
Avg Commits per committer in past year: 17.923
Development Distribution Score (DDS) in past year: 0.682
Name | Commits | |
---|---|---|
Luca Soldaini | l****a@s****t | 74 |
chris-ha458 | h****9@g****m | 69 |
Luca Soldaini | l****s@a****g | 64 |
kyleclo | k****o@u****u | 12 |
Niklas Muennighoff | n****f@g****m | 3 |
dependabot[bot] | 4****] | 3 |
Peter Bjørn Jørgensen | p****n@g****m | 2 |
Ben Bogin | b****9@g****m | 1 |
Dirk Groeneveld | d****g@a****g | 1 |
Ishan Anand | g****b@i****g | 1 |
Rodney Kinney | r****k@a****g | 1 |
Dustin Schwenk | d****k | 1 |
Ian Magnusson | 4****n | 1 |
Committer domains:
- allenai.org: 3
- ishan.org: 1
- uw.edu: 1
- soldaini.net: 1
Issue and Pull Request metadata
Last synced: 12 months ago
Total issues: 87
Total pull requests: 135
Average time to close issues: 15 days
Average time to close pull requests: 8 days
Total issue authors: 26
Total pull request authors: 21
Average comments per issue: 1.62
Average comments per pull request: 0.47
Merged pull request: 118
Bot issues: 0
Bot pull requests: 9
Past year issues: 87
Past year pull requests: 135
Past year average time to close issues: 15 days
Past year average time to close pull requests: 8 days
Past year issue authors: 26
Past year pull request authors: 21
Past year average comments per issue: 1.62
Past year average comments per pull request: 0.47
Past year merged pull request: 118
Past year bot issues: 0
Past year bot pull requests: 9
Top Issue Authors
- hannahzacharski55 (38)
- soldni (11)
- peterbjorgensen (9)
- chris-ha458 (4)
- codefly13 (2)
- dustinwloring1988 (2)
- jtalmi (2)
- Vedaad-Shakib (1)
- TTTTao725 (1)
- tokenizer-decode (1)
- Tendo33 (1)
- suolyer (1)
- simonw (1)
- silverriver (1)
- RohitRathore1 (1)
Top Pull Request Authors
- soldni (78)
- chris-ha458 (10)
- dependabot[bot] (9)
- kyleclo (7)
- peterbjorgensen (6)
- Muennighoff (4)
- rodneykinney (3)
- IanMagnusson (3)
- ianand (2)
- Whattabatt (2)
- undfined (1)
- simonw (1)
- RohitRathore1 (1)
- drschwenk (1)
- jacob-morrison (1)
Top Issue Labels
- enhancement (9)
Top Pull Request Labels
- dependencies (9)
Package metadata
- Total packages: 1
-
Total downloads:
- pypi: 17,721 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 17
- Total maintainers: 2
pypi.org: dolma
Data filters
- Homepage: https://github.com/allenai/dolma
- Documentation: https://dolma.readthedocs.io/
- Licenses: Apache-2.0
- Latest release: 1.0.3 (published about 1 year ago)
- Last Synced: 2024-05-22T16:57:42.693Z (12 months ago)
- Versions: 17
- Dependent Packages: 0
- Dependent Repositories: 0
- Downloads: 17,721 Last month
-
Rankings:
- Stargazers count: 4.879%
- Downloads: 5.226%
- Dependent packages count: 7.49%
- Forks count: 10.613%
- Average: 19.603%
- Dependent repos count: 69.808%
- Maintainers (2)
Dependencies
- PyO3/maturin-action v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v1 composite
- actions/checkout v3 composite
- actions/download-artifact v3 composite
- actions/setup-python v2 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- 233 dependencies
- anyascii >=0.3.2
- blingfire ==0.1.8
- boto3 *
- cached-path ==1.3.4
- detect-secrets ==1.4.0
- fasttext-wheel ==0.9.2
- fsspec *
- msgspec >=0.14.2
- nltk ==3.8.1
- omegaconf >=2.3.0
- presidio_analyzer ==2.2.32
- pycld2 ==0.41
- pyyaml *
- requests *
- rich *
- s3fs *
- smart-open *
- tokenizers >=0.13.3,<1.0.0
- tqdm *
- uniseg *
- apache-beam *
- jsonlines *
- jsonlines *
- apache-beam *
- jsonlines *
- jsonlines *
- apache-beam *
- datasets *
- jsonlines *
- jsonlines *
- apache-beam *
- datasets *
- jsonlines *
- jsonlines *
- apache-beam *
- datasets *
- jsonlines *
- jsonlines *
- pyarrow *
Score: 19.058090976969105