https://github.com/allenai/dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://github.com/allenai/dolma

Keywords

data-processing large-language-models llm machile-learning nlp

Keywords from Contributors

transformer language-model measur archiving mapper prompt generic optimize preprocessing animals

Last synced: 11 months ago
JSON representation

Acceptance Criteria

Repository metadata

Data and tools for generating and inspecting OLMo pre-training data.


Owner metadata


GitHub Events

Total
Last Year

Committers metadata

Last synced: over 1 year ago

Total Commits: 233
Total Committers: 13
Avg Commits per committer: 17.923
Development Distribution Score (DDS): 0.682

Commits in past year: 233
Committers in past year: 13
Avg Commits per committer in past year: 17.923
Development Distribution Score (DDS) in past year: 0.682

Name Email Commits
Luca Soldaini l****a@s****t 74
chris-ha458 h****9@g****m 69
Luca Soldaini l****s@a****g 64
kyleclo k****o@u****u 12
Niklas Muennighoff n****f@g****m 3
dependabot[bot] 4****] 3
Peter Bjørn Jørgensen p****n@g****m 2
Ben Bogin b****9@g****m 1
Dirk Groeneveld d****g@a****g 1
Ishan Anand g****b@i****g 1
Rodney Kinney r****k@a****g 1
Dustin Schwenk d****k 1
Ian Magnusson 4****n 1

Committer domains:


Issue and Pull Request metadata

Last synced: 12 months ago

Total issues: 87
Total pull requests: 135
Average time to close issues: 15 days
Average time to close pull requests: 8 days
Total issue authors: 26
Total pull request authors: 21
Average comments per issue: 1.62
Average comments per pull request: 0.47
Merged pull request: 118
Bot issues: 0
Bot pull requests: 9

Past year issues: 87
Past year pull requests: 135
Past year average time to close issues: 15 days
Past year average time to close pull requests: 8 days
Past year issue authors: 26
Past year pull request authors: 21
Past year average comments per issue: 1.62
Past year average comments per pull request: 0.47
Past year merged pull request: 118
Past year bot issues: 0
Past year bot pull requests: 9

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/allenai/dolma

Top Issue Authors

  • hannahzacharski55 (38)
  • soldni (11)
  • peterbjorgensen (9)
  • chris-ha458 (4)
  • codefly13 (2)
  • dustinwloring1988 (2)
  • jtalmi (2)
  • Vedaad-Shakib (1)
  • TTTTao725 (1)
  • tokenizer-decode (1)
  • Tendo33 (1)
  • suolyer (1)
  • simonw (1)
  • silverriver (1)
  • RohitRathore1 (1)

Top Pull Request Authors

  • soldni (78)
  • chris-ha458 (10)
  • dependabot[bot] (9)
  • kyleclo (7)
  • peterbjorgensen (6)
  • Muennighoff (4)
  • rodneykinney (3)
  • IanMagnusson (3)
  • ianand (2)
  • Whattabatt (2)
  • undfined (1)
  • simonw (1)
  • RohitRathore1 (1)
  • drschwenk (1)
  • jacob-morrison (1)

Top Issue Labels

  • enhancement (9)

Top Pull Request Labels

  • dependencies (9)

Package metadata

pypi.org: dolma

Data filters

  • Homepage: https://github.com/allenai/dolma
  • Documentation: https://dolma.readthedocs.io/
  • Licenses: Apache-2.0
  • Latest release: 1.0.3 (published about 1 year ago)
  • Last Synced: 2024-05-22T16:57:42.693Z (12 months ago)
  • Versions: 17
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 17,721 Last month
  • Rankings:
    • Stargazers count: 4.879%
    • Downloads: 5.226%
    • Dependent packages count: 7.49%
    • Forks count: 10.613%
    • Average: 19.603%
    • Dependent repos count: 69.808%
  • Maintainers (2)

Dependencies

.github/workflows/CI.yml actions
  • PyO3/maturin-action v1 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v1 composite
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v2 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
Cargo.lock cargo
  • 233 dependencies
pyproject.toml pypi
  • anyascii >=0.3.2
  • blingfire ==0.1.8
  • boto3 *
  • cached-path ==1.3.4
  • detect-secrets ==1.4.0
  • fasttext-wheel ==0.9.2
  • fsspec *
  • msgspec >=0.14.2
  • nltk ==3.8.1
  • omegaconf >=2.3.0
  • presidio_analyzer ==2.2.32
  • pycld2 ==0.41
  • pyyaml *
  • requests *
  • rich *
  • s3fs *
  • smart-open *
  • tokenizers >=0.13.3,<1.0.0
  • tqdm *
  • uniseg *
.github/workflows/ISSUE_TEMPLATE/bug_report.yml actions
.github/workflows/ISSUE_TEMPLATE/documentation.yml actions
.github/workflows/ISSUE_TEMPLATE/feature_request.yml actions
.github/workflows/ISSUE_TEMPLATE/question.yml actions
Cargo.toml cargo
sources/reddit/atomic_content_v3/requirements.txt pypi
  • apache-beam *
  • jsonlines *
sources/reddit/atomic_content_v3/setup.py pypi
  • jsonlines *
sources/reddit/atomic_content_v5/requirements.txt pypi
  • apache-beam *
  • jsonlines *
sources/reddit/atomic_content_v5/setup.py pypi
  • jsonlines *
sources/reddit/comment_threads_v1/requirements.txt pypi
  • apache-beam *
  • datasets *
  • jsonlines *
sources/reddit/comment_threads_v1/setup.py pypi
  • jsonlines *
sources/reddit/comment_threads_v2/requirements.txt pypi
  • apache-beam *
  • datasets *
  • jsonlines *
sources/reddit/comment_threads_v2/setup.py pypi
  • jsonlines *
sources/reddit/complete_threads_codelike_v4/requirements.txt pypi
  • apache-beam *
  • datasets *
  • jsonlines *
sources/reddit/complete_threads_codelike_v4/setup.py pypi
  • jsonlines *
sources/starcoder/requirements.txt pypi
  • pyarrow *

Score: 19.058090976969105