https://github.com/libertem/libertem-blobfinder

LiberTEM correlation and refinement library
data-processing electron-microscopy image-processing python
Added: about 1 year ago - Last Synced: 11 months ago - Created: February 05, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Jupyter Notebook
  • Commits: 85
  • Committers: 2
  • Issues: 29
  • Pull Requests: 68
  • Owner: LiberTEM
  • Stars: 5
  • Forks: 3
  • Packages: 1
  • Downloads: 217
https://github.com/alirezatheh/perke

A keyphrase extractor for Persian
data-mining data-processing information-retrieval keyphrase keyphrase-extraction keyphrase-extractor keyword keyword-extraction keyword-extractor machine-learning ml natural-language-processing nlp persian persian-language python text-mining text-processing unsupervised-learning
Added: over 1 year ago - Last Synced: 11 months ago - Created: February 03, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 87
  • Committers: 4
  • Issues: 0
  • Pull Requests: 4
  • Owner: AlirezaTheH
  • Stars: 68
  • Forks: 7
  • Packages: 1
  • Downloads: 77
https://github.com/bbva/mercury-dataschema

Utility package that, given a Pandas DataFrame, it uses the DataSchema class which auto-infers feature types and automatically calculates different statistics depending on the types.
analytics data data-cleaning data-processing data-science feature-engineering
Added: over 1 year ago - Last Synced: 11 months ago - Created: March 09, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 11
  • Committers: 5
  • Issues: 0
  • Pull Requests: 2
  • Owner: BBVA
  • Stars: 11
  • Forks: 1
  • Packages: 1
  • Downloads: 199
https://github.com/asyml/forte

Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
data-processing deep-learning information-retrieval machine-learning natural-language natural-language-processing pipeline python text-data
Added: over 1 year ago - Last Synced: 11 months ago - Created: August 09, 2019

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 1028
  • Committers: 53
  • Issues: 55
  • Pull Requests: 49
  • Owner: asyml
  • Stars: 236
  • Forks: 60
  • Packages: 1
  • Downloads: 173
https://github.com/plato-solutions/artifician

Artifician is an event-driven framework designed to simplify and accelerate the process of preparing datasets for Artificial Intelligence models.
artificial-intelligence data-processing data-processing-pipelines dataset-preparation machine-learning python
Added: over 1 year ago - Last Synced: 11 months ago - Created: August 15, 2022

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 168
  • Committers: 4
  • Issues: 0
  • Pull Requests: 32
https://github.com/flow-php/etl-adapter-parquet

PHP ETL Adapter: Parquet
data-engineering data-processing etl flow-php parquet
Added: over 1 year ago - Last Synced: 11 months ago - Created: May 07, 2022

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 109
  • Committers: 3
  • Issues: 0
  • Pull Requests: 8
  • Owner: flow-php
  • Stars: 5
  • Forks: 0
  • Packages: 1
  • Downloads: 2,340
https://github.com/streamnative/pulsar-spark

Spark Connector to read and write with Pulsar
apache-pulsar apache-spark batch-processing data-processing data-science flink spark spark-sql stream-processing structured-streaming
Added: over 1 year ago - Last Synced: 11 months ago - Created: July 01, 2019

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Scala
  • Commits: 189
  • Committers: 22
  • Issues: 93
  • Pull Requests: 142
https://github.com/code-rhapsodie/ezdataflow-bundle

Import/export bundle for eZ Platform / Ibexa Content based on Code-Rhapsodie Dataflow
data-processing dataflow export ez-platform ez-publish ibexa ibexa-content ibexa-platform ibexadxp import portphp
Added: over 1 year ago - Last Synced: 11 months ago - Created: October 08, 2019

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 41
  • Committers: 6
  • Issues: 10
  • Pull Requests: 32
  • Owner: code-rhapsodie
  • Stars: 7
  • Forks: 2
  • Packages: 1
  • Downloads: 31,981
https://github.com/flow-php/etl-adapter-xml

PHP ETL Adapter: XML
data-engineering data-processing etl flow-php xml
Added: over 1 year ago - Last Synced: 11 months ago - Created: April 18, 2021

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 179
  • Committers: 4
  • Issues: 1
  • Pull Requests: 102
  • Owner: flow-php
  • Stars: 4
  • Forks: 2
  • Packages: 1
  • Downloads: 30,730
https://github.com/hstreamdb/hstream

HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications.
data-processing database distributed-database distributed-systems financial-analysis haskell hstreamdb iot iot-database kafka materialized-view real-time realtime-database scale sql stream-processing streaming streaming-data streaming-database
Added: over 1 year ago - Last Synced: 11 months ago - Created: August 31, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Haskell
  • Commits: 1717
  • Committers: 22
  • Issues: 9
  • Pull Requests: 455
  • Owner: hstreamdb
  • Stars: 691
  • Forks: 56
  • Packages: 2
  • Downloads: 78
https://github.com/lithops-cloud/lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
big-data big-data-analytics cloud-computing data-processing distributed kubernetes multicloud multiprocessing object-storage parallel python serverless serverless-computing serverless-functions
Added: over 1 year ago - Last Synced: 11 months ago - Created: April 23, 2018

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 2852
  • Committers: 52
  • Issues: 76
  • Pull Requests: 286
  • Owner: lithops-cloud
  • Stars: 289
  • Forks: 91
  • Packages: 1
  • Downloads: 2,672
https://github.com/neuro-ml/connectome

A library for datasets containing heterogeneous data
data-processing pipelines python
Added: over 1 year ago - Last Synced: 11 months ago - Created: June 29, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 474
  • Committers: 11
  • Issues: 61
  • Pull Requests: 40
  • Owner: neuro-ml
  • Stars: 12
  • Forks: 1
  • Packages: 1
  • Downloads: 246
https://github.com/bytewax/bytewax

Python Stream Processing
data-engineering data-processing data-science dataflow machine-learning python rust stream-processing streaming-data
Added: over 1 year ago - Last Synced: 11 months ago - Created: February 04, 2022

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 1487
  • Committers: 22
  • Issues: 66
  • Pull Requests: 263
  • Owner: bytewax
  • Stars: 1257
  • Forks: 56
  • Packages: 2
  • Downloads: 6,485
https://github.com/asyml/texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
bert casl-project data-processing deep-learning dialog-systems gpt-2 machine-learning machine-translation natural-language-processing python tensorflow texar text-data text-generation xlnet
Added: over 1 year ago - Last Synced: 11 months ago - Created: July 22, 2017

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 1384
  • Committers: 40
  • Issues: 53
  • Pull Requests: 48
  • Owner: asyml
  • Stars: 2381
  • Forks: 371
  • Packages: 2
  • Downloads: 52
https://github.com/tomwright/dasel

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.
cli config configuration data-processing data-structures data-wrangling devops-tools go golang json json-processing parser query selector toml update xml yaml yaml-processor
Added: over 1 year ago - Last Synced: 11 months ago - Created: September 22, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 582
  • Committers: 26
  • Issues: 91
  • Pull Requests: 90
  • Owner: TomWright
  • Stars: 4877
  • Forks: 112
  • Packages: 7
  • Downloads: 285
https://github.com/vortex-exoplanet/vip

VIP is a python package/library for angular, reference star and spectral differential imaging for exoplanet/disk detection through high-contrast imaging.
data-processing extrasolar-planets-disks high-contrast-imaging image-processing low-rank-approximation mcmc pca
Added: over 1 year ago - Last Synced: 11 months ago - Created: May 24, 2015

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 2129
  • Committers: 51
  • Issues: 9
  • Pull Requests: 128
https://github.com/luckylittle/blinkist-m4a-downloader

Grabs all of the audio files from all of the Blinkist books
audiobooks blinkist books crawler data-archiving data-mining data-processing go golang scraper spider
Added: over 1 year ago - Last Synced: 11 months ago - Created: January 08, 2019

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 16
  • Committers: 5
  • Issues: 14
  • Pull Requests: 4
https://github.com/flow-php/etl

PHP - ETL (Extract Transform Load) data processing library
data-engineering data-processing etl flow-php
Added: over 1 year ago - Last Synced: 11 months ago - Created: October 26, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 818
  • Committers: 12
  • Issues: 8
  • Pull Requests: 93
  • Owner: flow-php
  • Stars: 341
  • Forks: 21
  • Packages: 1
  • Downloads: 267,060
https://github.com/dashbitco/broadway

Concurrent and multi-stage data ingestion and data processing with Elixir
broadway concurrent data-ingestion data-processing elixir genstage
Added: over 1 year ago - Last Synced: 11 months ago - Created: November 05, 2018

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Elixir
  • Commits: 389
  • Committers: 82
  • Issues: 47
  • Pull Requests: 63
  • Owner: dashbitco
  • Stars: 2313
  • Forks: 152
  • Packages: 3
  • Downloads: 6,718,579
https://github.com/ml6team/fondant

Production-ready data processing made easy and shareable
data-processing fine-tuning foundation-models machine-learning pipeline python
Added: over 1 year ago - Last Synced: 11 months ago - Created: March 02, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 529
  • Committers: 24
  • Issues: 151
  • Pull Requests: 200
  • Owner: ml6team
  • Stars: 316
  • Forks: 24
  • Packages: 1
  • Downloads: 704
https://github.com/giacbrd/smartpipeline

A framework for rapid development of robust data pipelines following a simple design pattern
data-analysis data-analytics data-mining data-pipelines data-processing data-science dataops design-patterns etl machine-learning mlops pipeline pipeline-framework pipelines reproducibility task-queue workflow
Added: over 1 year ago - Last Synced: 11 months ago - Created: September 03, 2018

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 275
  • Committers: 3
  • Issues: 0
  • Pull Requests: 3
  • Owner: giacbrd
  • Stars: 22
  • Forks: 2
  • Packages: 1
  • Downloads: 56
https://github.com/yord/pxi

🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.
csv data-processing deserializer dsv json marshaller parser pixie pxi serializer ssv tsv
Added: over 1 year ago - Last Synced: 11 months ago - Created: November 27, 2019

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: JavaScript
  • Commits: 404
  • Committers: 1
  • Issues: 7
  • Pull Requests: 1
  • Owner: Yord
  • Stars: 266
  • Forks: 3
  • Packages: 1
  • Downloads: 66
https://github.com/lherman-cs/go-rosbag

Rosbag parser written in pure Go
analytics cli cloud data-processing decoder parser robotics ros rosbag
Added: over 1 year ago - Last Synced: 11 months ago - Created: December 25, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 70
  • Committers: 1
  • Issues: 2
  • Pull Requests: 1
  • Owner: lherman-cs
  • Stars: 17
  • Forks: 1
  • Packages: 1
https://github.com/changchunhe/pyvaspflow

vasp calculation flow
data-processing defect-formation-energy vasp
Added: over 1 year ago - Last Synced: 11 months ago - Created: March 08, 2019

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 539
  • Committers: 4
  • Issues: 0
  • Pull Requests: 20
  • Owner: ChangChunHe
  • Stars: 17
  • Forks: 18
  • Packages: 1
  • Downloads: 9
https://github.com/zazuko/barnard59

An intuitive and flexible RDF pipeline solution designed to simplify and automate ETL processes for efficient data management.
data-integration data-pipeline data-processing etl json-ld linked-data pipeline rdf semantic-web
Added: over 1 year ago - Last Synced: 11 months ago - Created: October 25, 2018

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: JavaScript
  • Commits: 1159
  • Committers: 19
  • Issues: 147
  • Pull Requests: 282
  • Owner: zazuko
  • Stars: 21
  • Forks: 2
  • Packages: 14
  • Downloads: 100,567
https://github.com/svenkreiss/pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.
apache-spark data-processing data-science python
Added: over 1 year ago - Last Synced: 11 months ago - Created: May 09, 2015

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 1454
  • Committers: 10
  • Issues: 20
  • Pull Requests: 80
  • Owner: svenkreiss
  • Stars: 260
  • Forks: 44
  • Packages: 2
  • Downloads: 11,565
https://github.com/asyml/fortehealth

The project is in the incubation stage and still under development. ForteHealth is a flexible and powerful ML workflow builder for biomedical and clinical scenarios. This is part of the CASL project: http://casl-project.ai/
biomedical-named-entity-recognition clinical-nlp clinical-text-processing data-processing deep-learning information-retrieval machine-learning natural-language natural-language-processing python
Added: over 1 year ago - Last Synced: 11 months ago - Created: February 04, 2022

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 383
  • Committers: 10
  • Issues: 48
  • Pull Requests: 43
  • Owner: asyml
  • Stars: 10
  • Forks: 5
  • Packages: 1
  • Downloads: 13
https://github.com/AtomGraph/Processor

Ontology-driven Linked Data processor and server for SPARQL backends. Apache License.
appengine crud data-driven data-processing declarative docker-image framework generic hypermedia knowledge-graph ldt linked-data linked-data-templates ontology-driven-development rdf rest semantic-web server sparql
Added: over 1 year ago - Last Synced: 11 months ago - Created: April 06, 2015

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Java
  • Commits: 852
  • Committers: 3
  • Issues: 29
  • Pull Requests: 1
  • Owner: AtomGraph
  • Stars: 58
  • Forks: 6
  • Packages: 1
https://github.com/apache/incubator-wayang

Apache Wayang(incubating) is the first cross-platform data processing system.
apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark
Added: over 1 year ago - Last Synced: 11 months ago - Created: December 16, 2020

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Java
  • Commits: 1685
  • Committers: 45
  • Issues: 491
  • Pull Requests: 454
  • Owner: apache
  • Stars: 170
  • Forks: 71
  • Packages: 38
https://github.com/kfultz07/go-dataframe

A simple package to abstract away the process of creating usable DataFrames for data analytics. This package is heavily inspired by the amazing Python library, Pandas.
data-analysis data-analytics data-processing data-science dataframe go golang pandas
Added: over 1 year ago - Last Synced: 11 months ago - Created: January 03, 2022

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 130
  • Committers: 2
  • Issues: 0
  • Pull Requests: 2
  • Owner: kfultz07
  • Stars: 71
  • Forks: 6
  • Packages: 1
https://github.com/flow-php/etl-adapter-json

PHP ETL Adapter: JSON
data-engineering data-processing etl flow-php json
Added: over 1 year ago - Last Synced: 11 months ago - Created: April 10, 2021

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 214
  • Committers: 4
  • Issues: 1
  • Pull Requests: 99
  • Owner: flow-php
  • Stars: 5
  • Forks: 3
  • Packages: 1
  • Downloads: 150,417
https://github.com/skuschel/generatorpipeline

Parallelize your data-processing pipelines with just a decorator.
data-processing data-science hacktoberfest python
Added: over 1 year ago - Last Synced: 11 months ago - Created: November 21, 2019

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 129
  • Committers: 5
  • Issues: 21
  • Pull Requests: 25
  • Owner: skuschel
  • Stars: 2
  • Forks: 3
  • Packages: 1
  • Downloads: 8
https://github.com/flow-php/etl-adapter-doctrine

PHP ETL Adapter: Doctrine
data-engineering data-processing dbal doctrine elt flow-php
Added: over 1 year ago - Last Synced: 11 months ago - Created: March 22, 2021

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 201
  • Committers: 6
  • Issues: 1
  • Pull Requests: 99
  • Owner: flow-php
  • Stars: 3
  • Forks: 2
  • Packages: 1
  • Downloads: 123,092
https://github.com/flow-php/etl-adapter-logger

PHP ETL Adapter: Logger
data-engineering data-processing etl flow-php logger monolog
Added: over 1 year ago - Last Synced: 11 months ago - Created: May 17, 2021

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 124
  • Committers: 4
  • Issues: 1
  • Pull Requests: 99
  • Owner: flow-php
  • Stars: 3
  • Forks: 1
  • Packages: 1
  • Downloads: 114,312
https://github.com/flow-php/etl-adapter-csv

PHP ETL Adapter: CSV
csv data-engineering data-processing etl flow-php
Added: over 1 year ago - Last Synced: 11 months ago - Created: March 27, 2021

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: PHP
  • Commits: 242
  • Committers: 5
  • Issues: 1
  • Pull Requests: 99
  • Owner: flow-php
  • Stars: 4
  • Forks: 2
  • Packages: 1
  • Downloads: 123,767
https://github.com/nvidia/dali

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
audio-processing data-augmentation data-processing deep-learning fast-data-pipeline gpu gpu-tensorflow image-augmentation image-processing machine-learning mxnet neural-network paddle python pytorch
Added: over 1 year ago - Last Synced: 11 months ago - Created: June 01, 2018

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: C++
  • Commits: 3570
  • Committers: 107
  • Issues: 428
  • Pull Requests: 939
  • Owner: NVIDIA
  • Stars: 4939
  • Forks: 605
  • Packages: 12
  • Downloads: 30,223
https://github.com/numaproj/numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
data-processing hacktoberfest k8s kubernetes map-reduce pipeline stream-processing
Added: over 1 year ago - Last Synced: 11 months ago - Created: May 20, 2022

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 917
  • Committers: 65
  • Issues: 220
  • Pull Requests: 228
  • Owner: numaproj
  • Stars: 913
  • Forks: 88
  • Packages: 4
  • Downloads: 64
https://github.com/PatrickTourniaire/ror

Python library which provides simple interfaces to programatically create pipelines for data processing and ML, to create good seperation of concern.
data-processing machine-learning pipelines
Added: over 1 year ago - Last Synced: 11 months ago - Created: June 12, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 60
  • Committers: 3
  • Issues: 20
  • Pull Requests: 9
https://github.com/piotr-ku/yaml-runner-go

YAML Runner Go is an application that executes commands based on the rules defined in a YAML file. It provides the flexibility to run commands either once or as a daemon at specific intervals.
administration-tools alerting continous-delivery continous-deployment continuous-integration data-processing go golang monitoring-automation runner scheduled-tasks system-administration yaml
Added: over 1 year ago - Last Synced: 11 months ago - Created: June 13, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 35
  • Committers: 2
  • Issues: 0
  • Pull Requests: 19
  • Owner: piotr-ku
  • Stars: 2
  • Forks: 0
  • Packages: 1
https://github.com/roboto-ai/robologs-ros-actions

A collection of actions for working with ROS data
data-analysis data-processing data-transformation mcap robotics ros ros2 rosbag sensors
Added: over 1 year ago - Last Synced: 11 months ago - Created: June 20, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Shell
  • Commits: 41
  • Committers: 3
  • Issues: 0
  • Pull Requests: 1
  • Owner: roboto-ai
  • Stars: 9
  • Forks: 2
  • Packages: 0
https://github.com/allenai/dolma

Data and tools for generating and inspecting OLMo pre-training data.
data-processing large-language-models llm machile-learning nlp
Added: over 1 year ago - Last Synced: 11 months ago - Created: June 20, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 233
  • Committers: 13
  • Issues: 87
  • Pull Requests: 135
  • Owner: allenai
  • Stars: 800
  • Forks: 78
  • Packages: 1
  • Downloads: 17,721
https://github.com/crate-workbench/cratedb-toolkit

CrateDB Toolkit.
cratedb cratedb-client cratedb-driver data-expiration data-processing data-retention database-adapter expiration materialized-view materialized-views olap olap-database retention retention-policies retention-policy sqlalchemy toolkit
Added: over 1 year ago - Last Synced: 11 months ago - Created: June 27, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 162
  • Committers: 4
  • Issues: 27
  • Pull Requests: 220
https://github.com/stevapple/elasticsearch-utils

Asynchronous data processing and import/export for Elasticsearch, written in Python.
data-analysis data-processing elasticsearch python
Added: over 1 year ago - Last Synced: 11 months ago - Created: July 13, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 9
  • Committers: 1
  • Issues: 6
  • Pull Requests: 8
  • Owner: stevapple
  • Stars: 0
  • Forks: 0
  • Packages: 0
https://github.com/basemax/dataprocessingapachekafka

Welcome to the Real-time Data Processing using Apache Kafka project! In this project, we will explore the capabilities of Apache Kafka, a powerful and distributed streaming platform, to build a real-time data processing system. Whether you're a beginner or have some experience with Kafka.
apache-kafka api data-processing go go-data go-data-processing golang kafka nats restful
Added: over 1 year ago - Last Synced: 11 months ago - Created: August 16, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 3
  • Committers: 3
  • Issues: 0
  • Pull Requests: 5
  • Owner: BaseMax
  • Stars: 1
  • Forks: 0
  • Packages: 0
https://github.com/kkrusere/nhanes-pytool-api

The NHANES Data 'API' is a Python tool that simplifies access to the National Health and Nutrition Examination Survey (NHANES) dataset. This project provides an easy-to-use API to retrieve NHANES data, helping researchers, data scientists, health professionals, and other stakeholders access these valuable datasets.
data-engineering-pipeline data-mining data-processing health-data health-data-analysis health-data-science nhanes
Added: over 1 year ago - Last Synced: 11 months ago - Created: October 13, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 98
  • Committers: 2
  • Issues: 1
  • Pull Requests: 1
  • Owner: kkrusere
  • Stars: 1
  • Forks: 2
  • Packages: 1
  • Downloads: 30
https://github.com/getstrm/pace

Data policy IN, dynamic view OUT: PACE is the Policy As Code Engine. It helps you to programatically create and apply a data policy to a processing platform like Databricks, Snowflake or BigQuery, with definitions imported from Collibra, Datahub, ODD and the like.
bigquery data-catalog data-contracts data-governance data-processing databricks policy-enforcement snowflake
Added: over 1 year ago - Last Synced: 11 months ago - Created: October 18, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Kotlin
  • Commits: 575
  • Committers: 15
  • Issues: 57
  • Pull Requests: 172
  • Owner: getstrm
  • Stars: 19
  • Forks: 0
  • Packages: 1
https://github.com/alexliap/roll_rate_analysis

Roll Rate Analysis python package. Both month over month and snapshot roll rate functionalities are supported. It utilizes Polars library for optimization and speed.
credit-risk data-processing polars roll-rate-analysis
Added: over 1 year ago - Last Synced: 11 months ago - Created: November 14, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 116
  • Committers: 4
  • Issues: 0
  • Pull Requests: 10
  • Owner: alexliap
  • Stars: 3
  • Forks: 0
  • Packages: 1
  • Downloads: 28
https://github.com/n0rdy/pippin

Go library to create and manage data pipelines on your machine
async asynchronous data data-engineering data-pipeline data-processing go golang golang-library golang-package goroutines pipeline
Added: over 1 year ago - Last Synced: 11 months ago - Created: November 18, 2023

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 20
  • Committers: 3
  • Issues: 0
  • Pull Requests: 5
  • Owner: n0rdy
  • Stars: 14
  • Forks: 0
  • Packages: 1
https://github.com/wq/itertable

⇔ IterTable is a Pythonic API for iterating through tabular data formats, including CSV, XLSX, XML, and JSON.
csv data-processing excel export import iterable json openpyxl pandas pythonic spreadsheet tabular-data xml
Added: over 1 year ago - Last Synced: 11 months ago - Created: August 22, 2012

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 200
  • Committers: 1
  • Issues: 12
  • Pull Requests: 0
  • Owner: wq
  • Stars: 51
  • Forks: 13
  • Packages: 1
  • Downloads: 2,661
https://github.com/helmholtz-analytics/heat/

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
array-api data-analytics data-processing data-science distributed gpu hpc machine-learning massive-datasets mpi mpi4py multi-gpu multi-node-cluster numpy parallelism python pytorch tensors
Added: over 1 year ago - Last Synced: 11 months ago - Created: May 17, 2018

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Python
  • Commits: 4590
  • Committers: 62
  • Issues: 218
  • Pull Requests: 354
https://github.com/guancecloud/platypus

Platypus is a programming language for Observability Data Pipeline
data-processing dsl go observability
Added: over 1 year ago - Last Synced: 11 months ago - Created: September 20, 2022

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 43
  • Committers: 5
  • Issues: 17
  • Pull Requests: 32
https://github.com/johnkerl/miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
command-line command-line-tools csv csv-format data-cleaning data-processing data-reduction data-regression devops devops-tools json json-data miller statistical-analysis statistics streaming-algorithms streaming-data tabular-data tsv unix-toolkit
Added: over 1 year ago - Last Synced: 11 months ago - Created: May 03, 2015

  • Relevant topics? true
  • External users? true
  • Open source license? true
  • Active? true
  • Fork? false
  • Main Language: Go
  • Commits: 8374
  • Committers: 60
  • Issues: 217
  • Pull Requests: 217
  • Owner: johnkerl
  • Stars: 8658
  • Forks: 204
  • Packages: 5
  • Downloads: 227