https://github.com/libertem/libertem-blobfinder
LiberTEM correlation and refinement library
data-processing
electron-microscopy
image-processing
python
Added: about 1 year ago - Last Synced: 11 months ago
- Created: February 05, 2020
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Jupyter Notebook
- Commits: 85
- Committers: 2
- Issues: 29
- Pull Requests: 68
- Owner: LiberTEM
- Stars: 5
- Forks: 3
- Packages: 1
- Downloads: 217

https://github.com/alirezatheh/perke
A keyphrase extractor for Persian
data-mining
data-processing
information-retrieval
keyphrase
keyphrase-extraction
keyphrase-extractor
keyword
keyword-extraction
keyword-extractor
machine-learning
ml
natural-language-processing
nlp
persian
persian-language
python
text-mining
text-processing
unsupervised-learning
Added: over 1 year ago - Last Synced: 11 months ago
- Created: February 03, 2020
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 87
- Committers: 4
- Issues: 0
- Pull Requests: 4
- Owner: AlirezaTheH
- Stars: 68
- Forks: 7
- Packages: 1
- Downloads: 77

https://github.com/bbva/mercury-dataschema
Utility package that, given a Pandas DataFrame, it uses the DataSchema class which auto-infers feature types and automatically calculates different statistics depending on the types.
analytics
data
data-cleaning
data-processing
data-science
feature-engineering
Added: over 1 year ago - Last Synced: 11 months ago
- Created: March 09, 2023

https://github.com/asyml/forte
Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
data-processing
deep-learning
information-retrieval
machine-learning
natural-language
natural-language-processing
pipeline
python
text-data
Added: over 1 year ago - Last Synced: 11 months ago
- Created: August 09, 2019

https://github.com/plato-solutions/artifician
Artifician is an event-driven framework designed to simplify and accelerate the process of preparing datasets for Artificial Intelligence models.
artificial-intelligence
data-processing
data-processing-pipelines
dataset-preparation
machine-learning
python
Added: over 1 year ago - Last Synced: 11 months ago
- Created: August 15, 2022
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 168
- Committers: 4
- Issues: 0
- Pull Requests: 32
- Owner: Plato-solutions
- Stars: 10
- Forks: 0
- Packages: 1
- Downloads: 77

https://github.com/flow-php/etl-adapter-parquet
PHP ETL Adapter: Parquet
data-engineering
data-processing
etl
flow-php
parquet
Added: over 1 year ago - Last Synced: 11 months ago
- Created: May 07, 2022

https://github.com/streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
apache-pulsar
apache-spark
batch-processing
data-processing
data-science
flink
spark
spark-sql
stream-processing
structured-streaming
Added: over 1 year ago - Last Synced: 11 months ago
- Created: July 01, 2019
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Scala
- Commits: 189
- Committers: 22
- Issues: 93
- Pull Requests: 142
- Owner: streamnative
- Stars: 109
- Forks: 48
- Packages: 2

https://github.com/code-rhapsodie/ezdataflow-bundle
Import/export bundle for eZ Platform / Ibexa Content based on Code-Rhapsodie Dataflow
data-processing
dataflow
export
ez-platform
ez-publish
ibexa
ibexa-content
ibexa-platform
ibexadxp
import
portphp
Added: over 1 year ago - Last Synced: 11 months ago
- Created: October 08, 2019
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: PHP
- Commits: 41
- Committers: 6
- Issues: 10
- Pull Requests: 32
- Owner: code-rhapsodie
- Stars: 7
- Forks: 2
- Packages: 1
- Downloads: 31,981

https://github.com/flow-php/etl-adapter-xml
PHP ETL Adapter: XML
data-engineering
data-processing
etl
flow-php
xml
Added: over 1 year ago - Last Synced: 11 months ago
- Created: April 18, 2021

https://github.com/hstreamdb/hstream
HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications.
data-processing
database
distributed-database
distributed-systems
financial-analysis
haskell
hstreamdb
iot
iot-database
kafka
materialized-view
real-time
realtime-database
scale
sql
stream-processing
streaming
streaming-data
streaming-database
Added: over 1 year ago - Last Synced: 11 months ago
- Created: August 31, 2020

https://github.com/lithops-cloud/lithops
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
big-data
big-data-analytics
cloud-computing
data-processing
distributed
kubernetes
multicloud
multiprocessing
object-storage
parallel
python
serverless
serverless-computing
serverless-functions
Added: over 1 year ago - Last Synced: 11 months ago
- Created: April 23, 2018
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 2852
- Committers: 52
- Issues: 76
- Pull Requests: 286
- Owner: lithops-cloud
- Stars: 289
- Forks: 91
- Packages: 1
- Downloads: 2,672

https://github.com/neuro-ml/connectome
A library for datasets containing heterogeneous data
data-processing
pipelines
python
Added: over 1 year ago - Last Synced: 11 months ago
- Created: June 29, 2020

https://github.com/bytewax/bytewax
Python Stream Processing
data-engineering
data-processing
data-science
dataflow
machine-learning
python
rust
stream-processing
streaming-data
Added: over 1 year ago - Last Synced: 11 months ago
- Created: February 04, 2022

https://github.com/asyml/texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
bert
casl-project
data-processing
deep-learning
dialog-systems
gpt-2
machine-learning
machine-translation
natural-language-processing
python
tensorflow
texar
text-data
text-generation
xlnet
Added: over 1 year ago - Last Synced: 11 months ago
- Created: July 22, 2017

https://github.com/tomwright/dasel
Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.
cli
config
configuration
data-processing
data-structures
data-wrangling
devops-tools
go
golang
json
json-processing
parser
query
selector
toml
update
xml
yaml
yaml-processor
Added: over 1 year ago - Last Synced: 11 months ago
- Created: September 22, 2020

https://github.com/vortex-exoplanet/vip
VIP is a python package/library for angular, reference star and spectral differential imaging for exoplanet/disk detection through high-contrast imaging.
data-processing
extrasolar-planets-disks
high-contrast-imaging
image-processing
low-rank-approximation
mcmc
pca
Added: over 1 year ago - Last Synced: 11 months ago
- Created: May 24, 2015
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 2129
- Committers: 51
- Issues: 9
- Pull Requests: 128
- Owner: vortex-exoplanet
- Stars: 68
- Forks: 56
- Packages: 1
- Downloads: 549

https://github.com/luckylittle/blinkist-m4a-downloader
Grabs all of the audio files from all of the Blinkist books
audiobooks
blinkist
books
crawler
data-archiving
data-mining
data-processing
go
golang
scraper
spider
Added: over 1 year ago - Last Synced: 11 months ago
- Created: January 08, 2019
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Go
- Commits: 16
- Committers: 5
- Issues: 14
- Pull Requests: 4
- Owner: luckylittle
- Stars: 121
- Forks: 22
- Packages: 1

https://github.com/flow-php/etl
PHP - ETL (Extract Transform Load) data processing library
data-engineering
data-processing
etl
flow-php
Added: over 1 year ago - Last Synced: 11 months ago
- Created: October 26, 2020

https://github.com/dashbitco/broadway
Concurrent and multi-stage data ingestion and data processing with Elixir
broadway
concurrent
data-ingestion
data-processing
elixir
genstage
Added: over 1 year ago - Last Synced: 11 months ago
- Created: November 05, 2018

https://github.com/ml6team/fondant
Production-ready data processing made easy and shareable
data-processing
fine-tuning
foundation-models
machine-learning
pipeline
python
Added: over 1 year ago - Last Synced: 11 months ago
- Created: March 02, 2023

https://github.com/giacbrd/smartpipeline
A framework for rapid development of robust data pipelines following a simple design pattern
data-analysis
data-analytics
data-mining
data-pipelines
data-processing
data-science
dataops
design-patterns
etl
machine-learning
mlops
pipeline
pipeline-framework
pipelines
reproducibility
task-queue
workflow
Added: over 1 year ago - Last Synced: 11 months ago
- Created: September 03, 2018

https://github.com/yord/pxi
🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.
csv
data-processing
deserializer
dsv
json
marshaller
parser
pixie
pxi
serializer
ssv
tsv
Added: over 1 year ago - Last Synced: 11 months ago
- Created: November 27, 2019
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: JavaScript
- Commits: 404
- Committers: 1
- Issues: 7
- Pull Requests: 1
- Owner: Yord
- Stars: 266
- Forks: 3
- Packages: 1
- Downloads: 66

https://github.com/lherman-cs/go-rosbag
Rosbag parser written in pure Go
analytics
cli
cloud
data-processing
decoder
parser
robotics
ros
rosbag
Added: over 1 year ago - Last Synced: 11 months ago
- Created: December 25, 2020
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Go
- Commits: 70
- Committers: 1
- Issues: 2
- Pull Requests: 1
- Owner: lherman-cs
- Stars: 17
- Forks: 1
- Packages: 1

https://github.com/changchunhe/pyvaspflow
vasp calculation flow
data-processing
defect-formation-energy
vasp
Added: over 1 year ago - Last Synced: 11 months ago
- Created: March 08, 2019
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 539
- Committers: 4
- Issues: 0
- Pull Requests: 20
- Owner: ChangChunHe
- Stars: 17
- Forks: 18
- Packages: 1
- Downloads: 9

https://github.com/zazuko/barnard59
An intuitive and flexible RDF pipeline solution designed to simplify and automate ETL processes for efficient data management.
data-integration
data-pipeline
data-processing
etl
json-ld
linked-data
pipeline
rdf
semantic-web
Added: over 1 year ago - Last Synced: 11 months ago
- Created: October 25, 2018
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: JavaScript
- Commits: 1159
- Committers: 19
- Issues: 147
- Pull Requests: 282
- Owner: zazuko
- Stars: 21
- Forks: 2
- Packages: 14
- Downloads: 100,567

https://github.com/svenkreiss/pysparkling
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
apache-spark
data-processing
data-science
python
Added: over 1 year ago - Last Synced: 11 months ago
- Created: May 09, 2015
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 1454
- Committers: 10
- Issues: 20
- Pull Requests: 80
- Owner: svenkreiss
- Stars: 260
- Forks: 44
- Packages: 2
- Downloads: 11,565

https://github.com/asyml/fortehealth
The project is in the incubation stage and still under development. ForteHealth is a flexible and powerful ML workflow builder for biomedical and clinical scenarios. This is part of the CASL project: http://casl-project.ai/
biomedical-named-entity-recognition
clinical-nlp
clinical-text-processing
data-processing
deep-learning
information-retrieval
machine-learning
natural-language
natural-language-processing
python
Added: over 1 year ago - Last Synced: 11 months ago
- Created: February 04, 2022

https://github.com/AtomGraph/Processor
Ontology-driven Linked Data processor and server for SPARQL backends. Apache License.
appengine
crud
data-driven
data-processing
declarative
docker-image
framework
generic
hypermedia
knowledge-graph
ldt
linked-data
linked-data-templates
ontology-driven-development
rdf
rest
semantic-web
server
sparql
Added: over 1 year ago - Last Synced: 11 months ago
- Created: April 06, 2015

https://github.com/apache/incubator-wayang
Apache Wayang(incubating) is the first cross-platform data processing system.
apache
big-data
cross-platform
data-management-platform
data-processing
distributed-system
hadoop
java
jdbc
middleware
open-source
performance
scala
spark
Added: over 1 year ago - Last Synced: 11 months ago
- Created: December 16, 2020

https://github.com/kfultz07/go-dataframe
A simple package to abstract away the process of creating usable DataFrames for data analytics. This package is heavily inspired by the amazing Python library, Pandas.
data-analysis
data-analytics
data-processing
data-science
dataframe
go
golang
pandas
Added: over 1 year ago - Last Synced: 11 months ago
- Created: January 03, 2022

https://github.com/flow-php/etl-adapter-json
PHP ETL Adapter: JSON
data-engineering
data-processing
etl
flow-php
json
Added: over 1 year ago - Last Synced: 11 months ago
- Created: April 10, 2021

https://github.com/skuschel/generatorpipeline
Parallelize your data-processing pipelines with just a decorator.
data-processing
data-science
hacktoberfest
python
Added: over 1 year ago - Last Synced: 11 months ago
- Created: November 21, 2019

https://github.com/flow-php/etl-adapter-doctrine
PHP ETL Adapter: Doctrine
data-engineering
data-processing
dbal
doctrine
elt
flow-php
Added: over 1 year ago - Last Synced: 11 months ago
- Created: March 22, 2021

https://github.com/flow-php/etl-adapter-logger
PHP ETL Adapter: Logger
data-engineering
data-processing
etl
flow-php
logger
monolog
Added: over 1 year ago - Last Synced: 11 months ago
- Created: May 17, 2021

https://github.com/flow-php/etl-adapter-csv
PHP ETL Adapter: CSV
csv
data-engineering
data-processing
etl
flow-php
Added: over 1 year ago - Last Synced: 11 months ago
- Created: March 27, 2021

https://github.com/nvidia/dali
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
audio-processing
data-augmentation
data-processing
deep-learning
fast-data-pipeline
gpu
gpu-tensorflow
image-augmentation
image-processing
machine-learning
mxnet
neural-network
paddle
python
pytorch
Added: over 1 year ago - Last Synced: 11 months ago
- Created: June 01, 2018

https://github.com/numaproj/numaflow
Kubernetes-native platform to run massively parallel data/streaming jobs
data-processing
hacktoberfest
k8s
kubernetes
map-reduce
pipeline
stream-processing
Added: over 1 year ago - Last Synced: 11 months ago
- Created: May 20, 2022

https://github.com/PatrickTourniaire/ror
Python library which provides simple interfaces to programatically create pipelines for data processing and ML, to create good seperation of concern.
data-processing
machine-learning
pipelines
Added: over 1 year ago - Last Synced: 11 months ago
- Created: June 12, 2023
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 60
- Committers: 3
- Issues: 20
- Pull Requests: 9
- Owner: PatrickTourniaire
- Stars: 0
- Forks: 0
- Packages: 1
- Downloads: 13

https://github.com/piotr-ku/yaml-runner-go
YAML Runner Go is an application that executes commands based on the rules defined in a YAML file. It provides the flexibility to run commands either once or as a daemon at specific intervals.
administration-tools
alerting
continous-delivery
continous-deployment
continuous-integration
data-processing
go
golang
monitoring-automation
runner
scheduled-tasks
system-administration
yaml
Added: over 1 year ago - Last Synced: 11 months ago
- Created: June 13, 2023

https://github.com/roboto-ai/robologs-ros-actions
A collection of actions for working with ROS data
data-analysis
data-processing
data-transformation
mcap
robotics
ros
ros2
rosbag
sensors
Added: over 1 year ago - Last Synced: 11 months ago
- Created: June 20, 2023

https://github.com/allenai/dolma
Data and tools for generating and inspecting OLMo pre-training data.
data-processing
large-language-models
llm
machile-learning
nlp
Added: over 1 year ago - Last Synced: 11 months ago
- Created: June 20, 2023

https://github.com/crate-workbench/cratedb-toolkit
CrateDB Toolkit.
cratedb
cratedb-client
cratedb-driver
data-expiration
data-processing
data-retention
database-adapter
expiration
materialized-view
materialized-views
olap
olap-database
retention
retention-policies
retention-policy
sqlalchemy
toolkit
Added: over 1 year ago - Last Synced: 11 months ago
- Created: June 27, 2023
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 162
- Committers: 4
- Issues: 27
- Pull Requests: 220
- Owner: crate-workbench
- Stars: 3
- Forks: 3
- Packages: 1
- Downloads: 1,722

https://github.com/stevapple/elasticsearch-utils
Asynchronous data processing and import/export for Elasticsearch, written in Python.
data-analysis
data-processing
elasticsearch
python
Added: over 1 year ago - Last Synced: 11 months ago
- Created: July 13, 2023

https://github.com/basemax/dataprocessingapachekafka
Welcome to the Real-time Data Processing using Apache Kafka project! In this project, we will explore the capabilities of Apache Kafka, a powerful and distributed streaming platform, to build a real-time data processing system. Whether you're a beginner or have some experience with Kafka.
apache-kafka
api
data-processing
go
go-data
go-data-processing
golang
kafka
nats
restful
Added: over 1 year ago - Last Synced: 11 months ago
- Created: August 16, 2023

https://github.com/kkrusere/nhanes-pytool-api
The NHANES Data 'API' is a Python tool that simplifies access to the National Health and Nutrition Examination Survey (NHANES) dataset. This project provides an easy-to-use API to retrieve NHANES data, helping researchers, data scientists, health professionals, and other stakeholders access these valuable datasets.
data-engineering-pipeline
data-mining
data-processing
health-data
health-data-analysis
health-data-science
nhanes
Added: over 1 year ago - Last Synced: 11 months ago
- Created: October 13, 2023

https://github.com/getstrm/pace
Data policy IN, dynamic view OUT: PACE is the Policy As Code Engine. It helps you to programatically create and apply a data policy to a processing platform like Databricks, Snowflake or BigQuery, with definitions imported from Collibra, Datahub, ODD and the like.
bigquery
data-catalog
data-contracts
data-governance
data-processing
databricks
policy-enforcement
snowflake
Added: over 1 year ago - Last Synced: 11 months ago
- Created: October 18, 2023

https://github.com/alexliap/roll_rate_analysis
Roll Rate Analysis python package. Both month over month and snapshot roll rate functionalities are supported. It utilizes Polars library for optimization and speed.
credit-risk
data-processing
polars
roll-rate-analysis
Added: over 1 year ago - Last Synced: 11 months ago
- Created: November 14, 2023

https://github.com/n0rdy/pippin
Go library to create and manage data pipelines on your machine
async
asynchronous
data
data-engineering
data-pipeline
data-processing
go
golang
golang-library
golang-package
goroutines
pipeline
Added: over 1 year ago - Last Synced: 11 months ago
- Created: November 18, 2023

https://github.com/wq/itertable
⇔ IterTable is a Pythonic API for iterating through tabular data formats, including CSV, XLSX, XML, and JSON.
csv
data-processing
excel
export
import
iterable
json
openpyxl
pandas
pythonic
spreadsheet
tabular-data
xml
Added: over 1 year ago - Last Synced: 11 months ago
- Created: August 22, 2012

https://github.com/helmholtz-analytics/heat/
Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
array-api
data-analytics
data-processing
data-science
distributed
gpu
hpc
machine-learning
massive-datasets
mpi
mpi4py
multi-gpu
multi-node-cluster
numpy
parallelism
python
pytorch
tensors
Added: over 1 year ago - Last Synced: 11 months ago
- Created: May 17, 2018
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Python
- Commits: 4590
- Committers: 62
- Issues: 218
- Pull Requests: 354
- Owner: helmholtz-analytics
- Stars: 193
- Forks: 56
- Packages: 1

https://github.com/guancecloud/platypus
Platypus is a programming language for Observability Data Pipeline
data-processing
dsl
go
observability
Added: over 1 year ago - Last Synced: 11 months ago
- Created: September 20, 2022
- Relevant topics? true
- External users? true
- Open source license? true
- Active? true
- Fork? false
- Main Language: Go
- Commits: 43
- Committers: 5
- Issues: 17
- Pull Requests: 32
- Owner: GuanceCloud
- Stars: 20
- Forks: 6
- Packages: 2

https://github.com/johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
command-line
command-line-tools
csv
csv-format
data-cleaning
data-processing
data-reduction
data-regression
devops
devops-tools
json
json-data
miller
statistical-analysis
statistics
streaming-algorithms
streaming-data
tabular-data
tsv
unix-toolkit
Added: over 1 year ago - Last Synced: 11 months ago
- Created: May 03, 2015
