Open source repositories tagged with #big-data, ranked by health score.
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
Apache DataFusion Ballista Distributed Query Engine
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
AI + Data, online. https://vespa.ai
Apache IoTDB
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
ClickHouse® is a real-time analytics database management system
Drop-in Apache Spark replacement written in Rust, unifying batch processing, stream processing, and compute-intensive AI workloads.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Apache DataFusion SQL Query Engine
YTsaurus is a scalable and fault-tolerant open-source big data platform.
Apache Beam is a unified programming model for Batch and Streaming data processing.