Show HN:Lance – 開放式 Lakehouse 格式,適用於多模態 AI 資料集

Show HN:Lance – 開放式 Lakehouse 格式,適用於多模態 AI 資料集

Hacker News·

Lance 是一個專為多模態 AI 資料集設計的開放式 Lakehouse 格式,提供 100 倍更快的隨機存取、向量索引和資料版本控制。它與 Pandas、DuckDB 和 PyTorch 等熱門資料科學工具相容,並計劃增加更多整合。

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

To see all available qualifiers, see our documentation.

Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

License

Uh oh!

There was an error while loading. Please reload this page.

lance-format/lance

Folders and files

Latest commit

History

Repository files navigation

Image

The Open Lakehouse Format for Multimodal AI
High-performance vector search, full-text search, random access, and feature engineering capabilities for the lakehouse.
Compatible with Pandas, DuckDB, Polars, PyArrow, Ray, Spark, and more integrations on the way.

Documentation •
Community •
Discord

Image

Image

Image

Image

Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows. Lance is perfect for:

The key features of Lance include:

Expressive hybrid search: Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.

Lightning-fast random access: 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.

Native multimodal data support: Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.

Data evolution: Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.

Zero-copy versioning: ACID transactions, time travel, and automatic versioning without needing extra infrastructure.

Rich ecosystem integrations: Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

For more details, see the full Lance format specification.

Tip

Lance is in active development and we welcome contributions. Please see our contributing guide for more information.

Quick Start

Installation

To install a preview release:

Note

For versions prior to 1.0.0-beta.4, you can find them at https://pypi.fury.io/lancedb/pylance

Tip

Preview releases are released more often than full releases and contain the
latest features and bug fixes. They receive the same level of testing as full releases.
We guarantee they will remain published and available for download for at
least 6 months. When you want to pin to a specific version, prefer a stable release.

Converting to Lance

Reading Lance data

Pandas

DuckDB

Vector search

Download the sift1m subset

Convert it to Lance

Build the index

Search the dataset

Directory structure

Benchmarks

Vector search

We used the SIFT dataset to benchmark our results with 1M vectors of 128D

Image

Image

Vs. parquet

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/XMLs. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

Image

Why Lance for AI/ML workflows?

The machine learning development cycle involves multiple stages:

Traditional lakehouse formats were designed for SQL analytics and struggle with AI/ML workloads that require:

While existing formats (Parquet, Iceberg, Delta Lake) excel at SQL analytics, they require additional specialized systems for AI capabilities. Lance brings these AI-first features directly into the lakehouse format.

A comparison of different formats across ML development stages:

About

Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

Topics

Resources

License

Contributing

Uh oh!

There was an error while loading. Please reload this page.

Stars

Watchers

Forks

Releases

  388

Packages

  0

Used by 6k

Image

Image

Image

Image

Image

Image

Image

Image

Contributors

  162

Uh oh!

There was an error while loading. Please reload this page.

Languages

Footer

Footer navigation

Hacker News

相關文章

  1. 加入我們一同建構 LoongFlow – 認知演化式 AI 框架

    3 個月前

  2. Agentic AI基礎設施實踐經驗系列(九):上下文工程 | Amazon Web Services

    Amazon Web Services · 5 個月前

  3. Show HN:Plano – 具備 AI 代理協調功能的邊緣與服務代理器

    4 個月前

  4. Show HN:Intellistant,一款比 LangChain AI 代理快 10-50 倍的 C++ 替代方案

    4 個月前

  5. 從商業智慧到人工智慧:結合 Lance 與 Iceberg 的現代 Lakehouse 架構

    4 個月前