Lance 文件格式¶

文件结构¶

每个 .lance 文件都是实际数据的容器。

Format Overview

在文件末尾，使用 ColumnMetadata protobuf 块来描述文件中列的编码。

message ColumnMetadata {

  // This describes a page of column data.
  message Page {
    // The file offsets for each of the page buffers
    //
    // The number of buffers is variable and depends on the encoding.  There
    // may be zero buffers (e.g. constant encoded data) in which case this
    // could be empty.
    repeated uint64 buffer_offsets = 1;
    // The size (in bytes) of each of the page buffers
    //
    // This field will have the same length as `buffer_offsets` and
    // may be empty.
    repeated uint64 buffer_sizes = 2;
    // Logical length (e.g. # rows) of the page
    uint64 length = 3;
    // The encoding used to encode the page
    Encoding encoding = 4;
    // The priority of the page
    //
    // For tabular data this will be the top-level row number of the first row
    // in the page (and top-level rows should not split across pages).
    uint64 priority = 5;
  }
  // Encoding information about the column itself.  This typically describes
  // how to interpret the column metadata buffers.  For example, it could
  // describe how statistics or dictionaries are stored in the column metadata.
  Encoding encoding = 1;
  // The pages in the column
  repeated Page pages = 2;   
  // The file offsets of each of the column metadata buffers
  //
  // There may be zero buffers.
  repeated uint64 buffer_offsets = 3;
  // The size (in bytes) of each of the column metadata buffers
  //
  // This field will have the same length as `buffer_offsets` and
  // may be empty.
  repeated uint64 buffer_sizes = 4;

}

一个 Footer 描述了文件的整体布局。整个文件布局在此处描述

// Note: the number of buffers (BN) is independent of the number of columns (CN)
//       and pages.
//
//       Buffers often need to be aligned.  64-byte alignment is common when
//       working with SIMD operations.  4096-byte alignment is common when
//       working with direct I/O.  In order to ensure these buffers are aligned
//       writers may need to insert padding before the buffers.
//       
//       If direct I/O is required then most (but not all) fields described
//       below must be sector aligned.  We have marked these fields with an
//       asterisk for clarity.  Readers should assume there will be optional
//       padding inserted before these fields.
//
//       All footer fields are unsigned integers written with  little endian
//       byte order.
//
// ├──────────────────────────────────┤
// | Data Pages                       |
// |   Data Buffer 0*                 |
// |   ...                            |
// |   Data Buffer BN*                |
// ├──────────────────────────────────┤
// | Column Metadatas                 |
// | |A| Column 0 Metadata*           |
// |     Column 1 Metadata*           |
// |     ...                          |
// |     Column CN Metadata*          |
// ├──────────────────────────────────┤
// | Column Metadata Offset Table     |
// | |B| Column 0 Metadata Position*  |
// |     Column 0 Metadata Size       |
// |     ...                          |
// |     Column CN Metadata Position  |
// |     Column CN Metadata Size      |
// ├──────────────────────────────────┤
// | Global Buffers Offset Table      |
// | |C| Global Buffer 0 Position*    |
// |     Global Buffer 0 Size         |
// |     ...                          |
// |     Global Buffer GN Position    |
// |     Global Buffer GN Size        |
// ├──────────────────────────────────┤
// | Footer                           |
// | A u64: Offset to column meta 0   |
// | B u64: Offset to CMO table       |
// | C u64: Offset to GBO table       |
// |   u32: Number of global bufs     |
// |   u32: Number of columns         |
// |   u16: Major version             |
// |   u16: Minor version             |
// |   "LANC"                         |
// ├──────────────────────────────────┤
//
// File Layout-End

文件版本¶

Lance 文件格式经历了一些更改，包括从版本 1 到版本 2 的破坏性更改。有许多 API 允许指定文件版本。使用较新的文件格式版本将带来更好的压缩和/或性能。但是，较旧的软件版本可能无法读取较新的文件。

此外，文件格式的最新版本（next）不稳定，不应用于生产场景。对不稳定的编码可能会进行破坏性更改，这意味着使用这些编码写入的文件将无法被任何新版本的 Lance 读取。next 版本只应用于实验和基准测试即将推出的功能。

支持以下值

版本	最小 Lance 版本	最大 Lance 版本	描述
0.1	任何	任何	这是最初的 Lance 格式。
2.0	0.16.0	任何	Lance 文件格式的重构，移除了行组并引入了对列表、固定大小列表和基本类型的空值支持
2.1 (不稳定)	无	任何	增强了整数和字符串压缩，增加了对结构体字段中空值的支持，并提高了嵌套字段的随机访问性能。
legacy	不适用	不适用	0.1 的别名
stable	不适用	不适用	最新稳定版本（目前为 2.0）的别名
next	不适用	不适用	最新不稳定版本（目前为 2.1）的别名

文件编码¶

Lance 支持各种数据类型的多种编码。选择编码是为了提供随机访问和扫描性能。编码会随着时间的推移而增加，未来可能会扩展。清单记录了一个最大格式版本，该版本控制将使用哪些编码。这允许逐步迁移到新的数据格式，以便在迁移过程中，旧的读取器仍然可以读取新的数据。

编码分为“字段编码”和“数组编码”。字段编码在整个数据字段中保持一致，而数组编码用于字段内的数据的单独页面。数组编码可以嵌套其他数组编码（例如，字典编码可以位打包索引），但是数组编码不能嵌套字段编码。因此，尚不支持 Dictionary<UInt8, List<String>> 等数据类型（因为没有字典字段编码）

可用编码¶

编码名称	编码类型	作用	支持版本	应用时机
基本结构体	字段编码	编码非空结构体数据	>= 2.0	结构体的默认编码
列表	字段编码	编码列表（可空或不可空）	>= 2.0	列表的默认编码
基本原始类型	字段编码	使用单独的有效性数组编码原始数据类型	>= 2.0	原始数据类型的默认编码
值	数组编码	编码单个固定宽度值的向量	>= 2.0	固定宽度类型的备用编码
二进制	数组编码	编码单个可变宽度数据的向量	>= 2.0	可变宽度类型的备用编码
字典	数组编码	使用字典数组和索引数组编码数据，适用于具有少量唯一值的大型数据类型	>= 2.0	用于唯一元素少于 100 个的字符串页面
紧凑结构体	数组编码	以行主格式编码具有固定宽度字段的结构体，使随机访问更高效	>= 2.0	仅当字段元数据属性 `"packed"` 设置为 `"true"` 时用于结构体类型
Fsst	数组编码	通过识别常见子字符串（8 字节或更少）并将其编码为符号来压缩二进制数据	>= 2.1	用于未字典编码的字符串页面
位打包	数组编码	使用位打包编码单个固定宽度值的向量，适用于不跨越完整值范围的整数类型	>= 2.1	用于整数类型