秒解析 400MB,少于 80 行代码
那么解析 CSV 有何难处,不就是.split("\n")
and吗.split(",")
?如果您不必遵守RFC 4180 ,则可以。
RFC 4180 规定,字段可以放在引号 ( "
) 之间,然后将换行符和逗号作为字段的一部分是合法的。因此.split("\n")
,.split(",")
将不再起作用。
早在 2019 年,我就将FlexBuffers移植到了 C# ( FlexBuffers-CSharp ),因为我想将它用于我正在进行的项目。由于初始数据为 CSV,我还实现了CSV 到 FlexBuffers 转换器。
现在我正在探索一种新的编程语言,称为Mojo。在通过实现一个对UTF-8 字符串中的字符进行计数的函数进行尝试后,我决定更进一步,温习一下 CSV 解析知识,并检查它在 Mojo 中的运行效果如何。
首先,我将转换器移植到 Mojo,只做了一些小的修改:
from String import *
from Vector import DynamicVector
struct CsvTable:
var inner_string: String
var starts: DynamicVector[Int]
var ends: DynamicVector[Int]
var column_count: Int
fn __init__(inout self, owned s: String):
self.inner_string = s
self.starts = DynamicVector[Int](10)
self.ends = DynamicVector[Int](10)
self.column_count = -1
self.parse()
@always_inline
fn parse(inout self):
let QUOTE = ord('"')
let COMMA = ord(',')
let LF = ord('\n')
let CR = ord('\r')
let length = len(self.inner_string.buffer)
var offset = 0
var in_double_quotes = False
self.starts.push_back(offset)
while offset < length:
let c = self.inner_string.buffer[offset]
if c == QUOTE:
in_double_quotes = not in_double_quotes
offset += 1
elif not in_double_quotes and c == COMMA:
self.ends.push_back(offset)
offset += 1
self.starts.push_back(offset)
elif not in_double_quotes and c == LF and not in_double_quotes:
self.ends.push_back(offset)
if self.column_count == -1:
self.column_count = len(self.ends)
offset += 1
self.starts.push_back(offset)
elif not in_double_quotes and c == CR and length > offset + 1 and self.inner_string.buffer[offset + 1] == LF:
self.ends.push_back(offset)
if self.column_count == -1:
self.column_count = len(self.ends)
offset += 2
self.starts.push_back(offset)
else:
offset += 1
self.ends.push_back(length)
fn get(self, row: Int, column: Int) -> String:
if column >= self.column_count:
return ""
let index = self.column_count * row + column
if index >= len(self.ends):
return ""
return self.inner_string[self.starts[index]:self.ends[index]]
代码相当简单。我定义了一个名为 的结构CsvTable
,其中包含一个保存 CSV 文本的字符串。创建结构实例后,它确定表中每个字段的开始和结束位置以及列数。计算在解析函数中逐字节进行,其中检查RFC 4180 中指定的"
,
\n
, 或\r
字符并做出相应响应。该parse
函数假设 CSV 格式良好,这意味着每行具有相同的列数。此外,还有一个 get 函数,可根据提供的行索引和列索引检索字段。如果提供了无效的索引,该函数将返回一个空字符串,尽管有些人可能认为引发异常更合适。
该CsvTable
实现存在一定的缺点,缺乏一些明显的特征,包括:
- 字段的类型推断
get
使用函数返回字段时删除引号
- 提供一个标志以偏离 RFC-4180 并忽略字段中的引号
- 提供将不同字符设置为列分隔符的功能
虽然我将来可能会实现这些功能,但我目前的重点是 SIMD(单指令、多数据)以及由此产生的性能增强。
早在 2019 年,Geoff Langdale 和 Daniel Lemire 撰写的一篇名为“每秒解析 Gigabytes of JSON ”的论文引起了不小的轰动。他们展示了使用 SIMD 指令如何显着加快 JSON 解析速度。同一作者还有一个解析 CSV 文件的解决方案。我很好奇,想尝试使用 Mojo 来实现他们的方法。不幸的是,我遇到了一些问题,因为当前的 Mojo 标准库没有公开所有必要的内在函数。我本可以使用 MLIR 互操作,但相反,我决定仅使用标准库自己的结构和函数来构建解析器。尽管有这些限制,我还是成功地创建了一个非常高效的解析器,并且在我看来,它是简单而优雅的解析器。
在下面的部分中,我将演示我的实现,提供性能基准,并强调我的解决方案与 Geoff Langdale 和 Daniel Lemire 提出的解决方案之间的区别。
那么,让我们首先看看SimdCsvTable
:
from DType import DType
from Functional import vectorize
from Intrinsics import compressed_store
from Math import iota, any_true, reduce_bit_count
from Memory import *
from Pointer import DTypePointer
from String import String, ord
from TargetInfo import dtype_simd_width
from Vector import DynamicVector
alias simd_width_u8 = dtype_simd_width[DType.ui8]()
struct SimdCsvTable:
var inner_string: String
var starts: DynamicVector[Int]
var ends: DynamicVector[Int]
var column_count: Int
fn __init__(inout self, owned s: String):
self.inner_string = s
self.starts = DynamicVector[Int](10)
self.ends = DynamicVector[Int](10)
self.column_count = -1
self.parse()
@always_inline
fn parse(inout self):
let QUOTE = ord('"')
let COMMA = ord(',')
let LF = ord('\n')
let CR = ord('\r')
let p = DTypePointer[DType.si8](self.inner_string.buffer.data)
let string_byte_length = len(self.inner_string)
var in_quotes = False
self.starts.push_back(0)
@always_inline
@parameter
fn find_indexies[simd_width: Int](offset: Int):
let chars = p.simd_load[simd_width](offset)
let quotes = chars == QUOTE
let commas = chars == COMMA
let lfs = chars == LF
let all_bits = quotes | commas | lfs
let offsets = iota[simd_width, DType.ui8]()
let sp: DTypePointer[DType.ui8] = stack_allocation[simd_width, UI8, simd_width]()
compressed_store(offsets, sp, all_bits)
let all_len = reduce_bit_count(all_bits)
let crs_ui8 = (chars == CR).cast[DType.ui8]()
let lfs_ui8 = lfs.cast[DType.ui8]()
for i in range(all_len):
let index = sp.load(i).to_int()
if quotes[index]:
in_quotes = not in_quotes
continue
if in_quotes:
continue
let current_offset = index + offset
self.ends.push_back(current_offset - (lfs_ui8[index] * crs_ui8[index - 1]).to_int())
self.starts.push_back(current_offset + 1)
if self.column_count == -1 and lfs[index]:
self.column_count = len(self.ends)
vectorize[simd_width_u8, find_indexies](string_byte_length)
self.ends.push_back(string_byte_length)
fn get(self, row: Int, column: Int) -> String:
if column >= self.column_count:
return ""
let index = self.column_count * row + column
if index >= len(self.ends):
return ""
return self.inner_string[self.starts[index]:self.ends[index]]
结构体的结构与 非常相似CsvTable,主要区别在于parse函数的实现。要使用 SIMD 访问底层字符串字节缓冲区,我们需要将其表示为DTypePointer[DType.si8].这允许我们使用该simd_load方法,在字节向量上启用 SIMD 操作。该find_indices函数在我们的解析过程中起着至关重要的作用。借助该vectorize函数,我们可以将该find_indices函数应用于 的块inner_string,有效地将它们转换为 SIMD 向量。随后,我们创建三个位集,分别指示向量中的元素是否等于"、,或\n:
let chars = p.simd_load[simd_width](offset)
let quotes = chars == QUOTE
let commas = chars == COMMA
let lfs = chars == LF
let all_bits = quotes | commas | lfs
The all_bits
bit-set (a SIMD vector of bools) represents the combination of the markers.
Now we need to transform a SIMD vector of bools into a list of offsets. Which we can do with following four lines:
let offsets = iota[simd_width, DType.ui8]()
let sp: DTypePointer[DType.ui8] = stack_allocation[simd_width, UI8, simd_width]()
compressed_store(offsets, sp, all_bits)
let all_len = reduce_bit_count(all_bits)
First we use the iota
function to produce a vector of increasing ints starting from zero. Side note: AFAIK the iota function has it’s roots in APL and is also called an index generator.
Moving forward, our objective is to “compress” the offsets
using the all_bits
vector. This involves removing all instances in the offsets
where the corresponding values in all_bits
are False
. The resulting list’s size will correspond to the number of True
values in the all_bits
vector. We allocate memory on the stack for the compressed_store
function’s result and determine its length by invoking the reduce_bit_count(all_bits)
function.
Given the length and the compressed list, we can iterate over the special characters in current text chunk and push start and end offsets if they are not in quotes.
Now it is time to do some benchmarking. I picked the same CSV file which were provided in the simdcsv repo.
For the nfl.csv
, I got following result:
SIMD min parse time in nanoseconds:
3135722.0
Non SIMD min parse time in nanoseconds:
4992006.0
Difference
1.5919797737171855
SIMD bytes parsed per nanosecond
0.43519738038002093
Non SIMD bytes parsed per nanosecond
0.27336866181651226
And for EDW.TEST_CAL_DT.csv
the number are a bit lower:
SIMD min time in nanoseconds:
2007207.0
Non SIMD min time in nanoseconds:
2446398.0
Difference
1.2188070288714616
SIMD byte per nanosecond
0.25521333873387247
Non SIMD byte per nanosecond
0.20939601814586178
It’s worth noting that at present, Mojo code can only be run through a cloud-based Jupyter Notebook. This means that the absolute performance numbers may not carry as much weight, but the relative difference between the SIMD and non-SIMD parsers should still hold true. In our analysis, we found that the SIMD parser was around 60% faster for the nfl
dataset, while for the second dataset, it showed a modest improvement of about 20%.
To be frank, the performance boost achieved was not as substantial as I had anticipated. However, it still represents a notable improvement, which I consider a win.
Now, let’s discuss why I had to deviate from the solution proposed by Geoff Langdale and Daniel Lemire. In Mojo, when we compare a SIMD vector of integers with an integer, we obtain a SIMD vector of booleans, which is indeed a useful feature. However, I encountered a challenge when trying to perform a shift operation on the boolean vector. In contrast, performing a shift on a bit-set represented as a 64-bit unsigned integer is a straightforward operation. This limitation hindered my ability to replicate certain aspects of the original solution.
Another obstacle I encountered was the need to check if a separator is within quotation marks, which requires the use of carry-less multiplication. Geoff Langdale wrote a blog post explaining the advantages of this approach. Unfortunately, the Mojo standard library does not currently provide access to this rather specialized operation. While it may be possible to expose carry-less multiplication through MLIR inter-op, I opted not to pursue that route for now.
Therefore, due to these limitations in the Mojo standard library, I had to find alternative approaches to address these specific requirements in my implementation, which introduced some branching in the code.
That’s all from my side. Thank you for reading until the end. Feel free to leave a comment. Take care, and until next time!
转载自 :https://mzaks.medium.com/simple-csv-parser-in-mojo-3555c13fb5c8