DolphinDB文本数据加载教程

本页是一份关于 DolphinDB 文本（如 CSV）数据加载的教程页面，并包含作者与发布时间信息。

Source: https://dolphindb.cn/blogs/66

What this page covers

文本导入函数概览与用途（loadText/ploadText/loadTextEx/textChunkDS）。
导入时字段名与数据类型的自动识别机制与限制。
通过 schema 参数指定导入的列名、类型、格式与列下标。
skipRows 跳过文件前若干行的行为与注意点。
并行导入：单文件多线程与多文件并行写入数据库的示例与约束。
写入数据库前的 transform 预处理机制与示例。
编码、数值解析与 CSV 引号处理等其它注意事项。

技能认证特训营第二期报名推广 cta

页面顶部包含培训营报名推广信息与报名链接。

该部分提供一个报名入口链接。
该部分属于培训营相关的推广信息。

DolphinDB文本数据加载教程（标题/作者/日期） product_overview

该部分给出教程标题、作者署名与发布时间信息。

作者署名为 Junxi。
发布时间为 2021-08-05。

文本导入函数概览与性能主张 product_overview

该部分列出 4 个文本导入/处理函数及其用途，并包含导入性能对比主张与教程范围说明。

DolphinDB 提供 4 个用于导入文本数据的函数。
loadText 用于将文本文件导入为内存表。
ploadText 用于并行导入文本文件为分区内存表。
loadTextEx 用于将文本文件导入数据库。
textChunkDS 用于将文件切分为数据源并配合 mr 处理。

1. 自动识别数据格式 how_it_works

该部分说明导入时字段名与数据类型的自动识别机制、限制提示，并展示 loadText 示例与 schema 查看。

导入文本时可自动识别字段名与数据类型。
若首行列名不以数字开头，系统将其视为表头字段名。
类型推断基于小样本，可能出现列类型误识别。
当前不支持 UUID 与 IPADDR 的自动识别。
schema() 可用于查看表结构（字段名与数据类型）。

2. 指定数据导入格式（schema参数） how_it_works

该部分说明可通过 schema 参数指定列名、类型、格式与列下标，并给出 schema 表结构要求与示例。

schema 参数可使用一个表来描述导入列的规则。
schema 表可包含 name、type、format、col 四列。
name 与 type 为必填且必须是前两列。
format 与 col 为可选列，出现顺序不受限。
col 表示要导入的列下标，并要求升序。

2.1 提取文本文件的schema how_it_works

该部分介绍 extractTextSchema 获取字段名称与类型，并说明可基于结果修改解析类型。

extractTextSchema 可获取文本文件的 schema（字段名与数据类型）。
若自动解析类型不符合预期，可修改得到的 schema 表。
可使用 SQL 语句对 schema 表进行修改。

2.2 指定字段名称和类型 how_it_works

该部分说明可通过修改 schema 并在 loadText 中指定 schema 参数来控制导入字段名与类型，并提到日期时间解析处理方式。

当自动识别不符合需求时，可显式指定 schema 参数。
schema 参数可用于设置每列的字段名与数据类型。
日期/时间列需要结合 type 与 format 进行控制（在后续小节展开）。

2.3 指定日期和时间类型的格式 how_it_works

该部分通过示例说明日期时间列需要同时指定 type 与 format，才能按期望解析与导入。

若日期/时间列解析类型不符合预期，需要在 schema.type 中指定目标类型。
日期/时间列需要在 schema.format 中指定解析格式（如 “MM/dd/yyyy”）。
仅设置类型但不设置 format 可能无法按预期解析日期/时间。
schema.format 用于指导日期/时间字符串的解析规则。

2.4 导入指定列 how_it_works

该部分说明可通过 schema.col 只导入指定列，列号从 0 开始，且不能在导入时改变列顺序。

可使用 schema.col 指定仅导入部分列。
列下标从 0 开始。
导入时不能改变原文件列顺序。
如需调整列顺序，可先导入再使用 reorderColumns!。

2.5 跳过文本数据的前若干行（skipRows） how_it_works

该部分说明 skipRows 用于跳过前 n 行（最大 1024），以及对列名识别的影响，并给出保留列名的做法。

skipRows 可在导入时跳过文件前 n 行。
skipRows 的最大值为 1024。
该参数被文中所述的 4 个加载函数支持。
若首行是列名，使用 skipRows 会跳过列名行并导致默认列名（如 col0）。
为保留列名可先用 extractTextSchema 提取 schema，再在 loadText 中同时使用 schema 与 skipRows。

3. 并行导入数据 how_it_works

该部分介绍单文件多线程载入内存与多文件并行导入数据库两种并行场景，并给出示例数据与性能结果。

ploadText 使用多线程将单个文本文件载入内存并生成分区内存表。
并行度与 CPU 核数及节点 localExecutors 配置相关。
示例生成约 4GB 的文本文件用于性能对比。
示例中 loadText 用时为 12629.492 ms。
示例中 ploadText 用时为 2669.702 ms。
示例中 ploadText 约为 loadText 的 4.5 倍性能。
loadTextEx 可将文本导入分布式数据库、本地磁盘数据库或内存数据库。
分区表不允许多个线程同时写入同一分区。

4. 导入数据库前的预处理（transform） how_it_works

该部分说明 loadTextEx 提供 transform 参数，可在写入数据库前对未分区内存表进行预处理，并给出类型转换与空值填充的示例。

仅 loadTextEx 提供 transform 参数用于写入前预处理。
transform 接受一个恰好单参数的函数。
transform 的输入是未分区内存表，输出也为未分区内存表。
自定义 transform 中建议使用就地修改函数（带 “!”）以提升性能。
示例中 transform 后 time 列以 TIME 类型存储，而非文本中的 INT。
示例中 transform 后 tradingDay 列以 MONTH 类型存储，而非文本中的 DATE。
可用偏函数将多参数内置函数转换为可用的单参数 transform。

5. 使用Map-Reduce自定义数据导入 how_it_works

该部分介绍使用 textChunkDS 与 mr 进行按 chunk 的 Map-Reduce 导入与自定义处理，并包含并行写入约束与首尾 chunk 加载示例。

支持使用 Map-Reduce 方式按行切分并自定义导入。
可用 textChunkDS 切分文件并用 mr 写入数据库。
在写入前可进行灵活处理。
示例按每 300MB 切分，约 1GB 文件得到 4 个分块。
若分块可能写入同一分区，应设置 mr parallel=false 以避免并发写同分区异常。
可通过选择 ds.head() 与 ds.tail() 并 union 来只加载大文件首尾分块。

6. 其它注意事项 limitations

该部分涵盖编码（UTF-8 要求与转换函数）、数值解析规则，以及 CSV 字段双引号处理等注意点。

由于字符串使用 UTF-8，待加载文件需要为 UTF-8 编码。
提供 convertEncode、fromUTF8、toUTF8 用于导入后编码转换。
数值解析可识别普通数字、千分位、带小数与科学计数法形式。
解析数值时会忽略数字周围的字母与符号。
若字段中不含数字，则解析为 NULL。
处理 CSV 被双引号包裹的字段时会自动去除外围双引号。

附录：示例数据文件链接 misc

该部分提供教程示例数据文件 candle_201801.csv 的下载链接。

示例数据文件为 candle_201801.csv。
该文件链接指向 GitHub（通过知乎跳转链接引用）。

Facts Index

Entity	Attribute	Value	Confidence
DolphinDB文本数据加载教程	publication_date	2021-08-05	high
DolphinDB文本数据加载教程	author	Junxi	high
DolphinDB	provides_text_import_functions	Provides 4 functions for importing text data into memory or databases: loadText, ploadText, loadTextEx, textChunkDS.	high
loadText	purpose	Imports a text file as an in-memory table.	high
ploadText	purpose	Imports a text file in parallel as a partitioned in-memory table; faster than loadText.	high
loadTextEx	purpose	Imports a text file into a database (distributed database, local disk database, or in-memory database).	high
textChunkDS	purpose	Splits a text file into multiple small data sources and then uses mr for flexible data processing.	high
DolphinDB text import performance	comparison_claim	Compared with Clickhouse, MemSQL, Druid, Pandas, single-thread import is faster (up to an order-of-magnitude advantage) and multi-thread parallel import advantage is more obvious.	low
DolphinDB text import	automatic_format_detection	Can automatically recognize data format when importing text, including field name recognition and data type recognition.	high
DolphinDB automatic header detection	rule	If the first row has no column starting with a digit, the system treats the first row as a header containing field names.	high
DolphinDB auto type inference	accuracy_note	Types are inferred from a small sample; some columns may be misidentified.	medium
DolphinDB data type auto-recognition	unsupported_types	Does not currently support auto-recognition of UUID and IPADDR types; planned for future versions.	high
schema()	purpose	schema function can view table structure including field names and data types.	high
schema parameter for text loading functions	schema_table_columns	schema parameter can be a table containing columns: name (string, column name), type (string, data type), format (string, date/time format), col (int, index of columns to load, must be ascending).	high
schema parameter for text loading functions	required_columns_order	name and type columns are required and must be the first two columns; format and col are optional and can appear in any order.	high
extractTextSchema	purpose	Gets schema of a text file, including field names and data types.	high
extractTextSchema output	editable	After obtaining schemaTB, if auto-parsed types are not as expected, you can modify the schema table with SQL statements.	high
loadText schema override	capability	If auto-detected field names/types are not as required, you can specify schema parameter to set field names and data types for each column.	high
date/time parsing with schema	format_needed	For date/time columns, if the parsed type is not expected, you must set the desired type in schema.type and specify the format in schema.format (e.g., "MM/dd/yyyy").	high
Import selected columns	capability	You can use schema.col to import only specified columns from a text file.	high
schema.col	indexing_rule	Column indices start from 0.	high
Import selected columns	order_constraint	Cannot change the order of columns during import; to adjust order, load first then use reorderColumns!	high
skipRows parameter	purpose_and_limit	skipRows can skip the first n rows when importing; maximum value is 1024; supported by all four loading functions described.	high
skipRows behavior	header_skipped_note	If the first row contains column names, it will be skipped when using skipRows, causing default column names (col0, col1, etc.).	high
skipRows with preserved column names	method	To preserve column names while skipping rows, first extract schema with extractTextSchema and pass schema when calling loadText with skipRows.	high
ploadText	behavior	Uses multi-threading to load a single text file into memory, creating an in-memory partitioned table; parallelism depends on CPU core count and node localExecutors configuration.	high
loadText vs ploadText performance example	test_file_size	Generated a text file of about 4GB for performance comparison.	medium
loadText vs ploadText performance example	cpu_spec	Example node uses a 6-core, 12-hyperthread CPU.	high
loadText vs ploadText performance example	loadText_time_ms	12629.492 ms	high
loadText vs ploadText performance example	ploadText_time_ms	2669.702 ms	high
loadText vs ploadText performance example	speedup_claim	ploadText performance is about 4.5x of loadText under that configuration.	high
loadTextEx	database_targets	Can import text into distributed database, local disk database, or in-memory database.	high
loadTextEx	implementation_note	When importing into a distributed database, data is first loaded into memory then written to the database; both steps are done by one function for efficiency.	high
Multi-file import example	generated_files_and_size	Generated 100 files totaling about 778MB, including 10 million records.	high
Partitioned table concurrent writes	constraint	DolphinDB partitioned tables do not allow multiple threads to write to the same partition simultaneously; ensure no concurrent writes to the same partition when designing concurrent read/write.	high
getRecentJobs	purpose	getRecentJobs can obtain the status of the most recent n batch jobs on the local node.	high
Multi-file parallel import example	parallel_level	parallelLevel=10 threads used for parallel import.	high
Multi-file parallel import example	parallel_elapsed_ms	1590 ms (approx. 1.59 s) on 6-core/12-hyperthread CPU (computed as max(endTime)-min(startTime)).	high
Multi-file single-thread import example	single_thread_elapsed_ms	8647.645 ms (approx. 8.65 s)	high
Multi-file import example	speedup_claim	Parallel import with 10 threads is about 5.5x faster than single-thread sequential import under that configuration.	high
Imported table record count example	count	10000000	high
loadTextEx transform parameter	availability	Only loadTextEx provides a transform parameter for preprocessing before importing into database.	high
transform parameter	function_signature	transform accepts a function that takes exactly one parameter; input is an unpartitioned in-memory table and output is also an unpartitioned in-memory table.	high
transform function performance	recommendation	Within custom transform functions, prefer local in-place modifications (functions with '!') to improve performance.	medium
transform usage example (INT time to TIME)	result_claim	After using transform foo, the time column is stored as TIME type rather than INT type from the text file.	high
transform usage example (DATE to MONTH)	result_claim	After using transform fee, the tradingDay column is stored as MONTH type rather than DATE type from the text file.	high
Partial application	use_case_in_transform	When a built-in function requires multiple parameters, partial application can convert it into a one-parameter function for use as transform (example: nullFill!{,0}).	high
Map-Reduce custom import	capability	DolphinDB supports using Map-Reduce to customize data import by splitting data by rows and importing via Map-Reduce.	high
textChunkDS + mr	workflow	Use textChunkDS to split a file into small data sources and use mr to write into a database; users can perform flexible processing before writing.	high
textChunkDS split example	chunk_size_mb	Split file by 300MB per chunk, resulting in 4 parts (for ~1GB file example).	high
mr parallel writes to partitions	constraint	If chunks may contain the same partition, set mr parallel=false because DolphinDB disallows concurrent writes to the same partition; otherwise an exception occurs.	high
textChunkDS head/tail loading	capability	Can load only the first and last chunks of a large file by selecting ds.head() and ds.tail() and unioning results.	high
Text file encoding for DolphinDB strings	encoding_requirement	Because DolphinDB strings use UTF-8, files to be loaded must be UTF-8; other encodings can be converted after import.	high
Encoding conversion functions	provided_functions	Provides convertEncode, fromUTF8, and toUTF8 functions for converting string encodings after import.	high
Numeric parsing (when schema specifies numeric types)	recognized_formats	Recognizes numeric values in forms: plain digits (e.g., 123), comma-separated (e.g., 100,000), decimals (e.g., 1.231), scientific notation (e.g., 1.23E5).	high
Numeric parsing behavior	symbol_ignoring_and_null_rule	During import, DolphinDB ignores letters and other symbols around numbers; if no digits appear, parses as NULL.	high
CSV import quoting behavior	double_quote_handling	Automatically strips surrounding double quotes from text fields when processing CSV fields that are quoted.	high
Tutorial example data file	download_link	candle_201801.csv available at https://github.com/dolphindb/Tutorials_CN/blob/master/data/candle_201801.csv (linked via zhihu redirect).	high
技能认证特训营第二期	registration_link	https://www.qingsuyun.com/h5/e/217471/5/	high