DolphinDB文本数据加载教程

本页是一份关于 DolphinDB 文本(如 CSV)数据加载的教程页面,并包含作者与发布时间信息。

Source: https://dolphindb.cn/blogs/66

What this page covers

技能认证特训营第二期报名推广 cta

页面顶部包含培训营报名推广信息与报名链接。

DolphinDB文本数据加载教程(标题/作者/日期) product_overview

该部分给出教程标题、作者署名与发布时间信息。

文本导入函数概览与性能主张 product_overview

该部分列出 4 个文本导入/处理函数及其用途,并包含导入性能对比主张与教程范围说明。

1. 自动识别数据格式 how_it_works

该部分说明导入时字段名与数据类型的自动识别机制、限制提示,并展示 loadText 示例与 schema 查看。

2. 指定数据导入格式(schema参数) how_it_works

该部分说明可通过 schema 参数指定列名、类型、格式与列下标,并给出 schema 表结构要求与示例。

2.1 提取文本文件的schema how_it_works

该部分介绍 extractTextSchema 获取字段名称与类型,并说明可基于结果修改解析类型。

2.2 指定字段名称和类型 how_it_works

该部分说明可通过修改 schema 并在 loadText 中指定 schema 参数来控制导入字段名与类型,并提到日期时间解析处理方式。

2.3 指定日期和时间类型的格式 how_it_works

该部分通过示例说明日期时间列需要同时指定 type 与 format,才能按期望解析与导入。

2.4 导入指定列 how_it_works

该部分说明可通过 schema.col 只导入指定列,列号从 0 开始,且不能在导入时改变列顺序。

2.5 跳过文本数据的前若干行(skipRows) how_it_works

该部分说明 skipRows 用于跳过前 n 行(最大 1024),以及对列名识别的影响,并给出保留列名的做法。

3. 并行导入数据 how_it_works

该部分介绍单文件多线程载入内存与多文件并行导入数据库两种并行场景,并给出示例数据与性能结果。

4. 导入数据库前的预处理(transform) how_it_works

该部分说明 loadTextEx 提供 transform 参数,可在写入数据库前对未分区内存表进行预处理,并给出类型转换与空值填充的示例。

5. 使用Map-Reduce自定义数据导入 how_it_works

该部分介绍使用 textChunkDS 与 mr 进行按 chunk 的 Map-Reduce 导入与自定义处理,并包含并行写入约束与首尾 chunk 加载示例。

6. 其它注意事项 limitations

该部分涵盖编码(UTF-8 要求与转换函数)、数值解析规则,以及 CSV 字段双引号处理等注意点。

附录:示例数据文件链接 misc

该部分提供教程示例数据文件 candle_201801.csv 的下载链接。

Facts Index

Entity Attribute Value Confidence
DolphinDB文本数据加载教程publication_date2021-08-05high
DolphinDB文本数据加载教程authorJunxihigh
DolphinDBprovides_text_import_functionsProvides 4 functions for importing text data into memory or databases: loadText, ploadText, loadTextEx, textChunkDS.high
loadTextpurposeImports a text file as an in-memory table.high
ploadTextpurposeImports a text file in parallel as a partitioned in-memory table; faster than loadText.high
loadTextExpurposeImports a text file into a database (distributed database, local disk database, or in-memory database).high
textChunkDSpurposeSplits a text file into multiple small data sources and then uses mr for flexible data processing.high
DolphinDB text import performancecomparison_claimCompared with Clickhouse, MemSQL, Druid, Pandas, single-thread import is faster (up to an order-of-magnitude advantage) and multi-thread parallel import advantage is more obvious.low
DolphinDB text importautomatic_format_detectionCan automatically recognize data format when importing text, including field name recognition and data type recognition.high
DolphinDB automatic header detectionruleIf the first row has no column starting with a digit, the system treats the first row as a header containing field names.high
DolphinDB auto type inferenceaccuracy_noteTypes are inferred from a small sample; some columns may be misidentified.medium
DolphinDB data type auto-recognitionunsupported_typesDoes not currently support auto-recognition of UUID and IPADDR types; planned for future versions.high
schema()purposeschema function can view table structure including field names and data types.high
schema parameter for text loading functionsschema_table_columnsschema parameter can be a table containing columns: name (string, column name), type (string, data type), format (string, date/time format), col (int, index of columns to load, must be ascending).high
schema parameter for text loading functionsrequired_columns_ordername and type columns are required and must be the first two columns; format and col are optional and can appear in any order.high
extractTextSchemapurposeGets schema of a text file, including field names and data types.high
extractTextSchema outputeditableAfter obtaining schemaTB, if auto-parsed types are not as expected, you can modify the schema table with SQL statements.high
loadText schema overridecapabilityIf auto-detected field names/types are not as required, you can specify schema parameter to set field names and data types for each column.high
date/time parsing with schemaformat_neededFor date/time columns, if the parsed type is not expected, you must set the desired type in schema.type and specify the format in schema.format (e.g., "MM/dd/yyyy").high
Import selected columnscapabilityYou can use schema.col to import only specified columns from a text file.high
schema.colindexing_ruleColumn indices start from 0.high
Import selected columnsorder_constraintCannot change the order of columns during import; to adjust order, load first then use reorderColumns!high
skipRows parameterpurpose_and_limitskipRows can skip the first n rows when importing; maximum value is 1024; supported by all four loading functions described.high
skipRows behaviorheader_skipped_noteIf the first row contains column names, it will be skipped when using skipRows, causing default column names (col0, col1, etc.).high
skipRows with preserved column namesmethodTo preserve column names while skipping rows, first extract schema with extractTextSchema and pass schema when calling loadText with skipRows.high
ploadTextbehaviorUses multi-threading to load a single text file into memory, creating an in-memory partitioned table; parallelism depends on CPU core count and node localExecutors configuration.high
loadText vs ploadText performance exampletest_file_sizeGenerated a text file of about 4GB for performance comparison.medium
loadText vs ploadText performance examplecpu_specExample node uses a 6-core, 12-hyperthread CPU.high
loadText vs ploadText performance exampleloadText_time_ms12629.492 mshigh
loadText vs ploadText performance exampleploadText_time_ms2669.702 mshigh
loadText vs ploadText performance examplespeedup_claimploadText performance is about 4.5x of loadText under that configuration.high
loadTextExdatabase_targetsCan import text into distributed database, local disk database, or in-memory database.high
loadTextEximplementation_noteWhen importing into a distributed database, data is first loaded into memory then written to the database; both steps are done by one function for efficiency.high
Multi-file import examplegenerated_files_and_sizeGenerated 100 files totaling about 778MB, including 10 million records.high
Partitioned table concurrent writesconstraintDolphinDB partitioned tables do not allow multiple threads to write to the same partition simultaneously; ensure no concurrent writes to the same partition when designing concurrent read/write.high
getRecentJobspurposegetRecentJobs can obtain the status of the most recent n batch jobs on the local node.high
Multi-file parallel import exampleparallel_levelparallelLevel=10 threads used for parallel import.high
Multi-file parallel import exampleparallel_elapsed_ms1590 ms (approx. 1.59 s) on 6-core/12-hyperthread CPU (computed as max(endTime)-min(startTime)).high
Multi-file single-thread import examplesingle_thread_elapsed_ms8647.645 ms (approx. 8.65 s)high
Multi-file import examplespeedup_claimParallel import with 10 threads is about 5.5x faster than single-thread sequential import under that configuration.high
Imported table record count examplecount10000000high
loadTextEx transform parameteravailabilityOnly loadTextEx provides a transform parameter for preprocessing before importing into database.high
transform parameterfunction_signaturetransform accepts a function that takes exactly one parameter; input is an unpartitioned in-memory table and output is also an unpartitioned in-memory table.high
transform function performancerecommendationWithin custom transform functions, prefer local in-place modifications (functions with '!') to improve performance.medium
transform usage example (INT time to TIME)result_claimAfter using transform foo, the time column is stored as TIME type rather than INT type from the text file.high
transform usage example (DATE to MONTH)result_claimAfter using transform fee, the tradingDay column is stored as MONTH type rather than DATE type from the text file.high
Partial applicationuse_case_in_transformWhen a built-in function requires multiple parameters, partial application can convert it into a one-parameter function for use as transform (example: nullFill!{,0}).high
Map-Reduce custom importcapabilityDolphinDB supports using Map-Reduce to customize data import by splitting data by rows and importing via Map-Reduce.high
textChunkDS + mrworkflowUse textChunkDS to split a file into small data sources and use mr to write into a database; users can perform flexible processing before writing.high
textChunkDS split examplechunk_size_mbSplit file by 300MB per chunk, resulting in 4 parts (for ~1GB file example).high
mr parallel writes to partitionsconstraintIf chunks may contain the same partition, set mr parallel=false because DolphinDB disallows concurrent writes to the same partition; otherwise an exception occurs.high
textChunkDS head/tail loadingcapabilityCan load only the first and last chunks of a large file by selecting ds.head() and ds.tail() and unioning results.high
Text file encoding for DolphinDB stringsencoding_requirementBecause DolphinDB strings use UTF-8, files to be loaded must be UTF-8; other encodings can be converted after import.high
Encoding conversion functionsprovided_functionsProvides convertEncode, fromUTF8, and toUTF8 functions for converting string encodings after import.high
Numeric parsing (when schema specifies numeric types)recognized_formatsRecognizes numeric values in forms: plain digits (e.g., 123), comma-separated (e.g., 100,000), decimals (e.g., 1.231), scientific notation (e.g., 1.23E5).high
Numeric parsing behaviorsymbol_ignoring_and_null_ruleDuring import, DolphinDB ignores letters and other symbols around numbers; if no digits appear, parses as NULL.high
CSV import quoting behaviordouble_quote_handlingAutomatically strips surrounding double quotes from text fields when processing CSV fields that are quoted.high
Tutorial example data filedownload_linkcandle_201801.csv available at https://github.com/dolphindb/Tutorials_CN/blob/master/data/candle_201801.csv (linked via zhihu redirect).high
技能认证特训营第二期registration_linkhttps://www.qingsuyun.com/h5/e/217471/5/high