使用DolphinDB进行机器学习

本页是一篇教程文章的页眉信息与概览，包含主题、作者与发布日期，并概述 DolphinDB 的机器学习能力与示例所基于的版本。

Source: https://dolphindb.cn/blogs/71

What this page covers

技能认证特训营第二期的限时报名与相关信息。
DolphinDB 机器学习教程的文章信息与能力概述。
使用 UCI wine 数据集进行随机森林分类的示例流程。
在分布式数据上进行逻辑回归训练与特征工程的示例流程。
使用 PCA 做降维，并将结果用于后续训练与预测的示例。
通过 XGBoost 插件展示插件式机器学习训练、预测与模型持久化。
附录：机器学习训练函数、工具函数与插件支持的图表/列表。

技能认证特训营第二期正式开启（限时报名）

本节为限时报名的技能认证特训营第二期的推广与行动号召，并提到专属福利优惠。

技能认证特训营第二期已正式开启。
报名为限时报名，并提供报名链接。
提及可享专属福利优惠。

使用DolphinDB进行机器学习

本节给出文章标题与页眉信息，并概述 DolphinDB 的机器学习能力与教程示例所依据的版本。

作者署名为 Junxi。
发布日期为 2021-08-05。
DolphinDB 实现了常见机器学习算法以支持回归、分类与聚类任务。
教程示例基于 DolphinDB 1.10.9。

第一个例子：对小样本数据进行分类

本节使用 UCI wine 数据集演示在 DolphinDB 中训练与评估随机森林分类器，包括数据导入、预处理、训练、预测、评估与模型持久化。

示例数据源为 UCI Machine Learning Repository 的 wine 数据集（wine.data）。
使用 loadText 按指定 schema 将本地下载的数据导入 DolphinDB。
randomForestClassifier 要求类别标签为区间 [0, classNum) 的整数。
示例将原始标签 1、2、3 更新为 0、1、2（Label = Label - 1）。
trainTestSplit 按 7:3 划分训练集与测试集（testRatio=0.3）。
训练时 randomForestClassifier 的必需参数包括 ds、yColName、xColNames、numClasses。
使用 model.predict(wineTest) 对测试集进行预测。
示例包含保存模型（saveModel）与加载模型（loadModel）用于预测。

分布式机器学习

本节展示在 DolphinDB 分布式环境中，基于按股票代码分区的 OHLC 数据进行特征工程，并训练逻辑回归分类模型与评估。

该部分强调 DolphinDB 的内置机器学习算法面向分布式环境设计。
示例数据为分布式数据库，按股票代码分区，包含 2010–2018 的日度 OHLC 数据。
特征包含 Open、High、Low、Close 及派生项与指标（如 10 日均线、相关系数、RSI）。
目标定义为“下一交易日 Close 是否大于当日 Close”。
使用 ffill 对原始数据进行缺失值填充。
计算 10 日均线与 RSI 后，前 10 行会变为 null 并需要移除（slice [10:]）。
logisticRegression 的必需参数包括 ds、yColName、xColNames。

使用PCA为数据降维

本节解释 PCA 的用途，并在 wine 示例中演示计算主成分、选择部分主成分，以及将降维结果用于后续训练与预测。

PCA 用于降维，在尽量减少信息损失的同时映射到低维空间。
该示例的输入特征数量为 13 个（xColNames 列出 13 列）。
pca 的用法包含参数 normalize=true 以进行数据归一化。
示例保留前三个主成分（components = pcaRes.components.transpose()[:3]）。
principalComponents 将主成分矩阵应用于数据并保留 y 列返回结果表。

使用DolphinDB插件进行机器学习

本节以 XGBoost 插件为例介绍 DolphinDB 的机器学习插件：插件获取与加载、训练与预测、参数说明、模型持久化与增量训练。

DolphinDB 提供插件以调用第三方机器学习库，并以 XGBoost 插件为例说明。
示例要求从 DolphinDBPlugin 的 GitHub 页面下载编译好的 XGBoost 插件。
使用 loadPlugin 通过 PluginXgboost.txt 路径加载 XGBoost 插件。
xgboost::train 的签名为 xgboost::train(Y, X, [params], [numBoostRound=10], [xgbModel])。
训练输入中，Y 为 wineTrain 的 Label 列，X 为其余特征列。
示例参数包含 objective="multi:softmax" 且 num_class=3。
示例包含 xgboost::saveModel 与 xgboost::loadModel 以持久化与加载模型。
xgboost::train 可通过 xgbModel 参数在已有模型基础上继续训练。

附录：DolphinDB机器学习函数

本节为附录，引用图表/列表以汇总 DolphinDB 的机器学习训练函数、工具函数，以及支持的插件信息。

附录图片展示机器学习训练函数列表及其类别与分布式支持信息。
附录图片列出工具函数：loadModel、saveModel、predict。
附录图片展示支持的机器学习插件，包括基于 XGBoost 与 libsvm 的算法。

Facts Index

Entity	Attribute	Value	Confidence
技能认证特训营第二期	status	正式开启	high
技能认证特训营第二期报名	availability	限时报名（附报名链接）	high
技能认证特训营第二期	benefit	享专属福利优惠	low
Article	published_date	2021-08-05	high
Junxi	role	Author (byline)	high
DolphinDB	machine_learning_support	Implements common ML algorithms such as least squares regression, random forest, and K-means, enabling regression, classification, and clustering tasks.	medium
Tutorial examples	product_version	Based on DolphinDB 1.10.9	high
Example 1 dataset	source	UCI Machine Learning Repository wine dataset (wine.data)	high
loadText	purpose	Import local downloaded dataset into DolphinDB using a specified schema.	high
randomForestClassifier	label_requirement	Class labels must be integers in [0, classNum).	high
wine dataset labels	original_labels	1, 2, 3	high
wine dataset labels	updated_labels	Updated to 0, 1, 2 by setting Label = Label - 1	high
trainTestSplit function	split_ratio	Splits data into train/test with 7:3 (testRatio=0.3).	high
wineTrain	size	124	high
wineTest	size	54	high
randomForestClassifier	required_parameters	ds, yColName, xColNames, numClasses	high
sqlDS	purpose_in_example	Used to generate the input data source (ds) for randomForestClassifier from wineTrain.	high
Random forest wine model	numClasses	3	high
Model prediction	method	model.predict(wineTest)	high
Random forest wine classification	accuracy_on_test	0.925926 (sum(predicted==wineTest.Label)/wineTest.size())	high
saveModel	purpose	Save trained model to disk.	high
loadModel	purpose	Load a model from disk for prediction.	high
DolphinDB machine learning vs common libraries	positioning	Designed for distributed environments; many built-in ML algorithms support distributed environments well.	medium
Distributed ML example	model_type	Logistic regression classification model training on DolphinDB distributed database.	high
Distributed dataset	description	A DolphinDB distributed database partitioned by stock ticker, storing daily OHLC data from 2010 to 2018.	high
Distributed example features	predictors	Open, High, Low, Close, Open-prev(Close), Open-prev(Open), 10-day moving average, correlation coefficient, RSI.	high
Distributed example target	definition	Whether next day's Close is greater than today's Close.	high
ffill	purpose_in_example	Fill missing values in raw data.	high
10-day moving average and RSI computation	data_handling	First 10 rows become null and need to be removed (slice [10:]).	high
transDS!	purpose_in_example	Apply preprocessing steps to a data source generated from the raw table.	high
DolphinDB ta module	purpose_in_example	Used to compute RSI via ta::rsi(Close, 10); module usage referenced to DolphinDBModules repository.	high
preprocess function	engineered_columns	OpenClose, OpenOpen, S_10 (mavg), RSI, Target, Corr (mcorr) and forward-filled OHLC columns.	high
Distributed data loading	code_path	ohlc = database("dfs://trades").loadTable("ohlc"); ds = sqlDS(<select * from ohlc>).transDS!(preprocess)	high
logisticRegression	required_parameters	ds, yColName, xColNames	high
logisticRegression in example	xColNames	Open, High, Low, Close, OpenClose, OpenOpen, S_10, RSI, Corr	high
logisticRegression in example	yColName	Target	high
AAPL prediction example	reported_accuracy	0.756522 (as shown in code comment)	medium
PCA	purpose	Reduce dimensionality by mapping high-dimensional data to lower-dimensional space while minimizing information loss; also useful for visualization.	medium
wine dataset in PCA example	dimension	13 input variables/features (xColNames lists 13 columns).	high
pca function usage	normalize_parameter	normalize=true to perform data normalization.	high
pcaRes.explainedVarianceRatio (wineTrain)	values	[0.209316,0.201225,0.121788,0.088709,0.077805,0.075314,0.058028,0.045604,0.038463,0.031485,0.021256,0.018073,0.012934]	high
PCA result selection	interpretation	First three components have large variance contribution; compressing to 3 dimensions is sufficient for training (as stated).	medium
Principal components selection	operation	components = pcaRes.components.transpose()[:3] (keep first three principal components).	high
principalComponents function	purpose	Apply PCA components matrix to dataset (matrix(t[xColNames]).dot(components)) and return as table with y column preserved.	high
randomForestClassifier after PCA	xColNames	Uses `col0 and `col1 (as shown) with numClasses=3 in the provided code snippet.	high
DolphinDB plugins for ML	statement	DolphinDB provides plugins enabling calling third-party libraries for machine learning; XGBoost plugin used as example.	medium
XGBoost plugin setup	source	Download compiled XGBoost plugin from DolphinDBPlugin GitHub page.	high
loadPlugin	purpose_in_example	Load XGBoost plugin using path to PluginXgboost.txt.	high
xgboost::train	signature	xgboost::train(Y, X, [params], [numBoostRound=10], [xgbModel])	high
XGBoost training inputs in example	Y_and_X	Y is Label column from wineTrain; X is remaining feature columns selected from wineTrain.	high
XGBoost multi-class params	objective	multi:softmax	high
XGBoost multi-class params	num_class	3	high
XGBoost parameter guidance	booster_options	booster can be "gbtree" or "gblinear".	high
XGBoost parameter guidance	eta	Range [0,1], default 0.3.	high
XGBoost parameter guidance	gamma	Range [0,∞], default 0.	high
XGBoost parameter guidance	max_depth	Range [0,∞], default 6.	high
XGBoost parameter guidance	subsample	Range (0,1], default 1.	high
XGBoost parameter guidance	lambda	Default 0.	high
XGBoost parameter guidance	alpha	Default 0.	high
XGBoost parameter guidance	seed	Default 0.	high
XGBoost parameters documentation	reference	Links to official XGBoost parameter documentation.	high
Example XGBoost params used	params	{ objective: "multi:softmax", num_class: 3, max_depth: 5, eta: 0.1, subsample: 0.9 }	high
XGBoost wine classification	accuracy_on_test	0.962963 (sum(predicted==wineTest.Label)/wineTest.size())	high
xgboost::saveModel	purpose	Persist XGBoost model to file (example: "xgboost001.mdl").	high
xgboost::loadModel	purpose	Load XGBoost model from file (example: "xgboost001.mdl").	high
xgboost::train (incremental training)	xgbModel_parameter	Can specify xgbModel to continue training from an existing model (model = xgboost::train(Y, X, params, , model)).	high
Appendix A image	content_description	Image shows a table/list of DolphinDB machine learning training functions, including function names, categories (classification/regression/clustering/dimensionality reduction), descriptions, and whether distributed computation is supported.	medium
Appendix B image	content_description	Image lists ML tool functions: loadModel (load from disk), saveModel (persist), predict (predict on new data).	medium
Appendix C image	content_description	Image shows supported ML plugins, including XGBoost-based gradient boosting algorithms and libsvm-based algorithms (e.g., SVM).	medium