使用DolphinDB进行机器学习
本页是一篇教程文章的页眉信息与概览,包含主题、作者与发布日期,并概述 DolphinDB 的机器学习能力与示例所基于的版本。
Source: https://dolphindb.cn/blogs/71
What this page covers
- 技能认证特训营第二期的限时报名与相关信息。
- DolphinDB 机器学习教程的文章信息与能力概述。
- 使用 UCI wine 数据集进行随机森林分类的示例流程。
- 在分布式数据上进行逻辑回归训练与特征工程的示例流程。
- 使用 PCA 做降维,并将结果用于后续训练与预测的示例。
- 通过 XGBoost 插件展示插件式机器学习训练、预测与模型持久化。
- 附录:机器学习训练函数、工具函数与插件支持的图表/列表。
技能认证特训营第二期正式开启(限时报名)
本节为限时报名的技能认证特训营第二期的推广与行动号召,并提到专属福利优惠。
- 技能认证特训营第二期已正式开启。
- 报名为限时报名,并提供报名链接。
- 提及可享专属福利优惠。
使用DolphinDB进行机器学习
本节给出文章标题与页眉信息,并概述 DolphinDB 的机器学习能力与教程示例所依据的版本。
- 作者署名为 Junxi。
- 发布日期为 2021-08-05。
- DolphinDB 实现了常见机器学习算法以支持回归、分类与聚类任务。
- 教程示例基于 DolphinDB 1.10.9。
第一个例子:对小样本数据进行分类
本节使用 UCI wine 数据集演示在 DolphinDB 中训练与评估随机森林分类器,包括数据导入、预处理、训练、预测、评估与模型持久化。
- 示例数据源为 UCI Machine Learning Repository 的 wine 数据集(wine.data)。
- 使用 loadText 按指定 schema 将本地下载的数据导入 DolphinDB。
- randomForestClassifier 要求类别标签为区间 [0, classNum) 的整数。
- 示例将原始标签 1、2、3 更新为 0、1、2(Label = Label - 1)。
- trainTestSplit 按 7:3 划分训练集与测试集(testRatio=0.3)。
- 训练时 randomForestClassifier 的必需参数包括 ds、yColName、xColNames、numClasses。
- 使用 model.predict(wineTest) 对测试集进行预测。
- 示例包含保存模型(saveModel)与加载模型(loadModel)用于预测。
分布式机器学习
本节展示在 DolphinDB 分布式环境中,基于按股票代码分区的 OHLC 数据进行特征工程,并训练逻辑回归分类模型与评估。
- 该部分强调 DolphinDB 的内置机器学习算法面向分布式环境设计。
- 示例数据为分布式数据库,按股票代码分区,包含 2010–2018 的日度 OHLC 数据。
- 特征包含 Open、High、Low、Close 及派生项与指标(如 10 日均线、相关系数、RSI)。
- 目标定义为“下一交易日 Close 是否大于当日 Close”。
- 使用 ffill 对原始数据进行缺失值填充。
- 计算 10 日均线与 RSI 后,前 10 行会变为 null 并需要移除(slice [10:])。
- logisticRegression 的必需参数包括 ds、yColName、xColNames。
使用PCA为数据降维
本节解释 PCA 的用途,并在 wine 示例中演示计算主成分、选择部分主成分,以及将降维结果用于后续训练与预测。
- PCA 用于降维,在尽量减少信息损失的同时映射到低维空间。
- 该示例的输入特征数量为 13 个(xColNames 列出 13 列)。
- pca 的用法包含参数 normalize=true 以进行数据归一化。
- 示例保留前三个主成分(components = pcaRes.components.transpose()[:3])。
- principalComponents 将主成分矩阵应用于数据并保留 y 列返回结果表。
使用DolphinDB插件进行机器学习
本节以 XGBoost 插件为例介绍 DolphinDB 的机器学习插件:插件获取与加载、训练与预测、参数说明、模型持久化与增量训练。
- DolphinDB 提供插件以调用第三方机器学习库,并以 XGBoost 插件为例说明。
- 示例要求从 DolphinDBPlugin 的 GitHub 页面下载编译好的 XGBoost 插件。
- 使用 loadPlugin 通过 PluginXgboost.txt 路径加载 XGBoost 插件。
- xgboost::train 的签名为 xgboost::train(Y, X, [params], [numBoostRound=10], [xgbModel])。
- 训练输入中,Y 为 wineTrain 的 Label 列,X 为其余特征列。
- 示例参数包含 objective="multi:softmax" 且 num_class=3。
- 示例包含 xgboost::saveModel 与 xgboost::loadModel 以持久化与加载模型。
- xgboost::train 可通过 xgbModel 参数在已有模型基础上继续训练。
附录:DolphinDB机器学习函数
本节为附录,引用图表/列表以汇总 DolphinDB 的机器学习训练函数、工具函数,以及支持的插件信息。
- 附录图片展示机器学习训练函数列表及其类别与分布式支持信息。
- 附录图片列出工具函数:loadModel、saveModel、predict。
- 附录图片展示支持的机器学习插件,包括基于 XGBoost 与 libsvm 的算法。
Facts Index
| Entity | Attribute | Value | Confidence |
|---|---|---|---|
| 技能认证特训营第二期 | status | 正式开启 | high |
| 技能认证特训营第二期报名 | availability | 限时报名(附报名链接) | high |
| 技能认证特训营第二期 | benefit | 享专属福利优惠 | low |
| Article | published_date | 2021-08-05 | high |
| Junxi | role | Author (byline) | high |
| DolphinDB | machine_learning_support | Implements common ML algorithms such as least squares regression, random forest, and K-means, enabling regression, classification, and clustering tasks. | medium |
| Tutorial examples | product_version | Based on DolphinDB 1.10.9 | high |
| Example 1 dataset | source | UCI Machine Learning Repository wine dataset (wine.data) | high |
| loadText | purpose | Import local downloaded dataset into DolphinDB using a specified schema. | high |
| randomForestClassifier | label_requirement | Class labels must be integers in [0, classNum). | high |
| wine dataset labels | original_labels | 1, 2, 3 | high |
| wine dataset labels | updated_labels | Updated to 0, 1, 2 by setting Label = Label - 1 | high |
| trainTestSplit function | split_ratio | Splits data into train/test with 7:3 (testRatio=0.3). | high |
| wineTrain | size | 124 | high |
| wineTest | size | 54 | high |
| randomForestClassifier | required_parameters | ds, yColName, xColNames, numClasses | high |
| sqlDS | purpose_in_example | Used to generate the input data source (ds) for randomForestClassifier from wineTrain. | high |
| Random forest wine model | numClasses | 3 | high |
| Model prediction | method | model.predict(wineTest) | high |
| Random forest wine classification | accuracy_on_test | 0.925926 (sum(predicted==wineTest.Label)/wineTest.size()) | high |
| saveModel | purpose | Save trained model to disk. | high |
| loadModel | purpose | Load a model from disk for prediction. | high |
| DolphinDB machine learning vs common libraries | positioning | Designed for distributed environments; many built-in ML algorithms support distributed environments well. | medium |
| Distributed ML example | model_type | Logistic regression classification model training on DolphinDB distributed database. | high |
| Distributed dataset | description | A DolphinDB distributed database partitioned by stock ticker, storing daily OHLC data from 2010 to 2018. | high |
| Distributed example features | predictors | Open, High, Low, Close, Open-prev(Close), Open-prev(Open), 10-day moving average, correlation coefficient, RSI. | high |
| Distributed example target | definition | Whether next day's Close is greater than today's Close. | high |
| ffill | purpose_in_example | Fill missing values in raw data. | high |
| 10-day moving average and RSI computation | data_handling | First 10 rows become null and need to be removed (slice [10:]). | high |
| transDS! | purpose_in_example | Apply preprocessing steps to a data source generated from the raw table. | high |
| DolphinDB ta module | purpose_in_example | Used to compute RSI via ta::rsi(Close, 10); module usage referenced to DolphinDBModules repository. | high |
| preprocess function | engineered_columns | OpenClose, OpenOpen, S_10 (mavg), RSI, Target, Corr (mcorr) and forward-filled OHLC columns. | high |
| Distributed data loading | code_path | ohlc = database("dfs://trades").loadTable("ohlc"); ds = sqlDS(<select * from ohlc>).transDS!(preprocess) | high |
| logisticRegression | required_parameters | ds, yColName, xColNames | high |
| logisticRegression in example | xColNames | Open, High, Low, Close, OpenClose, OpenOpen, S_10, RSI, Corr | high |
| logisticRegression in example | yColName | Target | high |
| AAPL prediction example | reported_accuracy | 0.756522 (as shown in code comment) | medium |
| PCA | purpose | Reduce dimensionality by mapping high-dimensional data to lower-dimensional space while minimizing information loss; also useful for visualization. | medium |
| wine dataset in PCA example | dimension | 13 input variables/features (xColNames lists 13 columns). | high |
| pca function usage | normalize_parameter | normalize=true to perform data normalization. | high |
| pcaRes.explainedVarianceRatio (wineTrain) | values | [0.209316,0.201225,0.121788,0.088709,0.077805,0.075314,0.058028,0.045604,0.038463,0.031485,0.021256,0.018073,0.012934] | high |
| PCA result selection | interpretation | First three components have large variance contribution; compressing to 3 dimensions is sufficient for training (as stated). | medium |
| Principal components selection | operation | components = pcaRes.components.transpose()[:3] (keep first three principal components). | high |
| principalComponents function | purpose | Apply PCA components matrix to dataset (matrix(t[xColNames]).dot(components)) and return as table with y column preserved. | high |
| randomForestClassifier after PCA | xColNames | Uses `col0 and `col1 (as shown) with numClasses=3 in the provided code snippet. | high |
| DolphinDB plugins for ML | statement | DolphinDB provides plugins enabling calling third-party libraries for machine learning; XGBoost plugin used as example. | medium |
| XGBoost plugin setup | source | Download compiled XGBoost plugin from DolphinDBPlugin GitHub page. | high |
| loadPlugin | purpose_in_example | Load XGBoost plugin using path to PluginXgboost.txt. | high |
| xgboost::train | signature | xgboost::train(Y, X, [params], [numBoostRound=10], [xgbModel]) | high |
| XGBoost training inputs in example | Y_and_X | Y is Label column from wineTrain; X is remaining feature columns selected from wineTrain. | high |
| XGBoost multi-class params | objective | multi:softmax | high |
| XGBoost multi-class params | num_class | 3 | high |
| XGBoost parameter guidance | booster_options | booster can be "gbtree" or "gblinear". | high |
| XGBoost parameter guidance | eta | Range [0,1], default 0.3. | high |
| XGBoost parameter guidance | gamma | Range [0,∞], default 0. | high |
| XGBoost parameter guidance | max_depth | Range [0,∞], default 6. | high |
| XGBoost parameter guidance | subsample | Range (0,1], default 1. | high |
| XGBoost parameter guidance | lambda | Default 0. | high |
| XGBoost parameter guidance | alpha | Default 0. | high |
| XGBoost parameter guidance | seed | Default 0. | high |
| XGBoost parameters documentation | reference | Links to official XGBoost parameter documentation. | high |
| Example XGBoost params used | params | { objective: "multi:softmax", num_class: 3, max_depth: 5, eta: 0.1, subsample: 0.9 } | high |
| XGBoost wine classification | accuracy_on_test | 0.962963 (sum(predicted==wineTest.Label)/wineTest.size()) | high |
| xgboost::saveModel | purpose | Persist XGBoost model to file (example: "xgboost001.mdl"). | high |
| xgboost::loadModel | purpose | Load XGBoost model from file (example: "xgboost001.mdl"). | high |
| xgboost::train (incremental training) | xgbModel_parameter | Can specify xgbModel to continue training from an existing model (model = xgboost::train(Y, X, params, , model)). | high |
| Appendix A image | content_description | Image shows a table/list of DolphinDB machine learning training functions, including function names, categories (classification/regression/clustering/dimensionality reduction), descriptions, and whether distributed computation is supported. | medium |
| Appendix B image | content_description | Image lists ML tool functions: loadModel (load from disk), saveModel (persist), predict (predict on new data). | medium |
| Appendix C image | content_description | Image shows supported ML plugins, including XGBoost-based gradient boosting algorithms and libsvm-based algorithms (e.g., SVM). | medium |