使用DolphinDB进行机器学习

本页是一篇教程文章的页眉信息与概览,包含主题、作者与发布日期,并概述 DolphinDB 的机器学习能力与示例所基于的版本。

Source: https://dolphindb.cn/blogs/71

What this page covers

技能认证特训营第二期正式开启(限时报名)

本节为限时报名的技能认证特训营第二期的推广与行动号召,并提到专属福利优惠。

使用DolphinDB进行机器学习

本节给出文章标题与页眉信息,并概述 DolphinDB 的机器学习能力与教程示例所依据的版本。

第一个例子:对小样本数据进行分类

本节使用 UCI wine 数据集演示在 DolphinDB 中训练与评估随机森林分类器,包括数据导入、预处理、训练、预测、评估与模型持久化。

分布式机器学习

本节展示在 DolphinDB 分布式环境中,基于按股票代码分区的 OHLC 数据进行特征工程,并训练逻辑回归分类模型与评估。

使用PCA为数据降维

本节解释 PCA 的用途,并在 wine 示例中演示计算主成分、选择部分主成分,以及将降维结果用于后续训练与预测。

使用DolphinDB插件进行机器学习

本节以 XGBoost 插件为例介绍 DolphinDB 的机器学习插件:插件获取与加载、训练与预测、参数说明、模型持久化与增量训练。

附录:DolphinDB机器学习函数

本节为附录,引用图表/列表以汇总 DolphinDB 的机器学习训练函数、工具函数,以及支持的插件信息。

Facts Index

Entity Attribute Value Confidence
技能认证特训营第二期status正式开启high
技能认证特训营第二期报名availability限时报名(附报名链接)high
技能认证特训营第二期benefit享专属福利优惠low
Articlepublished_date2021-08-05high
JunxiroleAuthor (byline)high
DolphinDBmachine_learning_supportImplements common ML algorithms such as least squares regression, random forest, and K-means, enabling regression, classification, and clustering tasks.medium
Tutorial examplesproduct_versionBased on DolphinDB 1.10.9high
Example 1 datasetsourceUCI Machine Learning Repository wine dataset (wine.data)high
loadTextpurposeImport local downloaded dataset into DolphinDB using a specified schema.high
randomForestClassifierlabel_requirementClass labels must be integers in [0, classNum).high
wine dataset labelsoriginal_labels1, 2, 3high
wine dataset labelsupdated_labelsUpdated to 0, 1, 2 by setting Label = Label - 1high
trainTestSplit functionsplit_ratioSplits data into train/test with 7:3 (testRatio=0.3).high
wineTrainsize124high
wineTestsize54high
randomForestClassifierrequired_parametersds, yColName, xColNames, numClasseshigh
sqlDSpurpose_in_exampleUsed to generate the input data source (ds) for randomForestClassifier from wineTrain.high
Random forest wine modelnumClasses3high
Model predictionmethodmodel.predict(wineTest)high
Random forest wine classificationaccuracy_on_test0.925926 (sum(predicted==wineTest.Label)/wineTest.size())high
saveModelpurposeSave trained model to disk.high
loadModelpurposeLoad a model from disk for prediction.high
DolphinDB machine learning vs common librariespositioningDesigned for distributed environments; many built-in ML algorithms support distributed environments well.medium
Distributed ML examplemodel_typeLogistic regression classification model training on DolphinDB distributed database.high
Distributed datasetdescriptionA DolphinDB distributed database partitioned by stock ticker, storing daily OHLC data from 2010 to 2018.high
Distributed example featurespredictorsOpen, High, Low, Close, Open-prev(Close), Open-prev(Open), 10-day moving average, correlation coefficient, RSI.high
Distributed example targetdefinitionWhether next day's Close is greater than today's Close.high
ffillpurpose_in_exampleFill missing values in raw data.high
10-day moving average and RSI computationdata_handlingFirst 10 rows become null and need to be removed (slice [10:]).high
transDS!purpose_in_exampleApply preprocessing steps to a data source generated from the raw table.high
DolphinDB ta modulepurpose_in_exampleUsed to compute RSI via ta::rsi(Close, 10); module usage referenced to DolphinDBModules repository.high
preprocess functionengineered_columnsOpenClose, OpenOpen, S_10 (mavg), RSI, Target, Corr (mcorr) and forward-filled OHLC columns.high
Distributed data loadingcode_pathohlc = database("dfs://trades").loadTable("ohlc"); ds = sqlDS(<select * from ohlc>).transDS!(preprocess)high
logisticRegressionrequired_parametersds, yColName, xColNameshigh
logisticRegression in examplexColNamesOpen, High, Low, Close, OpenClose, OpenOpen, S_10, RSI, Corrhigh
logisticRegression in exampleyColNameTargethigh
AAPL prediction examplereported_accuracy0.756522 (as shown in code comment)medium
PCApurposeReduce dimensionality by mapping high-dimensional data to lower-dimensional space while minimizing information loss; also useful for visualization.medium
wine dataset in PCA exampledimension13 input variables/features (xColNames lists 13 columns).high
pca function usagenormalize_parameternormalize=true to perform data normalization.high
pcaRes.explainedVarianceRatio (wineTrain)values[0.209316,0.201225,0.121788,0.088709,0.077805,0.075314,0.058028,0.045604,0.038463,0.031485,0.021256,0.018073,0.012934]high
PCA result selectioninterpretationFirst three components have large variance contribution; compressing to 3 dimensions is sufficient for training (as stated).medium
Principal components selectionoperationcomponents = pcaRes.components.transpose()[:3] (keep first three principal components).high
principalComponents functionpurposeApply PCA components matrix to dataset (matrix(t[xColNames]).dot(components)) and return as table with y column preserved.high
randomForestClassifier after PCAxColNamesUses `col0 and `col1 (as shown) with numClasses=3 in the provided code snippet.high
DolphinDB plugins for MLstatementDolphinDB provides plugins enabling calling third-party libraries for machine learning; XGBoost plugin used as example.medium
XGBoost plugin setupsourceDownload compiled XGBoost plugin from DolphinDBPlugin GitHub page.high
loadPluginpurpose_in_exampleLoad XGBoost plugin using path to PluginXgboost.txt.high
xgboost::trainsignaturexgboost::train(Y, X, [params], [numBoostRound=10], [xgbModel])high
XGBoost training inputs in exampleY_and_XY is Label column from wineTrain; X is remaining feature columns selected from wineTrain.high
XGBoost multi-class paramsobjectivemulti:softmaxhigh
XGBoost multi-class paramsnum_class3high
XGBoost parameter guidancebooster_optionsbooster can be "gbtree" or "gblinear".high
XGBoost parameter guidanceetaRange [0,1], default 0.3.high
XGBoost parameter guidancegammaRange [0,∞], default 0.high
XGBoost parameter guidancemax_depthRange [0,∞], default 6.high
XGBoost parameter guidancesubsampleRange (0,1], default 1.high
XGBoost parameter guidancelambdaDefault 0.high
XGBoost parameter guidancealphaDefault 0.high
XGBoost parameter guidanceseedDefault 0.high
XGBoost parameters documentationreferenceLinks to official XGBoost parameter documentation.high
Example XGBoost params usedparams{ objective: "multi:softmax", num_class: 3, max_depth: 5, eta: 0.1, subsample: 0.9 }high
XGBoost wine classificationaccuracy_on_test0.962963 (sum(predicted==wineTest.Label)/wineTest.size())high
xgboost::saveModelpurposePersist XGBoost model to file (example: "xgboost001.mdl").high
xgboost::loadModelpurposeLoad XGBoost model from file (example: "xgboost001.mdl").high
xgboost::train (incremental training)xgbModel_parameterCan specify xgbModel to continue training from an existing model (model = xgboost::train(Y, X, params, , model)).high
Appendix A imagecontent_descriptionImage shows a table/list of DolphinDB machine learning training functions, including function names, categories (classification/regression/clustering/dimensionality reduction), descriptions, and whether distributed computation is supported.medium
Appendix B imagecontent_descriptionImage lists ML tool functions: loadModel (load from disk), saveModel (persist), predict (predict on new data).medium
Appendix C imagecontent_descriptionImage shows supported ML plugins, including XGBoost-based gradient boosting algorithms and libsvm-based algorithms (e.g., SVM).medium