量化因子提取和数据预处理—pivot by执行先后的性能比较
因子表按纵表存储在dolphindb数据库中,不管是提取因子进行模型训练,还是进行其他的一些因子计算处理,都要按宽表的形式展示出来。下面一个demo进行,先数据预处理再pivot by成宽表和先pivot by成宽表再进行数据预处理的性能进行比较。先pivot by成宽表再进行数据预处理的性能是先数据预处理再pivot by成宽表的4倍。
模拟生成2008.01.01..2022.05.30,4500只code,1000个因子的数据。dolphindb server2.00.6,worker数为8,
if(existsDatabase("dfs://factor")){
dropDatabase("dfs://factor")
}
datePar = database("", RANGE, date(datetimeAdd(2008.01M, 0..40*12, 'M')))
namePar = database("", VALUE, ["DEFAULT_NAME"])
factorDB = database("dfs://factor", COMPO, [datePar, namePar], engine="TSDB")
factorDoubleSchema = table(1:0, `date`code`name`value, [DATE, SYMBOL, SYMBOL, DOUBLE])
factorDoubleTable = factorDB.createPartitionedTable(factorDoubleSchema, "factor_double", partitionColumns=`date`name, sortColumns=`name`date, keepDuplicates=ALL)
path="dfs://factor"
tbName= "factor_double"
tb=loadTable(path,"factor_double")
tradedates=2008.01.01..2022.05.30
tradedates=tradedates[1<=weekday(tradedates)<=5]
n=size(tradedates)
codes=string(600001..604500)
X=1..1000
factornames=`factor+decimalFormat(X, "0000")
tradedates=sort(take(tradedates,4500*n))
for (ifactor in factornames){
print(ifactor)
idata=table(tradedates as date,take(codes,4500*n) as code,take(ifactor,4500*n)as name,rand(10.,4500*n) as value)
upsert!(tb,idata,keyColNames=`name`date)
}定义数据标准化处理和指数平滑数据处理函数
def factorFiterAndNormize(x,fiter){
/*
* 数据标准化处理
*/
v=winsorize(x,fiter)
return (v-mean(v))\std(v)
}
def factorewmMean(x){
/*
* 指数平滑
*/
return ewmMean(X=x,com=0.5)
}
addFunctionView(factorFiterAndNormize)
addFunctionView(factorewmMean)第一种方法,提取2年100个因子数据,先数据预处理再pivot by 成宽表耗时1分6秒
def pivotby_last(){
factors=`factor+decimalFormat(1..100, "0000")
//查询数据,并数据标准化
data=select date, code, name,factorFiterAndNormize(value,0.03) as value from loadTable("dfs://factor","factor_double")
where name in factors and date between 2020.01.01:2022.01.01 context by date,name
//进行指数平滑
data=select date, code, name,factorewmMean(value) as value from data context by code,name csort date
data=select value from data pivot by date, code, name;
return data
}
timer data=pivotby_last()
//Time elapsed: 186393.562 ms第二种方法,提取2年100个因子数据,先pivot by 成宽表再数据预处理,耗时46s,提取性能是第一种方法的4倍
def pivotby_first(){
factors=`factor+decimalFormat(1..100, "0000")
data=select value from loadTable("dfs://factor","factor_double") where name in factors and date between 2020.01.01:2022.01.01 pivot by date, code, name;
factors=colNames(data)[2:]
colCount = factors.size()
colDefs = array(ANY, colCount+2)
colDefs[0]=sqlCol(`date)
colDefs[1]=sqlCol(`code)
for(i in 0:colCount){
colDefs[i+2] = sqlCol(factors[i], factorFiterAndNormize{,0.03},factors[i])
}
//标准化处理
temp=sql(colDefs, data, groupBy=sqlCol(`date),groupFlag=0).eval()
for(i in 0:colCount){
colDefs[i+2] = sqlCol(factors[i], factorewmMean,factors[i])
}
//进行指数平滑
return sql(colDefs, temp, groupBy=sqlCol(`code),groupFlag=0,csort=sqlCol(`date)).eval()
}
timer data=pivotby_first()
//TTime elapsed: 46751.896