如何在小内存环境中实现全市场数据的aj
思路:
总体的思路就是拿时间换空间,尽可能得串行,及时释放中间变量。
即按股票 aj + each
def ajBySecurityID(securityIDStr, aDate){
tick = select * from loadTable("dfs://SH_TSDB_tick", "tick")
where date(Tradetime) = aDate and securityID = securityIDStr
snapshot = select * from loadTable("dfs://SH_TSDB_snapshot_ArrayVector1","snapshot")
where date(DateTime) = aDate and securityID=securityIDStr
data = aj(tick, snapshot, `SecurityID`TradeTime, `SecurityID`DateTime)
return data
}
allSecurityID=exec distinct securityID from loadTable("dfs://SH_TSDB_tick", "tick")
where date(TradeTime) = 2021.01.04
//aj on dfsTabls
def ajBySecurityID2(securityIDStr, aDate){
//function loadTable ONLY load metadata
tick = loadTable("dfs://SH_TSDB_tick", "tick")
snapshot = loadTable("dfs://SH_TSDB_snapshot_ArrayVector1","snapshot")
data = select * from aj(tick, snapshot, `SecurityID`TradeTime`TradeTime, `SecurityID`DateTime`DateTime)
where date(TradeTime)=aDate and date(datetime)=aDate
and tick.SecurityID = securityIDStr and snapshot.SecurityID = securityIDStr
return data
}
rst = each(ajBySecurityID, allSecurityID, 2021.01.04)
//rst = each(ajBySecurityID2, allSecurityID, 2021.01.04)实测2021.01.04全市场的1800只股票其结果集在25GB左右。
故对于社区版而言,无法在内存中获得所有得结果集,想要拿到所有结果集一定的优化。
后续优化点:
- 只读取所需的字段,而不是select *
- 可以把结果数据存回dfs表而不是内存中, 即
return data部分改成写入dfs表