如何在小内存环境中实现全市场数据的aj

笨笨
2024-12-13

思路:

总体的思路就是拿时间换空间,尽可能得串行,及时释放中间变量。

即按股票 aj + each

def ajBySecurityID(securityIDStr, aDate){
    tick = select * from loadTable("dfs://SH_TSDB_tick", "tick") 
        where date(Tradetime) = aDate and securityID = securityIDStr
    snapshot = select * from loadTable("dfs://SH_TSDB_snapshot_ArrayVector1","snapshot") 
        where date(DateTime) = aDate and securityID=securityIDStr
    data = aj(tick, snapshot, `SecurityID`TradeTime, `SecurityID`DateTime)
    return data
}

allSecurityID=exec distinct securityID from loadTable("dfs://SH_TSDB_tick", "tick")
where date(TradeTime) = 2021.01.04

//aj on dfsTabls 
def ajBySecurityID2(securityIDStr, aDate){
    //function loadTable ONLY load metadata
    tick = loadTable("dfs://SH_TSDB_tick", "tick")
    snapshot = loadTable("dfs://SH_TSDB_snapshot_ArrayVector1","snapshot")
    data = select *  from aj(tick, snapshot, `SecurityID`TradeTime`TradeTime, `SecurityID`DateTime`DateTime)
        where date(TradeTime)=aDate and date(datetime)=aDate 
            and tick.SecurityID = securityIDStr and snapshot.SecurityID = securityIDStr
    return data
}


rst = each(ajBySecurityID, allSecurityID, 2021.01.04)
//rst = each(ajBySecurityID2, allSecurityID, 2021.01.04)

实测2021.01.04全市场的1800只股票其结果集在25GB左右。

故对于社区版而言,无法在内存中获得所有得结果集,想要拿到所有结果集一定的优化。

后续优化点:

  • 只读取所需的字段,而不是select *
  • 可以把结果数据存回dfs表而不是内存中, 即return data部分改成写入dfs表