pivot by

pivot by 是 DolphinDB 的独有功能,是对标准 SQL 语句的拓展。它将表中一列或多列的内容按照两个维度重新排列,亦可配合数据转换函数使用。与 select 子句一起使用时返回一个表,而和 exec 语句一起使用时返回一个矩阵。 参考 pivotunpivot

1.20.4 版本提供了以下两点改进:

  • pivot by 后面的列数可为两个以上,其中最后一列为列维度,其它各列均为行维度。

  • 使用 pivot by 时,select 子句可使用多个指标。

例子

例1:行维度为一列(即 pivot by 之后仅有两列)

$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
$ t2 = table(timestamp, sym, qty, price);

$ t2;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:35:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:35:47

IBM

6800

174.97

09:36:26

IBM

5400

175.23

09:34:16

C

1300

50.76

09:35:26

C

2500

50.32

09:36:12

C

8800

51.29

$ select price from t2 pivot by timestamp, sym;

timestamp

C

IBM

MS

09:34:07

49.6

09:34:16

50.76

09:35:26

50.32

09:35:42

29.46

09:35:47

174.97

09:36:12

51.29

09:36:26

175.23

09:36:51

29.52

09:36:59

30.02

$ a=select last(price) from t2 pivot by timestamp.minute(), sym;
$ a;

minute_timestamp

C

IBM

MS

09:34m

50.76

09:35m

50.32

174.97

29.46

09:36m

51.29

175.23

30.02

$ typestr a;
TABLE

$ b=exec count(price) from t2 pivot by timestamp.minute(), sym;
$ b;

label

C

IBM

MS

09:34m

2

09:35m

1

1

1

09:36m

1

1

2

$ typestr b;
FAST DOUBLE MATRIX

例2:计算 ETF 的内在价值。为简便起见,假设 ETF 有两个成分股:AAPL 和 FB,其成分权重为 [0.6, 0.4]。

$ symbol=take(`AAPL, 6) join take(`FB, 5)
$ time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
$ price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
$ quotes=table(symbol, time, price)
$ weights=dict(`AAPL`FB, 0.6 0.4)
$ ETF = select symbol, time, price*weights[symbol] as price from quotes;

$ select last(price) from ETF pivot by time, symbol;

Time

AAPL

FB

2019.02.27T09:45:01.000000146

103.962

2019.02.27T09:45:01.000000212

64.604

2019.02.27T09:45:01.000000278

103.956

2019.02.27T09:45:01.000000412

103.944

2019.02.27T09:45:01.000000445

103.95

2019.02.27T09:45:01.000000496

103.956

2019.02.27T09:45:01.000000556

64.6

2019.02.27T09:45:01.000000598

64.596

2019.02.27T09:45:01.000000712

64.6

2019.02.27T09:45:01.000000789

103.962

2019.02.27T09:45:01.000000989

64.604

以上结果为在每个时间戳经过 ETF 权重调整的股票价格。为了计算 ETF 在每个时间戳的内在价值,需要使用 ffill 函数来填充每个股票的 NULL 值,然后将所有股票价格相加。可在一个 SQL 语句中完成。

$ select rowSum(ffill(last(price))) from ETF pivot by time, symbol;

Time

rowSum

2019.02.27T09:45:01.000000146

103.962

2019.02.27T09:45:01.000000212

168.566

2019.02.27T09:45:01.000000278

168.56

2019.02.27T09:45:01.000000412

168.548

2019.02.27T09:45:01.000000445

168.554

2019.02.27T09:45:01.000000496

168.56

2019.02.27T09:45:01.000000556

168.556

2019.02.27T09:45:01.000000598

168.552

2019.02.27T09:45:01.000000712

168.556

2019.02.27T09:45:01.000000789

168.562

2019.02.27T09:45:01.000000989

168.566

例3:使用 pivot by 时某些新建列名的自动转换

1.30.14 版本以前,DolphinDB 中列名仅可使用中文或英文字母、数字或下划线 (_),且必须以中文或英文字母开头。

1.30.14 版本开始,由 pivot by, addColumn 操作产生的列名,支持包含特殊字符,或以数字开头。

详情可以参考:创建表章节

$ date = take(2021.08.01  2021.08.02 2021.08.03, 12)
$ sym = take(["IBM N", "_MSFTN", "3_GOOGS", ""], 12).sort()
$ value = 1..12
$ t=table(date, sym, value)
$ re = select value from t pivot by date, sym

date

NULL

3_GOOGS

IBM N

_MSFTN

2021.08.01

1

4

7

10

2021.08.02

2

5

8

11

2021.08.03

3

6

9

12

$ select  _"NULL" from re

NULL

1

2

3

$ select  _"3_GOOGS" from re

3_GOOGS

4

5

6

下例,设置 removeSpecialCharInColumnName = true,股票代码 "600300" 与 "600600" 在成为列名后,系统自动将其处理为 "C600300" 与 "C600600"。

$ symbol=take(`600300, 6) join take(`600600, 5)
$ time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
$ price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
$ quotes=table(symbol, time, price)
$ weights=dict(`600300`600600, 0.6 0.4)
$ ETF = select symbol, time, price*weights[symbol] as price from quotes;

$ select last(price) from ETF pivot by time, symbol;

time

C600300

C600600

2019.02.27T09:45:01.000000146

103.962

2019.02.27T09:45:01.000000212

64.604

2019.02.27T09:45:01.000000278

103.956

2019.02.27T09:45:01.000000412

103.944

2019.02.27T09:45:01.000000445

103.95

2019.02.27T09:45:01.000000496

103.956

2019.02.27T09:45:01.000000556

64.6

2019.02.27T09:45:01.000000598

64.596

2019.02.27T09:45:01.000000712

64.6

2019.02.27T09:45:01.000000789

103.962

2019.02.27T09:45:01.000000989

64.604

例4:行维度为多列(即 pivot by 之后多于两列)

$ date = 2020.09.21 + 0 0 0 0 1 1 1 1
$ sym = `MS`MS`GS`GS`MS`MS`GS`GS$SYMBOL
$ factorNum = 1 2 1 2 1 2 1 2
$ factorValue = 1.2 -3.4 -2.5 6.3 1.1 -3.2 -2.1 5.6
$ t = table(date, sym, factorNum, factorValue);

$ t;

date

sym

factorNum

factorValue

2020.09.21

MS

1

1.2

2020.09.21

MS

2

-3.4

2020.09.21

GS

1

-2.5

2020.09.21

GS

2

6.3

2020.09.22

MS

1

1.1

2020.09.22

MS

2

-3.2

2020.09.22

GS

1

-2.1

2020.09.22

GS

2

5.6

$ select factorValue from t pivot by date, sym, factorNum;

date

sym

C1

C2

2020.09.21

GS

-2.5

6.3

2020.09.21

MS

1.2

-3.4

2020.09.22

GS

-2.1

5.6

2020.09.22

MS

1.1

-3.2

例5:select 子句包含多个指标

$ select factorValue, factorValue>0 as factorSign from t pivot by date, sym, factorNum;

date

sym

factorValue_1

factorValue_2

factorSign_1

factorSign_2

2020.09.21

GS

-2.5

6.3

0

1

2020.09.21

MS

1.2

-3.4

1

0

2020.09.22

GS

-2.1

5.6

0

1

2020.09.22

MS

1.1

-3.2

1

0

例6:若 pivot by 后面的列在原表中存在多个相同的值,pivot by 会进行去重,仅保留最后一个值的数据。

$ t = table((`C+string(1..10)) join (`C+string(1..10)) as sym, (2020.10.01+11..20) join (2020.10.01+11..20) as time, 21..40 as val)
$ select val from t pivot by time, sym;

time

C1

C10

C2

C3

C4

C5

C6

C7

C8

C9

2020.10.12

31

2020.10.13

32

2020.10.14

33

2020.10.15

34

2020.10.16

35

2020.10.17

36

2020.10.18

37

2020.10.19

38

2020.10.20

39

2020.10.21

40