11institutetext: School of Artificial Intelligence, University of Chinese Academy of Sciences 22institutetext: Institute of Automation, Chinese Academy of Sciences
22email: {xi.cheng, jinghao.zhang, yunan.zeng}@cripac.ia.ac.cn, wenfang.xue@ia.ac.cn

MOT: A Mixture of Actors Reinforcement Learning Method by Optimal Transport for Algorithmic Trading

Xi Cheng 1122    Jinghao Zhang 1122    Yunan Zeng 1122    Wenfang Xue 1122
Abstract

Algorithmic trading refers to executing buy and sell orders for specific assets based on automatically identified trading opportunities. Strategies based on reinforcement learning (RL) have demonstrated remarkable capabilities in addressing algorithmic trading problems. However, the trading patterns differ among market conditions due to shifted distribution data. Ignoring multiple patterns in the data will undermine the performance of RL. In this paper, we propose MOT, which designs multiple actors with disentangled representation learning to model the different patterns of the market. Furthermore, we incorporate the Optimal Transport (OT) algorithm to allocate samples to the appropriate actor by introducing a regularization loss term. Additionally, we propose Pretrain Module to facilitate imitation learning by aligning the outputs of actors with expert strategy and better balance the exploration and exploitation of RL. Experimental results on real futures market data demonstrate that MOT exhibits excellent profit capabilities while balancing risks. Ablation studies validate the effectiveness of the components of MOT.

Keywords:
Algorithmic trading Reinforcement learning Optimal transport.

1 Introduction

The goal of algorithmic trading is to maximize long-term profits while keeping risks within an acceptable range [21]. Compared to the traditional approach of relying on the expert judgment of trading timing, algorithmic trading is highly automated and efficient.

Traditional technical analysis methods include mean reversion [10], momentum investing [11], multi-factor models [4], etc. However, financial market data is non-stationary with a low signal-to-noise ratio. Expert-designed technical analysis methods can’t generate profits under diverse market conditions. Deep learning methods excel at capturing intricate price patterns and enhance models’ performance [28, 29, 15]. However, the process from supervised models’ output to actual investment still requires the construction of strategies, which introduces expert knowledge and subjectivity. RL methods don’t require carefully designed strategies by humans. They take market information as states and output trading decisions directly, which makes it easy to incorporate the unique financial constraints (e.g. transaction costs and slippage) into environments. RL has achieved SOTA in many quantitative investment tasks [16, 19, 30].

Refer to caption
Figure 1: Profit of strategies in different market conditions. A bull market is suitable for momentum trading, while a volatile market is suitable for mean reversion trading.

However, these methods rely on the assumption that financial data always follow the same distribution. Data patterns often switch in real scenarios. E.g. the most common way to classify market patterns is into two categories: stable (momentum) and volatile (reversal) markets, which require two categories of strategies [14]. These two phenomena are not independent but intertwined with each other. As shown in Figure 1, when bullish forces >>> bearish forces or are evenly balanced, the market is in a stable upward (bull) or a volatile state respectively. Momentum trading strategy models the momentum effect of stable market and mean reversion strategy models the reversal effect of volatile market. The same strategy can yield significantly different returns in different market conditions. Inspired by a mixture of experts [5], we propose MOT, which models multiple actors with disentangled representation learning and extracts various pattern information in RL. To allocate samples to agents appropriately, we introduce the Allocation Module with Optimal Transport (OT) regularization loss.

Previous research [16] has introduced imitation learning to RL, allowing agents to learn information from expert knowledge. However, in the early stages of imitation learning, the sampled action used in the training process is not from the agent’s generation but is directly given by the expert which is stored in the buffer. As a result, the true output action of the agent differs significantly from the action stored in the buffer. To solve this problem, MOT introduces a pre-training method based on supervised learning to imitation learning. We expect the output generated by the agent to be closer to the expert strategy in imitation learning so the model can be initialized in a better stage.

The training process of MOT can be divided into three stages: first, the Pretrain Module uses supervised learning to train only the actor with expert strategy. Then we use the expert strategy to fill the buffer and train the RL model by imitation learning. After that, MOT uses multiple actors to model different market patterns and uses OT to solve the problem of pattern allocation. The contributions are summarized as follows:

1) MOT is the first that introduces OT algorithm to RL for mining various trading patterns. Allocation Module allocates different samples to appropriate actors.

2) MOT is also the first study that addresses the imitation learning gap between the actor’s output and the buffer. MOT introduces a supervised Pretrain Module before imitation learning, which allows the real actor’s output to be closer to the expert strategy.

3) Experiments show MOT has great profitability in different market modes while balancing risks. Further studies confirm the effectiveness of three components of MOT.

2 Problem Formulation

Table 1: Changes in Position Based on Trading Signals
Po Action Po´ Operation Po Action Po´ Operation
0 1 1 Take a long position 0 -1 -1 Take a short position
1 1 1 No operation -1 -1 -1 No operation
-1 1 1 Close the position then go long 1 -1 -1 Close the position then go short

The algorithmic trading problem can be represented as Markov Decision Process (MDP) =𝒮,𝒜,𝒫,,γ𝒮𝒜𝒫𝛾\mathcal{M}=\langle\mathcal{S,A,P,R,}\gamma\ranglecaligraphic_M = ⟨ caligraphic_S , caligraphic_A , caligraphic_P , caligraphic_R , italic_γ ⟩, where 𝒮𝒮\mathcal{S}caligraphic_S represents the state space provided by the environment, 𝒜𝒜\mathcal{A}caligraphic_A represents the action space, 𝒫:𝒮×𝒜×𝒮[0,1]:𝒫𝒮𝒜𝒮01\mathcal{P:S\times A\times S}\rightarrow[0,1]caligraphic_P : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ] is the probability function of the conditional state transitions, :𝒮×𝒜:𝒮𝒜\mathcal{R:S\times A}\rightarrow\mathbb{R}caligraphic_R : caligraphic_S × caligraphic_A → blackboard_R is the reward function, and γ(0,1)𝛾01\mathcal{\gamma}\in(0,1)italic_γ ∈ ( 0 , 1 ) is the discount factor. The specific definition of the five-tuple for MDP is as follows:

The State space 𝒮𝒮\mathcal{S}caligraphic_S: The state St=[Stm;Sta]𝒮subscriptS𝑡superscriptsubscriptS𝑡𝑚superscriptsubscriptS𝑡𝑎𝒮\textbf{S}_{t}=[\textbf{S}_{t}^{m};\textbf{S}_{t}^{a}]\in\mathcal{S}S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ; S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ] ∈ caligraphic_S. The account indicators Sta=[sta1,sta2,]superscriptsubscriptS𝑡𝑎superscriptsubscript𝑠𝑡subscript𝑎1superscriptsubscript𝑠𝑡subscript𝑎2\textbf{S}_{t}^{a}=[s_{t}^{a_{1}},s_{t}^{a_{2}},...]S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = [ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … ] describe the trader’s positions, account cash balance, margin, returns, and other related information of the trader’s account. Pt=[pto,pth,ptl,ptc,vto,vta]subscriptP𝑡superscriptsubscript𝑝𝑡𝑜superscriptsubscript𝑝𝑡superscriptsubscript𝑝𝑡𝑙superscriptsubscript𝑝𝑡𝑐superscriptsubscript𝑣𝑡𝑜superscriptsubscript𝑣𝑡𝑎\textbf{P}_{t}=[p_{t}^{o},p_{t}^{h},p_{t}^{l},p_{t}^{c},v_{t}^{o},v_{t}^{a}]P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ] represents the Opening-High-Low-Closing (OHLC) prices, trading volume, and trading value. Qt=[qt1,qt2,,qti]subscriptQ𝑡superscriptsubscript𝑞𝑡1superscriptsubscript𝑞𝑡2superscriptsubscript𝑞𝑡𝑖\textbf{Q}_{t}=[q_{t}^{1},q_{t}^{2},...,q_{t}^{i}]Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] are derived from PtsubscriptP𝑡\textbf{P}_{t}P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and technical analysis. The market indicators Stm=[Pt;Qt]superscriptsubscriptS𝑡𝑚subscriptP𝑡subscriptQ𝑡\textbf{S}_{t}^{m}=[\textbf{P}_{t};\textbf{Q}_{t}]S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = [ P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] include the volume-price data PtsubscriptP𝑡\textbf{P}_{t}P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the technical indicators QtsubscriptQ𝑡\textbf{Q}_{t}Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

The Action space 𝒜𝒜\mathcal{A}caligraphic_A: The action at{1,1}subscript𝑎𝑡11a_{t}\in\left\{-1,1\right\}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { - 1 , 1 } represents the trading signal output by the agent. -1 corresponds to short selling and 1 corresponds to a long position. We define the agent to trade in units of contracts. The actual execution of trades depends on the trading signal and the trader’s existing positions. The specific changes in position and action are summarized in Table 1, where Po𝑃𝑜Poitalic_P italic_o means position.

The Transition Function 𝒫𝒫\mathcal{P}caligraphic_P: We assume that the actions of individual traders do not affect the overall asset price in the market. This implies that the observation transition function of market indicators is independent of trading behavior, i.e. 𝒫(St+1m|St)=𝒫(St+1m|St,at)𝒫conditionalsuperscriptsubscriptS𝑡1𝑚subscriptS𝑡𝒫conditionalsuperscriptsubscriptS𝑡1𝑚subscriptS𝑡subscript𝑎𝑡\mathcal{P}(\textbf{S}_{t+1}^{m}|\textbf{S}_{t})=\mathcal{P}(\textbf{S}_{t+1}^% {m}|\textbf{S}_{t},a_{t})caligraphic_P ( S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_P ( S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). However, the observation transition function of account prices is influenced by trading behavior, i.e. 𝒫(St+1a|St)𝒫(St+1a|St,at)𝒫conditionalsuperscriptsubscriptS𝑡1𝑎subscriptS𝑡𝒫conditionalsuperscriptsubscriptS𝑡1𝑎subscriptS𝑡subscript𝑎𝑡\mathcal{P}(\textbf{S}_{t+1}^{a}|\textbf{S}_{t})\neq\mathcal{P}(\textbf{S}_{t+% 1}^{a}|\textbf{S}_{t},a_{t})caligraphic_P ( S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ caligraphic_P ( S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

The Reward \mathcal{R}caligraphic_R: We choose the closing price ptcsuperscriptsubscript𝑝𝑡𝑐p_{t}^{c}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT to calculate profit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To better simulate real market, we set transaction fee rate μ𝜇\muitalic_μ 111Transaction costs are charged as a percentage of the contract. and slippage σ𝜎\sigmaitalic_σ 222Slippage refers to the difference between the expected and the actual execution price.. The profit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as rt=(ptcpt1c2σ)at1μptc|Δpo|subscript𝑟𝑡superscriptsubscript𝑝𝑡𝑐superscriptsubscript𝑝𝑡1𝑐2𝜎subscript𝑎𝑡1𝜇superscriptsubscript𝑝𝑡𝑐Δ𝑝𝑜r_{t}=(p_{t}^{c}-p_{t-1}^{c}-2\sigma)\cdot a_{t-1}-\mu\cdot p_{t}^{c}\cdot|% \Delta po|italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - 2 italic_σ ) ⋅ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_μ ⋅ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ | roman_Δ italic_p italic_o |, where Δpo=PoPoΔ𝑝𝑜𝑃superscript𝑜𝑃𝑜\Delta po=Po^{\prime}-Poroman_Δ italic_p italic_o = italic_P italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_P italic_o. When setting rewards, it is inappropriate to consider only the profit without taking into account the risk. The Sharpe ratio is the most widely used indicator for balancing risk and returns [24], defined as SR=mean(rt)std(rt)𝑆𝑅𝑚𝑒𝑎𝑛subscript𝑟𝑡𝑠𝑡𝑑subscript𝑟𝑡SR=\frac{mean(r_{t})}{std(r_{t})}italic_S italic_R = divide start_ARG italic_m italic_e italic_a italic_n ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_s italic_t italic_d ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG. To measure the impact of the profit on SR𝑆𝑅SRitalic_S italic_R each step, we adopt the Differential Sharpe Ratio (DSR) [17] as the reward. Considering that the adjacent data is more important than distant previous data in algorithmic trading, DSR employs the smoothing technique of Exponential Moving Average (EMA). DSRt𝐷𝑆subscript𝑅𝑡DSR_{t}italic_D italic_S italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as:

DSRt=Bt1ΔAt12At1ΔBt(Bt1At12)32,𝐷𝑆subscript𝑅𝑡subscript𝐵𝑡1Δsubscript𝐴𝑡12subscript𝐴𝑡1Δsubscript𝐵𝑡superscriptsubscript𝐵𝑡1superscriptsubscript𝐴𝑡1232DSR_{t}=\frac{B_{t-1}\Delta A_{t}-\frac{1}{2}A_{t-1}\Delta B_{t}}{(B_{t-1}-A_{% t-1}^{2})^{\frac{3}{2}}},italic_D italic_S italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT roman_Δ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT roman_Δ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG , (1)

representing the impact of each new profit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on SR𝑆𝑅SRitalic_S italic_R after applying EMA. Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the first moment and Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the second moment of profits rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT estimated by EMA. We utilize the DSRt𝐷𝑆subscript𝑅𝑡DSR_{t}italic_D italic_S italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the reward \mathcal{R}caligraphic_R. If the account money is insufficient, the trading will be terminated in advance, and we simulate this by setting a margin threshold.

Refer to caption
Figure 2: The architecture of MOT. First, we pretrain the actor using the expert strategy and then proceed with imitation learning. We model different market patterns using multiple actors and allocate samples to the actors using the Allocation Module.

3 Methodology

The overview of MOT is present in Figure 2. First, to ensure alignment between the actions in Demonstration Buffer and the actual outputs of the actor, we introduce Pretrain Module. Second, we leverage imitation learning to initialize the RL algorithm. Third, we use multiple actors with disentangled representation learning and model various market conditions. Last, Allocation Module allocates samples to different actors by OT algorithm.

3.1 Imitation Learning

In RL-based algorithmic trading, the initial exploration phase is often inefficient and yields low profits. Imitation learning leverages expert knowledge and provides the actor with a favorable starting point. We employ PPO [23] as the backbone to address the MDP problem. To capture the temporal patterns of states StsubscriptS𝑡\textbf{S}_{t}S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we utilize Gated Recurrent Units (GRU) [1] to obtain the hidden representation ht=GRU(ht1,St)subscript𝑡𝐺𝑅𝑈subscript𝑡1subscriptS𝑡h_{t}=GRU(h_{t-1},\textbf{S}_{t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_G italic_R italic_U ( italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of states StsubscriptS𝑡\textbf{S}_{t}S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is then fed into the actor and critic networks as inputs.

The actor network aims to find the optimal policy π𝜋\piitalic_π by maximizing the advantage function. The input is the environment state StsubscriptS𝑡\textbf{S}_{t}S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the output is the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To ensure sufficient exploration by the agent, we add noise ε𝜀\varepsilonitalic_ε to the output of the actor network. The actual executed action at=πθ(ht)+εsubscript𝑎𝑡superscript𝜋𝜃subscript𝑡𝜀a_{t}=\pi^{\theta}(h_{t})+\varepsilonitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ε, where ε𝜀\varepsilonitalic_ε represents the noise, π𝜋\piitalic_π is the policy given by actor network with parameters θ𝜃\thetaitalic_θ. The trading experience trajectories (SARS: state, action, reward, new state) are stored in the buffer \mathcal{B}caligraphic_B. After sampling, we update the gradients of the actor network and the critic network using the data from \mathcal{B}caligraphic_B.

The value function V𝑉Vitalic_V, computed by the critic network with parameters ω𝜔\omegaitalic_ω, estimates the value of the sample under state StsubscriptS𝑡\textbf{S}_{t}S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It is optimized through the loss function:

VF(ω)=𝔼[(Vω(St)Vt)2],superscript𝑉𝐹𝜔𝔼delimited-[]superscriptsubscript𝑉𝜔subscriptS𝑡subscript𝑉𝑡2\mathcal{L}^{VF}(\omega)=\mathbb{E}\left[(V_{\omega}(\textbf{S}_{t})-V_{t})^{2% }\right],caligraphic_L start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) = blackboard_E [ ( italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (2)

where Vt=t=tT1𝔼[γTt1DSRt(St,at)]subscript𝑉𝑡superscriptsubscriptsuperscript𝑡𝑡𝑇1𝔼delimited-[]superscript𝛾𝑇superscript𝑡1𝐷𝑆subscript𝑅superscript𝑡subscriptSsuperscript𝑡subscript𝑎superscript𝑡V_{t}=\sum_{t^{\prime}=t}^{T-1}\mathbb{E}[\gamma^{T-t^{\prime}-1}DSR_{t^{% \prime}}(\textbf{S}_{t^{\prime}},a_{t^{\prime}})]italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E [ italic_γ start_POSTSUPERSCRIPT italic_T - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_D italic_S italic_R start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( S start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] represents the empirical value of the accumulated future rewards DSR𝐷𝑆𝑅DSRitalic_D italic_S italic_R and T𝑇Titalic_T is the total number of time steps.

Let δtV=DSRt+γV(St+1)V(St)superscriptsubscript𝛿𝑡𝑉𝐷𝑆subscript𝑅𝑡𝛾𝑉subscriptS𝑡1𝑉subscriptS𝑡\delta_{t}^{V}=DSR_{t}+\gamma V(\textbf{S}_{t+1})-V(\textbf{S}_{t})italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = italic_D italic_S italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_V ( S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_V ( S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represent the advantage value estimation. In our research, the advantage function is computed by generalized advantage estimator (GAE) [23]: A^tGAE(γ,λ)=k=tT1(γλ)ktδkV,superscriptsubscript^𝐴𝑡𝐺𝐴𝐸𝛾𝜆superscriptsubscript𝑘𝑡𝑇1superscript𝛾𝜆𝑘𝑡superscriptsubscript𝛿𝑘𝑉\hat{A}_{t}^{GAE(\gamma,\lambda)}=\sum_{k=t}^{T-1}(\gamma\lambda)^{k-t}\delta_% {k}^{V},over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_A italic_E ( italic_γ , italic_λ ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_γ italic_λ ) start_POSTSUPERSCRIPT italic_k - italic_t end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT , where γ𝛾\gammaitalic_γ is the discount factor, λ𝜆\lambdaitalic_λ represents the trade-off between variance and bias.

PPO introduces a surrogate objective function to measure the similarity between the updated policy and the previous policy. The policy ratio formula is πθ(at|St)πθold(at|St)subscript𝜋𝜃conditionalsubscript𝑎𝑡subscriptS𝑡subscript𝜋subscript𝜃oldconditionalsubscript𝑎𝑡subscriptS𝑡\frac{\pi_{\theta}(a_{t}|\textbf{S}_{t})}{\pi_{\theta_{\text{old}}}(a_{t}|% \textbf{S}_{t})}divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG. πθoldsubscript𝜋subscript𝜃old\pi_{\theta_{\text{old}}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT and πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT represents the original and updated policy respectively. The objective function CLIP(θ)superscript𝐶𝐿𝐼𝑃𝜃\mathcal{L}^{CLIP}(\theta)caligraphic_L start_POSTSUPERSCRIPT italic_C italic_L italic_I italic_P end_POSTSUPERSCRIPT ( italic_θ ) for policy update is as Equation 3, ϵitalic-ϵ\epsilonitalic_ϵ is the clipping threshold.

We employ the commonly used Dual Thrust [13] as the expert strategy to provide demonstration actions. We store the demonstration trajectory SARS in Demonstration Buffer (DB) and train the agent using samples from DB. The training of the actor-critic network in imitation learning follows the same approach as the PPO algorithm, with the only difference being that the training data is from DB. Subsequently, the actor-critic network continues to train by PPO method, as shown in Equation 2 and Equation 3:

CLIP(θ)=𝔼[min(πθ(at|St)πθold(at|St)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].superscript𝐶𝐿𝐼𝑃𝜃𝔼delimited-[]subscript𝜋𝜃conditionalsubscript𝑎𝑡subscriptS𝑡subscript𝜋subscript𝜃𝑜𝑙𝑑conditionalsubscript𝑎𝑡subscriptS𝑡subscript^𝐴𝑡𝑐𝑙𝑖𝑝subscript𝑟𝑡𝜃1italic-ϵ1italic-ϵsubscript^𝐴𝑡\mathcal{L}^{CLIP}(\theta)=\mathbb{E}\left[\min(\frac{\pi_{\theta}(a_{t}|% \textbf{S}_{t})}{\pi_{\theta_{old}}(a_{t}|\textbf{S}_{t})}\hat{A}_{t},clip(r_{% t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{t})\right].caligraphic_L start_POSTSUPERSCRIPT italic_C italic_L italic_I italic_P end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E [ roman_min ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c italic_l italic_i italic_p ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . (3)

3.2 Pretrain Module

The Pretrain Module is used to align the actions in the buffer \mathcal{B}caligraphic_B with the outputs of the actor. As mentioned before, it can be observed that aexpertsubscript𝑎𝑒𝑥𝑝𝑒𝑟𝑡a_{expert}italic_a start_POSTSUBSCRIPT italic_e italic_x italic_p italic_e italic_r italic_t end_POSTSUBSCRIPT in DB is directly provided by Dual Thrust strategy rather than generated by the actor network πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Therefore, when using the demonstration data for gradient descent of the network, there is a significant discrepancy between the distribution of the actor network’s output action πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the action aexpertsubscript𝑎𝑒𝑥𝑝𝑒𝑟𝑡a_{expert}italic_a start_POSTSUBSCRIPT italic_e italic_x italic_p italic_e italic_r italic_t end_POSTSUBSCRIPT [16]. This has a negative impact on the stability of the RL network.

To address this issue, we aim to align the output action at=πθsubscript𝑎𝑡subscript𝜋𝜃a_{t}=\pi_{\theta}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT of the actor network with the expert-provided action aexpertsubscript𝑎𝑒𝑥𝑝𝑒𝑟𝑡a_{expert}italic_a start_POSTSUBSCRIPT italic_e italic_x italic_p italic_e italic_r italic_t end_POSTSUBSCRIPT by training the actor network using supervised learning. The loss function is defined as Equation 4:

pre=CrossEntropy(aexpert,πθ(ht))superscript𝑝𝑟𝑒𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦subscript𝑎𝑒𝑥𝑝𝑒𝑟𝑡subscript𝜋𝜃subscript𝑡\mathcal{L}^{pre}=CrossEntropy(a_{expert},\pi_{\theta}(h_{t}))caligraphic_L start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT = italic_C italic_r italic_o italic_s italic_s italic_E italic_n italic_t italic_r italic_o italic_p italic_y ( italic_a start_POSTSUBSCRIPT italic_e italic_x italic_p italic_e italic_r italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (4)

Pretrain Module accelerates the actor’s understanding of the task by mimicking expert strategies and enhances the actor’s ability to effectively engage in the imitation learning process. Pretrain Module is positioned before imitation learning as Figure 2.

3.3 Multiple Actors

We employ multiple actors to model strategies in different patterns. Futures data is derived from the trading activities of numerous participants and reflects different trading patterns [22]. Ignoring multiple patterns will reduce the performance of models [8]. All k𝑘kitalic_k actors of MOT are constructed in the same manner, as depicted in Figure 2 and Equation 3. For convenience, we illustrate how the model is trained with k=2𝑘2k=2italic_k = 2.

To integrate the outputs of the two actors, we use an Allocation Module to assign weights to them. Regarding the construction of the Allocation Module, we first consider what inputs should be provided to it. The historical sequence of futures states StsubscriptS𝑡\textbf{S}_{t}S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT plays a significant role in determining the current market patterns. Additionally, the historical decision errors of different actors represent their decision-making performance and also influence the current sample allocation. We use GRU to extract latent feature representations from StisuperscriptsubscriptS𝑡𝑖\textbf{S}_{t}^{i}S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, denoted as h^ti=GRU(h^t1i,Sti)superscriptsubscript^𝑡𝑖𝐺𝑅𝑈superscriptsubscript^𝑡1𝑖superscriptsubscriptS𝑡𝑖\hat{h}_{t}^{i}=GRU(\hat{h}_{t-1}^{i},\textbf{S}_{t}^{i})over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_G italic_R italic_U ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), where i𝑖iitalic_i means i-th sample. As the calculation of sample decision errors, we provide posterior teacher actions on the training set. The teacher action ateacher=1subscript𝑎𝑡𝑒𝑎𝑐𝑒𝑟1a_{teacher}=1italic_a start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT = 1 when the futures close price ptcsuperscriptsubscript𝑝𝑡𝑐p_{t}^{c}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT increases in the next time step and 11-1- 1 otherwise. Let ai1superscript𝑎𝑖1a^{i1}italic_a start_POSTSUPERSCRIPT italic_i 1 end_POSTSUPERSCRIPT and ai2superscript𝑎𝑖2a^{i2}italic_a start_POSTSUPERSCRIPT italic_i 2 end_POSTSUPERSCRIPT represent the action output by actor 1 and actor 2. The sample decision error etisuperscriptsubscripte𝑡𝑖\textbf{e}_{t}^{i}e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is then computed as [ateachertiati1,ateachertiati2]superscriptsubscript𝑎𝑡𝑒𝑎𝑐𝑒𝑟𝑡𝑖superscriptsubscript𝑎𝑡𝑖1superscriptsubscript𝑎𝑡𝑒𝑎𝑐𝑒𝑟𝑡𝑖superscriptsubscript𝑎𝑡𝑖2[a_{teacher\ t}^{i}-a_{t}^{i1},a_{teacher\ t}^{i}-a_{t}^{i2}][ italic_a start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i 2 end_POSTSUPERSCRIPT ]. To avoid introducing future information, we utilize the previous error et1isuperscriptsubscripte𝑡1𝑖\textbf{e}_{t-1}^{i}e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Subsequently, we concatenate h^tisuperscriptsubscript^𝑡𝑖\hat{h}_{t}^{i}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and embedding of error sequence dt1i=GRU(dt2i,et1i)superscriptsubscriptd𝑡1𝑖𝐺𝑅𝑈superscriptsubscriptd𝑡2𝑖superscriptsubscripte𝑡1𝑖\textbf{d}_{t-1}^{i}=GRU(\textbf{d}_{t-2}^{i},\textbf{e}_{t-1}^{i})d start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_G italic_R italic_U ( d start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) and feed them into a fully connected layer to predict the allocation results, denoted as bti=FC(h^ti,dt1i)superscriptsubscriptb𝑡𝑖𝐹𝐶superscriptsubscript^𝑡𝑖superscriptsubscriptd𝑡1𝑖\textbf{b}_{t}^{i}=FC(\hat{h}_{t}^{i},\textbf{d}_{t-1}^{i})b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_F italic_C ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , d start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ).

Refer to caption
Figure 3: OT refers to assigning x𝑥xitalic_x to the actor with the minimum Lerrijsuperscriptsubscript𝐿𝑒𝑟𝑟𝑖𝑗L_{err}^{ij}italic_L start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT while achieving a balanced allocation proportion, xtoActor 1xtoActor 2w1w2𝑥𝑡𝑜𝐴𝑐𝑡𝑜𝑟1𝑥𝑡𝑜𝐴𝑐𝑡𝑜𝑟2subscript𝑤1subscript𝑤2\frac{x\ to\ Actor\ 1}{x\ to\ Actor\ 2}\approx\frac{w_{1}}{w_{2}}divide start_ARG italic_x italic_t italic_o italic_A italic_c italic_t italic_o italic_r 1 end_ARG start_ARG italic_x italic_t italic_o italic_A italic_c italic_t italic_o italic_r 2 end_ARG ≈ divide start_ARG italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG. The pink circles represent Lerrijsuperscriptsubscript𝐿𝑒𝑟𝑟𝑖𝑗L_{err}^{ij}italic_L start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT.

In different patterns Allocation Module should have different attention for the two actors in Equation 5, where qtisuperscriptsubscript𝑞𝑡𝑖q_{t}^{i}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the allocation weights, and atisuperscriptsubscript𝑎𝑡𝑖a_{t}^{i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the final action. To ensure the discrete differentiability of the Allocation Module, we utilize the gumbel-softmax method [9] to compute Equation 5. It is worth noting that the allocation of samples is not binary, but rather a soft allocation ranging 0<qti<10superscriptsubscriptq𝑡𝑖10<\textbf{q}_{t}^{i}<10 < q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT < 1.

qti=softmax(bti),ati=qtiT[ati1,ati2],formulae-sequencesuperscriptsubscriptq𝑡𝑖𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscriptsubscriptb𝑡𝑖superscriptsubscript𝑎𝑡𝑖superscriptsuperscriptsubscriptq𝑡𝑖𝑇superscriptsubscript𝑎𝑡𝑖1superscriptsubscript𝑎𝑡𝑖2\textbf{q}_{t}^{i}=softmax(\textbf{b}_{t}^{i}),\ a_{t}^{i}={\textbf{q}_{t}^{i}% }^{T}[a_{t}^{i1},a_{t}^{i2}],q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i 2 end_POSTSUPERSCRIPT ] , (5)

However, if the actors want to learn different patterns, the representations should be as dissimilar as possible. Inspired by disentangled representation learning, we take the inputs x𝑥xitalic_x of the actors’ last layers as the representations and design a disentangled loss to enable the agent to learn different patterns, dis=i=1Nxi1xi2superscript𝑑𝑖𝑠superscriptsubscript𝑖1𝑁subscript𝑥𝑖1subscript𝑥𝑖2\mathcal{L}^{dis}=\sum_{i=1}^{N}x_{i1}\cdot x_{i2}caligraphic_L start_POSTSUPERSCRIPT italic_d italic_i italic_s end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT.

Algorithm 1 Training process of MOT
1:  Initialize actor network parameters θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, critic network parameters ω0subscript𝜔0\omega_{0}italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and epochs K𝐾Kitalic_K
2:  Obtain the expert strategy
3:  Pretrain the actor by presuperscript𝑝𝑟𝑒\mathcal{L}^{pre}caligraphic_L start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT in Equation 4
4:  Add the expert strategy to DB and train by imitation learning, get the dual policies πθj(a|S)subscript𝜋subscript𝜃𝑗conditional𝑎S\pi_{\theta_{j}}(a|\textbf{S})italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | S ), j=1,2𝑗12j=1,2italic_j = 1 , 2
5:  for k=0,1,2,𝑘012k=0,1,2,...italic_k = 0 , 1 , 2 , … do
6:     Collect the trajectory τt=(St,at,DSRt,St+1)t=0T1subscript𝜏𝑡superscriptsubscriptsubscriptS𝑡subscript𝑎𝑡𝐷𝑆subscript𝑅𝑡subscriptS𝑡1𝑡0𝑇1\tau_{t}=(\textbf{S}_{t},a_{t},DSR_{t},\textbf{S}_{t+1})_{t=0}^{T-1}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_D italic_S italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT by allocating the policy in Equation 5
7:     Compute advantages A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by current value Vωt(St)subscript𝑉subscript𝜔𝑡subscriptS𝑡V_{\omega_{t}}(\textbf{S}_{t})italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
8:     Compute the policy ratio πθt(aj|St)πθt1(aj|St)subscript𝜋subscript𝜃𝑡conditionalsubscript𝑎𝑗subscriptS𝑡subscript𝜋subscript𝜃𝑡1conditionalsubscript𝑎𝑗subscriptS𝑡\frac{\pi_{\theta_{t}}(a_{j}|\textbf{S}_{t})}{\pi_{\theta_{t-1}}(a_{j}|\textbf% {S}_{t})}divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
9:     Compute the loss OTsuperscript𝑂𝑇\mathcal{L}^{OT}caligraphic_L start_POSTSUPERSCRIPT italic_O italic_T end_POSTSUPERSCRIPT and dissuperscript𝑑𝑖𝑠\mathcal{L}^{dis}caligraphic_L start_POSTSUPERSCRIPT italic_d italic_i italic_s end_POSTSUPERSCRIPT in Equation 7
10:     Update the policy network by maximizing the clipped objective using actor(θ)superscript𝑎𝑐𝑡𝑜𝑟𝜃\mathcal{L}^{actor}(\theta)caligraphic_L start_POSTSUPERSCRIPT italic_a italic_c italic_t italic_o italic_r end_POSTSUPERSCRIPT ( italic_θ ) in Equation 7 (both for actor 1 and actor 2)
11:     Update the critic network by minimizing loss VF(ω)superscript𝑉𝐹𝜔\mathcal{L}^{VF}(\omega)caligraphic_L start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) in Equation 2
12:  end for

3.4 Optimal Transport Regularization

However, the model lacks a mechanism to ensure the effective allocation of samples to actors. Sometimes, the majority of samples are assigned to one actor. We incorporate OT techniques to ensure that the Allocation Module assigns more appropriate samples to each actor, thereby capturing diverse patterns more accurately.

We need to consider two main requirements. Firstly, the Allocation Module should allocate the samples to the actor with the smallest decision error. In other words, if |ateachertiati1|>|ateachertiati2|superscriptsubscript𝑎𝑡𝑒𝑎𝑐𝑒𝑟𝑡𝑖superscriptsubscript𝑎𝑡𝑖1superscriptsubscript𝑎𝑡𝑒𝑎𝑐𝑒𝑟𝑡𝑖superscriptsubscript𝑎𝑡𝑖2|a_{teacher\ t}^{i}-a_{t}^{i1}|>|a_{teacher\ t}^{i}-a_{t}^{i2}|| italic_a start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i 1 end_POSTSUPERSCRIPT | > | italic_a start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i 2 end_POSTSUPERSCRIPT |, we tend to assign the sample to actor 2. Secondly, the allocation of samples to the actors should be proportional to their respective patterns.

Below, we formally define the allocation problem. Assume we utilize N𝑁Nitalic_N samples in each epoch of PPO’s gradient descent process. Based on the error vector, we can construct an error matrix denoted as Lerr[N×2]subscript𝐿𝑒𝑟𝑟delimited-[]𝑁2L_{err}\in[N\times 2]italic_L start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT ∈ [ italic_N × 2 ]. Each element Lerrijsuperscriptsubscript𝐿𝑒𝑟𝑟𝑖𝑗L_{err}^{ij}italic_L start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT in it represents the decision error of the i-th sample on the j-th actor, given by Lerrij=ateacheriaijsuperscriptsubscript𝐿𝑒𝑟𝑟𝑖𝑗superscriptsubscript𝑎𝑡𝑒𝑎𝑐𝑒𝑟𝑖superscript𝑎𝑖𝑗L_{err}^{ij}=a_{teacher}^{i}-a^{ij}italic_L start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_a start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT. Corresponding to Lerrsubscript𝐿𝑒𝑟𝑟L_{err}italic_L start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT is the allocation matrix M[N×2]𝑀delimited-[]𝑁2M\in[N\times 2]italic_M ∈ [ italic_N × 2 ], where each element Mij{0,1}superscript𝑀𝑖𝑗01M^{ij}\in\left\{0,1\right\}italic_M start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT ∈ { 0 , 1 }. The value of 1 in the allocation matrix M𝑀Mitalic_M indicates that Allocation Module assigns the i-th sample to the j-th actor, while the value of 0 indicates no allocation.

The OT method is particularly suitable for solving allocation problem. OT involves determining an optimal allocation of resources from one location to another while minimizing overall cost or distance. It is also commonly employed to measure the difference between two probability distributions. Our research aims to find the optimal allocation scheme that minimizes Lerrsubscript𝐿𝑒𝑟𝑟L_{err}italic_L start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT. The specific formulation of the problem is as follows,

minMsubscript𝑚𝑖𝑛𝑀\displaystyle\mathop{min}\limits_{M}start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (LM)𝐿𝑀\displaystyle\ (L\cdot M)( italic_L ⋅ italic_M ) (6)
s.t.formulae-sequence𝑠𝑡\displaystyle s.t.italic_s . italic_t . i=1NMi1N=w1,i=1NMi2N=w2,Mi1+Mi2=1,i=1,2,,N,formulae-sequencesuperscriptsubscript𝑖1𝑁superscript𝑀𝑖1𝑁subscript𝑤1formulae-sequencesuperscriptsubscript𝑖1𝑁superscript𝑀𝑖2𝑁subscript𝑤2formulae-sequencesuperscript𝑀𝑖1superscript𝑀𝑖21for-all𝑖12𝑁\displaystyle\frac{\sum_{i=1}^{N}M^{i1}}{N}=w_{1},\ \frac{\sum_{i=1}^{N}M^{i2}% }{N}=w_{2},\ M^{i1}+M^{i2}=1,\forall i=1,2,...\ ,N,divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT italic_i 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT italic_i 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG = italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT italic_i 1 end_POSTSUPERSCRIPT + italic_M start_POSTSUPERSCRIPT italic_i 2 end_POSTSUPERSCRIPT = 1 , ∀ italic_i = 1 , 2 , … , italic_N ,

where w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent the proportions corresponding to different modes (assumed to be 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG). We employ the Sinkhorn method to solve the OT problem [2]. Figure 3 provides a visual explanation of the problem we aim to address.

To align the distribution of the output qisuperscriptq𝑖\textbf{q}^{i}q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the allocation module with Misuperscript𝑀𝑖M^{i}italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of the OT problem, we incorporate a cross-entropy loss term. Considering Allocation Module as part of actors, Equation 3 can be expanded to Equation 7, λOsubscript𝜆𝑂\lambda_{O}italic_λ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT is the hyperparameter. The third term is LOTsuperscript𝐿𝑂𝑇L^{OT}italic_L start_POSTSUPERSCRIPT italic_O italic_T end_POSTSUPERSCRIPT. The pseudocode for the MOT is shown in Algorithm 1.

actor(θ)=CLIP(θ)+dis+λOk=12Mtiklog(qtik).superscript𝑎𝑐𝑡𝑜𝑟𝜃superscript𝐶𝐿𝐼𝑃𝜃superscript𝑑𝑖𝑠subscript𝜆𝑂superscriptsubscript𝑘12superscriptsubscript𝑀𝑡𝑖𝑘𝑙𝑜𝑔superscriptsubscriptq𝑡𝑖𝑘\mathcal{L}^{actor}(\theta)=\mathcal{L}^{CLIP}(\theta)+\mathcal{L}^{dis}+% \lambda_{O}\sum_{k=1}^{2}M_{t}^{ik}log(\textbf{q}_{t}^{ik}).caligraphic_L start_POSTSUPERSCRIPT italic_a italic_c italic_t italic_o italic_r end_POSTSUPERSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUPERSCRIPT italic_C italic_L italic_I italic_P end_POSTSUPERSCRIPT ( italic_θ ) + caligraphic_L start_POSTSUPERSCRIPT italic_d italic_i italic_s end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_k end_POSTSUPERSCRIPT italic_l italic_o italic_g ( q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_k end_POSTSUPERSCRIPT ) . (7)

4 Experiments

4.1 Dataset

We utilize the IF stock index futures dataset whose underlying asset is the CSI 300 Index. The dataset provides minute-level trading data of contracts. Each minute bar includes OHLC, trading volume, etc. The total trading duration in a day is 240 minutes. We collected it from ricequant.com333A well-known Chinese quantitative trading platform, https://www.ricequant.com/., and divided the data into a training set from 2015-12-31 to 2018-05-08 and a test set from 2018-05-09 to 2019-05-09.

4.2 Baselines, Evaluation Metrics and Hyperparameters

Baselines: Long Position Hold (buy futures and hold), Short Position Hold (borrow contracts and hold), Dual Thrust [13] (a technical analysis trading strategy commonly used for intraday trading), GRU [1] (a variant of RNNs444We chose it as a baseline because we employed the GRU method in the Pretrain Module before imitation learning. The results of GRU demonstrate the performance of the Pretrain Module.), PPO [23] (a RL method that improves stability by preventing large policy changes555We enhance PPO using imitation learning mentioned in Methodology Section.), iRDPG [16] (SOTA: an off-policy algorithm that incorporates expert strategy and behavior cloning).

Evaluation Metrics: We will measure the model’s performance by Accumulated Rate of Return (ARR, the overall profitability), Volatility (VO, measures by standard deviation of profit r𝑟ritalic_r), Annualized Sharpe Ratio (ASR, annualized version of SR𝑆𝑅SRitalic_S italic_R), Maximum Drawdown (MDD, the maximum decline of an asset’s value from its peak to the lowest over a period), Calmar Ratio (CR=ARRMDD𝐴𝑅𝑅𝑀𝐷𝐷\frac{ARR}{MDD}divide start_ARG italic_A italic_R italic_R end_ARG start_ARG italic_M italic_D italic_D end_ARG, risk-adjusted ARR based on MDD) and Sortino Ratio (SoR=mean(r)std(min(r,0))𝑚𝑒𝑎𝑛𝑟𝑠𝑡𝑑𝑚𝑖𝑛𝑟0\frac{mean(r)}{std(min(r,0))}divide start_ARG italic_m italic_e italic_a italic_n ( italic_r ) end_ARG start_ARG italic_s italic_t italic_d ( italic_m italic_i italic_n ( italic_r , 0 ) ) end_ARG, excess return per unit of downside risk).

Hyperparameters: We set transaction fee rate μ=2.3×105𝜇2.3superscript105\mu=2.3\times 10^{5}italic_μ = 2.3 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT and slippage σ=0.2𝜎0.2\sigma=0.2italic_σ = 0.2. Insufficient account assets may trigger a forced liquidation. We set the margin threshold as 70%percent7070\%70 % and initial capital C=50000CNY𝐶50000𝐶𝑁𝑌C=50000\ CNYitalic_C = 50000 italic_C italic_N italic_Y. We repeated 6 experiments for each model.

Table 2: Experimental Results (\uparrow indicates the higher the better, \downarrow indicates the opposite)
Methods ARR (\uparrow) VO (\downarrow) ASR (\uparrow) MDD (\downarrow) CR (\uparrow) SoR (\uparrow)
Long Hold 2.5982.598-2.598- 2.598 0.2610.2610.2610.261 0.6380.638-0.638- 0.638 113.121113.121113.121113.121 0.0010.001-0.001- 0.001 0.0800.080-0.080- 0.080
Short Hold 3.1633.1633.1633.163 0.2590.2590.2590.259 0.7820.7820.7820.782 0.8940.8940.8940.894 0.0410.0410.0410.041 0.0930.0930.0930.093
Dual Thrust 10.13010.13010.13010.130 0.2530.2530.2530.253 2.6282.6282.6282.628 0.0330.0330.0330.033 3.9623.9623.9623.962 0.3650.3650.3650.365
GRU 11.342(1.12)11.3421.1211.342(1.12)11.342 ( 1.12 ) 0.242(0.00)0.2420.000.242(0.00)0.242 ( 0.00 ) 3.004(0.31)3.0040.313.004(0.31)3.004 ( 0.31 ) 0.016(0.02)0.0160.020.016(0.02)0.016 ( 0.02 ) 4.280(0.23)4.2800.234.280(0.23)4.280 ( 0.23 ) 0.399(0.05)0.3990.050.399(0.05)0.399 ( 0.05 )
iRDPG 14.453(0.98)14.4530.9814.453(0.98)14.453 ( 0.98 ) 0.254(0.01)0.2540.010.254(0.01)0.254 ( 0.01 ) 3.955(0.18)3.9550.183.955(0.18)3.955 ( 0.18 ) 0.023(0.03)0.0230.030.023(0.03)0.023 ( 0.03 ) 5.881(3.21)5.8813.215.881(3.21)5.881 ( 3.21 ) 0.537(0.03)0.5370.030.537(0.03)0.537 ( 0.03 )
PPO 12.245(0.23)12.2450.2312.245(0.23)12.245 ( 0.23 ) 0.243(0.00)0.2430.000.243(0.00)0.243 ( 0.00 ) 3.223(0.05)3.2230.053.223(0.05)3.223 ( 0.05 ) 0.022(0.02)0.0220.020.022(0.02)0.022 ( 0.02 ) 4.281(0.23)4.2810.234.281(0.23)4.281 ( 0.23 ) 0.436(0.01)0.4360.010.436(0.01)0.436 ( 0.01 )
MOT-ND 15.322(1.25)15.3221.2515.322(1.25)15.322 ( 1.25 ) 0.246(0.01)0.2460.010.246(0.01)0.246 ( 0.01 ) 4.252(0.24)4.2520.244.252(0.24)4.252 ( 0.24 ) 0.005(0.01)0.0050.01\textbf{0.005}(\textbf{0.01})0.005 ( 0.01 ) 7.277(3.51)7.2773.51\textbf{7.277}(\textbf{3.51})7.277 ( 3.51 ) 0.587(0.07)0.5870.070.587(0.07)0.587 ( 0.07 )
MOT-NO 17.236(1.05)17.2361.0517.236(1.05)17.236 ( 1.05 ) 0.248(0.01)0.2480.010.248(0.01)0.248 ( 0.01 ) 4.447(0.18)4.4470.184.447(0.18)4.447 ( 0.18 ) 0.026(0.01)0.0260.010.026(0.01)0.026 ( 0.01 ) 5.558(0.75)5.5580.755.558(0.75)5.558 ( 0.75 ) 0.529(0.08)0.5290.080.529(0.08)0.529 ( 0.08 )
MOT 20.379(0.85)20.3790.85\textbf{20.379}(\textbf{0.85})20.379 ( 0.85 ) 0.228(0.00)0.2280.00\textbf{0.228}(\textbf{0.00})0.228 ( 0.00 ) 5.395(0.26)5.3950.26\textbf{ 5.395}(\textbf{0.26})5.395 ( 0.26 ) 0.011(0.02)0.0110.020.011(0.02)0.011 ( 0.02 ) 6.582(0.66)6.5820.666.582(0.66)6.582 ( 0.66 ) 0.605(0.05)0.6050.05\textbf{0.605}(\textbf{0.05})0.605 ( 0.05 )
Refer to caption
Figure 4: Performance of different models in terms of ARR

4.3 Experimental Results

Table 2 provides a summary of the results. Figure 4 (a) depicts ARR of all the methods. From Table 2, MOT outperforms other methods in terms of profit and risk-reward balance. ARR is the most crucial indicator, and our model achieves the highest ARR. The ARR of PPO is about 1.0 higher than that of GRU, indicating that PPO exhibits greater robustness. The ASR, CR, and SoR are composite metrics that consider both risk and return. Deep learning methods (last 6 rows in Table 2) outperform the technical indicator models (first 3 rows in Table 2) in these three metrics, which suggests the former better represents complex states under high-noise conditions. MOT performs second in terms of MDD, indicating that MOT only requires a short time period to recover from losses. RL models outperform time-series models, as the latter primarily focuses on predicting price trends without considering the high costs caused by incorrect predictions. Since greater risk leads to greater returns, profits are higher when there are significant price fluctuations. So the correlation among most methods is very high.

4.4 Ablation Study

We conducted ablation experiments to show the effectiveness of its three components. The experimental results and the trend of ARR are depicted in Table 2 and Figure 4 (b).

Overall performance. MOT-NP applies imitation learning based on PPO without Pretrain Module. MOT-ND is obtained by removing multiple actors from the final model, while MOT-NO eliminates the process of OT. From Figure 4 (b), we observe that ARR curve of MOT remains higher than other variants in most periods. Table 2 shows that MOT performs best in terms of ARR, VO, ASR, and SOR. Among the three modules, OT method contributes the most to the improvement of model performance, followed by the Pretrain Module. MOT-ND excels in MDD metric, indicating that the model without multiple actors’ design tends to generate more conservative strategies. While a conservative trading strategy often misses the optimal investment opportunities. Since the calculation of CR relies on MDD, MOT-ND also exhibits higher CR.

Effectiveness of Pretrain Module. The influence of the expert strategy in DB diminishes over time and the benefit of imitation learning is mainly observed in the early stages. For the ablation experiment, we selected the agent trained for 100 epochs after imitation learning. Figure 4 (c) illustrates the impact of Pretrain Module on imitation learning and the yellow curve is the model with Pretrain Module. It can be observed that MOT-ND demonstrates a steady increase accompanied by minor fluctuations in profit. In contrast, MOT-NP experiences some declines and doesn’t learn well. This indicates that Pretrain Module contributes to the improvement of imitation learning.

Effectiveness of multiple actors and OT modeling. Figure 5 demonstrates the variation in weights assigned to two actors before and after OT modeling. In a relatively volatile period, the model assigns weights more randomly without OT while assigns higher weights to actor 2 with OT. Notably, the introduction of OT leads to higher returns and enhances the ability to capture complex patterns. Figure 4 (d) illustrates the impact of actors’ number to MOT. MOT achieves the best profitability when k=2𝑘2k=2italic_k = 2 while achieves the worst when k=1𝑘1k=1italic_k = 1. This indicates that only one actor is insufficient to capture all patterns, while an excessive number of actors may lead to redundancy. In our model, the optimal number of actors is 2.

Refer to caption
Figure 5: Effectiveness of OT modeling

5 Related Work

Investment strategies based on expert knowledge. The early method used expert knowledge to construct heuristic rules [10, 20], which can be divided into two categories: fundamental analysis and technical analysis. Fundamental analysis captures diverse factors such as industry trends, company financial statements, and public opinion. This method is more commonly used by long-term investors to find undervalued assets. Popular technical indicators include Relative Strength Index [27], Average Direction Index [6], On-Balance Volume [26] , etc. Commonly used investment strategies include momentum trading [7] and mean reversion strategy [20]. However, interrelated technical indicators are correlated with each other, and building them directly from the market introduces too much market noise. Typically, rules constructed based on expert knowledge can only capture trading opportunities under specific market conditions [3].

Investment strategies based on RL. In contrast to supervised learning, which still requires expert knowledge to construct strategies, RL can optimize strategies in an end-to-end form. Moody et al. [18] made the first attempt to apply recurrent RL (RRL) algorithm to algorithmic trading. However, traditional RL methods are not well-suited for environments with large state spaces, making it challenging to select market features. Deep RL methods have partially addressed this problem. Si et al. [25] argue that strategies need to consider multiple factors and combine multi-objective optimization with deep RL to address this issue. Oliveira et al. [19] adopts SARSA, which maps states and actions to specific cells in a table to learn the value function. Since insufficient financial data causes overfitting, Jeong et al. [12] divided stocks into groups based on their correlations and introduced transfer learning into the Deep Q-Network (DQN). To shorten the inefficient random exploration phase, iRDPG [16] incorporates technical analysis through imitation learning. Yuan et al. [30] argue that daily frequency data cannot meet the high demands of RL and instead use minute frequency data. And PPO algorithm achieves more stable returns compared to DQN and SAC algorithms.

6 Conclusion

In this paper, we propose MOT, an RL-based model for algorithmic trading problems. Specifically, we model the algorithmic trading problem as MDP and leverage imitation learning to enable the agent to learn from expert knowledge. To better initialize MOT, we introduce the Pretrain Module prior to the imitation learning phase. Considering that futures prices result from different patterns, we employ multiple actors with disentangled representation learning to model the patterns. We design the Allocation Module to integrate the outputs of multiple actors and incorporate OT techniques to guide the learning of the Allocation Module. Experimental results demonstrate that our model achieves superior profitability while controlling the risk, showcasing its robustness in financial markets with complex data patterns. Further ablation studies confirm the effectiveness of the three components of MOT.

6.0.1 Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 72374201).

References

  • [1] Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint:1412.3555 (2014)
  • [2] Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. NIPS 26 (2013)
  • [3] Deng, Y., Bao, F., Kong, Y., Ren, Z., Dai, Q.: Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans Neural Netw Learn Syst 28(3), 653–664 (2016)
  • [4] Fama, E.F., French, K.R.: Multifactor explanations of asset pricing anomalies. The journal of finance 51(1), 55–84 (1996)
  • [5] Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR 23(1), 5232–5270 (2022)
  • [6] Gurrib, I., et al.: Performance of the average directional index as a market timing tool for the most actively traded usd based currency pairs. Banks and Bank Systems 13(3), 58–70 (2018)
  • [7] Hong, H., Stein, J.C.: A unified theory of underreaction, momentum trading, and overreaction in asset markets. The Journal of finance 54(6), 2143–2184 (1999)
  • [8] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: ICML. pp. 2790–2799. PMLR (2019)
  • [9] Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint:1611.01144 (2016)
  • [10] Jegadeesh, N., Titman, S.: Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of finance 48(1), 65–91 (1993)
  • [11] Jegadeesh, N., Titman, S.: Cross-sectional and time-series determinants of momentum returns. The Review of Financial Studies 15(1), 143–157 (2002)
  • [12] Jeong, G., Kim, H.Y.: Improving financial trading decisions using deep q-learning: Predicting the number of shares, action strategies, and transfer learning. Expert Systems with Applications 117, 125–138 (2019)
  • [13] Kim, H.j., Shin, K.s.: A hybrid approach based on neural networks and genetic algorithms for detecting temporal patterns in stock markets. Applied Soft Computing 7(2), 569–576 (2007)
  • [14] Li, Z., Tam, V.: A machine learning view on momentum and reversal trading. Algorithms 11(11),  170 (2018)
  • [15] Lin, H., Zhou, D., Liu, W., Bian, J.: Learning multiple stock trading patterns with temporal routing adaptor and optimal transport. In: Proceedings of the 27th ACM SIGKDD. pp. 1017–1026 (2021)
  • [16] Liu, Y., Liu, Q., Zhao, H., Pan, Z., Liu, C.: Adaptive quantitative trading: An imitative deep reinforcement learning approach. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 2128–2135 (2020)
  • [17] Moody, J., Saffell, M.: Reinforcement learning for trading. NIPS 11 (1998)
  • [18] Moody, J., Wu, L.: Optimization of trading systems and portfolios. In: Proceedings of the IEEE/IAFE 1997 computational intelligence for financial engineering (CIFEr). pp. 300–307. IEEE (1997)
  • [19] de Oliveira, R.A., Ramos, H.S., Dalip, D.H., Pereira, A.C.M.: A tabular sarsa-based stock market agent. In: Proceedings of the First ACM International Conference on AI in Finance. pp. 1–8 (2020)
  • [20] Poterba, J.M., Summers, L.H.: Mean reversion in stock prices: Evidence and implications. Journal of financial economics 22(1), 27–59 (1988)
  • [21] Pricope, T.V.: Deep reinforcement learning in quantitative algorithmic trading: A review. arXiv preprint arXiv:2106.00123 (2021)
  • [22] Ritter, J.R.: Behavioral finance. Pacific-Basin finance journal 11(4), 429–437 (2003)
  • [23] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint:1707.06347 (2017)
  • [24] Sharpe, W.F.: Mutual fund performance. The Journal of business 39(1), 119–138 (1966)
  • [25] Si, W., Li, J., Ding, P., Rao, R.: A multi-objective deep reinforcement learning approach for stock index future’s intraday trading. In: 2017 10th ISCID. vol. 2, pp. 431–436. IEEE (2017)
  • [26] Tsang, W.W.H., Chong, T.T.L., et al.: Profitability of the on-balance volume indicator. Economics Bulletin 29(3), 2424–2431 (2009)
  • [27] Wilder, J.W.: New concepts in technical trading systems. Trend Research (1978)
  • [28] Xu, W., Liu, W., Wang, L., Xia, Y., Bian, J., Yin, J., Liu, T.Y.: Hist: A graph-based framework for stock trend forecasting via mining concept-oriented shared information. arXiv preprint arXiv:2110.13716 (2021)
  • [29] Xu, W., Liu, W., Xu, C., Bian, J., Yin, J., Liu, T.Y.: Rest: Relational event-driven stock trend forecasting. In: Proceedings of the Web Conference 2021. pp. 1–10 (2021)
  • [30] Yuan, Y., Wen, W., Yang, J.: Using data augmentation based reinforcement learning for daily stock trading. Electronics 9(9),  1384 (2020)