Portfolio Optimization with Deep-Q Learning

Fintech MK
8 min readNov 17, 2021
Fintech Thursdays — Stojancho Tudjarski

Disclaimer

This text will be a little bit more technical, than the previous texts on the topic of Fintech. Still, it provides information about latest state-of-the-art AI and Deep Learning technologies used for optimizing financial portfolios.

Therefore, probably worth of reading, at least, to grasp the bird-eye insights in the possibilities of the technology, but let it be you who will judge J

Introduction to Reinforcement Learning (RL)

Reinforcement learning in formal terms is a method of machine learning wherein the software agent learns to perform certain actions in an environment which lead it to maximum reward. It does so by exploration and exploitation of knowledge it learns by repeated trials of maximizing the reward.

Now, a little bit of less formal definition: you teach a machine how to achieve something wanted, by giving her chance to try a lot of random tries, and systematically extract the way how to solve some problem in the most optimal way.

Who is guilty of Reinforcement Learning?

It’s Google. His Deep Mind (one of the acquired companies by Google), popularized it by training models that can beat humans in several popular games. The famous one is the Go game, but they mastered also old-fashion Atari games, Doom 2, Dota, etc.

Some key terms that describe the basic elements of an RL problem are:

§ Environment — Physical world in which the agent operates

§ State — Current situation of the agent

§ Reward — Feedback from the environment

§ Policy — Method to map agent’s state to actions

§ Value — Future reward that an agent would receive by taking an action in a particular state

The steps involved in typical RL algorithm are as follows:

1. First, the agent interacts with the environment by performing an action

2. The agent performs an action and moves from one state to another

3. And then the agent will receive a reward based on the action it performed

4. Based on the reward, the agent will understand whether the action was good or bad

5. If the action was good, that is, if the agent received a positive reward, then the agent will prefer performing that action or else the agent will try performing another action which results in a positive reward. So, it is basically a trial-and-error learning process

Deeper technical explanation of reinforcement learning is out of the scope of this text. It’s complicated.

The good thing here is that all you should be able to do is to map your business domain feature space to the main terms listed above and start using some already made modules.

Let’s focus on this in the remaining of this text.

How RL relates to Stocks?

Fact # 1: Stocks inherently appears as time series data

Fact # 2: Stocks are inherently hard to be predicted. Elon Musk smokes weed on a podcast => TSLA stock goes down. A nurse in Denmark found with health problems two days after Covid-19 vaccination => MRNA stock goes down in next 3 min. Was there. Lost money.

Fact # 3: So, even stocks appear as time series data, what can we do with Reinforcement Learning for them, is still unclear.

So, let’s make it clear …

How about seeing on the stock market from the following perspective:

Fact # 1: Environment: Stock market

Fact # 2: State: Our portfolio stock prices

Fact # 3: Action: Take or leave a BUY or SELL position

Fact # 4: Reward: Simply, the money

Fact # 5: Policy goal: Make more money

Consequence # 1: Now we are talking about Reinforcement Learning

Consequence # 2: We have everything we need to start diving into putting RL into the stocks.

But. As already said, the stock market is a volatile thing, and the volatility comes with increased risk. One method of lowering the risk is diversification. Diversification means you put the eggs in different baskets, i.e., play with multiple stocks at the same time. This comes with lowering the maximum of the possible gain. But, if the gain is more frequent, in general, you gain more money.

Therefore:

Consequence # 3: We seriously dive in the stocks pool, playing with more stocks at the same time, and using RL to optimize the gain.

Now, let’s see how to implement this.

Information # 1: How many shares of each stock we own.
For ex. [3, 5, 7] means:

§ 3 shares of Apple AAPL

§ 5 shares of Motorola MSI

§ 7 shares of Starbucks SBUX

Information # 2: Share price for each stock.
For ex. [400, 80, 200] means:

§ APPL share costs 400$

§ MSI share costs 80$

§ SBUX share costs 200$

Information # 3: How much cache we have uninvested.

Atomic actions are: Buy, Sell, and Hold

This means that, in our situation:

- Three stocks: AAPL, MSI, SBUX

- 3³ = 27 possibilities

On this way:

- [Sell, Sell, Sell] == Sell everything we have

- [Buy, Sell, Hold] == Buy for AAPL, Sell for MSI, and do nothing for SBUX

Still, not enough: We don’t know how much shares do we buy or sell.

Reinforcement Learning actions for trading: simplified

We will ignore transaction costs.

When we sell, we sell all the shares of the stock we chose to sell. Not that far from reality: we buy and sell in batches.

When we buy, we buy as many as possible, having available cache in mind.

Yet, not yet deterministic enough: when we decide to buy more than one stock, how many of them to buy?

The answer: To simplify this, we buy in a round-robin fashion, until we spend all of our cache available: one of this than one of that that one of this … until EOC(ache).

We first sell, then we buy. On this way: we have more cache for buying.

Finally: The Reward

The reward in each of the possible states is simply the value of the stocks we have, plus the free cache.

For example: we own 5 AAPL shares, 3 MSI shares and 7 SBUX shares, and still have 1500$ available for trading.

Thus, the value of the reward in that state is:

Reward = 5 * 290 + 3 * 70 + 7 * 280 + 1500 = [ouch-math-hearts]

Having all of this in mind, we can train RL agent to make the optimal trades, based on the available historical data.

What do we have till now is “only” yet-another-trading strategy.

Let’s integrate it into a trading bot framework.

Introduction to FreqTrade

FreqTrade is a framework for building crypto-trading bot.

- Based on Python 3.7+ under Windows, macOS and Linux.

- Persistence: Persistence is achieved through sqlite.

- Dry-run: Run the bot without paying money.

- Backtesting: Run a simulation of your buy/sell strategy.

- Edge position sizing Calculate your win rate, risk reward ratio, the best stop-loss and adjust your position size before taking a position for each specific market.

- Blacklist/whitelist cryptocurrencies to work with.

- Manageable via Telegram: Manage the bot with Telegram. Control it from your phone!

- Display profit/loss in fiat: Display your profit/loss in 33 fiat.

- Daily summary of profit/loss: Provide a daily summary of your profit/loss.

- Performance status report: Provide a performance status of your current trades.

Now, let’s involve some optimization in the process.

We don’t have to deal only with the current prices. A little bit of history is important as well.

One way of capturing the “history” is grouping more stock price movements into tokens, by tokenizing the history into the most frequent token as the ones with the biggest length.

One way, (IMHO preferable) is to use old-style candlesticks.

Can we automatically recognize candlestick patterns?

The answer: Yes, of course.

How? Pythonically, of course :-)

The magic steps here are simply:

  1. C:\pip install TA-Lib
  2. Start using it

(Old but still good joke: Do you work in C? Yes, of course. How? Each time I start my computer, I see C:\.)

Some references:

§ Option Pricing Using Reinforcement Learning | by Letian Wang | The Startup | Medium

§ Combining Technical Indicators with Deep Learning for Stock Trading | by Victor Sim | Artificial Intelligence in Plain English | Jan, 2021 | Medium

§ Automating Stock Investing Technical Analysis With Python | by Farhad Malik | FinTechExplained | Feb, 2021 | Medium

§ A review of Reinforcement learning for financial time series prediction and portfolio optimization | by Nick Smith | Journal of Quantitative finance | Jan, 2021 | Medium

§ Algorithmic Trading with Economics Driven Deep Learning | by Yao Lei Xu | Jan, 2021 | Towards Data Science

§ Encoder-Decoder Model for Multistep Time Series Forecasting Using PyTorch | by Gautham Kumaran | Towards Data Science

Author:

Stojancho Tudjarski

Graduated software engineer with 29 years of experience in the IT industry.

The last 11 years he spent as an AI practitioner, learning, doing, and consulting. The last 18 months, intensively focused on the latest breakthroughs in the NLP area of AI, which included BERT and GPT-2 language models, and experimenting with them a lot. Trained his own BERT and GPT-2 models for all of the Balkan languages, making them appearing for the first time, publicly available for the first time, at https://huggingface.co/macedonizer, with the support from the Maceonizers team from FINKI.

Experimented extensively with fine-tuning GPT-2 with Shakespeare’s writings, among other things. Result: a neural network capable of writing poetry in Shakespeare’s style, at a scale. In addition to this: passed through the process of mixing styles, which resulted in an automatic text generator about Covid-19, in Shakespeare’s style. All the experiments are documented at https://dmind.ai. The same is done with Blaze Koneski’s writing style, autogenerated texts available on the same site.

Continually monitoring the latest happenings in the AI area on a daily basis and evaluating ideas where the new academic advances can be monetized in the business.

--

--

Fintech MK

First Fintech Community in North Macedonia which aims at developing and enabling the Fintech Ecosystem regionally.