1
$\begingroup$

I'm integrating Technical Analysis with Deep Learning for the first phase of my research. I wanted to know how should I pick (or group) stocks as input data and whether there should be relation between the selected stocks.
To further elaborate, I've seen researchers use different stocks, some eliminate company stocks below certain market cap, others use the whole historical price chart of S&P 500, and I can't find the reason behind their choice. Is there a best practice for selecting the data sets or should I just do it intuitively?

$\endgroup$
4
  • $\begingroup$ Seems, that your question is implying, that you want to use data for one stock to forecast the returns of another stock. If that's the case, maybe, it's better to formulate the question as such: how to pick stocks with technical indicators having predictive power for a given stock. If not, please, clarify what you mean. By default most people would use only the data for the same stock. $\endgroup$
    – LazyCat
    Commented Oct 24, 2018 at 20:53
  • $\begingroup$ @LazyCat I'm trying to predict the returns for each stock. I've seen that researchers use the historical price chart of different companies as input data. Shouldn't there be a relation between these companies to get better results? $\endgroup$
    – Saber
    Commented Oct 24, 2018 at 21:35
  • $\begingroup$ @LazyCat I edited my question to make it more clear. $\endgroup$
    – Saber
    Commented Oct 25, 2018 at 0:43
  • $\begingroup$ People sometimes eliminate stocks below a certain narket cap, liquidity level or stock price to keep trading costs low when the strategy is implemented in real life. Especially if the size of the investment fund is large it cannto relaistically trade small and obscure stocks. $\endgroup$
    – Alex C
    Commented Oct 25, 2018 at 1:26

1 Answer 1

4
+50
$\begingroup$

There are a few exclusions that I have commonly seen:

  1. Excluding thinly traded stocks. The price that shows up in your data feed may not relate to actual tradable prices.

  2. Filtering for ADR/Pink locals. You can find stocks listed in multiple places in ways that would lead you to think that they are great for pairs trades when actually they are the same stock, but just with listing differences. For example CS (Credit Suisse NYSE ADR) and CSGKF (Credit Suisse Pink Sheet Local). Screening for co-linearity can be helpful as well...

  3. Removing stock post corporate acquisition announcement. Once a stock is being acquired for a fixed amount it will lose many of the properties that you are trying to analyze.

  4. Handle time synchronization issues. Some data sets will show you "close/settlement prices" that are taken at different points in time. For example, a US equity close price is taken at 4:00pm EST, an oil contract close price is taken at 2:30pm EST, and a bond future close price is taken at 3:00pm. If your algorithm tells you to buy oil if XOM closes above it's 20 day moving average and it thinks that you could have transacted at 4:00pm at the 2:30 price, then you can imagine the errors that will occur.

And one important thing to screen back in:

Many data sets will drop delisted / acquired stocks. When analyzing historical data you need to make sure that your data set includes your candidate names that were actually trading at the time.

$\endgroup$
2
  • $\begingroup$ Thank you, I'm still uncertain about two things. Can stocks be selected from different industry sectors? (For example can I select stocks from companies like Microsoft, Amazon and select a steel company stock as well? I've seen researches do it but don't know how a pattern can be spotted.) The other question is that can I select any time range or should I limit it to a time range (i.e. based on federal reserve interest rates or presidential election)? $\endgroup$
    – Saber
    Commented Oct 28, 2018 at 19:09
  • 1
    $\begingroup$ I think you need to separate it into two sub-topics: 1. Securities to eliminate to improve data quality / cleanliness. 2. The universe to focus on. My list above are the standard answers to #1. We always filter for those to make sure what we are doing is repeatable and not subject to data idiosyncrasies that would cause real-world implementation to diverge from test. Part #2 is up to you and there are no wrong answers. It's like asking "What colors should I paint with?" Determining your focus set is the core of your creative endeavor and there are no right or wrong answers. $\endgroup$
    – JoshK
    Commented Oct 28, 2018 at 19:14

Not the answer you're looking for? Browse other questions tagged or ask your own question.