Difference between revisions of "Pair Database"

From Pair Trading Lab WIKI
Jump to: navigation, search
(Created page with "This page is under construction.")
 
m (Profit Measures)
 
(14 intermediate revisions by one user not shown)
Line 1: Line 1:
This page is under construction.
+
This is the manual for the new pair database system released at Mar 30th 2015. The new database fully replaced the older solution introduced in 2012 which was based on studies applied to all sector-based pairs. This new solution is standalone (not based on studies anymore) and it includes cointegration and orthogonal measures too.
 +
 
 +
Unlike the old system, this pair databases includes all brute-force combinations of qualified US equities '''regardless of sectors'''. Equity is considered qualified to be in the database if for last 2 months of the screening period its average price exceeds 0.2 USD and its average volume exceeds 80k. So for the database period of Jan 1st 2012 - Jan 1st 2015 it gives more than 10,000,000 pairs.
 +
 
 +
For each pair, we have taken last 3 years of data and we calculated profit measures, orthogonal stats, we have performed cointegration tests, etc...Practically we have included most data you can see play with in the pair analyzer.
 +
 
 +
We have setup a high performance cluster which allows us for each of these >10,000,000 pairs perform hundreds of tests and calculations. All the data is then indexed in a special database for fast searching. The system is able to execute all kinds of complex queries on this database in less than 30 seconds.
 +
 
 +
== Data description ==
 +
[[File:Megabase result.png|center]]
 +
For each pair, we have performed certain tests and calculations. You can filter pairs using any of these measures/ tests results. In addition, you can also sort results based on some of these measures.
 +
 
 +
=== General statistics ===
 +
* Slow correlation - average correlation with period 240 applied to the whole period, value between -1..1
 +
* Fast correlation - average correlation with period 60 applied to the whole period, value between -1..1
 +
* Half-life of ratio - [http://marcoagd.usuarios.rdc.puc-rio.br/half-life.html half-life] of the ratio series
 +
 
 +
It is important to mention something at this stage - some measures (like the half-life) differ with the variable ordering. What does it mean? A pair is created from two equities (legs) - for instance A and B. The variable ordering is about if you create the pair like A/B or like B/A. For some measures (like correlation) it does not matter. But for instance for calculating the half-life of ratio series it does, because apparently A/B pair has a different ratio series from B/A pair. How has this been solved in the database?
 +
For all cases where the ordering matters, we will calculate the measure for each order (half-life of A/B and half-life of B/A in this case).
 +
 
 +
Then we apply this trick:
 +
* we will save the better result of the measure (over A/B vs B/A) and we will call it '''the best case scenario''' (loose matching when searching)
 +
* we will save the worse result of the measure (over A/B vs B/A) and we will call it '''the worst case scenario''' (strict matching when searching)
 +
 
 +
While searching (or adding sort fields) you can specify if you want to work with the best or the worst case scenario.
 +
So for instance, if you choose "the best case scenario" and you filter for half-life shorter than X, it is the same as you had filtered "give me pairs having half-life shorter than X for '''any''' combination of A/B, B/A". If you choose "the worst case scenario" and you filter for half-life, it is the same as you had filtered "give me pairs having half-life better than X for '''both''' combinations of A/B, B/A".
 +
 
 +
This works the same way for all measures which differ with the pair leg ordering.
 +
 
 +
=== Cointegration ===
 +
Cointegration test results and cointegration residual measures:
 +
* [http://en.wikipedia.org/wiki/Coefficient_of_determination R-squared] - measure of the cointegration regression fit
 +
* ADF test [http://en.wikipedia.org/wiki/P-value p-value] of individual variables (you can use it together with the next field to form your custom confidence cointegration test, i.e. for 97% confidence).
 +
* Cointegration [http://en.wikipedia.org/wiki/P-value p-value] - p-value of the final step of Engle-Granger test
 +
* β coefficient (OLS) - regression coefficient of the cointegration regression (OLS)
 +
* Half-life (cointegration) - half-life of cointegration residuals
 +
* Cointegrated @ 95 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 95% confidence
 +
* Cointegrated @ 99 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 99% confidence
 +
 
 +
For all fields here you have to specify the best/worst case scenario, except for:
 +
* R-squared - it is the same for both variable orderings
 +
* ADF test p-value of individual variables - it always works with the worst case scenario
 +
 
 +
Read more about cointegration at [http://en.wikipedia.org/wiki/Cointegration Wikipedia].
 +
 
 +
=== Orthogonal Stats ===
 +
For each pair in database, [http://en.wikipedia.org/wiki/Total_least_squares orthogonal regression (TLS)] has been calculated and residuals of this regression have been analyzed:
 +
* β coefficient (TLS) - coefficient of the regression. Because changing order of variables give inverse value, for this field we '''always take the value >= 1''' (in absolute value)
 +
* Half-life (orthogonal) - half-life of orthogonal residuals
 +
* [http://en.wikipedia.org/wiki/Skewness Skewness] - skewness of orthogonal residuals
 +
* [http://en.wikipedia.org/wiki/Kurtosis Kurtosis] - (excess) kurtosis of orthogonal residuals
 +
* [http://artax.karlin.mff.cuni.cz/r-help/library/asbio/html/DH.test.html Doornik-Hansen] [http://en.wikipedia.org/wiki/P-value p-value] - result of the Doornik-Hansen normality test applied to orthogonal residuals. If >= 0.95, D-H test is passed with 95% confidence
 +
* [http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test Shapiro-Wilk] [http://en.wikipedia.org/wiki/P-value p-value] - result of the Shapiro-Wilk normality test applied to orthogonal residuals. If >= 0.95, S-W test is passed with 95% confidence
 +
* PACF rating - autocorrelation rating of the residuals. This number is always <=0 (0 = the best rating). [http://en.wikipedia.org/wiki/Partial_autocorrelation_function What is PACF?]
 +
 
 +
Orthogonal stats do not depend on pair leg ordering.
 +
 
 +
=== Profit Measures ===
 +
For each pair there have been backtests performed:
 +
* backtests for A/B order (unlimited in both directions)
 +
* backtests for A/B order (limited in long direction only)
 +
* backtests for A/B order (limited in short direction only)
 +
* backtests for B/A order (unlimited in both directions)
 +
* backtests for B/A order (limited in long direction only)
 +
* backtests for B/A order (limited in short direction only)
 +
 
 +
In each step above, there were total 19 backtests performed (Ratio, Ratio-RSI, Residual models with different settings + one Kalman-grid backtest).
 +
Results from backtests above have been aggregated in this way:
 +
* for backtests unlimited in both directions, results have been aggregated together and we got these stats:
 +
** Median [http://www.investopedia.com/terms/c/cagr.asp CAGR]
 +
** Median Linearity
 +
** Median [http://www.investopedia.com/terms/s/sharperatio.asp Sharpe ratio]
 +
** Median system score
 +
** Min [http://www.investopedia.com/terms/c/cagr.asp CAGR]
 +
** Min Score
 +
These results have been aggregated over all pair leg orders (except Min CAGR and Min Score). For Min CAGR and Min Score, you have also select the best/worst case scenario related to pair order. Select "Trading in both direction" to work with these measures.
 +
Use these fields to filter/sort on general pairs trading profitability with no direction limits.
 +
 
 +
* for backtests limited in single direction (long or short), results have been aggregated together and we got the same stats:
 +
** Median [http://www.investopedia.com/terms/c/cagr.asp CAGR]
 +
** Median Linearity
 +
** Median [http://www.investopedia.com/terms/s/sharperatio.asp Sharpe ratio]
 +
** Median system score
 +
** Min [http://www.investopedia.com/terms/c/cagr.asp CAGR]
 +
** Min Score
 +
These orders have been aggregated for the better case scenario related to long or short direction. That means, if long-only stats were better than short-only stats, you will see long-only stats here. Select "Trading in single direction" to work with these measures.
 +
 
 +
Use these fields to filter/sort in specific/direction limited pairs trading performance. This is useful for searching candidates for specific arbitrages (like "stocks outperforms index arbitrage"). You will want to limit pair strategies coming from search like this to use only long or short positions.
 +
 
 +
Because for some pairs not all backtests have been performed and aggregated (this can happen if there was not enough data available for performing backtests with longer periods), you can also filter on number of backtests performed to include only pairs with sufficient number of backtests aggregated. The field name is "Backtests performed".
 +
 
 +
In addition to aggregated profit stats, pair database includes also [http://www.investopedia.com/terms/s/sharperatio.asp Sharpe ratios] of concrete strategies:
 +
* Ratio model with period 15
 +
* Residual model with period 40
 +
* Ratio-RSI model with period 10
 +
* Kalman-grid model
 +
 
 +
Strategies above have been performed for each pair leg order, with unlimited and limited direction and for two periods:
 +
* full 3 years
 +
* recent 6 months
 +
 
 +
Only Sharpe ratios of these concrete strategies have been indexed. Select "Sharpe ratio" field for filtering/sorting using these measures. You will have to select additional options (worst/best case scenario, direction, model, period...).
 +
 
 +
=== Instrument filters ===
 +
In addition to pair measures, you can also filter pairs using current instrument (equity) measures. These measures are not part of the pair database, these are "extra" conditions using '''current''' market data. This means that tomorrow you can see slightly different results than yesterday (because market data do update).
 +
 
 +
You can filter on:
 +
* Average daily volume - in shares traded
 +
* Market capitalization - in millions
 +
* Instrument types (regular stock, ETF, ETN...)
 +
* Instrument sector
 +
* Instrument text (you can for instance search for pairs containing word "gas" or "gold" in one or both instrument titles)
 +
* Instrument not delisted - you can filter delisted stocks away
 +
 
 +
For most filters above, you are able to select if just one or both legs match your condition.

Latest revision as of 13:55, 30 March 2015

This is the manual for the new pair database system released at Mar 30th 2015. The new database fully replaced the older solution introduced in 2012 which was based on studies applied to all sector-based pairs. This new solution is standalone (not based on studies anymore) and it includes cointegration and orthogonal measures too.

Unlike the old system, this pair databases includes all brute-force combinations of qualified US equities regardless of sectors. Equity is considered qualified to be in the database if for last 2 months of the screening period its average price exceeds 0.2 USD and its average volume exceeds 80k. So for the database period of Jan 1st 2012 - Jan 1st 2015 it gives more than 10,000,000 pairs.

For each pair, we have taken last 3 years of data and we calculated profit measures, orthogonal stats, we have performed cointegration tests, etc...Practically we have included most data you can see play with in the pair analyzer.

We have setup a high performance cluster which allows us for each of these >10,000,000 pairs perform hundreds of tests and calculations. All the data is then indexed in a special database for fast searching. The system is able to execute all kinds of complex queries on this database in less than 30 seconds.

Contents

[edit] Data description

Error creating thumbnail: Unable to save thumbnail to destination

For each pair, we have performed certain tests and calculations. You can filter pairs using any of these measures/ tests results. In addition, you can also sort results based on some of these measures.

[edit] General statistics

  • Slow correlation - average correlation with period 240 applied to the whole period, value between -1..1
  • Fast correlation - average correlation with period 60 applied to the whole period, value between -1..1
  • Half-life of ratio - half-life of the ratio series

It is important to mention something at this stage - some measures (like the half-life) differ with the variable ordering. What does it mean? A pair is created from two equities (legs) - for instance A and B. The variable ordering is about if you create the pair like A/B or like B/A. For some measures (like correlation) it does not matter. But for instance for calculating the half-life of ratio series it does, because apparently A/B pair has a different ratio series from B/A pair. How has this been solved in the database? For all cases where the ordering matters, we will calculate the measure for each order (half-life of A/B and half-life of B/A in this case).

Then we apply this trick:

  • we will save the better result of the measure (over A/B vs B/A) and we will call it the best case scenario (loose matching when searching)
  • we will save the worse result of the measure (over A/B vs B/A) and we will call it the worst case scenario (strict matching when searching)

While searching (or adding sort fields) you can specify if you want to work with the best or the worst case scenario. So for instance, if you choose "the best case scenario" and you filter for half-life shorter than X, it is the same as you had filtered "give me pairs having half-life shorter than X for any combination of A/B, B/A". If you choose "the worst case scenario" and you filter for half-life, it is the same as you had filtered "give me pairs having half-life better than X for both combinations of A/B, B/A".

This works the same way for all measures which differ with the pair leg ordering.

[edit] Cointegration

Cointegration test results and cointegration residual measures:

  • R-squared - measure of the cointegration regression fit
  • ADF test p-value of individual variables (you can use it together with the next field to form your custom confidence cointegration test, i.e. for 97% confidence).
  • Cointegration p-value - p-value of the final step of Engle-Granger test
  • β coefficient (OLS) - regression coefficient of the cointegration regression (OLS)
  • Half-life (cointegration) - half-life of cointegration residuals
  • Cointegrated @ 95 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 95% confidence
  • Cointegrated @ 99 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 99% confidence

For all fields here you have to specify the best/worst case scenario, except for:

  • R-squared - it is the same for both variable orderings
  • ADF test p-value of individual variables - it always works with the worst case scenario

Read more about cointegration at Wikipedia.

[edit] Orthogonal Stats

For each pair in database, orthogonal regression (TLS) has been calculated and residuals of this regression have been analyzed:

  • β coefficient (TLS) - coefficient of the regression. Because changing order of variables give inverse value, for this field we always take the value >= 1 (in absolute value)
  • Half-life (orthogonal) - half-life of orthogonal residuals
  • Skewness - skewness of orthogonal residuals
  • Kurtosis - (excess) kurtosis of orthogonal residuals
  • Doornik-Hansen p-value - result of the Doornik-Hansen normality test applied to orthogonal residuals. If >= 0.95, D-H test is passed with 95% confidence
  • Shapiro-Wilk p-value - result of the Shapiro-Wilk normality test applied to orthogonal residuals. If >= 0.95, S-W test is passed with 95% confidence
  • PACF rating - autocorrelation rating of the residuals. This number is always <=0 (0 = the best rating). What is PACF?

Orthogonal stats do not depend on pair leg ordering.

[edit] Profit Measures

For each pair there have been backtests performed:

  • backtests for A/B order (unlimited in both directions)
  • backtests for A/B order (limited in long direction only)
  • backtests for A/B order (limited in short direction only)
  • backtests for B/A order (unlimited in both directions)
  • backtests for B/A order (limited in long direction only)
  • backtests for B/A order (limited in short direction only)

In each step above, there were total 19 backtests performed (Ratio, Ratio-RSI, Residual models with different settings + one Kalman-grid backtest). Results from backtests above have been aggregated in this way:

  • for backtests unlimited in both directions, results have been aggregated together and we got these stats:

These results have been aggregated over all pair leg orders (except Min CAGR and Min Score). For Min CAGR and Min Score, you have also select the best/worst case scenario related to pair order. Select "Trading in both direction" to work with these measures. Use these fields to filter/sort on general pairs trading profitability with no direction limits.

  • for backtests limited in single direction (long or short), results have been aggregated together and we got the same stats:

These orders have been aggregated for the better case scenario related to long or short direction. That means, if long-only stats were better than short-only stats, you will see long-only stats here. Select "Trading in single direction" to work with these measures.

Use these fields to filter/sort in specific/direction limited pairs trading performance. This is useful for searching candidates for specific arbitrages (like "stocks outperforms index arbitrage"). You will want to limit pair strategies coming from search like this to use only long or short positions.

Because for some pairs not all backtests have been performed and aggregated (this can happen if there was not enough data available for performing backtests with longer periods), you can also filter on number of backtests performed to include only pairs with sufficient number of backtests aggregated. The field name is "Backtests performed".

In addition to aggregated profit stats, pair database includes also Sharpe ratios of concrete strategies:

  • Ratio model with period 15
  • Residual model with period 40
  • Ratio-RSI model with period 10
  • Kalman-grid model

Strategies above have been performed for each pair leg order, with unlimited and limited direction and for two periods:

  • full 3 years
  • recent 6 months

Only Sharpe ratios of these concrete strategies have been indexed. Select "Sharpe ratio" field for filtering/sorting using these measures. You will have to select additional options (worst/best case scenario, direction, model, period...).

[edit] Instrument filters

In addition to pair measures, you can also filter pairs using current instrument (equity) measures. These measures are not part of the pair database, these are "extra" conditions using current market data. This means that tomorrow you can see slightly different results than yesterday (because market data do update).

You can filter on:

  • Average daily volume - in shares traded
  • Market capitalization - in millions
  • Instrument types (regular stock, ETF, ETN...)
  • Instrument sector
  • Instrument text (you can for instance search for pairs containing word "gas" or "gold" in one or both instrument titles)
  • Instrument not delisted - you can filter delisted stocks away

For most filters above, you are able to select if just one or both legs match your condition.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox