Difference between revisions of "Pair Database"
m (→Profit Measures) |
|||
(7 intermediate revisions by one user not shown) | |||
Line 6: | Line 6: | ||
We have setup a high performance cluster which allows us for each of these >10,000,000 pairs perform hundreds of tests and calculations. All the data is then indexed in a special database for fast searching. The system is able to execute all kinds of complex queries on this database in less than 30 seconds. | We have setup a high performance cluster which allows us for each of these >10,000,000 pairs perform hundreds of tests and calculations. All the data is then indexed in a special database for fast searching. The system is able to execute all kinds of complex queries on this database in less than 30 seconds. | ||
+ | |||
== Data description == | == Data description == | ||
+ | [[File:Megabase result.png|center]] | ||
For each pair, we have performed certain tests and calculations. You can filter pairs using any of these measures/ tests results. In addition, you can also sort results based on some of these measures. | For each pair, we have performed certain tests and calculations. You can filter pairs using any of these measures/ tests results. In addition, you can also sort results based on some of these measures. | ||
Line 28: | Line 30: | ||
=== Cointegration === | === Cointegration === | ||
Cointegration test results and cointegration residual measures: | Cointegration test results and cointegration residual measures: | ||
− | * R-squared - measure of the cointegration regression fit | + | * [http://en.wikipedia.org/wiki/Coefficient_of_determination R-squared] - measure of the cointegration regression fit |
− | * ADF test p-value of individual variables (you can use it together with the next field to form your custom confidence cointegration test, i.e. for 97% confidence). | + | * ADF test [http://en.wikipedia.org/wiki/P-value p-value] of individual variables (you can use it together with the next field to form your custom confidence cointegration test, i.e. for 97% confidence). |
− | * Cointegration p-value - p-value of the final step of Engle-Granger test | + | * Cointegration [http://en.wikipedia.org/wiki/P-value p-value] - p-value of the final step of Engle-Granger test |
* β coefficient (OLS) - regression coefficient of the cointegration regression (OLS) | * β coefficient (OLS) - regression coefficient of the cointegration regression (OLS) | ||
* Half-life (cointegration) - half-life of cointegration residuals | * Half-life (cointegration) - half-life of cointegration residuals | ||
Line 39: | Line 41: | ||
* R-squared - it is the same for both variable orderings | * R-squared - it is the same for both variable orderings | ||
* ADF test p-value of individual variables - it always works with the worst case scenario | * ADF test p-value of individual variables - it always works with the worst case scenario | ||
+ | |||
+ | Read more about cointegration at [http://en.wikipedia.org/wiki/Cointegration Wikipedia]. | ||
=== Orthogonal Stats === | === Orthogonal Stats === | ||
− | For each pair in database, orthogonal regression (TLS) has been calculated and residuals of this regression have been analyzed: | + | For each pair in database, [http://en.wikipedia.org/wiki/Total_least_squares orthogonal regression (TLS)] has been calculated and residuals of this regression have been analyzed: |
* β coefficient (TLS) - coefficient of the regression. Because changing order of variables give inverse value, for this field we '''always take the value >= 1''' (in absolute value) | * β coefficient (TLS) - coefficient of the regression. Because changing order of variables give inverse value, for this field we '''always take the value >= 1''' (in absolute value) | ||
* Half-life (orthogonal) - half-life of orthogonal residuals | * Half-life (orthogonal) - half-life of orthogonal residuals | ||
− | * Skewness - skewness of orthogonal residuals | + | * [http://en.wikipedia.org/wiki/Skewness Skewness] - skewness of orthogonal residuals |
− | * Kurtosis - (excess) kurtosis of orthogonal residuals | + | * [http://en.wikipedia.org/wiki/Kurtosis Kurtosis] - (excess) kurtosis of orthogonal residuals |
− | * Doornik-Hansen p-value - result of the Doornik-Hansen normality test applied to orthogonal residuals. If >= 0.95, D-H test is passed with 95% confidence | + | * [http://artax.karlin.mff.cuni.cz/r-help/library/asbio/html/DH.test.html Doornik-Hansen] [http://en.wikipedia.org/wiki/P-value p-value] - result of the Doornik-Hansen normality test applied to orthogonal residuals. If >= 0.95, D-H test is passed with 95% confidence |
− | * Shapiro-Wilk p-value - result of the Shapiro-Wilk normality test applied to orthogonal residuals. If >= 0.95, S-W test is passed with 95% confidence | + | * [http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test Shapiro-Wilk] [http://en.wikipedia.org/wiki/P-value p-value] - result of the Shapiro-Wilk normality test applied to orthogonal residuals. If >= 0.95, S-W test is passed with 95% confidence |
− | * PACF rating - autocorrelation rating of the residuals. This number is always <=0 (0 = the best rating) | + | * PACF rating - autocorrelation rating of the residuals. This number is always <=0 (0 = the best rating). [http://en.wikipedia.org/wiki/Partial_autocorrelation_function What is PACF?] |
Orthogonal stats do not depend on pair leg ordering. | Orthogonal stats do not depend on pair leg ordering. | ||
Line 64: | Line 68: | ||
Results from backtests above have been aggregated in this way: | Results from backtests above have been aggregated in this way: | ||
* for backtests unlimited in both directions, results have been aggregated together and we got these stats: | * for backtests unlimited in both directions, results have been aggregated together and we got these stats: | ||
− | ** Median CAGR | + | ** Median [http://www.investopedia.com/terms/c/cagr.asp CAGR] |
** Median Linearity | ** Median Linearity | ||
− | ** Median Sharpe ratio | + | ** Median [http://www.investopedia.com/terms/s/sharperatio.asp Sharpe ratio] |
** Median system score | ** Median system score | ||
− | ** Min CAGR | + | ** Min [http://www.investopedia.com/terms/c/cagr.asp CAGR] |
** Min Score | ** Min Score | ||
These results have been aggregated over all pair leg orders (except Min CAGR and Min Score). For Min CAGR and Min Score, you have also select the best/worst case scenario related to pair order. Select "Trading in both direction" to work with these measures. | These results have been aggregated over all pair leg orders (except Min CAGR and Min Score). For Min CAGR and Min Score, you have also select the best/worst case scenario related to pair order. Select "Trading in both direction" to work with these measures. | ||
Line 74: | Line 78: | ||
* for backtests limited in single direction (long or short), results have been aggregated together and we got the same stats: | * for backtests limited in single direction (long or short), results have been aggregated together and we got the same stats: | ||
− | ** Median CAGR | + | ** Median [http://www.investopedia.com/terms/c/cagr.asp CAGR] |
** Median Linearity | ** Median Linearity | ||
− | ** Median Sharpe ratio | + | ** Median [http://www.investopedia.com/terms/s/sharperatio.asp Sharpe ratio] |
** Median system score | ** Median system score | ||
− | ** Min CAGR | + | ** Min [http://www.investopedia.com/terms/c/cagr.asp CAGR] |
** Min Score | ** Min Score | ||
These orders have been aggregated for the better case scenario related to long or short direction. That means, if long-only stats were better than short-only stats, you will see long-only stats here. Select "Trading in single direction" to work with these measures. | These orders have been aggregated for the better case scenario related to long or short direction. That means, if long-only stats were better than short-only stats, you will see long-only stats here. Select "Trading in single direction" to work with these measures. | ||
Line 86: | Line 90: | ||
Because for some pairs not all backtests have been performed and aggregated (this can happen if there was not enough data available for performing backtests with longer periods), you can also filter on number of backtests performed to include only pairs with sufficient number of backtests aggregated. The field name is "Backtests performed". | Because for some pairs not all backtests have been performed and aggregated (this can happen if there was not enough data available for performing backtests with longer periods), you can also filter on number of backtests performed to include only pairs with sufficient number of backtests aggregated. The field name is "Backtests performed". | ||
− | In addition to aggregated profit stats, pair database includes also Sharpe ratios of concrete strategies: | + | In addition to aggregated profit stats, pair database includes also [http://www.investopedia.com/terms/s/sharperatio.asp Sharpe ratios] of concrete strategies: |
* Ratio model with period 15 | * Ratio model with period 15 | ||
* Residual model with period 40 | * Residual model with period 40 |
Latest revision as of 13:55, 30 March 2015
This is the manual for the new pair database system released at Mar 30th 2015. The new database fully replaced the older solution introduced in 2012 which was based on studies applied to all sector-based pairs. This new solution is standalone (not based on studies anymore) and it includes cointegration and orthogonal measures too.
Unlike the old system, this pair databases includes all brute-force combinations of qualified US equities regardless of sectors. Equity is considered qualified to be in the database if for last 2 months of the screening period its average price exceeds 0.2 USD and its average volume exceeds 80k. So for the database period of Jan 1st 2012 - Jan 1st 2015 it gives more than 10,000,000 pairs.
For each pair, we have taken last 3 years of data and we calculated profit measures, orthogonal stats, we have performed cointegration tests, etc...Practically we have included most data you can see play with in the pair analyzer.
We have setup a high performance cluster which allows us for each of these >10,000,000 pairs perform hundreds of tests and calculations. All the data is then indexed in a special database for fast searching. The system is able to execute all kinds of complex queries on this database in less than 30 seconds.
Contents |
[edit] Data description
For each pair, we have performed certain tests and calculations. You can filter pairs using any of these measures/ tests results. In addition, you can also sort results based on some of these measures.
[edit] General statistics
- Slow correlation - average correlation with period 240 applied to the whole period, value between -1..1
- Fast correlation - average correlation with period 60 applied to the whole period, value between -1..1
- Half-life of ratio - half-life of the ratio series
It is important to mention something at this stage - some measures (like the half-life) differ with the variable ordering. What does it mean? A pair is created from two equities (legs) - for instance A and B. The variable ordering is about if you create the pair like A/B or like B/A. For some measures (like correlation) it does not matter. But for instance for calculating the half-life of ratio series it does, because apparently A/B pair has a different ratio series from B/A pair. How has this been solved in the database? For all cases where the ordering matters, we will calculate the measure for each order (half-life of A/B and half-life of B/A in this case).
Then we apply this trick:
- we will save the better result of the measure (over A/B vs B/A) and we will call it the best case scenario (loose matching when searching)
- we will save the worse result of the measure (over A/B vs B/A) and we will call it the worst case scenario (strict matching when searching)
While searching (or adding sort fields) you can specify if you want to work with the best or the worst case scenario. So for instance, if you choose "the best case scenario" and you filter for half-life shorter than X, it is the same as you had filtered "give me pairs having half-life shorter than X for any combination of A/B, B/A". If you choose "the worst case scenario" and you filter for half-life, it is the same as you had filtered "give me pairs having half-life better than X for both combinations of A/B, B/A".
This works the same way for all measures which differ with the pair leg ordering.
[edit] Cointegration
Cointegration test results and cointegration residual measures:
- R-squared - measure of the cointegration regression fit
- ADF test p-value of individual variables (you can use it together with the next field to form your custom confidence cointegration test, i.e. for 97% confidence).
- Cointegration p-value - p-value of the final step of Engle-Granger test
- β coefficient (OLS) - regression coefficient of the cointegration regression (OLS)
- Half-life (cointegration) - half-life of cointegration residuals
- Cointegrated @ 95 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 95% confidence
- Cointegrated @ 99 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 99% confidence
For all fields here you have to specify the best/worst case scenario, except for:
- R-squared - it is the same for both variable orderings
- ADF test p-value of individual variables - it always works with the worst case scenario
Read more about cointegration at Wikipedia.
[edit] Orthogonal Stats
For each pair in database, orthogonal regression (TLS) has been calculated and residuals of this regression have been analyzed:
- β coefficient (TLS) - coefficient of the regression. Because changing order of variables give inverse value, for this field we always take the value >= 1 (in absolute value)
- Half-life (orthogonal) - half-life of orthogonal residuals
- Skewness - skewness of orthogonal residuals
- Kurtosis - (excess) kurtosis of orthogonal residuals
- Doornik-Hansen p-value - result of the Doornik-Hansen normality test applied to orthogonal residuals. If >= 0.95, D-H test is passed with 95% confidence
- Shapiro-Wilk p-value - result of the Shapiro-Wilk normality test applied to orthogonal residuals. If >= 0.95, S-W test is passed with 95% confidence
- PACF rating - autocorrelation rating of the residuals. This number is always <=0 (0 = the best rating). What is PACF?
Orthogonal stats do not depend on pair leg ordering.
[edit] Profit Measures
For each pair there have been backtests performed:
- backtests for A/B order (unlimited in both directions)
- backtests for A/B order (limited in long direction only)
- backtests for A/B order (limited in short direction only)
- backtests for B/A order (unlimited in both directions)
- backtests for B/A order (limited in long direction only)
- backtests for B/A order (limited in short direction only)
In each step above, there were total 19 backtests performed (Ratio, Ratio-RSI, Residual models with different settings + one Kalman-grid backtest). Results from backtests above have been aggregated in this way:
- for backtests unlimited in both directions, results have been aggregated together and we got these stats:
- Median CAGR
- Median Linearity
- Median Sharpe ratio
- Median system score
- Min CAGR
- Min Score
These results have been aggregated over all pair leg orders (except Min CAGR and Min Score). For Min CAGR and Min Score, you have also select the best/worst case scenario related to pair order. Select "Trading in both direction" to work with these measures. Use these fields to filter/sort on general pairs trading profitability with no direction limits.
- for backtests limited in single direction (long or short), results have been aggregated together and we got the same stats:
- Median CAGR
- Median Linearity
- Median Sharpe ratio
- Median system score
- Min CAGR
- Min Score
These orders have been aggregated for the better case scenario related to long or short direction. That means, if long-only stats were better than short-only stats, you will see long-only stats here. Select "Trading in single direction" to work with these measures.
Use these fields to filter/sort in specific/direction limited pairs trading performance. This is useful for searching candidates for specific arbitrages (like "stocks outperforms index arbitrage"). You will want to limit pair strategies coming from search like this to use only long or short positions.
Because for some pairs not all backtests have been performed and aggregated (this can happen if there was not enough data available for performing backtests with longer periods), you can also filter on number of backtests performed to include only pairs with sufficient number of backtests aggregated. The field name is "Backtests performed".
In addition to aggregated profit stats, pair database includes also Sharpe ratios of concrete strategies:
- Ratio model with period 15
- Residual model with period 40
- Ratio-RSI model with period 10
- Kalman-grid model
Strategies above have been performed for each pair leg order, with unlimited and limited direction and for two periods:
- full 3 years
- recent 6 months
Only Sharpe ratios of these concrete strategies have been indexed. Select "Sharpe ratio" field for filtering/sorting using these measures. You will have to select additional options (worst/best case scenario, direction, model, period...).
[edit] Instrument filters
In addition to pair measures, you can also filter pairs using current instrument (equity) measures. These measures are not part of the pair database, these are "extra" conditions using current market data. This means that tomorrow you can see slightly different results than yesterday (because market data do update).
You can filter on:
- Average daily volume - in shares traded
- Market capitalization - in millions
- Instrument types (regular stock, ETF, ETN...)
- Instrument sector
- Instrument text (you can for instance search for pairs containing word "gas" or "gold" in one or both instrument titles)
- Instrument not delisted - you can filter delisted stocks away
For most filters above, you are able to select if just one or both legs match your condition.