# Difference between revisions of "Pair Database"

Line 24: | Line 24: | ||

So for instance, if you choose "the best case scenario" and you filter for half-life shorter than X, it is the same as you had filtered "give me pairs having half-life shorter than X for '''any''' combination of A/B, B/A". If you choose "the worst case scenario" and you filter for half-life, it is the same as you had filtered "give me pairs having half-life better than X for '''both''' combinations of A/B, B/A". | So for instance, if you choose "the best case scenario" and you filter for half-life shorter than X, it is the same as you had filtered "give me pairs having half-life shorter than X for '''any''' combination of A/B, B/A". If you choose "the worst case scenario" and you filter for half-life, it is the same as you had filtered "give me pairs having half-life better than X for '''both''' combinations of A/B, B/A". | ||

− | This works the same way for all measures which differ with the | + | This works the same way for all measures which differ with the pair leg ordering. |

=== Cointegration === | === Cointegration === | ||

Line 39: | Line 39: | ||

* R-squared - it is the same for both variable orderings | * R-squared - it is the same for both variable orderings | ||

* ADF test p-value of individual variables - it always works with the worst case scenario | * ADF test p-value of individual variables - it always works with the worst case scenario | ||

+ | |||

+ | === Orthogonal Stats === | ||

+ | For each pair in database, orthogonal regression (TLS) has been calculated and residuals of this regression have been analyzed: | ||

+ | * β coefficient (TLS) - coefficient of the regression. Because changing order of variables give inverse value, for this field we '''always take the value >= 1''' (in absolute value) | ||

+ | * Half-life (orthogonal) - half-life of orthogonal residuals | ||

+ | * Skewness - skewness of orthogonal residuals | ||

+ | * Kurtosis - (excess) kurtosis of orthogonal residuals | ||

+ | * Doornik-Hansen p-value - result of the Doornik-Hansen normality test applied to orthogonal residuals. If >= 0.95, D-H test is passed with 95% confidence | ||

+ | * Shapiro-Wilk p-value - result of the Shapiro-Wilk normality test applied to orthogonal residuals. If >= 0.95, S-W test is passed with 95% confidence | ||

+ | * PACF rating - autocorrelation rating of the residuals. This number is always <=0 (0 = the best rating) | ||

+ | |||

+ | Orthogonal stats do not depend on pair leg ordering. |

## Revision as of 13:04, 30 March 2015

This is the manual for the new pair database system released at Mar 30th 2015. The new database fully replaced the older solution introduced in 2012 which was based on studies applied to all sector-based pairs. This new solution is standalone (not based on studies anymore) and it includes cointegration and orthogonal measures too.

Unlike the old system, this pair databases includes all brute-force combinations of qualified US equities **regardless of sectors**. Equity is considered qualified to be in the database if for last 2 months of the screening period its average price exceeds 0.2 USD and its average volume exceeds 80k. So for the database period of Jan 1st 2012 - Jan 1st 2015 it gives more than 10,000,000 pairs.

For each pair, we have taken last 3 years of data and we calculated profit measures, orthogonal stats, we have performed cointegration tests, etc...Practically we have included most data you can see play with in the pair analyzer.

We have setup a high performance cluster which allows us for each of these >10,000,000 pairs perform hundreds of tests and calculations. All the data is then indexed in a special database for fast searching. The system is able to execute all kinds of complex queries on this database in less than 30 seconds.

## Contents |

## Data description

For each pair, we have performed certain tests and calculations. You can filter pairs using any of these measures/ tests results. In addition, you can also sort results based on some of these measures.

### General statistics

- Slow correlation - average correlation with period 240 applied to the whole period, value between -1..1
- Fast correlation - average correlation with period 60 applied to the whole period, value between -1..1
- Half-life of ratio - half-life of the ratio series

It is important to mention something at this stage - some measures (like the half-life) differ with the variable ordering. What does it mean? A pair is created from two equities (legs) - for instance A and B. The variable ordering is about if you create the pair like A/B or like B/A. For some measures (like correlation) it does not matter. But for instance for calculating the half-life of ratio series it does, because apparently A/B pair has a different ratio series from B/A pair. How has this been solved in the database? For all cases where the ordering matters, we will calculate the measure for each order (half-life of A/B and half-life of B/A in this case).

Then we apply this trick:

- we will save the better result of the measure (over A/B vs B/A) and we will call it
**the best case scenario**(loose matching when searching) - we will save the worse result of the measure (over A/B vs B/A) and we will call it
**the worst case scenario**(strict matching when searching)

While searching (or adding sort fields) you can specify if you want to work with the best or the worst case scenario.
So for instance, if you choose "the best case scenario" and you filter for half-life shorter than X, it is the same as you had filtered "give me pairs having half-life shorter than X for **any** combination of A/B, B/A". If you choose "the worst case scenario" and you filter for half-life, it is the same as you had filtered "give me pairs having half-life better than X for **both** combinations of A/B, B/A".

This works the same way for all measures which differ with the pair leg ordering.

### Cointegration

Cointegration test results and cointegration residual measures:

- R-squared - measure of the cointegration regression fit
- ADF test p-value of individual variables (you can use it together with the next field to form your custom confidence cointegration test, i.e. for 97% confidence).
- Cointegration p-value - p-value of the final step of Engle-Granger test
- β coefficient (OLS) - regression coefficient of the cointegration regression (OLS)
- Half-life (cointegration) - half-life of cointegration residuals
- Cointegrated @ 95 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 95% confidence
- Cointegrated @ 99 % - Engle-Granger test result, passed if the pair is proven to be cointegrated at 99% confidence

For all fields here you have to specify the best/worst case scenario, except for:

- R-squared - it is the same for both variable orderings
- ADF test p-value of individual variables - it always works with the worst case scenario

### Orthogonal Stats

For each pair in database, orthogonal regression (TLS) has been calculated and residuals of this regression have been analyzed:

- β coefficient (TLS) - coefficient of the regression. Because changing order of variables give inverse value, for this field we
**always take the value >= 1**(in absolute value) - Half-life (orthogonal) - half-life of orthogonal residuals
- Skewness - skewness of orthogonal residuals
- Kurtosis - (excess) kurtosis of orthogonal residuals
- Doornik-Hansen p-value - result of the Doornik-Hansen normality test applied to orthogonal residuals. If >= 0.95, D-H test is passed with 95% confidence
- Shapiro-Wilk p-value - result of the Shapiro-Wilk normality test applied to orthogonal residuals. If >= 0.95, S-W test is passed with 95% confidence
- PACF rating - autocorrelation rating of the residuals. This number is always <=0 (0 = the best rating)

Orthogonal stats do not depend on pair leg ordering.