| Title: | Eigenvalue-Based Estimation of the Number of Factors in Approximate Factor Models |
|---|---|
| Description: | Eigenvalue-based estimation of the number of factors in approximate factor models. Designed to work when either N or T is large, without requiring both dimensions to grow simultaneously. Implements the eigenvalue ratio estimator of Ahn and Horenstein (2013) <doi:10.3982/ECTA8968>, the information criteria of Bai and Ng (2002) <doi:10.1111/1468-0262.00273>, the tuned penalty of Alessi, Barigozzi and Capasso (2010) <doi:10.1016/j.spl.2010.08.005>, the auto-covariance ratio estimator of Lam and Yao (2012) <doi:10.1214/12-AOS970>, and the edge distribution estimators of Onatski (2009) <doi:10.3982/ECTA6964> and Onatski (2010) <doi:10.1162/REST_a_00043>. |
| Authors: | Jason Parker [aut, cre] (ORCID: <https://orcid.org/0000-0001-9227-6976>) |
| Maintainer: | Jason Parker <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.2 |
| Built: | 2026-05-23 09:49:25 UTC |
| Source: | https://github.com/penny4nonsense/factorselect |
Estimates the number of factors using the tuning-stability procedure of Alessi, Barigozzi and Capasso (2010) applied to the three IC penalty functions of Bai and Ng (2002). For each penalty function, a grid of tuning constants is used and the most stable estimate across the grid is selected as the final estimate.
.abc(eigenvalues, V0, kmax, N, TT, c_grid = seq(0, 1, by = 0.01)).abc(eigenvalues, V0, kmax, N, TT, c_grid = seq(0, 1, by = 0.01))
eigenvalues |
Numeric vector of eigenvalues in descending order of
length kmax + 1, typically obtained from |
V0 |
Numeric scalar. Total mean squared value of the panel,
|
kmax |
Integer. Maximum number of factors to consider. |
N |
Integer. Number of cross-sectional units. |
TT |
Integer. Number of time periods. |
c_grid |
Numeric vector. Grid of tuning constants over which to
evaluate stability. Defaults to |
The ABC estimator applies the tuning-stability procedure of Hallin and
Liska (2007) to the IC criteria of Bai and Ng (2002). For each tuning
constant in the grid, a modified criterion is minimized:
where is the penalty function from of Bai and
Ng (2002), for j = 1, 2, 3. The final estimate is the modal value of
across the grid — the value of k that is selected
most frequently as c varies.
As with .bai_ng, this estimator requires unstandardized
data. The argument V0 should be computed from demeaned but
unstandardized data.
The ABC estimator generally outperforms the raw Bai & Ng IC criteria in finite samples, particularly when errors are cross-sectionally correlated.
A named list with the following elements:
Integer. Selected number of factors using ABC with IC1 penalty.
Integer. Selected number of factors using ABC with IC2 penalty.
Integer. Selected number of factors using ABC with IC3 penalty.
Integer vector of length length(c_grid).
Selected k for each value of c using IC1 penalty.
Integer vector of length length(c_grid).
Selected k for each value of c using IC2 penalty.
Integer vector of length length(c_grid).
Selected k for each value of c using IC3 penalty.
Numeric vector. The tuning constant grid used.
Alessi, L., Barigozzi, M. and Capasso, M. (2010). Improved Penalization for Determining the Number of Factors in Approximate Factor Models. Statistics and Probability Letters, 80, 1806-1813.
Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.
Hallin, M. and Liska, R. (2007). Determining the Number of Factors in the Generalized Dynamic Factor Model. Journal of the American Statistical Association, 102, 603-617.
.bai_ng, .extract_eigenvalues,
select_factors
Estimates the number of factors using the eigenvalue ratio (ER) and growth ratio (GR) statistics of Ahn and Horenstein (2013). The ratio approach provides robustness to perturbations in the eigenvalue spectrum and performs well when only one dimension (N or T) is large.
.ahn_horenstein(eigenvalues, kmax, n).ahn_horenstein(eigenvalues, kmax, n)
eigenvalues |
Numeric vector of eigenvalues in descending order of
length kmax + 1, typically obtained from |
kmax |
Integer. Maximum number of factors to consider. The function evaluates the ratio statistics for k = 1, ..., kmax. |
n |
Integer. The value of min(N, T), used to compute the mock eigenvalue boundary term following Ahn and Horenstein (2013) Corollary 1. |
The ER statistic is defined as the ratio of successive eigenvalue differences:
where is the k-th successive difference in the eigenvalue
sequence. The GR statistic replaces raw differences with log growth rates:
The boundary case k = 0 is handled by assigning
as the initial difference term, following Ahn and Horenstein (2013).
The number of factors is selected as the argmax of each statistic over k = 1, ..., kmax.
A named list with the following elements:
Integer. Selected number of factors based on the ER statistic.
Integer. Selected number of factors based on the GR statistic.
Numeric vector of length kmax. Full ER statistic sequence.
Numeric vector of length kmax. Full GR statistic sequence.
Ahn, S.C. and Horenstein, A.R. (2013). Eigenvalue Ratio Test for the Number of Factors. Econometrica, 81(3), 1203-1227.
.extract_eigenvalues, select_factors
Estimates the number of factors using the six penalty-based criteria of Bai and Ng (2002). Includes three PC criteria (minimize penalized residual variance) and three IC criteria (minimize penalized log residual variance).
.bai_ng(eigenvalues, V0, kmax, N, TT).bai_ng(eigenvalues, V0, kmax, N, TT)
eigenvalues |
Numeric vector of eigenvalues in descending order of
length kmax + 1, typically obtained from |
V0 |
Numeric scalar. Total mean squared value of the panel,
|
kmax |
Integer. Maximum number of factors to consider. |
N |
Integer. Number of cross-sectional units. |
TT |
Integer. Number of time periods. |
The six criteria are defined as follows. Let denote the
residual variance from a k-factor model, , and
.
PC criteria (minimize penalized residual variance):
IC criteria (minimize penalized log residual variance):
is computed from the eigenvalues of as:
which is the mean residual variance after removing the first k factors.
All six criteria are minimized over .
Note that is included to allow for the possibility of no
factors.
These estimators require both N and T to be large for consistent
estimation. They may perform poorly when either dimension is small.
For more robust estimation, consider .ahn_horenstein.
A named list with the following elements:
Integer. Selected number of factors by PC_p1.
Integer. Selected number of factors by PC_p2.
Integer. Selected number of factors by PC_p3.
Integer. Selected number of factors by IC_p1.
Integer. Selected number of factors by IC_p2.
Integer. Selected number of factors by IC_p3.
Numeric vector of length kmax. Full PC_p1 criterion sequence.
Numeric vector of length kmax. Full PC_p2 criterion sequence.
Numeric vector of length kmax. Full PC_p3 criterion sequence.
Numeric vector of length kmax. Full IC_p1 criterion sequence.
Numeric vector of length kmax. Full IC_p2 criterion sequence.
Numeric vector of length kmax. Full IC_p3 criterion sequence.
Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.
.ahn_horenstein, .extract_eigenvalues,
select_factors
Computes the leading eigenvalues of the sample covariance matrix using a truncated eigendecomposition. Automatically selects the smaller of the N x N or T x T covariance matrix for efficiency. Uses RSpectra when available for large matrices, falling back to base R otherwise.
.extract_eigenvalues(X, kmax).extract_eigenvalues(X, kmax)
X |
Numeric matrix of dimensions T x N, typically preprocessed by
|
kmax |
Integer. Number of leading eigenvalues to compute. Should be set generously (e.g., 8-15) to allow estimators to evaluate the full candidate range. |
When N <= T, decomposes the N x N matrix .
When N > T, decomposes the T x T matrix .
This ensures the cheaper decomposition is always used.
RSpectra's eigs_sym() is used when available and when
min(N, T) > 100, as the truncated decomposition only provides
meaningful speedup at larger scales.
A named list with the following elements:
Numeric vector of length kmax + 1 containing the
leading eigenvalues in descending order. The extra eigenvalue is
required by ratio-based estimators.
Numeric matrix of corresponding eigenvectors.
Character string, either "N" or "T",
indicating which covariance matrix was decomposed.
Ahn, S.C. and Horenstein, A.R. (2013). Eigenvalue Ratio Test for the Number of Factors. Econometrica, 81(3), 1203-1227.
.prepare_matrix, .ahn_horenstein
Estimates the number of factors using the eigenvalue ratio estimator of Lam and Yao (2012). Unlike estimators based on the contemporaneous covariance matrix, this estimator uses lagged auto-covariance matrices, exploiting the fact that the factor loading space is spanned by the eigenvectors of the summed lagged auto-covariance matrix M corresponding to its nonzero eigenvalues.
.lam_yao(X, kmax, h = 1).lam_yao(X, kmax, h = 1)
X |
Numeric matrix of dimensions T x N, typically preprocessed by
|
kmax |
Integer. Maximum number of factors to consider. |
h |
Integer. Number of lags to use in constructing the auto-covariance
matrix M. Defaults to |
The estimator constructs the N x N matrix:
where is
the lag-k sample auto-covariance matrix.
The factor loading space is spanned by the eigenvectors of M corresponding to its nonzero eigenvalues, and the number of nonzero eigenvalues equals the number of factors r (Lam and Yao, 2012, Proposition 1). In finite samples, the ratio of adjacent eigenvalues of M spikes at r because eigenvalues r+1 onward are theoretically zero.
The number of factors is estimated as:
A named list with the following elements:
Integer. Selected number of factors.
Numeric vector of length kmax. Full eigenvalue ratio sequence of M.
Numeric vector of length kmax + 1. Leading eigenvalues of M in descending order.
Lam, C. and Yao, Q. (2012). Factor Modelling for High-Dimensional Time Series: Inference for the Number of Factors. The Annals of Statistics, 40(2), 694-726.
.ahn_horenstein, select_factors
Estimates the number of factors using the sequential hypothesis testing procedure of Onatski (2009), applied to the static approximate factor model version described in Section 4 of that paper. The test statistic is based on ratios of differences of adjacent eigenvalues of a complex-valued transformation of the data.
.onatski_2009(X, kmax, alpha = 0.05).onatski_2009(X, kmax, alpha = 0.05)
X |
Numeric matrix of dimensions T x N, typically preprocessed by
|
kmax |
Integer. Maximum number of factors to consider. Defines the upper bound k1 in the sequential testing procedure. |
alpha |
Numeric. Significance level for the sequential test.
Defaults to |
The static approximate factor model version of the Onatski (2009) test (Section 4) proceeds as follows:
Split the T x N data matrix into two halves of length T/2.
Form complex-valued vectors
for .
Compute eigenvalues of
.
Sequentially test versus
for
using the statistic
.
Stop when is not rejected. The estimate is the
current .
Critical values are taken from Table I of Onatski (2009) and depend
on the significance level alpha and the number of factors
tested under the alternative .
If T is odd, the last observation is dropped to ensure equal-length halves.
A named list with the following elements:
Integer. Estimated number of factors from the sequential testing procedure.
Numeric vector of length kmax. The ratio statistic
for each i.
Numeric vector of length kmax + 2. Leading eigenvalues of the complex covariance matrix in descending order.
Numeric. Critical value used for the test at the specified significance level.
Numeric. The significance level used.
Onatski, A. (2009). Testing Hypotheses About the Number of Factors in Large Factor Models. Econometrica, 77(5), 1447-1479.
.ahn_horenstein, select_factors
Estimates the number of factors using the Edge Distribution (ED) estimator of Onatski (2010). The estimator exploits the fact that idiosyncratic eigenvalues of the sample covariance matrix cluster around a single point, while systematic eigenvalues diverge to infinity. The threshold separating the two groups is estimated iteratively using the square root shape of the edge of the eigenvalue distribution.
.onatski_2010(eigenvalues, kmax, n_iter = 4L).onatski_2010(eigenvalues, kmax, n_iter = 4L)
eigenvalues |
Numeric vector of eigenvalues in descending order,
typically obtained from |
kmax |
Integer. Maximum number of factors to consider. |
n_iter |
Integer. Maximum number of iterations for the
calibration procedure. Defaults to |
The ED estimator of Onatski (2010) is based on the theoretical result
that idiosyncratic eigenvalues cluster around the upper edge
of the limiting spectral distribution,
while systematic eigenvalues diverge. Near the edge, the density of
the limiting spectral distribution behaves like a square root function,
implying that eigenvalue differences
for idiosyncratic eigenvalues behave approximately as
.
The calibration procedure estimates
by regressing five consecutive eigenvalues on a constant and , where is initialized at
and updated iteratively.
The estimator requires eigenvalues to contain at least
kmax + 5 elements so that the OLS window
is always available.
A named list with the following elements:
Integer. Estimated number of factors.
Numeric. The estimated threshold .
Numeric. The estimated slope coefficient
from the OLS regression in the final iteration.
Numeric vector of length kmax. Successive
eigenvalue differences .
Integer. Number of iterations performed.
Onatski, A. (2010). Determining the Number of Factors From Empirical Distribution of Eigenvalues. The Review of Economics and Statistics, 92(4), 1004-1016.
.extract_eigenvalues, select_factors
Removes individual means, time means, or both from a numeric matrix, and optionally scales to unit variance. This is the standard preprocessing step required before eigendecomposition in factor number estimation.
.prepare_matrix( X, demean = c("both", "individual", "time", "none"), standardize = TRUE ).prepare_matrix( X, demean = c("both", "individual", "time", "none"), standardize = TRUE )
X |
Numeric matrix of dimensions T x N (time periods x units). |
demean |
Character string specifying the demeaning method. One of:
|
standardize |
Logical. If |
When demean = "both", the function iterates individual and time
demeaning to convergence (two passes is sufficient for practical purposes).
This follows the within-transformation used in panel data models.
A demeaned (and optionally scaled) numeric matrix of the same
dimensions as X.
Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.
.extract_eigenvalues, select_factors
## Not run: set.seed(42) X <- matrix(rnorm(200 * 100, mean = 5), 200, 100) X_clean <- .prepare_matrix(X, demean = "both", standardize = TRUE) ## End(Not run)## Not run: set.seed(42) X <- matrix(rnorm(200 * 100, mean = 5), 200, 100) X_clean <- .prepare_matrix(X, demean = "both", standardize = TRUE) ## End(Not run)
Produces a scree plot of the leading eigenvalues with the selected number of factors marked.
## S3 method for class 'factor_select' plot(x, main = "Scree Plot", ...)## S3 method for class 'factor_select' plot(x, main = "Scree Plot", ...)
x |
A |
main |
Character string. Plot title. Defaults to |
... |
Further arguments passed to |
Invisibly returns x, called for its side effect of
producing a scree plot.
Print Method for factor_select Objects
## S3 method for class 'factor_select' print(x, ...)## S3 method for class 'factor_select' print(x, ...)
x |
A |
... |
Further arguments passed to or from other methods. |
Invisibly returns x, called for its side effect of
printing a summary of the factor selection results to the console.
A unified interface for estimating the number of factors in a large dimensional approximate factor model. Preprocesses the data and dispatches to one or more factor number estimators.
select_factors( X, method = "ahn_horenstein", kmax = NULL, demean = c("both", "individual", "time", "none"), standardize = TRUE, h = 1L, alpha = 0.05 )select_factors( X, method = "ahn_horenstein", kmax = NULL, demean = c("both", "individual", "time", "none"), standardize = TRUE, h = 1L, alpha = 0.05 )
X |
A numeric matrix of dimensions T x N (time periods x units), or an object coercible to a numeric matrix. Must be a balanced panel with no missing values. |
method |
Character vector specifying which estimator(s) to use. One or
more of |
kmax |
Integer. Maximum number of factors to consider. Defaults to
|
demean |
Character string passed to |
standardize |
Logical. Whether to standardize columns to unit variance
before estimation. Defaults to |
h |
Integer. Number of lags to use for the |
alpha |
Numeric. Significance level for the |
The data are first preprocessed via .prepare_matrix() and then
a single eigendecomposition is performed via .extract_eigenvalues(),
which is shared across all requested estimators for efficiency.
The default method is "ahn_horenstein", which is recommended for
most applications. It is robust to perturbations in the eigenvalue
spectrum and performs well when only one of N or T is large.
The "bai_ng", "abc", and "lam_yao" methods always
use unstandardized data because their penalty terms and auto-covariance
structure depend on the actual scale of the data.
An object of class "factor_select", which is a named list
with the following elements:
Named integer vector of selected factor numbers, one per method requested.
Character vector of methods used.
Integer. Maximum number of factors considered.
Numeric vector of leading eigenvalues.
Named list of full output from each estimator.
The matched call.
Ahn, S.C. and Horenstein, A.R. (2013). Eigenvalue Ratio Test for the Number of Factors. Econometrica, 81(3), 1203-1227.
Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.
Alessi, L., Barigozzi, M. and Capasso, M. (2010). Improved Penalization for Determining the Number of Factors in Approximate Factor Models. Statistics and Probability Letters, 80, 1806-1813.
Lam, C. and Yao, Q. (2012). Factor Modelling for High-Dimensional Time Series: Inference for the Number of Factors. The Annals of Statistics, 40(2), 694-726.
.ahn_horenstein, .bai_ng,
.abc, .lam_yao,
.prepare_matrix, .extract_eigenvalues
set.seed(42) N <- 100; T <- 200; k_true <- 3 Lambda <- matrix(rnorm(N * k_true), N, k_true) F_mat <- matrix(rnorm(T * k_true), T, k_true) E <- matrix(rnorm(T * N, sd = 0.5), T, N) X <- F_mat %*% t(Lambda) + E select_factors(X)set.seed(42) N <- 100; T <- 200; k_true <- 3 Lambda <- matrix(rnorm(N * k_true), N, k_true) F_mat <- matrix(rnorm(T * k_true), T, k_true) E <- matrix(rnorm(T * N, sd = 0.5), T, N) X <- F_mat %*% t(Lambda) + E select_factors(X)
Generates a simulated panel data matrix from a static approximate factor model. Useful for testing and benchmarking factor number estimators.
simulate_factor_model(N, TT, k, sd = 1, seed = NULL)simulate_factor_model(N, TT, k, sd = 1, seed = NULL)
N |
Integer. Number of cross-sectional units. |
TT |
Integer. Number of time periods. Named |
k |
Integer. True number of factors. |
sd |
Numeric. Standard deviation of the idiosyncratic error term.
Defaults to |
seed |
Integer or |
The data generating process follows the standard approximate factor
model of Chamberlain and Rothschild (1983) as used in the simulation
exercises of Ahn and Horenstein (2013). Factors and loadings are
independent standard normal draws. Errors are i.i.d. normal with
mean zero and standard deviation sd.
The signal-to-noise ratio is controlled by sd — smaller values
produce a cleaner factor structure that is easier for estimators to
recover. The default sd = 1 matches the baseline simulation
design of Ahn and Horenstein (2013) with theta = 1.
A numeric matrix of dimensions TT x N generated from:
where is a TT x k matrix of factors drawn from
, is an N x k matrix of loadings
drawn from , and is a TT x N matrix of
idiosyncratic errors drawn from .
Ahn, S.C. and Horenstein, A.R. (2013). Eigenvalue Ratio Test for the Number of Factors. Econometrica, 81(3), 1203-1227.
Chamberlain, G. and Rothschild, M. (1983). Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets. Econometrica, 51(5), 1281-1304.
# Simulate a factor model with 3 factors X <- simulate_factor_model(N = 100, TT = 200, k = 3, sd = 0.5, seed = 42) dim(X) # Pass directly to select_factors result <- select_factors(X) result$k# Simulate a factor model with 3 factors X <- simulate_factor_model(N = 100, TT = 200, k = 3, sd = 0.5, seed = 42) dim(X) # Pass directly to select_factors result <- select_factors(X) result$k
Summary Method for factor_select Objects
## S3 method for class 'factor_select' summary(object, ...)## S3 method for class 'factor_select' summary(object, ...)
object |
A |
... |
Further arguments passed to or from other methods. |
Invisibly returns object, called for its side effect
of printing a summary including leading eigenvalues to the console.