Data Quality

What the system sees, where the gaps are, and what's been mitigated.

Universe overview

Companies

1,602

survivor universe

Start

2006-01-03

first source date

End

2015-12-31

last source date

Coverage timeseries

1,7621,6022006-01-312015-12-31
OHLCV countFundamentals countUniverse size

Survivorship-biased universe: all 1,602 companies are present every day in the source data.

Per-signal coverage

SignalWarmup daysFirst valid dateMean coverage
momentum2522007-01-04
100.0%
mean_reversion52006-01-10
100.0%
value02006-03-22
98.6%
growth2522007-01-04
96.1%
quality02006-03-22
98.6%

Universe composition

Exchange

NYSE915
NASDAQ623
PinkSheets64

Sector

Banks116
Oil, Gas and Consumable Fuels72
Specialty Retail63
Insurance61
Semiconductors and Semiconductor Equipment58
Machinery55
Electronic Equipment, Instruments and Components52
Health Care Equipment and Supplies51
Software50
Hotels, Restaurants and Leisure50
Capital Markets46
Chemicals43
Health Care Providers and Services43
Commercial Services and Supplies38
Energy Equipment and Services35
Professional Services34
Aerospace and Defense31
Biotechnology30
Metals and Mining30
Household Durables29
Communications Equipment28
Food Products27
Electric Utilities26
Life Sciences Tools and Services23
Media22
Retail REITs22
Pharmaceuticals19
Ground Transportation18
Building Products18
Textiles, Apparel and Luxury Goods17
Residential REITs16
Consumer Staples Distribution and Retail16
Technology Hardware, Storage and Peripherals15
Electrical Equipment15
Financial Services15
Entertainment14
Automobile Components14
Specialized REITs14
Construction and Engineering14
IT Services13
Containers and Packaging13
Diversified Telecommunication Services13
Trading Companies and Distributors13
Multi-Utilities13
Gas Utilities13
Broadline Retail12
Diversified Consumer Services12
Health Care REITs11
Consumer Finance10
Personal Care Products10
Office REITs10
Leisure Products9
Beverages9
Hotel and Resort REITs8
Mortgage Real Estate Investment Trusts (REITs)7
Household Products7
Air Freight and Logistics7
Interactive Media and Services6
Diversified REITs6
Passenger Airlines5
Distributors5
Real Estate Management and Development5
Tobacco5
Marine Transportation5
Paper and Forest Products5
Water Utilities5
Industrial REITs5
Automobiles4
Wireless Telecommunication Services4
Industrial Conglomerates4
Health Care Technology3
Construction Materials3
Independent Power and Renewable Electricity Producers1
Transportation Infrastructure1

Anomalies

Negative filing lags (fixed)

available_date = max(filing_date, period_end_date) ensures data is never accessible before quarter end.

CompanyFiling datePeriod endLag
2553972011-04-252011-04-30-5d
2553972011-07-012011-07-30-29d
2553972011-09-302011-10-29-29d
2553972012-05-012012-05-05-4d
8904982013-03-282013-03-31-3d

Extreme daily returns (clipped)

returns clipped at +/-100% in run_backtest engine.

CompanyDateReturnCauseMitigation
219347492014-10-162,592.782xreverse-split adjustment artifactclipped at ±100% in run_backtest engine
1743362008-02-012,456.709xlikely corporate action or restructuring (crisis-era data)clipped at ±100% in run_backtest engine
25578612008-11-21600%likely corporate action or restructuring (crisis-era data)clipped at ±100% in run_backtest engine
289222009-07-29330%likely corporate action or restructuring (crisis-era data)clipped at ±100% in run_backtest engine
2968892009-03-23275%likely corporate action or restructuring (crisis-era data)clipped at ±100% in run_backtest engine
1772512009-03-17242%likely corporate action or restructuring (crisis-era data)clipped at ±100% in run_backtest engine

Honest caveat

Survivorship bias

The source universe contains 1,602 surviving names. Known failed or acquired companies are absent, including:

Lehman Brothers

LEH · 2008-09-15

bankruptcy

Bear Stearns

BSC · 2008-03

acquired by JPM

Wachovia

WB · 2008-10

acquired by WFC

General Motors

GM (old) · 2009-06

bankruptcy

CIT Group

CIT · 2009-11

bankruptcy

Estimated return inflation

1-4% annualized

All reported returns are upper bounds; methodology is sound, data has a known limitation

Fundamentals freshness

Mean

42d

Median

38d

Max

620d

70620-304994830-60651860-9021290-12082120-15048150-180210180+

Most fundamentals are filed within 90 days of period-end; we use available_date as the PIT key.

Reproduce: .venv/bin/python scripts/data_quality_report.py