Science of data curation
We measure data quality across six key dimensions: completeness, accuracy, uniqueness, validity, consistency, and timeliness. A firm’s outdated data infrastructure may require too much maintenance and too many manually intensive processes. Modern data infrastructure with governance and automated standardization/normalization lets data flow freely between departments, better serving investors.
A sophisticated data operations layer allows CIOs and CDOs to create the fertile conditions for incremental augmentation, standardization of data models, strengthening data security, and implementing efficient data governance processes. In today’s exceedingly complicated markets, I cannot overstate the importance of data infrastructure’s capability to normalize multiple views of the same models and map existing data structures to a unified framework. The process begins with identifying common data entities and their relationships across systems.
As funds and managers grow and diversify, they accumulate fragmented systems that store and process data in different formats, at different granularities, and targeted to varying operational uses. This makes it painful to integrate data across systems and extract meaningful insights. Normalization ensures that data is stored in a consistent format, enabling seamless integration, analysis, and reporting across systems.
Building trust in investment data across the organization
Poor data quality—manifesting in missing values, duplicates, and errors—can severely undermine the effectiveness of a company’s data-driven strategies. To combat this, organizations should embed automated quality checks within their data management systems. These checks should be designed to detect and resolve inconsistencies, identify missing values, and flag erroneous data before it filters throughout the system. Automated validation processes can cross-check incoming data against predefined quality standards. The earlier you catch an issue, the less impact it will have downstream.
The people in treasury must trust the data they use to track collateral, generate margin reports, or calculate impact against positions daily. The people in portfolio accounting must trust the data they use to analyze performance across multiple currencies, generate P&L in multiple dimensions, or validate NAVs. Traders must trust the data to deploy strategies fast and achieve smarter capital allocation. Consistent, centralized securities data is essential to preserve integrity across trade capture, asset servicing, and accounting processes.
Governance, lineage, and bitemporal modeling in financial systems
Data curation is also about buttoned-up data governance that gives everyone the right access/permissions and entitlements, visibility into the data catalog, as well as clear lineage, observability, and bitemporal data history for audits. Bitemporal modeling is not easy but is necessary to have reliable historical information; the system must track data values in multiple timelines.
We have to be able to not only know our data; we have to know when we knew it – because the real world is complicated and can change underfoot. Trades get amended, earnings get restated, mistakes get corrected. It’s not enough to know that you made a prediction and it didn’t pan out – it’s important to be able to recall exactly what you knew when you made the prediction. Do you have a model problem or a data timeliness problem? Without bitemporal data you can’t know what you knew at the time.
Data curation means silencing the noise
Let’s say the CIO for the aforementioned multi-strategy hedge fund is trying to clearly understand total risk of the firm. They want to weigh risk introduced between strategies and weigh them off against each other when they offset. While it’s totally reasonable for a macro strategy or arbitrage strategy to be interested in one view of data that is completely different from the view a private credit strategy needs, the impact of these different silos increases the difficulty in getting a unified view. Poor data curation with no standards for time horizon or sentiment tone will result in the CIO trying (and failing) to compare strategies and their joint exposure or counterweighting. There’s nothing but incomprehensible data noise.
You know you have achieved data curation when new datasets are seamlessly ingested and normalized, anyone in the firm can grab the information they need when they need it, and that information they pull is accurate all the way upstream for risk modeling, analytics, compliance, and trade execution.
Moneyball in finance
A baseball manager will be fired if he keeps fielding players based solely on batting averages without knowing if they’re hitting dingers off aces or bunting against bullpen retreads. Likewise, a CIO operating without curated, normalized, and context-tagged performance data is managing blind.
Standardized analyst models enable a legitimate performance comparison. Leadership can see who consistently outperforms and who is anchoring to consensus. Even more crucially, without standardized risk factor models, risk managers are unable to precisely gauge potential overlaps in exposure; while PMs execute trades unaware of correlated tail risk in shared exposures. My colleague wrote, in Traders Magazine: “A firm that relies on bad, inaccurate, and untrustworthy data is flying blind with overly complex workflows, will be slow to incorporate new strategies or asset classes, and will be unable to make informed portfolio decisions.”iv
Driving alpha by turning investment data into actionable products
Datasets, whether unstructured, structured, or standardized, are not actionable data products. After a firm has installed the right data infrastructure layer that allows for structuring unstructured data; after it has normalized disparate data sources, signals, and analyst models to get everyone speaking the same language; and has achieved explainability and compliance through data governance, it has successfully become a data curator all-star. The fund has silenced the din of data noise and is now positioned to drive alpha.
Ideally, the firm’s people across functions can create data products, or dashboards, tailored to their needs of specific parts of the business, without calling data engineers. Firms can create dashboards, for example, that link trades to fundamental theses, factor exposures, and real-time macro risk. They can unbind talented traders and managers’ creativity by enabling them to seamlessly generate capital activity, fund performance, and investor balance reports. Real-time, accurate data enables better decision-making, risk management, and compliance, directly impacting portfolio performance and investor confidence.
The winners won’t just gather more data—they’ll draw clearer insights from it. Institutional asset managers and hedge fund CIOs face critical decisions amid fragmented data infrastructure. By prioritizing curated investment data management, firms can foster innovation, align risk and performance, and enable AI adoption. Superior data governance and real-time insights offer a competitive edge in today’s hypercompetitive investment management space.