In traditional data warehousing, onboarding vast datasets is onerous. It often involves maintaining a separate data environment and duplicate data storage which can take weeks to implement. For example, before being able to analyse a new dataset from a data vendor, traders and researchers would typically have to raise new support tickets, wait for the right engineers to become available to build and kick off new data pipelines, and so on.
When SigTech and IHS Markit came together to make IHS Markit datasets available for SigTech platform users, data teams from both sides quickly agreed on two overarching objectives:
1. Fast data discovery
Newly added IHS Markit datasets had to be instantly available within the SigTech platform and data catalog. New datasets shouldn't require any additional setup.
2. Efficient data access
Datasets should be served directly from IHS Markit Data Lake upon request, with no intermediate copying steps and no risk of stale data.
To achieve these two objectives, we pioneered a technical solution that integrated two existing data architectures hosted on AWS.
Fast data discovery is enabled via catalog synchronisation. The IHS Markit Data Lake Catalogue is kept in sync with the SigTech catalog in real-time, allowing users to search IHS Markit datasets and metadata alongside all existing SigTech datasets.
Efficient data access is enabled via query federation and caching. When an IHS Markit dataset is queried, data is fetched directly from the IHS Markit Data Lake. Query filtering and aggregation are done at the data lake level with caching to ensure similar queries do not result in re-fetching of data. Query results are transferred to SigTech users via compressed parquet format to ensure maximum data throughput. This method is easily applicable for other integrations.