The CIMView Data Advantage - Data Provenance - CIMView

Register Here During GCPA : Register

The CIMView Data Advantage – Data Provenance

Good data is a requirement for good business. That’s why CIMView collects and stores vital electricity market information published by ERCOT. The data we collect powers insights that matter.

One of the strategies that CIMView employs to ensure data quality is data provenance. Data provenance requires that each record we import into our production datasets is traceable to a specific input source with information that speaks to the reliability of that data. ERCOT typically publishes public market and grid operating information in publicly accessible files, in extensible markup language (XML), in comma separated values (CSV), or in Microsoft Excel files; so in practice for ERCOT data this means that we want to record which file each record came from.

Although this can be done easily enough by hand, the CIMView data processing engine is able automatically to inject a “run reference id” into the data stream corresponding to each unique input source. The run reference id can be used to lookup information including: the source data name (eg- filename), whether the source data is part of an archive, the name of software module that processed the data and its software version, how many bytes or rows were created, the date and time of the import operation, and the final disposition of the import operation with error messages in case of error.

To illustrate, the following case occurred recently. In 5-minute locational marginal price (LMP) data we identified a market data gap in the ERCOT source data that happened during the summer of 2021 between 4:40 PM and 5:00 PM on the day in question, resulting in four missing files. All of these files were sent shortly after 5:00 PM to backfill the gap, but sometime later three of these files were sent again with updated values. We were able quickly and with precision to remove the invalid data from our production datasets using the run reference ids assigned to the stale data files. Of course, all original source data files are retained in our data lake and can also be identified quickly by run reference id if needed later.

The data provenance enabled by our run reference ids has other applications beyond efficient operations. Downstream analysts can also create quality metrics on each data source and include/exclude them using the run reference id.  In the schematic above, an analyst notices that the variation of data in one file is zero, concludes that a problem happened during production of that data, and can exclude it in analytics using the run reference id.  Also machine learning developers can use this to quickly define test, training and validation datasets for algorithm development.  And since the CIMView data processing engine can do this automatically, it doesn’t cost much in time or effort.