How to Measure Data Accuracy?

If you believe that better data quality has huge business value, and you believe the old axiom that you cannot improve something if you cannot measure it, then it follows that measuring data quality is very, very important. And it’s not a one-time exercise. Data quality should be measured regularly to establish a baseline and trend; otherwise continuous improvement wouldn’t be possible.

Measuring data quality is not simple. We have all been exposed to metrics like accuracy, completeness, timeliness, integrity, consistency, appropriateness, etc. Wikipedia’s entry for Data Quality says that there’re over 200 such metrics.  Some metrics, like completeness and integrity, are relatively easy to measure. Most data quality tools and ETL tools can express them as executable rules. But others are a lot harder to measure.

Accuracy is notorious. Let me give you an example. A Canadian law enforcement agency saw that in crime statistics, pickpocketing is usually high. Further investigation revealed that in the application for entering crime reports, “Pickpocketing” is the first item in the dropdown list box for crime type. So, how would one go about measuring the accuracy of this field? I can only think of two good ways.

First is to manually audit a sample. Take a small percentage of new crime reports and have data analysts go through them to determine if, given other pieces of descriptive information, the crime type field is accurate.

Second is to allow anyone in the organization to identify data inaccuracies and raise issues. The issues can then be routed to the right person for correction. And the issues can be rolled up to compile metrics. This approach is akin to crowd-sourcing.

I’ve seen other ways but I don’t think they’re very effective. You could compare the data with authoritative records. But if you had authoritative records, this wouldn’t be a problem in the first place! You could also measure statistical distribution and detect anomalies. For example, pick-pocketing typically represent 10% of all crimes; if it goes up to 15%, then there may be a problem. But it’s very hard to tell whether the data is wrong or there is an actual change in the real world. You end up resorting to manual auditing again.

Of these techniques, I think crowd-sourcing is the best. The trick is to provide end users with a dead easy way to raise an issue the moment an inaccuracy is discovered. Both Kalido MDM and Data Governance Director provide browser interfaces for raising issues. We also have an open API for issues to be reported, tracked, and acted upon.

Ideally, in every screen that presents data to end users, whether it’s a business application, dashboard, or report, there is a button for raising data issues. So SAP and Oracle, what are you waiting for?

Related Blogs:

 

10 replies
  1. Dylan Jones says:

    One approach I’ve seen to reducing these kind of data-entry related style inaccuracies is to design contextual, dynamic forms.

    For example, if you are entering details of a pickpocket, you may wish to enter details of the victim, time of day, street, pickpocket approach – was it violent/in busy crowd/at a concert etc.

    If the crime was burglary then there would be a completely different set of fields.

    The point being that sometimes the form design creates inaccuracies, by making it easier for the staff to enter the correct information I’ve seen far better accuracy.

    I agree with your point completely though that it is far too difficult for down-stream data users to flag issues with the data and this is just a matter of common sense and basic process improvement.

    Reply
    • Winston Chen says:

      Dylan, yes, form design can absolutely improve accuracy. And this is something application vendors should pay more attention to. Also, as you said, process improvement is ultimately the most effective cure for data quality problems. Thanks for your comment.

      Reply
      • Julian Schwarzenbach says:

        Winston, Dylan,

        Another way to counteract the problem of the default option being left unchanged is to set the default value as “Please select”. This makes it even easier to spot those that have not bothered to enter a suitable value!

        Julian

        Reply
  2. Julian Schwarzenbach says:

    Winston,

    I fully agree that measuring accuracy is both a vital activity and also one that is difficult to undertake. Your ‘pickpocket’ example is a good one, as it will be difficult to go back to those involved in a crime to confirm the details of the events.

    In the physical asset management world accuracy checking is made difficult for a number of reasons:
    1. Assets are frequently widely dispersed, so accuracy checking may involve significant amounts of travel
    2. Assets may be in hazardous locations which prevent easy access and may require permits to work, multi-person teams etc.
    3. Assets such as pipes and cables will typically be buried, so cannot be accessed to check the data accuracy
    4. Due to the wide variations in types and ages of assets deployed, it can be difficult to ensure that samples of assets checked for accuracy represent a valid subset of the overall asset stock
    5. Relying on checking data when someone has to respond to a problem will not be representative of the full population of assets

    Although all these points indicate the difficulty of assessing the accuracy of asset data, these should not be used as excuses for not assessing your data accuracy. Without a valid assessment of accuracy there is a risk that resulting business decisions may be compromised.

    Julian

    Reply
    • Winston Chen says:

      Julian, Thanks for your comment. I heard a story from an oil pipeline operator about how often a truck driving far out to perform maintenance on a piece of asset, and realizing that the data about the asset is wrong, and he/she brought the wrong equipment. You’re right, physically assets present their unique challenges.

      Reply
  3. Ken O'Connor says:

    Hi Winston,

    Great post – well done. I really like the idea of empowering everyone in the organisation to flag data quality issues.

    Your post prompted me to write about a new post about what I call the “Ryanair Data Entry Model”.

    Rgds Ken

    Reply
  4. Sushil Kumra says:

    Data quality measure is challenging but not impossible task. There is no silver bullet. One has to define valid data values for each data element collected so that one would know what we are measuring against. The descriptive data collection and validation is always a challenge. In the descriptive data collection “Drop-down” are often used to minimize the key strokes and improve the data accuracy. Human being human will make mistake and select a wrong choice.
    To fix this problem, one needs to develop data validation based on the event context. If we are collecting data about a crime, as Dylan suggested, there are some data elements unique to a particaular crime. For example, pickpocking location is a house address defintely raises a suspicion if captured data in the crime field is accurate. This validation needs to take place as data is being submitted to save and store. To develope context based validation is a daunting task but I believe will be effective.
    Another simple way to measure data quality is by using :Data Profiling” tool. One can determine what kind of data quality issues are there and take appropriate measure to fix it.

    Reply
    • Winston Chen says:

      Thanks Sushil for your comment. You’re absolutely right that event context is the key to solving the problem, but it is not easy. Context is a hard thing for computers to get — which makes automation hard.

      Reply
  5. Jayati Naswa says:

    What in your opinion is the best way to measure the authoritativeness of data sources used by search engines in their search results??

    Reply
    • John Evans says:

      Jayati,
      This is a good question. Given there is no standard for this you should impose your own policies on what you consider to be authoritative. I think there are two angles to this. One is, are the search results from authoritative sources, and two, are the search results relevant to the search term. These are not necessarily the same.

      You can influence relevance and authoritativeness of search results by using advanced search syntax to exclude sites you don’t believe are authoritative, as well as to exclude homonyms and to control context. For example, using Google, if you wanted to search the web for information on “jaguar” the animal, a search term “jaguar -car” would remove results related to the car brand. This would therefore increase relevance. If you did not consider Wikipedia authoritative, you could add “-site:Wikipedia.org” to the search syntax, or if you did not consider .net domains as authoritative, you could use “-site:*.net” which would remove results from any site using a .net domain. There is also Google syntax to limit results to come from a particular site that you consider authoritative.

      This is a simple example of how you might try to get more “accurate” search results, other than simply relying on the algorithms and indexes used by your favorite search engine.

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *