One of the difficult problems associated with healthcare information is sharing data between primary users and secondary data. In fact, it’s come up in quite a few places recently, and seems to be causing more noise than light. The problem is that these two user bases have such radically different views of how the data should be understood.
The secondary users of data fundamentally live in a statistics orientated world view. If fact, to be clear, that’s how I define what a secondary user is – someone who wishes to do analysis (usually statistical) on the data.
Their fundamental desire is to get data in spreadsheets or databases, in the form of a square – that is, rows and columns, because that’s the form that’s most amenable to statistical analysis. Typically, secondary users seek to control the reality that they interact with, to simplify the data by making a set of assumptions true. This is particularly true in research, for example, where the objective is to keep everything well controlled except the variables you are interested in. This is what I used to do:
- Design an experiment (or a trial protocol, just an experiment on a grand scale)
- Determine what data items capture the outcomes of the experiment
- Compare the possible result of analysis on these to your goal
- Repeat until they align
Secondary users who use the data for financial efficiency, quality and safety reporting don’t have the quite the same amount of ability to control their reality, but they do still make choices about the degree to which they capture it.
The other feature of secondary use is that quality is a feature of the aggregate data, not individual points of data. As a user of secondary data, you worry about systematic error more than you worry about random error.
Primary users don’t have these kind of choices – their record keeping system has to be robust against the cases they have to deal with, and – particularly in healthcare – they just have to deal with what they encounter. No one has designed perfect record keeping systems for healthcare (and all attempts have ended up looking horrifyingly complicated), so primary users have to tolerate ambiguity and looseness in their record keeping.
Because of this, primary users are obsessed with context and trace-ability in their records, so that they can judge for themselves about the reliability of a particular piece of data. For a primary user, that’s the determiner of quality, and it is judged at the individual data point level. Errors in aggregated data – such as systematic bias – simply don’t matter in the same way. As a consequence, operational systems are characterized by hierarchical data structures.
These two groups cannot – and should not – share the same set of data elements. Recently I’ve been party to some discussions where parties from each of these communities claim that the others are wrong to use their own data element definitions.
But I think that’s wrong: the different perspectives on data are valid and necessary. That doesn’t mean that data shouldn’t migrate from one community to another – just that it’s going to need transforming. And the transform is not just a tax – it’s an investment in the strengths of each approach.
Let’s illustrate this with an example, using blood pressure as. Classically, a blood pressure measurement includes 2 values, systolic and diastolic. In a clinical record, they’ll be written/recorded as something like 130/80. Clinical users can easily acquire and compare these values.
The first thing you do when you put these values into a spreadsheet is split them into 2 columns – Diastolic, and Systolic so that the statistics package can handle these as numbers. It’s simply assumed that the 2 numbers come from the same measurement, but it’s rarely stated anywhere (or, if it is, it’s generally stated in narrative prose, not some formal definition). Of course, in this simple example, that’s pretty safe because people will generally know that this is what implied. But that’s not always true.
In clinical practice, however, a blood pressure is not just a systolic/diastolic pair, but also you need information about how the blood pressure was taken – in particular, what was the patient like? Maybe they were lying down, or extremely agitated? For little kids, you might have to do it on the leg, or even get a bad measurement, and just have to go with what you can get.
This presents a recording challenge for operational systems – there’s a myriad of ways to capture this kind of uncertainty, and it’s not really clear how much of that quality information needs to be computable (one of the more comprehensive models is openEHR’s Blood Pressure Archetype, which has 18 elements not counting the stuff in the reference model).
Secondary users mostly aren’t interested in this stuff, particularly if they do research, where they can simply write a study protocol that eliminates all these things from their scope.
Migrating blood pressure measurements from a primary to a secondary use isn’t simply a matter of mapping from one model to another, but of adapting information from one set of perspectives and intellectual requirements to another.
Value the Transform
So, the transform is important. In tool form, it’s an ETL (Extract, Transform, Load) and it’s an explicit representation of the assumptions of the secondary data.
Outside healthcare, this is hardly controversial; it’s the first step of OLAP: Consolidation of the data).
Tom Beale reviewed this post for me (thanks), and pointed out that there’s an unfinished task here:
My suspicion is that the transform is going to be the interesting question in more computational clinical data situations – at the moment, everyone writes ETLs ad hoc. But we shouldn’t…
..and I agree, but you have to walk before you can run, and we can’t even crawl yet.
p.s. I’ll be making a follow up post to this, describing some things we propose to do in FHIR regarding this.