#FHIR and Bulk Data Access Proposal

ONC have asked the FHIR community to add new capabilities to the FHIR specification to increase support for API-based access and push of data for large number of patients in support of provider-based exchange, analytics and other value-based services.

The background to this requirement is that while FHIR allows for access to data from multiple patients at a time, the Argonaut implementations are generally constrained to a single patient access, and requires human mediated login on a regular basis. This is mainly because the use case on which the Argonaut community focused was patient portal access. If this work is going to be extended to provide support for API based access to a large number of patients in support of provider based exchange, the following questions (among others) need to be answered:

  • how does a business client (a backend service, not a human) get access to the service? How are authorizations handled?
  • How do the client and server agree about which patients are being accessed, and which data is available?
  • What format is the data made available in?
  • How is the request made on a RESTful API?
  • How would the client and server most efficiently ensure the client gets the data it asks for, without sending all data every time?

The last few questions are important because the data could be pretty large – potentially >100000000 resources, and we’ve been focused on highly granular exchanges so far. Our existing solutions don’t scale well.

In response to some of these problems, the SMART team had drafted an initial strawman proposal, which a group of us (FHIR editors, ONC staff, EHR & other vendors) met to discuss further late one night at the San Diego WGM last week. Discussion was – as expected – vigorous. Between us, we hammered out the following refined proposal:


This proposal describes a way of granting an application access to data on a set of patients. The application can request a copy of all pertinent (clinical) access to the patients in a single download. Note: We expect that this data will be pretty large.

High-level Use Case Description – FHIR-enabled Population Services (this section provided by ONC)

  • Ecosystem outcome expected to enable many specific use case/business needs: Providers and organizations accountable for managing the health of populations can efficiently access to large volumes of informationon a specified group of individuals without having to access one record at a time. This population-level access would enable these stakeholders to: assess the value of the care provided, conduct population analyses, identify at-risk populations, and track progress on quality improvement.
  • Technical Expectations: There would be a standardized method built into the FHIR standard to support access to and transfer of a large amount of data on a specified group of patients and that such method could be reused for any number of specific business purposes.
  • Policy Expectations: All existing legal requirements for accessing identifiable patient information via other bulk methods (e.g., ETL) used today would continue to apply (e.g., through HIPAA BAAs/contracts, Data Use Agreements, etc).

Authorizing Access

Access to the data is granted by using the SMART backend services spec.

Note: We discussed this at length, but we didn’t see a need for Group/* or Launch/* kind of scopes – System/*.read will do fine. (or User/*.*, for interactive processes, though interactive processes are out of scope for this work). This means that a user cannot restrict Authorization down to just a group, but in this context, users will trust their agents.

Accessing Data

The application can do either of the following queries:

 GET [base]/Patient/$everything?start=[date-time]&_type=[type,type]
 GET [base]/Group/[id]/$everything?start=[date-time]&_type=[type,type]


  • The first query returns all data on all patients that the client’s account has access to, since the starting date time provided.
  • The second query provides access to all data on all patients in the nominated group. The point of this is that applications can request data on a subset of all their patients without needing a new access account provisioned (exactly how the Group resource is created/identified/defined/managed is out of scope for now – the question of whether we need to do sort this out has been referred to ONC for consideration).
  • The start date/time means only records since the nominated time. In the absence of the parameter, it means all data ever
  • The _type parameter is used to specify which resource types are part of the focal query – e.g. what kind of resources are returned in the main set. The _type parameter has no impact on which related resources are included) (e.g. practitioner details for clinical resources). In the absence of this parameter, all types are included.
  • The data that is available for return includes at least the CCDS (we referred the question of exactly what the data should cover back to the ONC)
  • The FHIR specification will be modified to allow Patient/$everything to cross patients, and to add $everything to Group
  • Group will be added as a compartment type in the base Specification

Asynchronous Query

Generally, this is expected to result in quite a lot of data. The client is expected to request this asynchronously, per rfc 7240. To do this, the client uses the Prefer header:

Prefer: respond-async

When the server sees this return header, instead of generating the response, and then returning it, the server returns a 202 Accepted header, and a Content-Location at which the client can use to access the response.

The client then queries this content location using GET content-location (no prefixing). The response can be one of 3 outcomes:

  • a 202 Accepted that indicates that processing is still happening. This may have an “X-Progress header” that provides some indication of progress to the user (displayed as is to the user – no format restrictions but should be <100 characters in length). The client repeats this request periodically until it gets either a 200 or a 5xx
  • a 5xx Error that indicates that preparing the response has failed. The body is an OperationOutcome describing the error
  • a 200 OK with the response for the original request. This response has one or more Link: headers (see rfc 5988) that list the files that are available for download as a result of servicing the request. The response can also carry a X-Available-Until header to indicate when the response will no longer be available


  • This asynchronous protocol will be added as a general feature to the FHIR spec for all calls. it will be up to server discretion when to support it.
  • The client can cancel a task or advise the server it’s ok to delete the outcome using DELETE [content-location]
  • Other than the 5xx response, these responses have no body, except when the accept content type is ‘text/html’, in which case the responses should have an HTML representation of the content in the header (e.g. a redirect, an error, or a list of files to download) (it’s up to server discretion to decide whether to support text/html – typically, the reference/test servers do, and the production servers don’t)
  • Link Headers can have one or more links in them, per rfc 5988
  • Todo: decide whether to add ‘supports asynchronous’ flag to the CapabilityStatement resource

Format of returned data

If the client uses the Accept type if application/fhir+json or application/fhir+xml, the response will be a bundle in the specified format. Alternatively, the client can use the type application/fhir+ndjson. In this case:

  • The response is a set of files in ndjson format (see http://ndjson.org/).
  • Each file contains only resources of a single type.
  • There can be more than one file for each resource type.
  • Bundles are broken up at Bundle.entry.resource – e.g. a bundle is split on Entries so the the bundle json file will contain the bundle without the entry resource, and the resources are found (by id) in the type specific resource files (todo: how does that work for history?)

The nd-json files are split up by resource type to facilitate processing by generic software that reads nd-json into storage services such as Hadoop.


  • the content type application/fhir+ndjson will be documented in the base spec
  • We may need to do some registration work to make +ndjson legal
  • We spent some time discussing formats such as Apache Avro and Parquet – these have considerable performance benefits over nd-json but are much harder to produce and consume. Clients and servers are welcome to do content type negotiation to support Parquet/ORC/etc, but for now, only nd-json is required. We’ll monitor implementation experience to see how it goes

Follow up Requests

Having made the initial request, applications should store and retain the data, and then only retrieve subsequent changes. this is done by providing a _start time on the request.


  • Todo: Is _start the right parameter (probably need _lastUpdated, or a new one)?
  • Todo: where does the marker time (to go into the start/date of the next follow up request) go?
  • clients should be prepared to receive resources that change on the boundary more than once (still todo)


The ONC request included “push of data”. It became clear, when discussing this, that server side push is hard for servers. Given that this applications can perform these queries regularly/as needed, we didn’t believe that push (e.g. based on subscriptions) was needed, and we have not described how to orchestrate a push based on these semantics at this time


It’s time to test out this proposal and see how it actually works. With that in mind, we’ll work to prototype this API on the reference servers, and then we’ll hold a connectathon on this API at the New Orleans HL7 meeting in January. Given this is an ONC request, we’ll be working with ONC to find participants in the connectathon, but we’ll be asking the Argonaut vendors to have a prototype available for this connectathon.

Also, I’ll be creating FHIR change proposals for community review for all the changes anticipated in this write up. I guess we’ll also be talking about an updated Argonaut implementation guide at some stage.



  1. Grahame Grieve says:

    gForge tasks 13919 – 13923

  2. Alexander Henket says:

    Thank you for this very clearly outlined piece. It’s good to see a proposed, useful solution for something that was lingering in our realms too.

    I do worry sometimes about all the ‘server discretions’. As a specifier/architect we increasingly need to worry about infrastructural interoperability an ask ourselves which of those server discretions need to be supported at a minimum to guarantee basic interoperability while keeping a window open for innovation.

  3. Grahame Grieve says:

    I expect that will tighten up, and that it will get profiled for realms and projects as well. In addition, it’s potentially onerous. I ended up implementing it down inside the REST stack, so I can do it for any call at all – but it doesn’t make sense for most simple things, so I’ve made a rule that it can only be used for search, history, and operations

  4. Niquola says:

    I’m going to present some related ideas in Amsterdam

  5. Hi Grahame, this is a great first step for multi-record access to FHIR resources, thank you for putting it together. I was curious if you considered either oData or GraphQL as part of your research for the proposal (I didn’t see either mentioned but I’m sure you already know of their existence).

    GraphQL especially is a great opportunity to use an industry-consensus batch querying language in place of something we might create special just for FHIR. Given that some folks in the FHIR community have already built some GraphQL to FHIR bridges I don’t think it would take too long to work that in. I look forward to hearing what you think.

    • Grahame Grieve says:

      We have already added a page around graphQL- see http://build.fhir.org/graphql.html – though I think we’ll have to take that out for IP reasons (we’re waiting to see if Facebook cares to resolve them or not). But adding graphql to the bulk data picture is potentially troublesome – while it has undeniable usefulness, it also presents some real challenges around scalability for the engineering provision side – and that was a subject that caused passionate discussions when we were scoping this out. OTOH, we didn’t consider oData at all. I’ve looked at it before (and we’ve adopted ideas from it), but it’s resolute squareness has always been a challenge for us

      • Shahid Shah says:

        Thanks, Grahame — Facebook has resolved the IP issues around GraphQL. Could you elaborate what you mean by GraphQL “also presents some real challenges around scalability for the engineering provision side”. If there’s a conversation where that took place I’d love to take look or if you have a summary of the issue that’s good too. For almost all our digital health work we’re moving to a GraphQL-first, REST/FHIR APIs-second, approach and I’d love to hear where some pitfalls may arise.

        • Grahame Grieve says:

          I saw the graphQL IP issues update, and need to get around to updating the text in the graphQL page in regard to that

          With regard to using graphQL, I don’t think there’s any written discussion though we could have on on chat.fhir.org. The issues that arise for graphQL in this context are two-fold:

          * the code to produce resources is canned and reproducible. GraphQL requires an interpreter which hasn’t been done by the existing EMR vendors, and so it’s extra work on top of producing resources.

          * there’s a certain amount of work to do to process the outcomes into a bulk data store. The vendors had a strong preference for seeing the ETL step done at this point, rather then before – and graphQL is effectively an ETL. More generally, having a very regular output from the bulk data query means that you get more re-use at your ETL tool level. Of course, you could argue that the converse is true: applying graphQL during the bulk data extract could make the ETL step simpler… that’s a value decision.

          Perhaps the best way to resolve this is for me to package up the java graphQL processor I wrote so it can easily be run against the bulk data output as a step in an ETL pipeline? then you get fully powered graphQL without having to ask each vendor to do it

          • Shahid Shah says:

            Thanks, Grahame — I love the idea of a general purpose GraphQL middleware that can hit existing FHIR REST endpoints. I’m looking forward to seeing that when you can share it.

  6. John Moehrke says:

    Really need to understand the Problem you are trying to solve. No question this is a Solution, but it is hard to evaluate fitness of this solution without the Problem defined.

    Also, See my proposal on #FHIR Bulk De-Identification https://healthcaresecprivacy.blogspot.com/2017/09/fhir-and-bulk-de-identification.html

    • Grahame Grieve says:

      I added some notes from ONC about the problem that is being solved. But more generally this is about getting lots of data at once from the API, and there’s lots of use cases

  7. Brett Esler says:

    Thanks Grahame – have been prototyping this. Checking my assumptions – I assumed that the url provided in Content-Location header; and the final content (via Link header references) are secured by SMART-on-FHIR token. Is that right?
    Expect a Bundle response on the $everything operation – can I assume that this is optional for the downloadable files?

    • Grahame Grieve says:

      well, if the API is secured by Smart-on-fhir, then yes, all the calls need to be secure. And in general, it would be secure. I’m not sure what you mean about the bundle – yes, the processor always needs the bundle if wants to know what’s a direct response, and what’s included etc.

  8. JE says:

    Interesting to see this come up. I have had to develop special tooling to handle bulk FHIR Patient $everything requests into columnar data formats suitable for analytic workloads. Have you all started to examine the Apache Arrow project? They are also working on a reader/writer for JSON->Arrow->Parquet for arbitrary JSON if you can give it a few hints. Also, are there any AVRO schemas of the fhir standard floating around?

Leave a Reply

Your email address will not be published. Required fields are marked *

question razz sad evil exclaim smile redface biggrin surprised eek confused cool lol mad twisted rolleyes wink idea arrow neutral cry mrgreen


%d bloggers like this: