In the FHIR specification we say that the basic language for resources is unicode:
The XML character set is always Unicode.
Actually, that’s not the right wording – what it should have said is “The character set of a resource is always Unicode”.
Now if the character set is unicode, then any character encoding that is fully mapped to unicode is therefore valid. However, elsewhere in the specification, it says:
FHIR uses UTF-8 for all request and response bodies
This attracted several comments, all along the same lines – why require UTF-8? Well, the logic is fairly simple:
- content type negotiation doesn’t work very well for character sets
- while it might be legal to represent a resource in any character encoding mapped to unicode, what would you do if someone asked you to represent a resource in a character set that doesn’t have a mapping for one or more characters in the unicode?
- Even though it’s possible to convert resources between character sets, what happens to digital signatures?
- What’s going to happen if systems with different encodings, or with different supported subsets try to interoperate?
- As for which unicode encodings, why support more than one? and UTF-8 is widely supported, and required by several HL7 Asian affiliates for v2
- It’s just simpler to say, everyone use UTF-8.
One problem with requiring UTF-8 is that the HTTP default is ISO-8859-1. This means that you have to specify UTF-8 as the character set on all the http requests and responses. But since it’s a parameter of the content type, and you have to specify the content type anyway, I didn’t see that as particularly painful – but it did get comment in the connectathons, because you do have to remember.
Unicode subsets
However if you don’t support unicode natively – which is still a large subset of systems – then the fact that resources are always in UTF-8 presents you with a problem – you have to do something about the unicode issue, even if you are positive that all your trading partners are using pure ASCII. There’s still so many systems that don’t support unicode (the reason for this is because even though the platforms support unicode relatively well, to support it in your application, the entire eco-system – database, UIs, printers, messaging formats, etc all have support unicode, and for many vendors sorting this out simply isn’t feasible in a financial sense).
What I see in practice, is systems that can’t interoperate safely because they thought they were using pure ASCII, but they weren’t. (In fact, it’s not that unusual to see systems that don’t fully operate, let alone interoperate.) So I’d always prefer unicode as the wire format – it makes everyone deal with the issue.
So, we have several comments – why require UTF-8? Why not allow at least ISO-8859-1? Or why not allow any round-trip encoding? What if we require all interfaces to “support” UTF-8 in addition to anything else that they also do? Or maybe we require all servers to support UTF-8 at least?
We’ve discussed this in committee several times, and we’re just not sure what to do here. Seen as an entire eco-system – and I do think FHIR interfaces will be highly interconnected – a simple blanket rule of always UTF-8 is obviously much simpler overall. But it imposes an entry cost on many systems – especially the existing data stores, which are generally older systems – and maybe this isn’t a very good idea?
HHS HIT Standards Committee & Character Set
The situation is somewhat complicated by this (private communication that made it’s way to me):
The HHS HIT Standards Committee was asked how EHR language display should be certified using standards and the recommendation was ISO 8859-15 aka “Latin 9″ which has character support for all the required ISO 639 languages including direct support for the Eastern European languages and transliteration to Latin characters for e.g. cyrillic and mandarin. This EHR certification requirement is anticipated to raise issues for HL7 standards and HL7 implementers particularly for systems with interfaces to certified EHRS.
I’ve got to say, I don’t really understand this. If you’re going to recommend something, why not Unicode? The point is, US EHR vendors (which includes all the multinationals) are going to be forced to change towards whatever this committee recommends. But now, instead of migrating to unicode, which is at least a sensible long term option, they’re going to spending their money changing from ISO-8859-1 – which is the default for all the systems I’ve ever looked at personally, to ISO-8859-15. I can only see that as a sideways move, and not a good investment on behalf of the end users. And how that will play in other countries, where ISO-8859-15 is not on the list of supported character sets in national standards?
In terms of FHIR and unicode, I’m not exactly sure what the impact of this is. ISO-8859-15 is fully mapped to unicode, so it probably doesn’t really change the basic question – unicode, or something else that makes subset support explicit? But EHR vendors are going to be important adopters of FHIR, so I think this weighs on the decision.

Thanks for bringing this up, especially that latter bit of idiocy which I hadn’t yet heard.
Interoperability is hard. It always requires that the interoperating systems step up the bar a little bit beyond what they’ve done within their closed environments. They have to deal with code systems they’re not used to, possibly doing translations. They have to deal with data element granularities they’re not used to, also doing translations.
I would look at a requirement to use UTF-8 as something along the same lines. It’s a base requirement for interoperability. In most cases, most systems will be able to map most content. They’ll have to figure out how to deal with the content they can’t map (specifically, when receiving UTF-8 and trying to map into their persistent store).
I don’t think this is an insurmountable barrier. In the worst case, trading partners can choose to be non-conformant. But I think it’s reasonable to set UTF-8 as an expectation.
As a developer, I would also prefer just one possibility. And since UTF-8 covers all, that is the one. That way you put the problem with the parties that support a less rich characterset, instead of restraining the whole ecosystem.
Partners in the same cultural regions will not often receive characters that they can’t map, since those characters will probably not be used by the sender either. If the problem does occur however (infrequently, hopefully), partners may need to find a solution ‘out of band’ (a simple phonecall can sometimes solve more than a thousand standards
).