Version 2 and character sets and encoding

I’ve been rewriting my v2 parser and trying to make it fully conformant to the v2 specification with regard to character sets. It’s a tough problem. There’s several parts of the problem that make it tough.

Finding the character set/encoding

The character set is embedded many bytes into the message content at MSH-18. So you need to read the first 40-100 bytes or so into characters before you know how to turn them into characters…. sounds like fun. Actually, it’s pretty much a managable problem, because there’s no need to use characters with value >127 before MSH-18 (note that there’s no need, but it’s possible to use them). Given that the message starts with ‘MSH’, you can tell by inspecting the first 6 bytes whether you have single or double encoding, and if it’s double byte encoding, what the endianness is. Note that you can also tell that from a byte order mark (BOM) if there is one. Given this, and if the sender didn’t send any characters >127 while using UTF-8, then you can reliably find and read MSH-18. Once I’ve read that, then I reset the parser and start again with the specified encoding.

Of course, it’s always possible that the character set as specified by the BOM or made clear by inspecting the first 6 bytes differs from what is implied by the value of MSH-18… I ignore MSH-18 if it doesn’t match.

Note that v2 doesn’t say anything about the BOM – I think it should in a future version.

Understanding the character set/encoding

The second part of the problem is that MSH-18 is sometimes character set and sometimes character encoding (see here for discussion) – the values are an unholy mix of the two. In addition, the list of values matches precisely the list of values in DICOM, and as far as I can tell, no other list at all. Here’s a list of the possible values for MSH-18 (v2.6):

  • ASCII  – The printable 7-bit ASCII character set.
  • 8859/1  -The printable characters from the ISO 8859/1 Character set
  • 8859/2 – The printable characters from the ISO 8859/2 Character set
  • 8859/3 – The printable characters from the ISO 8859/3 Character set
  • 8859/4 – The printable characters from the ISO 8859/4 Character set
  • 8859/5 – The printable characters from the ISO 8859/5 Character set
  • 8859/6 – The printable characters from the ISO 8859/6 Character set
  • 8859/7 – The printable characters from the ISO 8859/7 Character set
  • 8859/8 – The printable characters from the ISO 8859/8  Character set
  • 8859/9 – The printable characters from the ISO 8859/9 Character set
  • 8859/15  The printable characters from the ISO 8859/15 (Latin-15)
  • ISO IR14 – Code for Information Exchange (one byte)(JIS X 0201-1976).
  • ISO IR87 – Code for the Japanese Graphic Character set for information interchange (JIS X 0208-1990)
  • ISO IR159 – Code of the supplementary Japanese Graphic Character set for information interchange (JIS X 0212-1990).
  • GB 18030-2000 – Code for Chinese Character Set (GB 18030- 2000)
  • KS X 1001 – Code for Korean Character Set (KS X 1001)
  • CNS 11643-1992 – Code for Taiwanese Character Set (CNS 11643-1992)
  • BIG-5 – Code for Taiwanese Character Set (BIG-5)
  • UNICODE – The world wide character standard fromISO/IEC 10646-1-19935
  • UNICODE UTF-8 – UCS Transformation Format, 8-bit format
  • UNICODE UTF-16  UCS Transformation Format, 16-bit format
  • UNICODE UTF-32 – UCS Transformation Format, 32-bit format

That’s a fun list. The default is ASCII, btw. Now I’m not going to write my own general character encoding engine – who is? I’m going to use the inbuilt functions in windows to convert everything to unicode. That means I have to map these values to windows code pages to pass to the windows conversion routines. But it’s a problem, mapping between these values and the windows code page values. Here’s my mapping list.

  • ASCII  = 20127 or 437
  • 8859/1 = 28591 : ISO 8859 : Latin Alphabet 1
  • 8859/2 = 28592 : ISO 8859 : Latin Alphabet 2)
  • 8859/3 =28593 : ISO 8859 : Latin Alphabet 3
  • 8859/4 =28594 : ISO 8859 : Latin Alphabet 4)
  • 8859/5 =28595 : ISO 8859 : Cyrillic
  • 8859/6 =28596 : ISO 8859 : Arabic)
  • 8859/7 =28697 : ISO 8859 : Greek
  • 8859/8 =28598 : ISO 8859 : Hebrew
  • 8859/9 = 28599 : ISO 8859-9 Turkish
  • 8859/15 = 28605 : ISO 8859-15 Latin 9
  • ISO IR14  = ??
  • ISO IR87 = ??
  • ISO IR159  = ??
  • GB 18030-2000  = 54936 : GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
  • KS X 1001 = ??
  • CNS 11643-1992 = ??
  • BIG-5 = 950, ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)

As you can see, it’s incomplete. I just don’t know enough to map between the HL7/DICOM codes and the windows code pages. Searching on the internet didn’t quickly resolve them either. All the links I found pointed to either HL7 or dicom standards, or copies thereof.

If you know what the mappings are, please let me know, and I’ll update the list.

The character set can change

If that’s not enough, the character set is allowed to change mid-message. There’s a couple of escape sequences (\C..\ and \M….\) that allow the stream to switch characters mid-stream. This makes for a slow parser because of the way windows does character conversion – you can’t ask for x number of characters to be read off the stream, but for x number of bytes to be read into characters (how do you tell how many bytes were actually read – convert the characters back to bytes – I suspect that this isn’t deterministic, and there’s some valid unicode sequences that some windows applications will fail to read, but I don’t know how to test that). So you have to keep reading a byte or two at a time until you get a character back, because you can’t get an encoder to read ahead on the stream – you might have to switch encoders.

Having said that, I’ve never seen these escape sequences change in the wild, and it seems like a sensationally dumb idea to me (however, I’ll make a post about unicode and the Japanese in the future).

If I have any Japanese readers, how does character encoding in v2 actually work in Japan?

Mostly, implementers get this wrong

This stuff is sufficiently poorly understood that most implementers assume their working in ANSI,use characters from their local code page, put them in and claim they’re using something else. The windows character conversion routines fail in some of these cases. I don’t know what to do about that.

There. That’s enough. We really really need to retire v2. It’s time has passed.

 

 

4 Comments

  1. Rene Spronk says:

    Hrm .. I have yet to see a production v2 interface that uses the mid-message character switching feature. Im my experience most US vendors use UTF-8 as the default encoding (or plain ASCII on old mainframes), and most European vendors Latin-1 (or Latin-15, the ‘patched’ version of Latin-1). It’s almost always up to trading partners to agree upon the character encoding, MSH-18 isn’t a very useful (nor: used) indicator of the true encoding.

    However, should one wish to fully comply with the standard, one is indeed faced with the many challenges as put forward in this post. Japan uses Shift-JIS as far as I know, which is not (as the japanese claim) fully mappable to Unicode.

  2. Peter Jordan says:

    MSH-18 is optional and I’ve rarely seen it completed in an Australian or NZ message. HL7 will need to sell v3 messaging to vendors before v2 can be replaced; in the iterim, it’s probably more feasible to send PIT to an old folks home.

  3. Grahame Grieve says:

    #Peter

    I agree that I’ve rarely seen it completed – which means that the message is pure ASCII. However it’s common to see characters between 127-255 used for tables or medically useful characters like °. Which is technically illegal, and often broken in practice.

    The new MSIA profile says:

    A character set must be specified in MSH-18. It must be either 8859/1 (extended ascii) or UTF-8. Optional support for UTF-8 should only be assumed for receivers where this is noted in their capability register. Support for 8859/1 is mandatory. . However, only ASCII characters shall be used in the MSH segment. The HL7 escape sequences \M and \C shall not be used.

    So I think we’re going to start seeing messaging sources sort this out.

  4. Nyerguds says:

    On the subject of the “ISO IR14” / “ISO IR87” / “ISO IR159” encodings… the last two of those, as the specs say, correspond to JIS X 0208 and JIS X 0212.

    These two are the same code page in Windows, namely 20932, and if the information on the wikipedia page of JIS 201 is correct, the JIS X 0208 standard actually includes JIS X 0201.

    So, in other words, all three encodings could be seen as being code page 20932. That’s how I implemented it, anyway.

Leave a Reply

Your email address will not be published. Required fields are marked *

question razz sad evil exclaim smile redface biggrin surprised eek confused cool lol mad twisted rolleyes wink idea arrow neutral cry mrgreen

*

%d bloggers like this: