While you're reading this, keep in mind that I'm available for hire stupid!
HL7v2 stands for “Health Level 7: Version 2”. This was the first version of a specification for shuttling clinical data around and between medical institutions. Yep, the first version. And it’s named “version 2”. Yeah, I know. Anyway, It’s been in service for nearly thirty years now, and as with any protocol that survives that long, it’s accumulated some quirks. It looks like it’ll be supplanted by FHIR (Fast Healthcare Interoperability Resources) eventually, but it has some serious inertia, so I reckon it’ll be around for a while. While working on Medtasker with Nimblic, I’ve written a library for reading this protocol and querying the messages it contains. You can find it at fknsrs.biz/p/hl7 (godoc.org). Join me as we explore how it works!
The Protocol
If you’re used to JSON and XML, the HL7 protocol is going to look like
absolute gibberish. It was conceived at a time when text was king, and line-
separated records were just the way things were done. It’s sometimes referred
to as “pipehat” - so called because it most commonly uses pipes (|
) and
“hats” (^
) to delimit fields. It uses a bunch of other characters too, and
all of them are characters you could conceivably see in actual content. Kind
of like HTML, there’s a way to represent them as entities. This actually makes
parsing an HL7 message pretty simple. Simpler than it looks at first glance,
anyway.
Here’s an example of an HL7 message:
MSH|^~\&|EP\S\IC|EPICADT|SMS|SMSADT|199912271408|CHARRIS|ADT^A04|1817457|D|2.5|
PID||0493575^^^2^ID 1|454721||DOE^JOHN^^^^|DOE^JOHN^^^^|19480203|M||B|254 MYSTREET AVE^^MYTOWN^OH^44123^USA||(216)123-4567|||M|NON|400003403~1129086|
NK1||ROE^MARIE^^^^|SPO||(216)123-4567||EC|||||||||||||||||||||||||||
PV1||O|168 ~219~C~PMA^^^^^^^^^||||277^ALLEN MYLASTNAME^BONNIE^^^^|||||||||| ||2688684|||||||||||||||||||||||||199912271408||||||002376853"
For your sake, I’ve converted the line breaks from \r
to \n
and added
extra breaks so that it displays a little more nicely here. Normally, each
line would be separated by a carriage return instead of a line break
character. Also, I bet that’s a valid perl program.
I won’t bore you with what all the pieces of data mean - you don’t actually have to know about much of it just to understand how the structure works.
The most important thing in the message is right near the start of the first
row. This is the “message header” row, as denoted by it starting with MSH
.
The characters immediately following MSH
actually define what the control
characters are in the message. That means that you have to consume the first
eight bytes of a message before you can parse the rest of it. How awesome is
that? It’s basically the opposite of a context-free grammar.
The first byte after MSH
defines the “field separator”. In this case, it’s a
pipe character. It’s almost always a pipe character, but it totally doesn’t
have to be. You could just as easily set it to x
or #
or ¿
, or even
\n
, and it’d be perfectly valid. Probably anything except \r
, as that’s
used to separate the lines - called “segments” in HL7 speak. I fully expect
some system somewhere in the wild to have done something like this.
The next four bytes define the “encoding characters” - named “component
separator” (^
), “field repeat separator” (~
), “escape character” (\
),
and “sub-component separator” (&
). These characters serve as delimiters for
the different parts of the message. Just as with the field separator, these
characters can be literally any byte except \r
.
An HL7 message is made up of segments
separated by \r
, which themselves
are composed of fields
separated by |
, each of which can have zero or more
repetitions
separated by ~
, within which there are zero or more
components
separated by ^
, which themselves are broken up into
sub-components
separated by &
.
That is the clearest way that I’ve found to explain the structure of an HL7 message. I’ve been working with HL7 for about a year now and I still get confused at least every week, so take a second to read over that a few times if you’d like.
The Query Language
That data structure up there is pretty complex. There’s a query language that seems to have evolved, and the only name I could find for it is “terser”. I think in the sense that it’s a “terse” representation of a query. There’s a bit of information about it in the HAPI docs. I’m summarise it here. Most references on the web use terser or terser-like syntax to describe fields.
A query looks like PID(2)-3(4)-5-6
, and the parts in parentheses can be left
out if their values would be 1
. The way this reads is so: “sub-component
number six, of component number five, of repetition four, of field three, of
the second PID segment.” You can pretty much imagine the numbers as being
array offsets, except shifted one position up. Notably, field access in the
terser language is one-based, while most programming languages are zero-based.
For example, the previous query would look like
message['pid'][1][2][3][4][5]
. OBX-5-4-3
would look like
message['obx'][0][4][0][3][2]
.
I’ve written a parser and evaluator for this language as well, which makes it much easier to extract values from HL7 messages.
Parsing Messages
While I recently wrote my own parser for HL7 messages, there’s another very capable library that I used for many months previously - github.com/kdar/health/hl7. kdar’s library is very complete, but it’s not particularly performant. I often find myself processing several-hundred-megabyte HL7 logs, and my program was actually bottlenecked on parsing the messages. Additionally, the data types in kdar’s library made implementing the HL7 terser pretty complex. The parser I wrote has more verbose types, but they’re far more consistent, and thus easier for me to consume.
The first library I used was kdar’s library. It served me well for a while,
but the performance just wasn’t what I needed. I looked at improving the
performance, but the parser in that package is generated from a grammar, so
there’s not really a lot of opportunity to tune things. Parsing a simple
message (one segment) on my laptop takes about 38971ns, according to go test
-bench
. That’s still pretty quick, but if you have hundreds of thousands of
messages to parse, it starts to add up.
I wrote my own parser, and parsing the same message, it clocked in at 153107ns. That’s about 400% slower than kdar’s library. That’s pretty awful. To try to make it faster, I added some optimisations to my fknsrs.biz/p/supersplit library. It didn’t help much.
I rewrote my parser using a much simpler strategy, and it now parses that simple message in 8485ns. This is about a fifth of the time that kdar’s library takes to do the same thing. This cut more than 50% off the runtime of my program - I was very pleased. It’s also much less code to maintain.
The parsing strategy I’m using is pretty simple. The whole parser (minus the part that handles unescaping) is contained below. It’s annotated for easier reading.
// ParseMessage takes input as a `[]byte`, and returns the whole message, the
// control characters (as `*Delimiters`), and maybe an error.
func ParseMessage(buf []byte) (Message, *Delimiters, error) {
// This is a sanity check, to make sure the message is long enough to
// contain a valid header. If it's less than eight bytes long, it can't
// possibly contain the required information.
if len(buf) < 8 {
return nil, nil, ErrTooShort(stackerr.Newf("message must be at least eight bytes long; instead was %d", len(buf)))
}
// Every valid HL7 message will begin with `MSH`. This isn't specifically
// mandated in the specification, but by combining a few constraints, we can
// safely come to this conclusion. This allows us to reject junk data pretty
// quickly.
if !bytes.HasPrefix(buf, []byte("MSH")) {
return nil, nil, ErrInvalidHeader(stackerr.Newf("expected message to begin with MSH; instead found %q", buf[0:3]))
}
// These are the control characters. `fs` is the field separator, `cs` the
// component separator, `rs` the field repeat separator, `ec` the escape
// character, and `ss` the sub-component separator.
fs := buf[3]
cs := buf[4]
rs := buf[5]
ec := buf[6]
ss := buf[7]
d := Delimiters{fs, cs, rs, ec, ss}
// These are the variables we'll be working with. We reuse these variables a
// lot in the parsing loop below. A `FieldItem` is one instance of a field
// value - the HL7 standard calls this a "repetition," but I found that
// `FieldItem` was easier to think about.
var (
message Message
segment Segment
field Field
fieldItem FieldItem
component Component
s []byte
)
// We manually construct the first few fields of the message, as we know
// that it has to be structured this way. It's easier than having special
// code to parse these weird fields out.
segment = Segment{
Field{FieldItem{Component{Subcomponent("MSH")}}},
Field{FieldItem{Component{Subcomponent(buf[3])}}},
Field{FieldItem{Component{Subcomponent(string(buf[4:8]))}}},
}
// These functions are used when we encounter control characters. When we
// see a control character, it signals the end of a certain kind of element.
// `|` means the end of a field, `~` a repetition, `^` a component, and `&`
// a subcomponent. Another property of these separators is that each one not
// only ends that element itself, but also any elements it contains. For
// example, hitting `|` not only means that you've found the end of the
// current field, but also the end of the current repetition, component, and
// sub-component. This is expressed below as nested calls in the different
// `commitX` functions.
commitBuffer := func(force bool) {
if s != nil || force {
component = append(component, Subcomponent(unescape(s, &d)))
s = nil
}
}
commitComponent := func(force bool) {
commitBuffer(false)
if component != nil || force {
fieldItem = append(fieldItem, component)
component = nil
}
}
commitFieldItem := func(force bool) {
commitComponent(false)
if fieldItem != nil || force {
field = append(field, fieldItem)
fieldItem = nil
}
}
commitField := func(force bool) {
commitFieldItem(false)
if field != nil || force {
segment = append(segment, field)
field = nil
}
}
commitSegment := func(force bool) {
commitField(false)
if segment != nil || force {
message = append(message, segment)
segment = nil
}
}
// This is the main parse loop. We go through the input byte-by-byte,
// accumulating data until we hit any of the control characters. When we do,
// we commit whatever we have "buffered" for that level. Carriage returns
// and line breaks count as control characters, as they delimit segments
// themselves.
sawNewline := false
for _, c := range buf[9:] {
switch c {
case '\r', '\n':
if !sawNewline {
commitSegment(true)
}
sawNewline = true
case fs:
sawNewline = false
commitField(true)
case rs:
sawNewline = false
commitFieldItem(true)
case cs:
sawNewline = false
commitComponent(true)
case ss:
sawNewline = false
commitBuffer(true)
default:
sawNewline = false
s = append(s, c)
}
}
// After we've gotten to the end of the input, we might still have some data
// buffered up, so we make sure that gets committed.
commitSegment(false)
// That's it - we're done! Return the message, the `Delimiters` object, and
// `nil` - signalling that there was no error.
return message, &d, nil
}
Querying Messages
Once you’ve got a message parsed, that’s only half the battle. Now you have to pull fields out of it somehow. Luckily, I’ve implemented a fairly capable HL7 terser along with the message parser.
Here’s an example (from example_test.go
) that shows how you might get the
type of a message.
package hl7
import (
"fmt"
)
func ExampleQuery_GetString() {
m, _, _ := ParseMessage([]byte(longTestMessageContent))
msh9_1, _ := ParseQuery("MSH-9-1")
msh9_2, _ := ParseQuery("MSH-9-2")
fmt.Printf("%s_%s", msh9_1.GetString(m), msh9_2.GetString(m))
// Output: ORU_R01
}
Conclusion
HL7v2 isn’t really that complex overall. It has some quirks, and it’s definitely not a modern protocol by any measure, but it’s not completely unapproachable.
I’m looking forward to the day FHIR replaces it for mainstream use.