Golang HL7 Library
Apr 27, 2016
10 minutes read

While you're reading this, keep in mind that I'm available for hire stupid!

HL7v2 stands for “Health Level 7: Version 2”. This was the first version of a specification for shuttling clinical data around and between medical institutions. Yep, the first version. And it’s named “version 2”. Yeah, I know. Anyway, It’s been in service for nearly thirty years now, and as with any protocol that survives that long, it’s accumulated some quirks. It looks like it’ll be supplanted by FHIR (Fast Healthcare Interoperability Resources) eventually, but it has some serious inertia, so I reckon it’ll be around for a while. While working on Medtasker with Nimblic, I’ve written a library for reading this protocol and querying the messages it contains. You can find it at fknsrs.biz/p/hl7 (godoc.org). Join me as we explore how it works!

The Protocol

If you’re used to JSON and XML, the HL7 protocol is going to look like absolute gibberish. It was conceived at a time when text was king, and line- separated records were just the way things were done. It’s sometimes referred to as “pipehat” - so called because it most commonly uses pipes (|) and “hats” (^) to delimit fields. It uses a bunch of other characters too, and all of them are characters you could conceivably see in actual content. Kind of like HTML, there’s a way to represent them as entities. This actually makes parsing an HL7 message pretty simple. Simpler than it looks at first glance, anyway.

Here’s an example of an HL7 message:


PID||0493575^^^2^ID 1|454721||DOE^JOHN^^^^|DOE^JOHN^^^^|19480203|M||B|254 MYSTREET AVE^^MYTOWN^OH^44123^USA||(216)123-4567|||M|NON|400003403~1129086|


PV1||O|168 ~219~C~PMA^^^^^^^^^||||277^ALLEN MYLASTNAME^BONNIE^^^^|||||||||| ||2688684|||||||||||||||||||||||||199912271408||||||002376853"

For your sake, I’ve converted the line breaks from \r to \n and added extra breaks so that it displays a little more nicely here. Normally, each line would be separated by a carriage return instead of a line break character. Also, I bet that’s a valid perl program.

I won’t bore you with what all the pieces of data mean - you don’t actually have to know about much of it just to understand how the structure works.

The most important thing in the message is right near the start of the first row. This is the “message header” row, as denoted by it starting with MSH. The characters immediately following MSH actually define what the control characters are in the message. That means that you have to consume the first eight bytes of a message before you can parse the rest of it. How awesome is that? It’s basically the opposite of a context-free grammar.

The first byte after MSH defines the “field separator”. In this case, it’s a pipe character. It’s almost always a pipe character, but it totally doesn’t have to be. You could just as easily set it to x or # or ¿, or even \n, and it’d be perfectly valid. Probably anything except \r, as that’s used to separate the lines - called “segments” in HL7 speak. I fully expect some system somewhere in the wild to have done something like this.

The next four bytes define the “encoding characters” - named “component separator” (^), “field repeat separator” (~), “escape character” (\), and “sub-component separator” (&). These characters serve as delimiters for the different parts of the message. Just as with the field separator, these characters can be literally any byte except \r.

An HL7 message is made up of segments separated by \r, which themselves are composed of fields separated by |, each of which can have zero or more repetitions separated by ~, within which there are zero or more components separated by ^, which themselves are broken up into sub-components separated by &.

That is the clearest way that I’ve found to explain the structure of an HL7 message. I’ve been working with HL7 for about a year now and I still get confused at least every week, so take a second to read over that a few times if you’d like.

The Query Language

That data structure up there is pretty complex. There’s a query language that seems to have evolved, and the only name I could find for it is “terser”. I think in the sense that it’s a “terse” representation of a query. There’s a bit of information about it in the HAPI docs. I’m summarise it here. Most references on the web use terser or terser-like syntax to describe fields.

A query looks like PID(2)-3(4)-5-6, and the parts in parentheses can be left out if their values would be 1. The way this reads is so: “sub-component number six, of component number five, of repetition four, of field three, of the second PID segment.” You can pretty much imagine the numbers as being array offsets, except shifted one position up. Notably, field access in the terser language is one-based, while most programming languages are zero-based. For example, the previous query would look like message['pid'][1][2][3][4][5]. OBX-5-4-3 would look like message['obx'][0][4][0][3][2].

I’ve written a parser and evaluator for this language as well, which makes it much easier to extract values from HL7 messages.

Parsing Messages

While I recently wrote my own parser for HL7 messages, there’s another very capable library that I used for many months previously - github.com/kdar/health/hl7. kdar’s library is very complete, but it’s not particularly performant. I often find myself processing several-hundred-megabyte HL7 logs, and my program was actually bottlenecked on parsing the messages. Additionally, the data types in kdar’s library made implementing the HL7 terser pretty complex. The parser I wrote has more verbose types, but they’re far more consistent, and thus easier for me to consume.

The first library I used was kdar’s library. It served me well for a while, but the performance just wasn’t what I needed. I looked at improving the performance, but the parser in that package is generated from a grammar, so there’s not really a lot of opportunity to tune things. Parsing a simple message (one segment) on my laptop takes about 38971ns, according to go test -bench. That’s still pretty quick, but if you have hundreds of thousands of messages to parse, it starts to add up.

I wrote my own parser, and parsing the same message, it clocked in at 153107ns. That’s about 400% slower than kdar’s library. That’s pretty awful. To try to make it faster, I added some optimisations to my fknsrs.biz/p/supersplit library. It didn’t help much.

I rewrote my parser using a much simpler strategy, and it now parses that simple message in 8485ns. This is about a fifth of the time that kdar’s library takes to do the same thing. This cut more than 50% off the runtime of my program - I was very pleased. It’s also much less code to maintain.

The parsing strategy I’m using is pretty simple. The whole parser (minus the part that handles unescaping) is contained below. It’s annotated for easier reading.

// ParseMessage takes input as a `[]byte`, and returns the whole message, the
// control characters (as `*Delimiters`), and maybe an error.
func ParseMessage(buf []byte) (Message, *Delimiters, error) {
  // This is a sanity check, to make sure the message is long enough to
  // contain a valid header. If it's less than eight bytes long, it can't
  // possibly contain the required information.

  if len(buf) < 8 {
    return nil, nil, ErrTooShort(stackerr.Newf("message must be at least eight bytes long; instead was %d", len(buf)))

  // Every valid HL7 message will begin with `MSH`. This isn't specifically
  // mandated in the specification, but by combining a few constraints, we can
  // safely come to this conclusion. This allows us to reject junk data pretty
  // quickly.

  if !bytes.HasPrefix(buf, []byte("MSH")) {
    return nil, nil, ErrInvalidHeader(stackerr.Newf("expected message to begin with MSH; instead found %q", buf[0:3]))

  // These are the control characters. `fs` is the field separator, `cs` the
  // component separator, `rs` the field repeat separator, `ec` the escape
  // character, and `ss` the sub-component separator.

  fs := buf[3]
  cs := buf[4]
  rs := buf[5]
  ec := buf[6]
  ss := buf[7]

  d := Delimiters{fs, cs, rs, ec, ss}

  // These are the variables we'll be working with. We reuse these variables a
  // lot in the parsing loop below. A `FieldItem` is one instance of a field
  // value - the HL7 standard calls this a "repetition," but I found that
  // `FieldItem` was easier to think about.

  var (
    message   Message
    segment   Segment
    field     Field
    fieldItem FieldItem
    component Component
    s         []byte

  // We manually construct the first few fields of the message, as we know
  // that it has to be structured this way. It's easier than having special
  // code to parse these weird fields out.

  segment = Segment{

  // These functions are used when we encounter control characters. When we
  // see a control character, it signals the end of a certain kind of element.
  // `|` means the end of a field, `~` a repetition, `^` a component, and `&`
  // a subcomponent. Another property of these separators is that each one not
  // only ends that element itself, but also any elements it contains. For
  // example, hitting `|` not only means that you've found the end of the
  // current field, but also the end of the current repetition, component, and
  // sub-component. This is expressed below as nested calls in the different
  // `commitX` functions.

  commitBuffer := func(force bool) {
    if s != nil || force {
      component = append(component, Subcomponent(unescape(s, &d)))
      s = nil

  commitComponent := func(force bool) {

    if component != nil || force {
      fieldItem = append(fieldItem, component)
      component = nil

  commitFieldItem := func(force bool) {

    if fieldItem != nil || force {
      field = append(field, fieldItem)
      fieldItem = nil

  commitField := func(force bool) {

    if field != nil || force {
      segment = append(segment, field)
      field = nil

  commitSegment := func(force bool) {

    if segment != nil || force {
      message = append(message, segment)
      segment = nil

  // This is the main parse loop. We go through the input byte-by-byte,
  // accumulating data until we hit any of the control characters. When we do,
  // we commit whatever we have "buffered" for that level. Carriage returns
  // and line breaks count as control characters, as they delimit segments
  // themselves.

  sawNewline := false
  for _, c := range buf[9:] {
    switch c {
    case '\r', '\n':
      if !sawNewline {
      sawNewline = true
    case fs:
      sawNewline = false
    case rs:
      sawNewline = false
    case cs:
      sawNewline = false
    case ss:
      sawNewline = false
      sawNewline = false
      s = append(s, c)

  // After we've gotten to the end of the input, we might still have some data
  // buffered up, so we make sure that gets committed.


  // That's it - we're done! Return the message, the `Delimiters` object, and
  // `nil` - signalling that there was no error.

  return message, &d, nil

Querying Messages

Once you’ve got a message parsed, that’s only half the battle. Now you have to pull fields out of it somehow. Luckily, I’ve implemented a fairly capable HL7 terser along with the message parser.

Here’s an example (from example_test.go) that shows how you might get the type of a message.

package hl7

import (

func ExampleQuery_GetString() {
  m, _, _ := ParseMessage([]byte(longTestMessageContent))

  msh9_1, _ := ParseQuery("MSH-9-1")
  msh9_2, _ := ParseQuery("MSH-9-2")

  fmt.Printf("%s_%s", msh9_1.GetString(m), msh9_2.GetString(m))
  // Output: ORU_R01


HL7v2 isn’t really that complex overall. It has some quirks, and it’s definitely not a modern protocol by any measure, but it’s not completely unapproachable.

I’m looking forward to the day FHIR replaces it for mainstream use.

Back to posts