Transitioning from HDF to JSON

Backstory

Since the early days of Discogs, all release data has been stored in a format called HDF, or Hierarchical Data Format. At the time, this was a good solution due to the tight integration with the Clearsilver templating library which was the foundation of how we generated HTML.

As time went on, the JSON data format started gaining momentum not only amongst the development community, but in the Discogs codebase as well especially after converting our templates from Clearsilver to myghty and then finally to jinja2. It became apparent that in order to simplify our codebase, we needed to start pruning away legacy technologies like HDF.

The Implementation

Python has pretty good built-in JSON support that allows painless transitions to and from a native Python dictionary object. HDF? Not so much. There is no easy way to transform an HDF object to a native Python dictionary. If you want to do that, you have to traverse the HDF tree and manually build one. Here’s an example of that in action:

:::python
>>> import neo_cgi
>>> from neo_util import HDF
>>> release = HDF()
>>> release.readString(release_hdf_data)
>>> node = release.getObj('labels.0')
>>> labels = []
>>> while node:
...     labels.append({'name': node.getValue('name'), 'catno': node.getValue('catno')})
...     node = node.next()
>>> labels[0]
{'name': 'Svek', 'catno': 'SK032'}

All that code just to represent a Release’s labels in a native fashion. Converting back to HDF also requires more code. Mo code, mo problems.

However, if the release data was stored as JSON, transforming to a native Python dictionary is painless:

:::python
>>> import json
>>> release = json.loads(release_json_data)
>>> release['labels'][0]
{'name': 'Svek', 'catno': 'SK032'}

Now, we can easily make changes to the release object and then serialize it back to JSON:

:::python
>>> release['title'] = 'Stockholm'
>>> json.dumps(release)
'{"labels": [{"catno": "SK032", "name": "Svek"}], "title": "Stockholm"}'

Knowing that, how do we transform HDF data to JSON? No magical tool existed at the time to do this conversion for us, so we ventured out to write our own. And so hdf2json was born.

Introducing the hdf2json Python Library

Usage is a piece of cake:

:::python
>>> from hdf2json import hdf2json
>>> hdf2json(release_hdf_data)
'{"labels": [{"catno": "SK032", "name": "Svek"}], "title": "Stockholm"}'

If we wanted to convert to a native Python dictionary, hdf2dict also exists:

:::python
>>> from hdf2json import hdf2dict
>>> release = hdf2dict(release_hdf_data)
>>> release['title']
Stockholm

And then back to JSON:

:::python
>>> json.dumps(release)
'{"labels": [{"catno": "SK032", "name": "Svek"}], "title": "Stockholm"}'

Outro

For all the poor souls out there that are still stuck using HDF, we hope hdf2json will come in handy for you. Otherwise, we hope you enjoyed learning a bit of the behind-the-scenes stuff that happens at Discogs.


Return to Discogs Blog
4 Comments
  • Apr 22,2013 at 11:40 pm

    Does this imply that Discogs uses file-based data persistence? I was wondering why a relational database wasn’t chosen instead.

  • Apr 22,2013 at 11:39 pm

    Happy to see the engineering blog, hope you guys keep it up. And +1 for open sourcing!

  • Apr 22,2013 at 11:35 pm

    epic comment above.
    bravo to the developer anyway!

  • Apr 22,2013 at 11:30 pm

    I got 99 problems but HDF to JSON aint one.

Leave A Reply