Choosing the Right Serialization Format

When saving or communicating some kind of information, we often use serialization. Serialization takes a Ruby object and converts it into a string of bytes and vice versa. For example, if you have an object representing information about a user and need to send it over the network, it has to be serialized into a set of bytes that can be pushed over a socket. Then, at the other end, the receiver has to unserialize the object, converting it back into something that Ruby (or another language) can understand.

It turns out that there are lots of ways to serialize Ruby objects. I’ll cover YAML, JSON, and MessagePack in this article, exploring their pianos and fortes to see them in action with Ruby. At the end, we’ll put together a modular serialization approach using some metaprogramming tricks.

Let’s jump in!

YAML

YAML is a recursive acronym that stands for “YAML Ain’t Markup Language”. It is a serialization format, but it is also (easily) human readable, meaning that it can be used as a configuration language. In fact, Rails uses YAML to do all sorts of configuration, e.g. database connectivity.

Let’s check out an example:

name: "David"
height: 124
age: 28
children:
  "John":
    age: 1
    height: 10
  "Adam":
    age: 2
    height: 20
  "Robert":
    age: 3
    height: 30
traits:
  - smart
  - nice
  - caring

The format of YAML is incredibly easy to understand. The quickest way to make it click is to transform it into a Ruby hash or Javascript object. We’ll go with the former (saving the above YAML in test.yaml):

require 'yaml'

YAML.load File.read('test.yaml')

Running the above in Pry will give you a nicely formatted result that looks like:

{"name"=>"David",
 "height"=>124,
 "age"=>28,
 "children"=>{"John"=>{"age"=>1, "height"=>10},
             "Adam"=>{"age"=>2, "height"=>20},
             "Robert"=>{"age"=>3, "height"=>30}},
 "traits"=>["smart", "nice", "caring"]}

As you can see, the colons represent “key-value” pairings, and tabs create a new hash. The little hyphens tell YAML that we want a list rather than a hash. This easy translation between YAML and Ruby dictionaries is one of the primary benefits of YAML.

require 'yaml'

class Person
  attr_accessor :name, :age, :gender

  def initialize(name, age, gender)
    @name = name
    @age = age
    @gender = gender
  end

  def to_yaml
    YAML.dump ({
      :name => @name,
      :age => @age,
      :gender => @gender
    })
  end

  def self.from_yaml(string)
    data = YAML.load string
    p data
    self.new(data[:name], data[:age], data[:gender])
  end
end

p = Person.new "David", 28, "male"
p p.to_yaml

p = Person.from_yaml(p.to_yaml)
puts "Name #{p.name}"
puts "Age #{p.age}"
puts "Gender #{p.gender}"

Let’s break down the code. We have the to_yaml method:

def to_yaml
  YAML.dump ({
    :name => @name,
    :age => @age,
    :gender => @gender
  })
end

We are making a Ruby hash and turning it into a YAML string using modules provided by the standard library. To go the other direction and convert a YAML string into a Ruby Object:

def self.from_yaml(string)
  data = YAML.load string
  p data
  self.new(data[:name], data[:age], data[:gender])
end

Here, take the string, convert it into a Ruby hash, then use the contents of our hash with the constructor to construct a new instance of Person.

Now, let’s see how YAML compares with the heavyweight from the land of Javascript.

JSON

In some ways, JSON is very similar to YAML. It is meant to be a human-readable format that often serves as a configuration format. Both are widely adopted in the Ruby community. However, JSON differs in that it draws its roots from Javascript. In fact, JSON actually stands for Javascript Object Notation. The syntax for JSON is nearly the same as the syntax for defining Javascript objects (which are somewhat analogous to Ruby hashes). Let’s see an example:

{
  "name": "David",
  "height": 124,
  "age": 28,
  "children": {"John": {"age": 1, "height": 10},
             "Adam": {"age": 2, "height": 20},
             "Robert": {"age": 3, "height": 30}},
  "traits": ["smart", "nice", "caring"]
}

That looks really similar to the good old Ruby hash. The only difference seems to be that the key-pair relation is expressed by “:” in JSON instead of the => we find in Ruby.

Let’s see exactly what the example looks like in Ruby:

require 'json'
JSON.load File.read("test.json")

{"name"=>"David",
 "height"=>124,
 "age"=>28,
 "children"=>{"John"=>{"age"=>1, "height"=>10},
             "Adam"=>{"age"=>2, "height"=>20},
             "Robert"=>{"age"=>3, "height"=>30}},
 "traits"=>["smart", "nice", "caring"]}

We can add set of methods to the Person class developed earlier, making it JSON-serializable:

require 'json'

class Person
  ...
  def to_json
    JSON.dump ({
      :name => @name,
      :age => @age,
      :gender => @gender
    })
  end

  def self.from_json(string)
    data = JSON.load string
    self.new(data['name'], data['age'], data['gender'])
  end
  ...
end

The underlying code is exactly the same, except for the fact that the methods use JSON instead of YAML!

What sets JSON apart from the rest is its similarity to Ruby and Javascript syntax. It takes some mental energy to switch between YAML and Ruby when writing code. There is no such problem with JSON, since the syntax is nearly identical to that of Ruby. In addition, many modern browsers have a Javascript implementation of JSON by default, making it the lingua franca of AJAX communication.

On the other hand, YAML requires an extra library and simply does not have that much following in the Javascript community. If your primary objective for a serialization method is to communicate with Javascript, look at JSON first.

MessagePack

So far, we haven’t paid much attention to how much space a serialized object consumes. It turns out that small serialized size is a very important characteristic, especially for systems that require low latency and high throughput. This is where MessagePack steps in.

Unlike JSON and YAML, MessagePack is not meant to be human readable! It is a binary format, which means that it represents its information as arbitrary bytes, not necessarily bytes that represent the alphabet. The benefit of doing so is that its serializations often take up significantly less space than their YAML and JSON counterparts. Although this does rule out MessagePack as a configuration file format, it makes it very attractive to those building fast, distributed systems.

Let’s see how to use it with Ruby. Unlike YAML and JSON, MessagePack does not come bundled with Ruby (yet!). So, let’s get ourselves a copy:

gem install msgpack

We can mess around with it a bit:

require 'msgpack'
msg = {:height => 47, :width => 32, :depth => 16}.to_msgpack

#prints out mumbo-jumbo
p msg

obj = MessagePack.unpack(msg)
p obj

First, create a standard Ruby hash and call to_msgpack on it. This returns the MessagePack serialized version of the hash. Then, unserialize the serialized hash with MessagePack.unpack (we should get the original hash back). Of course, we can use our good old converter methods (notice the similar API):

class Person
  ...
  def to_msgpack
    MessagePack.dump ({
      :name => @name,
      :age => @age,
      :gender => @gender
    })
  end

  def self.from_msgpack(string)
    data = MessagePack.load string
    self.new(data['name'], data['age'], data['gender'])
  end
  ...
end

Okay, so MessagePack should be used when we feel the need for speed, JSON for when we need to communicate with Javascript, and YAML is for configuration files. But, you’re usually not going to be sure of which one to pick when you start a large project, so how do we keep our options open?

Modularizing with Mixins

Ruby is a dynamic language with some pretty awesome metaprogramming features. Let’s use them to make sure that we don’t pigeonhole ourselves into an approach we might later regret. First of all, notice that the Person serialization/unserialization methods created earlier seem awfully similar.

Let’s turn that into a mixin:

require 'json'

#mixin
module BasicSerializable

  #should point to a class; change to a different
  #class (e.g. MessagePack, JSON, YAML) to get a different
  #serialization
  @@serializer = JSON

  def serialize
    obj = {}
    instance_variables.map do |var|
      obj[var] = instance_variable_get(var)
    end

    @@serializer.dump obj
  end

  def unserialize(string)
    obj = @@serializer.parse(string)
    obj.keys.each do |key|
      instance_variable_set(key, obj[key])
    end
  end
end

First of all, notice that the @@serializer is set to the serializing class. This means that we can immediately change our serialization method, as long as our serializable classes include this module.

Taking a closer look at the code, it’s basically taking a look at the instance variables to serialize and unserialize an object/string. In the serialize method:

def serialize
  obj = {}
  instance_variables.map do |var|
    obj[var] = instance_variable_get(var)
  end

  @@serializer.dump obj
end

It loops over the instance_variables and constructs a Ruby hash of the variable names and their values. Then, simply use the @@serializer to dump out the object. If the serializing mechanism does not have a dump method, we can simply subclass it to give it that method!

We use a similar approach with the unserialize method:

def unserialize(string)
  obj = @@serializer.parse(string)
  obj.keys.each do |key|
    instance_variable_set(key, obj[key])
  end
end

Here, use the serializer to get a Ruby hash out of the string and set the object’s instance variables to the values of the hash.

This makes our Person class really easy to implement:

class Person
include BasicSerializable

  attr_accessor :name, :age, :gender

  def initialize(name, age, gender)
    @name = name
    @age = age
    @gender = gender
  end
end

Notice, we’re just adding the include BasicSerializable line! Let’s test it out:

p = Person.new "David", 28, "male"
p p.serialize

p.unserialize (p.serialize)
puts "Name #{p.name}"
puts "Age #{p.age}"
puts "Gender #{p.gender}"

Now, if you comb through the code carefully (or just understand the underlying concepts), you might notice that the BasicSerializable methods work very well for objects that only have serializable instance variables (i.e. integers, strings, floats, etc. or arrays and hashes of them). However, it will fail for an object that has other BasicSerializable objects as instances.

The easy wasy to fix this problem is to override the serialize and unserialize methods in such classes, like so:

class People
  include BasicSerializable

  attr_accessor :persons

  def initialize
    @persons = []
  end

  def serialize
    obj = @persons.map do |person|
      person.serialize
    end

    @@serializer.dump obj
  end

  def unserialize(string)
    obj = @@serializer.parse string
    @persons = []
    obj.each do |person_string|
      person = Person.new "", 0, ""
      person.unserialize(person_string)
      @persons << person
    end
  end

  def <<(person)
    @persons << person
  end
end

Finishing up

Serialization is a pretty important topic that often goes overlooked. Choosing the right serialization method can make your much life much easier when optimization time comes around. Along with our coverage of serialization methods, the modular approach (it may need to be modified for particular applications) can help you change your decision at a later date.

Frequently Asked Questions on Choosing the Right Serialization Format

What are the key factors to consider when choosing a serialization format?

When choosing a serialization format, it’s important to consider several factors. Firstly, the type of data you’re working with is crucial. Some formats are better suited to certain types of data than others. For example, XML is often used for document-like data, while JSON is typically used for data that is more like a data structure. Secondly, consider the language support. Some serialization formats are supported by a wide range of programming languages, while others are not. Thirdly, consider the performance of the format. Some formats are faster to serialize and deserialize than others. Lastly, consider the size of the serialized data. Some formats produce smaller serialized data than others, which can be important if you’re working with large amounts of data or if bandwidth is a concern.

How does JSON compare to other serialization formats?

JSON (JavaScript Object Notation) is a popular serialization format due to its simplicity and readability. It’s supported by many programming languages and is often used for data that is more like a data structure. However, JSON is not the most efficient format in terms of performance or size. It’s slower to serialize and deserialize than some other formats, and it produces larger serialized data. If performance or size is a concern, you might want to consider a binary format like Protocol Buffers or Avro.

What are the advantages of using XML for serialization?

XML (eXtensible Markup Language) is a widely used serialization format that is often used for document-like data. One of the main advantages of XML is its readability. It’s easy to understand and edit by humans, which can be a big advantage for debugging. XML also supports complex data structures and has strong typing, which can help to prevent errors. However, XML is not the most efficient format in terms of performance or size. It’s slower to serialize and deserialize than some other formats, and it produces larger serialized data.

Why might I choose a binary format like Protocol Buffers or Avro?

Binary formats like Protocol Buffers and Avro are often chosen for their efficiency. They are faster to serialize and deserialize than text-based formats like JSON and XML, and they produce smaller serialized data. This can be a big advantage if you’re working with large amounts of data or if bandwidth is a concern. However, binary formats are not as readable or editable by humans as text-based formats, which can make debugging more difficult.

What is the difference between serialization and deserialization?

Serialization is the process of converting data into a format that can be stored or transmitted and then reconstructed later. Deserialization is the reverse process, converting serialized data back into its original form. These processes are used to enable data to be stored and transmitted in a format that is efficient and that can be understood by different systems.

How does language support affect the choice of serialization format?

The choice of serialization format can be influenced by the programming languages that you’re using. Some serialization formats are supported by a wide range of languages, while others are not. If you’re using a language that doesn’t support a particular format, you’ll need to use a different format or find a way to add support for that format to your language.

How does the size of serialized data affect performance?

The size of serialized data can have a big impact on performance. Larger data takes longer to serialize and deserialize, and it takes up more storage space. It also requires more bandwidth to transmit. Therefore, if you’re working with large amounts of data or if bandwidth is a concern, you might want to choose a serialization format that produces smaller serialized data.

What is the role of serialization in distributed systems?

In distributed systems, data needs to be transmitted between different systems. Serialization is used to convert the data into a format that can be transmitted and understood by the different systems. This enables data to be shared and processed across the distributed system.

How does the complexity of data structures affect the choice of serialization format?

The complexity of your data structures can influence your choice of serialization format. Some formats, like XML, support complex data structures and have strong typing, which can help to prevent errors. Other formats, like JSON, are simpler and are often used for data that is more like a data structure.

What are some common use cases for different serialization formats?

Different serialization formats are suited to different use cases. For example, XML is often used in document-like data and in situations where readability and the ability to edit the data by humans is important. JSON is often used for data that is more like a data structure, and in situations where simplicity and wide language support are important. Binary formats like Protocol Buffers and Avro are often used in situations where performance and the size of the serialized data are important, such as in large-scale distributed systems.