JSON and the Confusion of Formats in Big Data

In the era of big data, the choices of data formats are dazzling, and the concept of data format itself can be confusing. Our data platform team found it helpful to breakdown this topic based on the three major stages in the life cycle of data: in-memory representation (logical format), on-the-wire serialization (exchange format), and on disk “Big Data” (storage format). All these stages are necessary for extracting value from data. JSON, or JavaScript Object Notation, is a major player in all three stages. In this post, we’ll help clarify what these formats are about, what roles JSON plays, and why JSON and related technologies are important.

In-Memory Logical Format

Logical format focuses on the abstract concepts in a domain. It defines the space of possible interactions between the declarative data (representing domain knowledge, state, etc.) and the application logic. The picture shows a specific way to represent the concept that Peter and Mary are friends.

More generally, the domain data model can be represented by graphs or even hypergraphs, tensors, etc. Although graph-based data models, programming models, and statistical models are on the rise, most applications still use relational databases. For example, the ER (entity-relationship) model uses a bipartite graph of entities and relationships for data modeling, from the concept to the design of database tables. At the source code level, data models must be represented by data structures in the host language. Techniques like ORM (object-relational mapping) can bridge object-oriented programming (OOP) languages and relational databases within logical data models.

The logical format of data lives within the applications, whose incarnations are "data structures." You can define a logical format in any programming language, or in standard declarative IDL (interface description language) or DDL (data definition language). An instance of data can be a string, a collection of key-value pairs, a tuple with named slots (as in relational data models), a tree with named branches (like a JSON object), a graph, or an instance of Java class. There can be many logically equivalent implementations in the same programming language or across different programming languages.

JSON provides a set of rules to encode general data structures into a well-defined physical form. It’s a simple yet expressive language based on the type system of JavaScript. JSON is able to express rich structural semantics with little space taken by syntactic notations, striking a balance among readability, compactness and expressiveness. For data to enter the other two important stages, JSON is the natural choice since JavaScript is so common on the internet, and JSON can be converted to and from runtime JavaScript objects seamlessly.

On-the-Wire Exchange Format

We often need to send data from inside an application across the wire to be consumed by other applications. As data leaves our host's in-memory data structure, it must convert to a common exchange format to preserve the information. Then it’s converted back to data structures inside other consumer applications. These conversion processes are called serialization/deserialization (SerDes), also known as the "wire protocol."

JSON is a commonly used exchange format that is not only self-describing, but also human readable. This has earned JSON the reputation of being "schema-less" - which is a bit misleading. JSON is only considered schema-less because the schema is already encoded with each record of the data, enabling us to automate schema management.

On the other hand, JSON is not the most compact representation, which can slow down the data exchanging process when traveling across the wire. BSON, the underlying storage format of MongoDB, uses packed binary format to reduce the message size. Storage format, in the simplest form, can be a trivial extension to the exchange format. We’ll cover storage format next. Frameworks like Protobuf, Apache Thrift (which comes with configurable serialization protocols, including JSON protocol, plus an entire RPC framework), and Apache Avro support more compact messages by stripping out schema from the data. So it’s helpful to evaluate where is the bottleneck when you’re choosing exchange formats.

In addition to the general SerDes frameworks, distributed application frameworks often use their own ad hoc exchange format. Apache Spark, for example, uses java serialization to send shuffled data across the network. One of the improvements introduced in Project Tungsten, which focuses on Spark performance optimization, uses special code-generated serializers so the schema only needs to be sent once during the entire shuffle, as opposed to for each row of data.

On Disk Storage Format

Data powers business value and needs to be saved. Data storage format, aka file format, defines how information is encoded in files on cold storage. There are many standard and proprietary file formats for different types of data, such as video, image, and application specific data. General purpose or specialized compression is also an important element of storage format.

In the world of big data, a data file is a collection of samples (rows, or records). The most natural way of creating a data file is to simply append each datum's byte representation with a line separator. For example, the JSON lines text file format is a storage format, which is also called newline-delimited JSON. The CSV format consists of lines of comma separated fields with an optional header. Similarly, all the exchange formats we mentioned earlier can be used in storage format. For example, an Avro file is just Avro data plus the schema. This type of file format is called row based format, where data is stored (you guessed it) row by row.

Scalability and Efficiency

At Credit Karma, we process terabytes of data with billions of records on a daily basis, so it’s important for us to design for scale and efficiency. The common strategy for scalability is to save data in partitioned files to support parallel data processing. In addition, a variety of technologies can improve the efficiency of processing individual file partitions. The main optimization considerations are read vs write, and space vs time.

Apache Hadoop sequence file is an example of row based file format, which uses large blocks to optimize read throughput. It also supports compression and custom SerDes. Row based format can become more sophisticated by integrating indexes to facilitate filter based queries, such as the data files of various database implementations.

Columnar databases have become popular due to the presence of wide tables. BigTable and Apache HBase are well known examples. Standardized column based file formats are available as modular components independent of query engines. Parquet and ORC (Optimized Row Columnar) file formats are widely adopted options. Tableau Data Extract (TDE) is also a columnar data storage format, but it’s not an open standard.

In addition to the common techniques, column based formats, as suggested by the name, optimize for read performance by loading only the columns specified in the query. However, there is inevitably more upfront cost associated with writing the data. Parquet file, for example, requires nontrivial pre-processing to build. Parquet also supports structured encoding techniques such as dictionary encoding and run-length encoding. General purpose compression such as gzip can be applied on top of Parquet for additional space saving. In general, Parquet is optimized for read to support OLAP (Online Analytical Processing) use cases.

Optimization can sacrifice data accessibility. JSON as a simple but not so efficient format is very accessible - it is supported by all major big data query engines, such as Apache Hive and SparkSQL which can directly query JSON files. On the other hand, for performance-optimized formats such as Thrift/Protobuf, there are more obstacles to access and analyze the data. Nevertheless, if the data of interest is already produced as Thrift/Protobuf, it may be worth the extra effort to use them directly as data storage format.

In summary, the challenge and necessity of manipulating high volume of data leads to many advanced data storage formats beyond plain JSON, which are shown in the picture. Storage space, speed of processing, and ease of access are all important considerations that are often at odds with each other.

Conclusion

Let's bring this all together. Why does JSON cause confusion? When should we use JSON and when should we consider alternatives?

  1. For logical format, JSON schema has major advantages over the traditional flat table schema, and JSON has become a key element of the NoSQL movement. JSON represents the more general concept of “nested semi-structured schema.” It’s more flexible than flat schema and should be used whenever possible.
  2. For exchange format, JSON is the vanilla option, and frameworks like Thrift and Protobuf should be used if runtime performance is the priority. Still, JSON REST API is everywhere due to its simplicity.
  3. For storage format, JSON is a commonly used file format and is supported by most NoSQL solutions as a data source format. JSON is naturally the raw data for “source of truth”, which is always needed, regardless of whether query optimized formats such as Parquet are used or not.

I hope you find this useful. Let us know if you’ve also been bothered by the potential confusions and how you’ve tackled it at @CreditKarmaEng.

About the Author

Yongjia is a big data and machine learning technologist focused on data engineering at Credit Karma.