data serialization parquet protobuf big-data json binary file-extensions

Data Serialization Formats Guide: Parquet, Avro, Proto, and More

From readable JSON to high-performance binary formats. Learn the differences between .parquet, .avro, .proto, .ndjson, .cbor, .bson, and .har file extensions.

2026-04-11

Data Serialization Formats Guide: Parquet, Avro, Proto, and More

In the world of data engineering and software development, how we store and transmit data is critical. While JSON is the king of web APIs due to its readability, it is often too bulky or slow for big data processing, high-frequency messaging, or specialized debugging.

This guide explores the specialized file extensions used for data serialization beyond basic JSON and CSV.


Quick Reference Table: Data Serialization Formats

Extension Full Name Format Primary Use Case
.ndjson, .jsonl Newline Delimited JSON Text (ASCII) Log files, data streaming, big data imports
.parquet Apache Parquet Binary (Columnar) Big Data analytics (Hadoop, Spark, AWS S3)
.avro Apache Avro Binary (Row-based) Data serialization with schemas (Kafka)
.proto Protocol Buffers Text (DSL) Defining gRPC interfaces and data structures
.bson Binary JSON Binary MongoDB storage and data exchange
.cbor Concise Binary Object Representation Binary IoT, low-bandwidth environments
.har HTTP Archive Text (JSON) Debugging network requests in browsers
.edn Extensible Data Notation Text (Lisp-like) Clojure ecosystem, metadata configuration

1. Stream-Friendly JSON (.ndjson, .jsonl)

Standard JSON requires a whole file to be read into memory to be parsed (it starts with [ and ends with ]). This is impossible for 100GB log files.

  • NDJSON / JSONL: Each line is a valid, independent JSON object.
  • Why use it? You can read a file line-by-line without loading the entire thing. If the file is truncated or corrupted, you only lose the last line, not the whole data set.

2. Big Data Columnar Storage (.parquet, .orc)

Traditional databases store data row-by-row. Analytical databases often prefer Columnar storage.

  • Parquet: The industry standard for big data. Because it stores data by column, it can compress data much more effectively and allows you to "skip" columns you don't need for a specific query.
  • ORC: Optimized Row Columnar. Similar to Parquet but primarily used in the Apache Hive ecosystem.

3. Schema-First Serialization (.avro, .proto)

Unlike JSON where the key names ("first_name") are repeated in every single record, schema-first formats separate the "rules" from the "data."

  • Avro: The schema is stored as JSON, but the data is binary. It's the standard for Apache Kafka because it's fast and supports schema evolution.
  • Protobuf (.proto): Developed by Google. You define your data structure in a .proto file, and then a compiler generates code for your preferred language. It's the backbone of gRPC.

4. Binary JSON Alternatives (.bson, .cbor, .msgpack)

If you like the flexibility of JSON but need more speed or smaller file sizes, binary formats are the answer.

  • BSON: Used internally by MongoDB. It supports more data types than JSON (like Date and Binary data).
  • CBOR: Designed to be extremely small and efficient. It's widely used in IoT (Internet of Things) devices where every byte of bandwidth counts.
  • MessagePack: Similar to CBOR, it's "like JSON but fast and small."

5. Specialized Formats (.har, .edn)

  • HAR (HTTP Archive): If you've ever "exported" a network trace from Chrome or Firefox DevTools, it's a .har file. It's actually just a massive JSON file containing every header, cookie, and response body from your browsing session.
  • EDN: Used primarily in the Clojure world. It's more powerful than JSON because it supports custom types (tags) and more complex data structures natively.

How to View These Files

  • NDJSON/JSONL: Use any text editor or the jq command-line tool.
  • Parquet: Requires specialized viewers or Python libraries (like pandas or fastparquet).
  • Protobuf: You usually need the .proto definition file to decode the binary data.
  • HAR: You can drag and drop it back into Chrome/Firefox DevTools or use an Online HAR Viewer.

Common Questions (FAQ)

Q: Why is my Parquet file smaller than my CSV?

A: Parquet uses advanced compression techniques like "dictionary encoding" and "run-length encoding." Since it stores data by column, values in a column (like "Country") are often identical, allowing for massive compression ratios compared to text files.

Q: Can I edit a .proto file?

A: Yes, .proto files are human-readable text files where you define your data structures. However, you cannot directly "edit" the binary data produced from it—you must use the compiled code.

Q: Is .bson the same as JSON?

A: Not quite. BSON is a binary representation that includes JSON-like data but also adds types like Buffer, Long, and Decimal128 which standard JSON doesn't support.


Related Tools on Tool3M

  • JSON to CSV Converter: Convert your structured data into a flat table.
  • Binary Serialization Guide: Learn more about Protobuf and MessagePack.