Data Serialization Formats Guide: Parquet, Avro, Proto, and More
In the world of data engineering and software development, how we store and transmit data is critical. While JSON is the king of web APIs due to its readability, it is often too bulky or slow for big data processing, high-frequency messaging, or specialized debugging.
This guide explores the specialized file extensions used for data serialization beyond basic JSON and CSV.
Quick Reference Table: Data Serialization Formats
| Extension | Full Name | Format | Primary Use Case |
|---|---|---|---|
.ndjson, .jsonl |
Newline Delimited JSON | Text (ASCII) | Log files, data streaming, big data imports |
.parquet |
Apache Parquet | Binary (Columnar) | Big Data analytics (Hadoop, Spark, AWS S3) |
.avro |
Apache Avro | Binary (Row-based) | Data serialization with schemas (Kafka) |
.proto |
Protocol Buffers | Text (DSL) | Defining gRPC interfaces and data structures |
.bson |
Binary JSON | Binary | MongoDB storage and data exchange |
.cbor |
Concise Binary Object Representation | Binary | IoT, low-bandwidth environments |
.har |
HTTP Archive | Text (JSON) | Debugging network requests in browsers |
.edn |
Extensible Data Notation | Text (Lisp-like) | Clojure ecosystem, metadata configuration |
1. Stream-Friendly JSON (.ndjson, .jsonl)
Standard JSON requires a whole file to be read into memory to be parsed (it starts with [ and ends with ]). This is impossible for 100GB log files.
- NDJSON / JSONL: Each line is a valid, independent JSON object.
- Why use it? You can read a file line-by-line without loading the entire thing. If the file is truncated or corrupted, you only lose the last line, not the whole data set.
2. Big Data Columnar Storage (.parquet, .orc)
Traditional databases store data row-by-row. Analytical databases often prefer Columnar storage.
- Parquet: The industry standard for big data. Because it stores data by column, it can compress data much more effectively and allows you to "skip" columns you don't need for a specific query.
- ORC: Optimized Row Columnar. Similar to Parquet but primarily used in the Apache Hive ecosystem.
3. Schema-First Serialization (.avro, .proto)
Unlike JSON where the key names ("first_name") are repeated in every single record, schema-first formats separate the "rules" from the "data."
- Avro: The schema is stored as JSON, but the data is binary. It's the standard for Apache Kafka because it's fast and supports schema evolution.
- Protobuf (.proto): Developed by Google. You define your data structure in a
.protofile, and then a compiler generates code for your preferred language. It's the backbone of gRPC.
4. Binary JSON Alternatives (.bson, .cbor, .msgpack)
If you like the flexibility of JSON but need more speed or smaller file sizes, binary formats are the answer.
- BSON: Used internally by MongoDB. It supports more data types than JSON (like Date and Binary data).
- CBOR: Designed to be extremely small and efficient. It's widely used in IoT (Internet of Things) devices where every byte of bandwidth counts.
- MessagePack: Similar to CBOR, it's "like JSON but fast and small."
5. Specialized Formats (.har, .edn)
- HAR (HTTP Archive): If you've ever "exported" a network trace from Chrome or Firefox DevTools, it's a
.harfile. It's actually just a massive JSON file containing every header, cookie, and response body from your browsing session. - EDN: Used primarily in the Clojure world. It's more powerful than JSON because it supports custom types (tags) and more complex data structures natively.
How to View These Files
- NDJSON/JSONL: Use any text editor or the
jqcommand-line tool. - Parquet: Requires specialized viewers or Python libraries (like
pandasorfastparquet). - Protobuf: You usually need the
.protodefinition file to decode the binary data. - HAR: You can drag and drop it back into Chrome/Firefox DevTools or use an Online HAR Viewer.
Common Questions (FAQ)
Q: Why is my Parquet file smaller than my CSV?
A: Parquet uses advanced compression techniques like "dictionary encoding" and "run-length encoding." Since it stores data by column, values in a column (like "Country") are often identical, allowing for massive compression ratios compared to text files.
Q: Can I edit a .proto file?
A: Yes, .proto files are human-readable text files where you define your data structures. However, you cannot directly "edit" the binary data produced from it—you must use the compiled code.
Q: Is .bson the same as JSON?
A: Not quite. BSON is a binary representation that includes JSON-like data but also adds types like Buffer, Long, and Decimal128 which standard JSON doesn't support.
Related Tools on Tool3M
- JSON to CSV Converter: Convert your structured data into a flat table.
- Binary Serialization Guide: Learn more about Protobuf and MessagePack.