Data Serialization Formats Guide: Parquet, Avro, Proto, and More

In the world of data engineering and software development, how we store and transmit data is critical. While JSON is the king of web APIs due to its readability, it is often too bulky or slow for big data processing, high-frequency messaging, or specialized debugging.

This guide explores the specialized file extensions used for data serialization beyond basic JSON and CSV.

Quick Reference Table: Data Serialization Formats

Extension	Full Name	Format	Primary Use Case
`.ndjson`, `.jsonl`	Newline Delimited JSON	Text (ASCII)	Log files, data streaming, big data imports
`.parquet`	Apache Parquet	Binary (Columnar)	Big Data analytics (Hadoop, Spark, AWS S3)
`.avro`	Apache Avro	Binary (Row-based)	Data serialization with schemas (Kafka)
`.proto`	Protocol Buffers	Text (DSL)	Defining gRPC interfaces and data structures
`.bson`	Binary JSON	Binary	MongoDB storage and data exchange
`.cbor`	Concise Binary Object Representation	Binary	IoT, low-bandwidth environments
`.har`	HTTP Archive	Text (JSON)	Debugging network requests in browsers
`.edn`	Extensible Data Notation	Text (Lisp-like)	Clojure ecosystem, metadata configuration

1. Stream-Friendly JSON (`.ndjson`, `.jsonl`)

Standard JSON requires a whole file to be read into memory to be parsed (it starts with [ and ends with ]). This is impossible for 100GB log files.

NDJSON / JSONL: Each line is a valid, independent JSON object.
Why use it? You can read a file line-by-line without loading the entire thing. If the file is truncated or corrupted, you only lose the last line, not the whole data set.

2. Big Data Columnar Storage (`.parquet`, `.orc`)

Traditional databases store data row-by-row. Analytical databases often prefer Columnar storage.

Parquet: The industry standard for big data. Because it stores data by column, it can compress data much more effectively and allows you to "skip" columns you don't need for a specific query.
ORC: Optimized Row Columnar. Similar to Parquet but primarily used in the Apache Hive ecosystem.

3. Schema-First Serialization (`.avro`, `.proto`)

Unlike JSON where the key names ("first_name") are repeated in every single record, schema-first formats separate the "rules" from the "data."

Avro: The schema is stored as JSON, but the data is binary. It's the standard for Apache Kafka because it's fast and supports schema evolution.
Protobuf (.proto): Developed by Google. You define your data structure in a .proto file, and then a compiler generates code for your preferred language. It's the backbone of gRPC.

4. Binary JSON Alternatives (`.bson`, `.cbor`, `.msgpack`)

If you like the flexibility of JSON but need more speed or smaller file sizes, binary formats are the answer.

BSON: Used internally by MongoDB. It supports more data types than JSON (like Date and Binary data).
CBOR: Designed to be extremely small and efficient. It's widely used in IoT (Internet of Things) devices where every byte of bandwidth counts.
MessagePack: Similar to CBOR, it's "like JSON but fast and small."

5. Specialized Formats (`.har`, `.edn`)

HAR (HTTP Archive): If you've ever "exported" a network trace from Chrome or Firefox DevTools, it's a .har file. It's actually just a massive JSON file containing every header, cookie, and response body from your browsing session.
EDN: Used primarily in the Clojure world. It's more powerful than JSON because it supports custom types (tags) and more complex data structures natively.

How to View These Files

NDJSON/JSONL: Use any text editor or the jq command-line tool.
Parquet: Requires specialized viewers or Python libraries (like pandas or fastparquet).
Protobuf: You usually need the .proto definition file to decode the binary data.
HAR: You can drag and drop it back into Chrome/Firefox DevTools or use an Online HAR Viewer.

Common Questions (FAQ)

Q: Why is my Parquet file smaller than my CSV?

A: Parquet uses advanced compression techniques like "dictionary encoding" and "run-length encoding." Since it stores data by column, values in a column (like "Country") are often identical, allowing for massive compression ratios compared to text files.

Q: Can I edit a .proto file?

A: Yes, .proto files are human-readable text files where you define your data structures. However, you cannot directly "edit" the binary data produced from it—you must use the compiled code.

Q: Is .bson the same as JSON?

A: Not quite. BSON is a binary representation that includes JSON-like data but also adds types like Buffer, Long, and Decimal128 which standard JSON doesn't support.

Related Tools on Tool3M

JSON to CSV Converter: Convert your structured data into a flat table.
Binary Serialization Guide: Learn more about Protobuf and MessagePack.

Data Serialization Formats Guide: Parquet, Avro, Proto, and More

Data Serialization Formats Guide: Parquet, Avro, Proto, and More

Quick Reference Table: Data Serialization Formats

1. Stream-Friendly JSON (`.ndjson`, `.jsonl`)

2. Big Data Columnar Storage (`.parquet`, `.orc`)

3. Schema-First Serialization (`.avro`, `.proto`)

4. Binary JSON Alternatives (`.bson`, `.cbor`, `.msgpack`)

5. Specialized Formats (`.har`, `.edn`)

How to View These Files

Common Questions (FAQ)

Q: Why is my Parquet file smaller than my CSV?

Q: Can I edit a .proto file?

Q: Is .bson the same as JSON?

Related Tools on Tool3M

Privacy & Security

Completely Free

Data Serialization Formats Guide: Parquet, Avro, Proto, and More

Quick Reference Table: Data Serialization Formats

1. Stream-Friendly JSON (.ndjson, .jsonl)

2. Big Data Columnar Storage (.parquet, .orc)

3. Schema-First Serialization (.avro, .proto)

4. Binary JSON Alternatives (.bson, .cbor, .msgpack)

5. Specialized Formats (.har, .edn)

How to View These Files

Common Questions (FAQ)

Q: Why is my Parquet file smaller than my CSV?

Q: Can I edit a .proto file?

Q: Is .bson the same as JSON?

Related Tools on Tool3M

1. Stream-Friendly JSON (`.ndjson`, `.jsonl`)

2. Big Data Columnar Storage (`.parquet`, `.orc`)

3. Schema-First Serialization (`.avro`, `.proto`)

4. Binary JSON Alternatives (`.bson`, `.cbor`, `.msgpack`)

5. Specialized Formats (`.har`, `.edn`)