Text Diff Checker: Compare and Find Differences Between Text Instantly

Introduction

Every time you open a pull request, review a document revision, or resolve a merge conflict, you are interacting with a text diff. A diff (short for difference) is a representation of the changes between two versions of text — showing what was added, what was removed, and what stayed the same.

Text diff tools are foundational to modern software development. They power version control systems, code review platforms, collaboration tools, and deployment pipelines. Understanding how diffs work — not just how to read them, but how the algorithms behind them function — makes you a more effective developer and a more thoughtful collaborator.

This article takes you from the Unix origins of diff in 1974 all the way through modern algorithms used by git today, explaining diff formats, three-way merges, visualization strategies, and practical best practices along the way.

A Brief History of diff

1974 — The Birth of `diff`

Doug McIlroy wrote the original diff utility for Unix at Bell Labs in 1974. It was a revelation: for the first time, developers could automatically compare two text files and produce a structured description of their differences. This was immediately useful for distributing software patches and tracking changes to source code.

1984 — GNU diff

The Free Software Foundation released GNU diff as part of GNU diffutils, making a portable and improved version available to everyone. GNU diff introduced additional output formats — context diff and unified diff — that became industry standards.

1986 — The Myers Algorithm

Eugene Myers published his landmark paper "An O(ND) Difference Algorithm and Its Applications" in 1986. This algorithm — which finds the Shortest Edit Script (SES) — became the theoretical backbone of most modern diff implementations, including git.

1990s — diff/patch as the Universal Patch Format

The combination of diff and patch became the de-facto standard for distributing software updates. Open-source projects circulated .patch files, contributors mailed diffs to mailing lists, and the Linux kernel was developed and maintained almost entirely through emailed diff/patch workflows.

2005 — Git and Myers Diff

Linus Torvalds created git in 2005, and it adopted the Myers algorithm as its default diff engine. Git's diff subsystem became one of the most widely used diff implementations in history, processing billions of comparisons every day across platforms like GitHub and GitLab.

2010s — Histogram Diff and Web Visualization

Git introduced the histogram diff algorithm as its preferred default for many operations. Simultaneously, web-based diff visualization flourished — GitHub's split-view and inline PR diffs, GitLab's review tools, and Gerrit's change tracking all brought diff output to a mass audience of developers.

Diff Algorithms

LCS — Longest Common Subsequence

The classical approach to computing a diff is based on the Longest Common Subsequence (LCS) of two sequences. The LCS is the longest sequence of elements that appear in the same relative order in both inputs, though not necessarily contiguously.

Example:

String A = "ABCBDAB"
String B = "BDCAB"
LCS = "BCAB" (length 4)

The diff is derived from what is not in the LCS: elements unique to A are deletions; elements unique to B are insertions. Computing the LCS takes O(M×N) time and space, which is acceptable for small files but slow for large ones.

Myers Algorithm — The Shortest Edit Script

Eugene Myers' 1986 algorithm finds the Shortest Edit Script (SES): the minimum number of insertions and deletions required to transform sequence A into sequence B. This is equivalent to finding the LCS, but Myers' approach is far more efficient in practice.

Key properties:

Time complexity: O(ND), where N = len(A) + len(B) and D = the edit distance (number of changes)
Space complexity: O(N) with the linear-space refinement
Uses a "snake" graph traversal — a diagonal path through an edit graph where each "snake" represents a sequence of matching characters
Used by git, GNU diff, and most modern diff tools

The Myers algorithm excels when changes are small relative to file size (low D), which is the common case in version control: most commits change only a small fraction of a file.

Patience Diff — Better for Code Structure

Patience diff takes a different approach: it first finds unique lines that appear exactly once in both files, uses those as anchors, and then recursively diffs the sections between them.

This produces dramatically better results for code that contains many identical structural lines — think of }, {, return, or blank lines that appear throughout a source file. Myers might match the wrong closing brace; patience diff anchors on unique, meaningful lines and produces diffs that are far easier to understand.

Patience diff is used by Bazaar and Mercurial, and is available in git via git diff --diff-algorithm=patience.

Histogram Diff — Git's Modern Default

Histogram diff is an evolution of patience diff. It builds a histogram of line frequencies and uses this frequency information to make smarter matching decisions. Lines that appear many times are less likely to be meaningful anchors; rare lines are better candidates.

Git introduced histogram diff and it has been the recommended default since approximately 2012 for many scenarios. You can use it explicitly with:

git config --global diff.algorithm histogram

Diff Output Formats

Normal Diff (Original Unix Format)

The original output format produced by diff file1.txt file2.txt:

2d1
< line only in file1
5,7c4,6
< old line A
---
> new line A

Commands like 2d1 (delete line 2 from file1) and 5,7c4,6 (change lines 5–7 to lines 4–6) made this format machine-readable but cryptic for humans.

Context Diff (`-c` flag)

Introduced with GNU diff, context diff adds surrounding lines for readability, marking changed lines with ! and using *** / --- to separate old and new blocks.

Unified Diff Format (`-u` flag) — The Standard

The unified diff format is the modern standard, used by git and all major patch workflows. It combines both old and new content in a single block, uses + and - to mark changes, and includes hunk headers that identify the location of each change.

git diff Output

Git's output is unified diff with added metadata — file mode, index hashes, and the diff --git header that identifies the repository paths.

Understanding Unified Diff Format

Let's decode the unified diff format in detail:

--- a/config.py
+++ b/config.py
@@ -10,7 +10,8 @@
 DATABASE_HOST = 'localhost'
 DATABASE_PORT = 5432
-DATABASE_NAME = 'myapp_dev'
+DATABASE_NAME = 'myapp_production'
+DATABASE_SSL = True
 
 # Cache settings
 CACHE_TTL = 300

File Headers

--- marks the original file (version A); +++ marks the new file (version B). In git, a/ and b/ are conventional prefixes.

Hunk Headers

@@ -10,7 +10,8 @@

This is the hunk header, and it tells you exactly where in the file this chunk of changes lives:

-10,7 → In the original file, this hunk starts at line 10 and spans 7 lines
+10,8 → In the new file, this hunk starts at line 10 and spans 8 lines (one line was added)

The format is always @@ -start,count +start,count @@.

Line Markers

Each line in the hunk body is prefixed with one of three characters:

(space) — context line: unchanged, shown for readability
- (minus) — deleted line: present in original, absent in new
+ (plus) — added line: absent in original, present in new

In our example, DATABASE_NAME = 'myapp_dev' was deleted and replaced with the production name, and DATABASE_SSL = True is a brand-new line. The hunk spans 7 lines in the original (1 deleted + 6 context) and 8 lines in the new file (2 added + 6 context).

Line-Level vs Word-Level vs Character-Level Diffs

Standard diff operates at the line level — each line is treated as an atomic unit. This is ideal for source code, where lines are the natural unit of change.

Word-Level Diff

For prose, documentation, or configuration files, word-level diff is more informative. Consider this change:

Before: The quick brown fox jumps over the lazy dog
After: The quick red fox leaps over the sleeping cat

A line-level diff would show the entire line as changed. A word-level diff highlights exactly what changed:

The quick ~~brown~~ red fox ~~jumps~~ leaps over the ~~lazy dog~~ sleeping cat

Git supports word-level diff with git diff --word-diff.

Character-Level Diff

Character-level diff (using algorithms like Levenshtein distance) works at the individual character level. Best suited for short strings — passwords, identifiers, configuration values — where even a single character matters.

Comparison Table

Approach	Granularity	Best for	Tool example
Line diff	Lines	Source code	`git diff`
Word diff	Words	Prose/docs	`git diff --word-diff`
Char diff	Characters	Short strings	Levenshtein-based
Semantic diff	AST nodes	Code refactoring	difftastic

Semantic diff tools like difftastic parse source code into an Abstract Syntax Tree (AST) and diff the tree structure rather than raw text, producing diffs that understand language syntax and ignore cosmetic changes.

Three-Way Merges and Merge Conflicts

The Three-Way Merge Model

When two people modify the same file independently, a simple two-way diff cannot determine whose changes should win. Git uses a three-way merge:

Base — the common ancestor commit
Ours — the current branch's version
Theirs — the incoming branch's version

The algorithm compares both ours and theirs against the base:

If only ours changed a region → use ours
If only theirs changed a region → use theirs
If both changed the same region differently → conflict

Merge Conflict Markers

When git cannot auto-resolve a conflict, it inserts markers into the file:

<<<<<<< HEAD
DATABASE_NAME = 'myapp_production'
=======
DATABASE_NAME = 'myapp_staging'
>>>>>>> feature/staging-config

Everything between <<<<<<< and ======= is your version (HEAD)
Everything between ======= and >>>>>>> is the incoming version
You must manually edit the file to resolve, then git add it

Use Cases

Code Review

Diffs are the language of code review. Pull requests on GitHub, GitLab, and Bitbucket all present changes as diffs, allowing reviewers to understand exactly what changed, line by line. Small, focused diffs dramatically improve review quality and speed.

Document Comparison

Legal teams use diff tools to compare contract revisions. Technical writers use them to review documentation changes. Any workflow involving versioned documents benefits from structured diff output.

Log Analysis

System administrators compare log files to identify what changed between runs — new errors, missing entries, configuration drift. Tools like diff and colordiff are standard parts of the sysadmin toolkit.

Legal and Compliance

Regulatory submissions, audit trails, and compliance documents often require a formal record of changes between versions. Diff tools provide an objective, reproducible record of exactly what changed, when, and how.

Security Analysis

Security researchers diff configuration snapshots and system states to detect unauthorized changes. File integrity monitoring systems are built on diff principles.

Visualization Approaches

Side-by-Side (Split View)

Two panels show the old and new versions side by side, with changes highlighted in corresponding rows. Best for large changes where context on both sides is helpful. This is the default view in many GUI tools and GitHub's split-diff toggle.

Inline (Unified View)

Deletions and additions are shown in a single stream, interleaved with context lines. This is the default in most command-line tools and GitHub's PR view. Best for dense, small changes where you want to see the before and after close together.

GitHub PR View

GitHub enhances the unified diff with syntax highlighting, expandable context, inline review comments, side-by-side toggle, and per-file "Viewed" tracking — making large pull requests navigable for reviewers.

Word-Diff Highlighting

Tools like git diff --word-diff=color highlight changed words within lines, making character-level changes visible in a line-diff context. This is especially useful for configuration files and prose documents.

Best Practices

Keep commits small and focused. A diff that changes one logical thing is far easier to review than a diff touching dozens of files for multiple reasons.
Write meaningful commit messages. The diff shows what changed; the commit message explains why.
Use the right diff algorithm. For code, histogram or patience diff often produces more readable output than Myers. Configure globally: git config --global diff.algorithm histogram.
Review diffs before committing. git diff --staged shows exactly what will be committed. Always read it before running git commit.
Use word-diff for prose. When writing documentation or README files, git diff --word-diff is far more readable than line diff.
Understand hunk context. The three context lines around each hunk help you understand the change in context. Don't skip them when reviewing.
Resolve conflicts carefully. Never accept one side of a conflict without understanding what the other side changed. Both changes may be important.
Use .gitattributes for binary files. Tell git how to handle binary and special files to avoid meaningless diffs.

FAQ

Q: What is the difference between diff and patch?
A: diff compares two files and produces diff output. patch takes that diff output and applies it to a file to reproduce the changes. They are complementary tools designed to work together.

Q: Which diff algorithm does git use by default?
A: Git uses the Myers algorithm by default, but histogram diff is recommended: git config --global diff.algorithm histogram.

Q: What does @@ -10,7 +10,8 @@ mean?
A: The hunk starts at line 10 in both files. In the old file it covers 7 lines; in the new file, 8 lines (one line was added).

Q: Can I diff binary files?
A: Standard diff tools operate on text. For binary files, specialized tools exist (like bsdiff). Most diff tools will simply report "Binary files differ."

Q: What is a "hunk"?
A: A hunk is a contiguous block of changes in a diff, including surrounding context lines. A single diff can contain multiple hunks if changes are spread throughout a file.

Q: Why does git sometimes produce confusing diffs for moved code?
A: Standard line-based diff has no concept of "move" — it only sees additions and deletions. Code that was moved appears as deleted in one location and added in another. Tools like difftastic that understand AST structure can detect moves.

Q: What is a three-way merge?
A: A merge strategy that uses a common ancestor (base) along with two changed versions to intelligently combine changes, auto-resolving non-conflicting edits and flagging genuine conflicts.

Understanding text diff is not just a technical curiosity — it is a fundamental skill for anyone who works with text, code, or documents over time. From the elegant simplicity of the Unix diff command to the sophisticated algorithms powering modern code review platforms, the humble diff has shaped how software is built, reviewed, and maintained for over five decades.