01Case study2023 — ongoing

Digitizing Pakistan's Statutes

An ingestion pipeline that turns decades of fragmented PDFs and HTML into a structured, searchable statute tree.

Role

Lead — pipeline, schema, entry tooling

Period

2023 — ongoing

Stack

Python · PostgreSQL · Regex · BeautifulSoup · Docker

Status

Live in production

0K+statutory records structured

~0sections reviewed / day

~0%section retrieval acc.

— Specimens2 views

[NO. XLV OF 1860]
THE PAKISTAN PENAL CODE.
[6th October, 1860]

CHAPTER XV.
OF OFFENCES RELATING TO RELIGION.

295-A. Deliberate and malicious acts intended to
outrage religious feelings of any class by insulting its
religion or religious beliefs.
   Whoever, with deliberate and malicious intention of
outraging the religious feelings of any class of the citizens
of Pakistan, by words, either spoken or written, or by
visible representations insults the religion or the religious
beliefs of that class, shall be punished with imprisonment
of either description for a term which may extend to ten
years, or with fine, or with both.

The problem

Where the data starts.

Legal data in Pakistan exists across decades of scanned PDFs, departmental HTML pages, and a few ad-hoc gazettes. A significant share of statutes are available only as non-searchable image PDFs, and no two sources agree on numbering, indentation, or what counts as a section break.

›Image-only PDFs with legacy typefaces and OCR noise.
›Inconsistent hierarchical formatting (Parts, Chapters, Sections) across decades.
›No relational links between original acts and their later amendments.

The pipeline

Normalise before you structure.

The pipeline normalises before it structures. Every act enters as a versioned artefact; cleaning passes (Tesseract OCR + a multi-pass regex engine) get it to a canonical form; then parsing walks the line stream and emits the natural hierarchy any law reporter already publishes — acts, chapters, sections, marginal notes. Each output line keeps provenance back to its source.

A companion internal entry interface lets the team review parsed output, correct edge cases, and maintain a consistent hierarchy. It was never a separate initiative — it exists to support the digitization pipeline directly, at roughly ~200 sections/day.

Search is hybrid for the obvious reason: statute vocabulary is narrow and repetitive, so pure dense retrieval keeps surfacing the wrong neighbour, and pure lexical retrieval misses anything phrased differently from the statute itself. A blend handles both — and the blend is what users actually feel, reaching ~80% section-level accuracy on internal evaluation queries.

On confidentiality

Implementation specifics (schema, scoring weights, infra topology) belong to DigiLawyer and aren’t reproduced here. The specimens above use the Pakistan Penal Code, 1860 — a public-domain statute — purely to show the visible shape of the work.