Talha IjlalTalha Ijlal

Case Study

Digitizing Pakistan's Statutes

A comprehensive data engineering initiative aimed at structuring, parsing, and democratizing access to decades of fragmented legal documents.

RoleData Engineer / Lead
TimelineQ3 2023 - Q1 2024
Technologies
PythonRegexPostgreSQLBeautifulSoup

The Problem

Legal data in Pakistan exists in a state of profound fragmentation. Statutes, amendments, and repeals are often stored as unstructured PDFs, scanned images without OCR, or poorly formatted HTML across disparate governmental websites. This lack of a centralized, machine-readable repository severely impedes legal research, policy analysis, and computational law initiatives.

description

Data Fragmentation Matrix

  • error 40% of statutes available only as non-searchable image PDFs.
  • error Inconsistent hierarchical formatting (Parts, Chapters, Sections) across decades.
  • error Absence of relational links between original acts and subsequent amendments.

Research & Solution

The solution required a robust, custom Extract, Transform, Load (ETL) pipeline. We designed a deterministic parsing engine relying heavily on complex Regular Expressions (Regex) to identify legal hierarchies despite OCR errors and inconsistent typing. The transformed data was then modeled into a highly relational PostgreSQL schema.

Internal entry workflow

This project also included the internal statute entry interface used by the team to review parsed output, correct edge cases, and maintain a consistent hierarchy across parts, chapters, sections, and subsections. It was not a separate initiative. It existed to support the digitization pipeline directly.

Extraction (Scraping & OCR)

Automated crawlers aggregated raw documents. Tesseract OCR was deployed to convert image-based PDFs into raw text strings, handling legacy typefaces.

Transformation (Parsing Logic)

A multi-pass Regex engine was built to segment text into structured JSON blocks: Title, Enactment Date, Chapters, Sections, and Sub-sections.

Parsing approach
  • Multi-pass parsing to recognize parts, chapters, sections, and nested subsections.
  • Normalization to fix typography artifacts, line breaks, and numbering inconsistencies.
  • Validation checks to catch missing headings and malformed numbering early.
  • Stable identifiers so downstream search can cite and link to specific units.

Architecture

A conceptual view of the data flow from raw, unstructured governmental sources to a structured, queryable relational database.

RAW PDFsHTML DOCSPYTHON ETL ENGINE(Regex Parsing)POSTGRESQL