PDF Data Extraction API

Extract Data from PDF to JSON (API + Example)

Send any PDF to Parselyze and receive structured JSON with the fields you define, from invoices and contracts to scanned forms and financial reports. No scripting required.

Works with scanned and digital PDFs

Extracts fields, tables, and nested line items

Simple REST API. No custom scripts per document layout

Extract your first PDF

Start in minutes

50 pages per month free

No credit card required

REST API, SDK, webhooks

Best fit for

Engineering teams building document pipelines, finance automation, data ingestion workflows, and any system that needs data out of PDFs.

Any PDF type

Digital PDFs, scanned documents, and image files all processed through the same API endpoint.

Tables included

Extracts table rows and nested line items alongside scalar fields in the same JSON response.

No maintenance

Define your fields once. The AI handles layout variation, different fonts, and formatting changes automatically.

What is PDF data extraction?

PDF data extraction is the automated process of reading a PDF document and pulling out specific fields (such as dates, amounts, names, and table rows) into a structured format like JSON. Unlike copy-paste or basic OCR, PDF data extraction maps each value to a named key you define once, so the output is immediately usable in your application.

Parselyze provides a PDF data extraction API that handles digital PDFs, scanned PDFs with built-in OCR, multi-page documents, and image inputs. Define your fields in the Template Builder, submit the document via API, and receive structured JSON in seconds.

How it works

How to extract data from a PDF

Rather than dealing with raw text, you receive structured fields ready to be used in your application. This allows for easy integration of PDF data extraction into data pipelines, internal systems, and automated workflows.

Upload a PDF

Send your PDF document to Parselyze via our API. You can upload any PDF, whether it's a digital file or a scanned document.

Fields are detected

Parselyze detects and extracts relevant fields, tables, and data from your documents, regardless of their layout or format.

Receive structured JSON

Get the result of PDF to JSON conversion, ready to be used in your application or data pipeline.

PDF to JSON extraction example

Send a PDF to Parselyze via API and receive structured JSON in seconds, ready for any downstream system.

Sample invoice: FCT-000342 from ACME Corporation

extraction_result.json

{
  "invoice_number": "FCT-000342",
  "invoice_date":   "2024-05-28",
  "vendor_name":    "ACME Corporation",
  "vendor_address": "123 Innovation St, Example City",
  "bill_to":        "John Example",
  "bill_to_address": "456 Demo Ave, Sampletown",
  "currency":       "USD",
  "total_amount":   1500.00,
  "line_items": [
    {
      "description": "Consulting services",
      "qty": 8,
      "unit_price": 125.00,
      "total": 1000.00
    },
    {
      "description": "Design mockups",
      "qty": 1,
      "unit_price": 500.00,
      "total":  500.00
    }
  ]
}

Supported PDF Types

Parselyze supports every PDF types, such as invoices, receipts, financial reports, contracts, forms, scanned documents, and more.

Invoices

Extract totals, dates, line items, and more from scanned invoices.

Receipts

Parse merchant names, amounts, and dates from receipts for expense tracking.

Contracts

Extract parties, dates, and clauses from contracts and agreements.

Financial reports

Convert financial statements and reports into structured data for analysis.

Forms and surveys

Parse filled-out forms and surveys to extract responses and metadata.

Scanned documents

Convert scanned PDFs of any type into structured JSON for downstream processing.

Typical Workflows

Parselyze supports a variety of workflows, such as invoice processing, receipt data extraction, contract data ingestion, and document ingestion pipelines.

Invoice processing automation

Convert scanned invoices into structured JSON to automatically import totals, dates, and line items into accounting systems.

Receipt data extraction

Extract merchant names, amounts, and dates from receipts to automate expense tracking and reimbursements.

Contract data ingestion

Parse contracts and agreements to extract key information like parties, dates, and clauses for internal systems.

Document ingestion pipelines

Convert large volumes of PDFs and scanned documents into structured JSON to feed data warehouses or automation workflows.

How to integrate

Add PDF extraction to any application

Install the SDK or call the REST API directly. For large volumes, use the async job queue and receive results via webhook as each document completes.

Create a PDF template in the Template Builder

Submit PDFs via REST API or Node.js SDK

Receive structured JSON instantly or via webhook

Read the docs | Webhook guide

Ready to integrate?

SDK examples, REST API reference, webhook handler, and cURL samples are all available in the developer docs.

Developer integration guide

Frequently asked questions

Everything you need to know about PDF data extraction.

What does PDF data extraction mean?

PDF data extraction is the process of automatically pulling structured fields, tables, and values from PDF documents without manual copy-paste. The result is clean JSON ready for databases, APIs, or automation workflows.

What PDF formats are supported?

Parselyze supports native digital PDFs, scanned PDFs (with built-in OCR), multi-page PDFs, and PDF images. JPEG, PNG, WEBP, TIFF, and BMP image inputs are also accepted.

How do I extract data from a PDF?

Create a template in the Parselyze dashboard defining the fields you want, then submit the PDF via REST API or Node.js SDK. You receive a structured JSON response in seconds. For bulk volumes, use the async job queue with webhook delivery.

Can Parselyze extract tables from PDFs?

Yes. Parselyze extracts table rows and nested line items alongside scalar fields in the same JSON response — no separate pipeline required.

How is PDF data extraction different from OCR?

OCR converts document images to raw text. PDF data extraction goes further: it maps that text into named, typed fields defined by your template. The output is structured JSON, not a block of text.

Do I need to train a model for each PDF layout?

No. You define the fields once in the Template Builder. The AI handles layout variation, different fonts, and formatting differences automatically across document variants.

Start extracting data from PDFs today

50 pages/month free · No credit card required

Extract your first PDF All solutions