Skip to content

From PDF to Text: Extracting Meaning from Documents

Learn how we pull clean, useful text from messy PDFs and scientific citations.

The first step in building a knowledge graph for integral ecology is simple in concept, but tricky in practice:

How do we get clean, usable text from messy, multilingual PDF reports?

This post explains how we extract both the raw text and the structured citations from reports using two tools:


Why This Matters

PDFs are designed for printing, not for reading by machines.

They can include:

  • Columns and footnotes
  • Images, tables, and scanned pages
  • Embedded fonts or malformed characters
  • Multiple languages in one document

If we want to detect entities and link knowledge later, we need high-quality plain text.


Step 1: Extract Text with PyMuPDF

We use PyMuPDF to extract the text page-by-page from a PDF.

It:

  • Preserves layout well
  • Handles multiple languages
  • Works with scanned+OCR’d documents if text is embedded

Example output:

The Amazon rainforest is shrinking rapidly.

WWF reported that deforestation increased 12% in 2023.

This text gets saved as report.txt.

Step 2: Extract Citations with GROBID

Next, we use GROBID, a tool that reads the bibliography and metadata of academic papers and reports.

GROBID converts messy citation lists like:

[12] Smith, J., “Biodiversity and Forests”, Nature, 2020

Into structured, machine-readable TEI XML, which can include:

  • Title
  • Authors
  • Year
  • Journal or publisher
  • DOI or identifiers

We save this as report.biblio.xml.

Later in the pipeline, this will help us:

  • Build :CITES relationships in the graph
  • Match references across reports
  • Cluster similar reports by their sources

How This Works in Code

Our system includes a script that does this automatically with a script:

python extract_text.py my-report.pdf

This script:

  1. Extract the full text using PyMuPDF → my-report.txt
  2. Send the PDF to the GROBID API → my-report.biblio.xml

The script runs inside a Docker container and outputs files to the /data/output/ folder (data/output on your local machine).

Try It Yourself

If you’ve followed the setup from Part 2, you can run:

make pipeline PDF=/data/input/sample1.pdf

This will:

  • Extract text and citations
  • Tag entities (coming up in Part 4)
  • Load data into Neo4j
  • Export annotated text for review

Check the results in data/output/, you should see .txt, .entities.json, and .biblio.xml files.


What We Learned

  • PDFs are complex and tricky, but PyMuPDF gives us reliable plain text to work with
  • GROBID gives us structured citations, ready for linking
  • Clean text is the foundation for everything that follows

What’s Next?

In the next post, we’ll explore how we tag entities in the text using AI, recognizing organizations, locations, ecological concepts, and more.

Previous Post

Building Blocks: Documents, Entities, and Relationships

Next Post

Tagging the World: Finding Places, Plants, and Ideas with AI

🕊️ Part of the Digital Library of Integral Ecology: Building open, multilingual tools for ecological understanding.

© 2025 CLIR. All rights reserved.