From PDF to Text: Extracting Meaning from Documents

Learn how we pull clean, useful text from messy PDFs and scientific citations.

The first step in building a knowledge graph for integral ecology is simple in concept, but tricky in practice:

How do we get clean, usable text from messy, multilingual PDF reports?

This post explains how we extract both the raw text and the structured citations from reports using two tools:

Why This Matters

PDFs are designed for printing, not for reading by machines.

They can include:

If we want to detect entities and link knowledge later, we need high-quality plain text.

We use PyMuPDF to extract the text page-by-page from a PDF.

It:

Example output:

The Amazon rainforest is shrinking rapidly.

WWF reported that deforestation increased 12% in 2023.

This text gets saved as report.txt.

Next, we use GROBID, a tool that reads the bibliography and metadata of academic papers and reports.

GROBID converts messy citation lists like:

[12] Smith, J., “Biodiversity and Forests”, Nature, 2020

Into structured, machine-readable TEI XML, which can include:

We save this as report.biblio.xml.

Later in the pipeline, this will help us:

Our system includes a script that does this automatically with a script:

python extract_text.py my-report.pdf

This script:

The script runs inside a Docker container and outputs files to the /data/output/ folder (data/output on your local machine).

If you’ve followed the setup from Part 2, you can run:

make pipeline PDF=/data/input/sample1.pdf

This will:

Check the results in data/output/, you should see .txt, .entities.json, and .biblio.xml files.

PDFs are complex and tricky, but PyMuPDF gives us reliable plain text to work with
GROBID gives us structured citations, ready for linking
Clean text is the foundation for everything that follows

In the next post, we’ll explore how we tag entities in the text using AI, recognizing organizations, locations, ecological concepts, and more.