Building Blocks: Documents, Entities, and Relationships

How reports are transformed into structured, searchable networks of knowledge.

In our last post, we introduced the vision: building a knowledge graph for integral ecology, a way to connect people, places, organizations, and ideas across languages and disciplines.

But how exactly do we turn messy PDFs and complex reports into a meaningful, searchable web of knowledge?

We start with three key building blocks:

1. Documents

The foundation of our knowledge graph is a document — a report, academic paper, policy brief, or even a faith-based reflection.

Each document is:

A single file (usually a PDF)
With a title, language, source (UNEP, WWF, OpenAlex, etc.)
That contains many sentences — and lots of information hidden in plain text

We treat each document as a node in the graph, and from there, we extract meaning.

2. Entities

Entities are things that the document talks about — people, organizations, species, places, and ecological concepts.

For example:

Entity Text	Label
Amazon rainforest	`LOCATION`
WWF	`ORG`
climate resilience	`ECO_CONCEPT`
Laudato Si’	`DOCUMENT`

Our system uses Natural Language Processing (NLP) tools to automatically recognize these entities in many languages, with models trained on large text collections.

Later, we’ll even fine-tune our own models to be more accurate for ecological language.

3. Relationships

The real power of a knowledge graph comes from the connections between entities, also called edges or relationships.

Some examples:

A document MENTIONS an entity (e.g. a location, organization, etc.)
A document CITES another document
A concept IS_RELATED_TO another concept
An organization WORKS_IN a specific region

These relationships turn isolated data points into an interconnected network — where you can explore patterns, paths, and shared meaning.

Putting It Together

Here's a simple example:

[WWF Report]–MENTIONS–>[Amazon rainforest]
–MENTIONS–>[climate resilience]
–CITES—–>[IPBES 2022 Report]

In the graph database, each of these is a node (document or entity) and each arrow is a relationship.

We can now:

Search for all reports mentioning "climate resilience"
Find which NGOs cite a particular scientific assessment
Map ecological priorities across language and region

Why It Matters

This model gives us:

Structure — so we can search and analyze consistently
Scalability — works for hundreds or thousands of documents
Interoperability — can be visualized, queried, and shared

And it sets the stage for automation, collaboration, and learning.

Try It Yourself: Clone & Run the Project

You can explore and run this pipeline locally using Docker (and the terminal)

Prerequisites

Docker (Desktop or CLI)
Git

Step 1: Clone the Repository

git clone https://github.com/clirdlf/dlie_knowledge_graph.git
cd dlie_knowledge_graph

Step 2: Build and Start the System

make build
make up

This will spin up:

GROBID for citation parsing
Neo4j for the knowledge graph
Doccano for annotation
A Python environment for text and entity extraction

Then, you can run the full pipeline like this:

make pipeline PDF=/data/input/sample1.pdf

You’ll find the results in the data/output/ and data/doccano/ folders.

What’s Next?

In the next post, we’ll start extracting text from real reports, even messy PDFs, using smart tools like PyMuPDF and GROBID.