Skip to content

Building Blocks: Documents, Entities, and Relationships

How reports are transformed into structured, searchable networks of knowledge.

In our last post, we introduced the vision: building a knowledge graph for integral ecology, a way to connect people, places, organizations, and ideas across languages and disciplines.

But how exactly do we turn messy PDFs and complex reports into a meaningful, searchable web of knowledge?

We start with three key building blocks:


1. Documents

The foundation of our knowledge graph is a document — a report, academic paper, policy brief, or even a faith-based reflection.

Each document is:

  • A single file (usually a PDF)
  • With a title, language, source (UNEP, WWF, OpenAlex, etc.)
  • That contains many sentences — and lots of information hidden in plain text

We treat each document as a node in the graph, and from there, we extract meaning.


2. Entities

Entities are things that the document talks about — people, organizations, species, places, and ecological concepts.

For example:

Entity TextLabel
Amazon rainforestLOCATION
WWFORG
climate resilienceECO_CONCEPT
Laudato Si’DOCUMENT

Our system uses Natural Language Processing (NLP) tools to automatically recognize these entities in many languages, with models trained on large text collections.

Later, we’ll even fine-tune our own models to be more accurate for ecological language.


3. Relationships

The real power of a knowledge graph comes from the connections between entities, also called edges or relationships.

Some examples:

  • A document MENTIONS an entity (e.g. a location, organization, etc.)
  • A document CITES another document
  • A concept IS_RELATED_TO another concept
  • An organization WORKS_IN a specific region

These relationships turn isolated data points into an interconnected network — where you can explore patterns, paths, and shared meaning.


Putting It Together

Here's a simple example:

[WWF Report]–MENTIONS–>[Amazon rainforest]
–MENTIONS–>[climate resilience]
–CITES—–>[IPBES 2022 Report]

In the graph database, each of these is a node (document or entity) and each arrow is a relationship.

We can now:

  • Search for all reports mentioning "climate resilience"
  • Find which NGOs cite a particular scientific assessment
  • Map ecological priorities across language and region

Why It Matters

This model gives us:

  • Structure — so we can search and analyze consistently
  • Scalability — works for hundreds or thousands of documents
  • Interoperability — can be visualized, queried, and shared

And it sets the stage for automation, collaboration, and learning.


Try It Yourself: Clone & Run the Project

You can explore and run this pipeline locally using Docker (and the terminal)

Prerequisites

Step 1: Clone the Repository

git clone https://github.com/clirdlf/dlie_knowledge_graph.git
cd dlie_knowledge_graph

Step 2: Build and Start the System

make build
make up

This will spin up:

  • GROBID for citation parsing
  • Neo4j for the knowledge graph
  • Doccano for annotation
  • A Python environment for text and entity extraction

Then, you can run the full pipeline like this:

make pipeline PDF=/data/input/sample1.pdf

You’ll find the results in the data/output/ and data/doccano/ folders.

What’s Next?

In the next post, we’ll start extracting text from real reports, even messy PDFs, using smart tools like PyMuPDF and GROBID.

Previous Post

What is a Knowledge Graph for Integral Ecology?

Next Post

From PDF to Text: Extracting Meaning from Documents

🕊️ Part of the Digital Library of Integral Ecology: Building open, multilingual tools for ecological understanding.

© 2025 CLIR. All rights reserved.