Skip to content

Scaling the Knowledge Graph Infrastructure for Integral Ecology

How you can help build and grow this open knowledge commons.

As the Digital Library of Integral Ecology grows, so do the challenges of processing, tagging, and organizing its multilingual reports and academic papers. What works for a dozen documents quickly becomes insufficient at hundreds or thousands.

In this post, we introduce key concepts and tools to scale the pipeline we've built, from extracting text and tagging entities to storing relationships in a knowledge graph. We’ll walk through how to introduce ETL (Extract, Transform, Load) pipelines and scalable infrastructure to support real-world, high-volume processing.


What is ETL?

ETL stands for:

  • Extract: Load data from source (PDFs, metadata, external databases)
  • Transform: Clean, tag, and enrich with NLP (NER, language detection, taxonomic linking)
  • Load: Store structured output into a database or knowledge graph (e.g., Neo4j)

This separation of concerns helps you:

  • Track and retry failed documents
  • Modularize steps (e.g., swap out NER models)
  • Scale each stage independently (more extractors, more model containers, etc.)

The Current Stack

So far, this project uses:

  • make pipeline to run all processing on local files
  • Docker Compose to isolate workers and services
  • spaCy + TaxoNERD to tag entities
  • Neo4j for the knowledge graph
  • Doccano for annotation and model refinement

This is perfect for prototyping — but it runs serially and assumes a single machine.


Scaling Up: Concepts and Tools

Let's look at where we can go from here.

1. Batch or Parallel Processing

Instead of one big loop, we can break processing into chunks:

  • Each PDF or .txt becomes a job
  • Use queues (like Redis, RabbitMQ, or Kafka) to dispatch tasks
  • Workers (running in Docker or Kubernetes) pull and process jobs independently

This means you can scale from 1 to 100 workers with minimal code changes.

2. ETL Orchestration with Airflow or Prefect

Apache Airflow or Prefect can:

Define data pipelines as Python DAGs (directed acyclic graphs)

  • Track failures, retries, and run history
  • Schedule jobs (e.g., run daily, every hour, or on new upload)

You define a pipeline as a DAG:

with DAG("extract_entities", schedule_interval="@hourly") as dag:
    extract = PythonOperator(...)
    detect_lang = PythonOperator(...)
    run_spacy = PythonOperator(...)
    push_to_graph = PythonOperator(...)

3. Use Object Storage for Large Files

Instead of keeping files on disk:

  • Store .pdf, .txt, .jsonl in Amazon S3 (or MinIO for self-hosted)
  • Tag files with metadata (e.g., language, processed status)
  • Stream files into your workers as needed

This makes your pipeline stateless and easier to scale.

Scaling Model Inference

As we add more complex NER models (like en_ner_eco_biobert), inference gets slower. To deal with this we can:

You can wrap spaCy models into a microservice like:

@app.post("/ner")
def tag(text: str):
    doc = nlp(text)
    return [{"text": ent.text, "label": ent.label_} for ent in doc.ents]

Knowledge Graph Growth

As our Neo4j graph grows, how this piece is handled needs to also grow:


Distributed Training & Annotation

As our labeled data grows:


Monitoring and Fault Tolerance


Scaling for Ecological Impact

Integral ecology is inherently global and multilingual. By scaling your knowledge graph infrastructure, we can enable powerful tools:

  • Track biodiversity loss across regions and languages
  • Compare ecological concepts in UN reports and academic literature
  • Enable public tools for data exploration and learning

Let's go build a smarter, more scalable digital library!

Questions? Want more code examples? Jump into the repo or open an issue!

Previous Post

Evaluating the Results: Accuracy, Speed, and Insight

Next Post

Contribute: Join the Digital Library of Integral Ecology

🕊️ Part of the Digital Library of Integral Ecology: Building open, multilingual tools for ecological understanding.

© 2025 CLIR. All rights reserved.