Scaling the Knowledge Graph Infrastructure for Integral Ecology

How you can help build and grow this open knowledge commons.

As the Digital Library of Integral Ecology grows, so do the challenges of processing, tagging, and organizing its multilingual reports and academic papers. What works for a dozen documents quickly becomes insufficient at hundreds or thousands.

In this post, we introduce key concepts and tools to scale the pipeline we've built, from extracting text and tagging entities to storing relationships in a knowledge graph. We’ll walk through how to introduce ETL (Extract, Transform, Load) pipelines and scalable infrastructure to support real-world, high-volume processing.

What is ETL?

ETL stands for:

Extract: Load data from source (PDFs, metadata, external databases)
Transform: Clean, tag, and enrich with NLP (NER, language detection, taxonomic linking)
Load: Store structured output into a database or knowledge graph (e.g., Neo4j)

This separation of concerns helps you:

Track and retry failed documents
Modularize steps (e.g., swap out NER models)
Scale each stage independently (more extractors, more model containers, etc.)

The Current Stack

So far, this project uses:

make pipeline to run all processing on local files
Docker Compose to isolate workers and services
spaCy + TaxoNERD to tag entities
Neo4j for the knowledge graph
Doccano for annotation and model refinement

This is perfect for prototyping — but it runs serially and assumes a single machine.

Scaling Up: Concepts and Tools

Let's look at where we can go from here.

1. Batch or Parallel Processing

Instead of one big loop, we can break processing into chunks:

Each PDF or .txt becomes a job
Use queues (like Redis, RabbitMQ, or Kafka) to dispatch tasks
Workers (running in Docker or Kubernetes) pull and process jobs independently

This means you can scale from 1 to 100 workers with minimal code changes.

2. ETL Orchestration with Airflow or Prefect

Apache Airflow or Prefect can:

Define data pipelines as Python DAGs (directed acyclic graphs)

Track failures, retries, and run history
Schedule jobs (e.g., run daily, every hour, or on new upload)

You define a pipeline as a DAG:

with DAG("extract_entities", schedule_interval="@hourly") as dag:
    extract = PythonOperator(...)
    detect_lang = PythonOperator(...)
    run_spacy = PythonOperator(...)
    push_to_graph = PythonOperator(...)

3. Use Object Storage for Large Files

Instead of keeping files on disk:

Store .pdf, .txt, .jsonl in Amazon S3 (or MinIO for self-hosted)
Tag files with metadata (e.g., language, processed status)
Stream files into your workers as needed

This makes your pipeline stateless and easier to scale.

Scaling Model Inference

As we add more complex NER models (like en_ner_eco_biobert), inference gets slower. To deal with this we can:

Host models with FastAPI as REST or GraphQL endpoints
Use transformers pipelines with TorchServe or Hugging Face Inference Endpoints
Use GPU-backed instances for model inference

You can wrap spaCy models into a microservice like:

@app.post("/ner")
def tag(text: str):
    doc = nlp(text)
    return [{"text": ent.text, "label": ent.label_} for ent in doc.ents]

Knowledge Graph Growth

As our Neo4j graph grows, how this piece is handled needs to also grow:

Move to Neo4j AuraDB or Neo4j Enterprise for performance and backups
Consider TigerGraph or TerminusDB for high-volume or time-series data
Use Cypher or GraphQL to query and visualize ecological relationships

Distributed Training & Annotation

As our labeled data grows:

Run training jobs via Airflow or a Makefile
Distribute annotation tasks with Doccano’s multi-user support
Consider Hugging Face AutoTrain, Prodi.gy, or Label Studio for advanced workflows.

Monitoring and Fault Tolerance

Use Prometheus + Grafana to monitor throughput and failures
Log processing status in PostgreSQL
Alert when documents fail or get stuck

Scaling for Ecological Impact

Integral ecology is inherently global and multilingual. By scaling your knowledge graph infrastructure, we can enable powerful tools:

Track biodiversity loss across regions and languages
Compare ecological concepts in UN reports and academic literature
Enable public tools for data exploration and learning

Let's go build a smarter, more scalable digital library!

Questions? Want more code examples? Jump into the repo or open an issue!