Scaling the Knowledge Graph Infrastructure for Integral Ecology
How you can help build and grow this open knowledge commons.
As the Digital Library of Integral Ecology grows, so do the challenges of processing, tagging, and organizing its multilingual reports and academic papers. What works for a dozen documents quickly becomes insufficient at hundreds or thousands.
In this post, we introduce key concepts and tools to scale the pipeline we've built, from extracting text and tagging entities to storing relationships in a knowledge graph. We’ll walk through how to introduce ETL (Extract, Transform, Load) pipelines and scalable infrastructure to support real-world, high-volume processing.
What is ETL?
ETL stands for:
- Extract: Load data from source (PDFs, metadata, external databases)
- Transform: Clean, tag, and enrich with NLP (NER, language detection, taxonomic linking)
- Load: Store structured output into a database or knowledge graph (e.g., Neo4j)
This separation of concerns helps you:
- Track and retry failed documents
- Modularize steps (e.g., swap out NER models)
- Scale each stage independently (more extractors, more model containers, etc.)
The Current Stack
So far, this project uses:
make pipeline
to run all processing on local files- Docker Compose to isolate workers and services
- spaCy + TaxoNERD to tag entities
- Neo4j for the knowledge graph
- Doccano for annotation and model refinement
This is perfect for prototyping — but it runs serially and assumes a single machine.
Scaling Up: Concepts and Tools
Let's look at where we can go from here.
1. Batch or Parallel Processing
Instead of one big loop, we can break processing into chunks:
- Each PDF or .txt becomes a job
- Use queues (like Redis, RabbitMQ, or Kafka) to dispatch tasks
- Workers (running in Docker or Kubernetes) pull and process jobs independently
This means you can scale from 1 to 100 workers with minimal code changes.
2. ETL Orchestration with Airflow or Prefect
Apache Airflow or Prefect can:
Define data pipelines as Python DAGs (directed acyclic graphs)
- Track failures, retries, and run history
- Schedule jobs (e.g., run daily, every hour, or on new upload)
You define a pipeline as a DAG:
with DAG("extract_entities", schedule_interval="@hourly") as dag:
extract = PythonOperator(...)
detect_lang = PythonOperator(...)
run_spacy = PythonOperator(...)
push_to_graph = PythonOperator(...)
3. Use Object Storage for Large Files
Instead of keeping files on disk:
- Store
.pdf
,.txt
,.jsonl
in Amazon S3 (or MinIO for self-hosted) - Tag files with metadata (e.g., language, processed status)
- Stream files into your workers as needed
This makes your pipeline stateless and easier to scale.
Scaling Model Inference
As we add more complex NER models (like en_ner_eco_biobert
), inference gets slower. To deal with this we can:
- Host models with FastAPI as REST or GraphQL endpoints
- Use transformers pipelines with TorchServe or Hugging Face Inference Endpoints
- Use GPU-backed instances for model inference
You can wrap spaCy models into a microservice like:
@app.post("/ner")
def tag(text: str):
doc = nlp(text)
return [{"text": ent.text, "label": ent.label_} for ent in doc.ents]
Knowledge Graph Growth
As our Neo4j graph grows, how this piece is handled needs to also grow:
- Move to Neo4j AuraDB or Neo4j Enterprise for performance and backups
- Consider TigerGraph or TerminusDB for high-volume or time-series data
- Use Cypher or GraphQL to query and visualize ecological relationships
Distributed Training & Annotation
As our labeled data grows:
- Run training jobs via Airflow or a Makefile
- Distribute annotation tasks with Doccano’s multi-user support
- Consider Hugging Face AutoTrain, Prodi.gy, or Label Studio for advanced workflows.
Monitoring and Fault Tolerance
- Use Prometheus + Grafana to monitor throughput and failures
- Log processing status in PostgreSQL
- Alert when documents fail or get stuck
Scaling for Ecological Impact
Integral ecology is inherently global and multilingual. By scaling your knowledge graph infrastructure, we can enable powerful tools:
- Track biodiversity loss across regions and languages
- Compare ecological concepts in UN reports and academic literature
- Enable public tools for data exploration and learning
Let's go build a smarter, more scalable digital library!
Questions? Want more code examples? Jump into the repo or open an issue!