Tagging the World: Finding Places, Plants, and Ideas with AI

How NLP tools identify ecological concepts, organizations, and more.

Once we've extracted clean text from a PDF, the next step is to understand what’s being talked about.

That’s where Natural Language Processing (NLP) comes in.

We use NLP to scan the text and find key pieces of information like:

Locations (e.g. "Amazon rainforest")
Organizations (e.g. "WWF")
Ecological concepts (e.g. "resilience", "biodiversity loss")
Citations (e.g. "IPBES 2019 Report")

Each of these is called an entity, and this process is called Named Entity Recognition (NER).

What is Named Entity Recognition?

NER is a type of AI model that reads text and labels the parts that represent real-world things.

Example:

Original text:
The WWF report on the Amazon rainforest highlights climate resilience strategies.

NER output:
[ORG: WWF], [LOC: Amazon rainforest], [ECO_CONCEPT: climate resilience]

This gives us structured data from unstructured sentences, and helps us populate our knowledge graph with nodes and connections.

Tools We Use

spaCy

We use spaCy, a popular open-source NLP library that can:

Work in English, Spanish, French, Chinese, Russian, and more
Recognize standard entities like ORG, LOC, PERSON, etc.
Run fast and integrate easily with Python

TaxoNERD

For ecological texts, general NLP isn’t enough so we also use TaxoNERD, a tool trained to detect ecological and taxonomic entities, like:

Ecosystem types
Species groups
Environmental terms

TaxoNERD uses a model called BioBERT and is specialized for ecological language.

Multilingual Support

We also use spaCy models for:

fr_core_news_lg (French)
es_core_news_lg (Spanish)
zh_core_web_trf (Chinese)
xx_ent_wiki_sm (basic multilingual)

This makes the system language-aware, even when documents span continents.

How It Works in Code

The tagging is handled by this command:

python ner_pipeline.py my-report.txt

Which produces:

{
  "text": "...",
  "entities": [
    {"start": 4, "end": 22, "label": "LOC", "text": "Amazon rainforest"},
    {"start": 31, "end": 34, "label": "ORG", "text": "WWF"},
    {"start": 45, "end": 63, "label": "ECO_CONCEPT", "text": "climate resilience"}
  ]
}

This is saved as my-report.entities.json in the data/output/ directory. This generates a data structure that has the original text, then all the entities the models have found, with the starting (start) and ending character (end) space in the text, a standardized label (label), and the that was detected (text).

Why This Matters

Recognizing entities allows us to:

Link a sentence to the right concepts
Group reports by theme or region
Connect related documents, even across languages
Support annotation and model training

This is the first step toward turning plain text into a semantic map.

Try It Yourself

If you’ve already extracted text using:

make pipeline PDF=/data/input/sample1.pdf

The entity tagging will run automatically. Check the output in:

data/output/sample1.entities.json

You can also run the script independently:

docker compose exec worker python ner_pipeline.py /data/input/sample1.txt

What’s Next?

Next, we’ll take these entities and load them into Neo4j — our graph database — where we can start to visualize and query relationships.