Evaluating the Results: Accuracy, Speed, and Insight
Measuring how well our models work, and what they help us uncover.
We’ve come a long way — from PDFs to graphs, from automatic tagging to human annotation, and finally to training our own ecological NLP model.
But how do we know if it’s working?
In this post, we’ll explore how to evaluate your pipeline and model using both standard NLP metrics and real-world ecological insight.
Why Evaluate?
Evaluation helps us understand:
- ✅ Is the model tagging entities accurately?
- ✅ Does it work well across languages?
- ✅ Is it fast enough for real-world use?
- ✅ What kinds of mistakes does it make?
Evaluation isn’t just about numbers — it’s about improving trust, transparency, and impact.
Key Metrics
Precision
The % of predicted entities that were correct
Did it predict something meaningful?
Recall
The % of correct entities that were successfully predicted
Did it miss important things?
F1 Score
The balance of precision and recall
A good all-around indicator
These metrics are calculated by comparing your model's output to human-annotated data, usually exported from Doccano.
How to Evaluate with spaCy
After training, spaCy automatically evaluates on your dev/test set and shows:
✔ Accuracy: 89.2%
✔ Precision: 88.5%
✔ Recall: 90.1%
✔ F1 Score: 89.3
You can also evaluate manually using:
python -m spacy evaluate ./output/model-best ./dev.spacy
Or export annotated JSONL and compare predictions programmatically.
Speed and Runtime
You can also evaluate how fast your model runs:
import spacy, time
nlp = spacy.load("output/model-best")
text = "The WWF is active in the Amazon on climate resilience."
start = time.time()
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])
print("Time:", round(time.time() - start, 4), "seconds")
This helps you decide:
- Which model is best for live pipelines vs. batch runs
- When to use sm vs.
lg
ortrf
models
Beyond Accuracy: Ecological Insight
Numbers matter, but so does usefulness!
Ask:
- Does the graph help us find new connections?
- Are multilingual documents being fairly represented?
- Does it highlight underrepresented organizations or regions?
- Is it surfacing unexpected insights for researchers or communities?
These qualitative measures can be just as valuable as F1 score.
Try It Yourself
After training, run:
python -m spacy evaluate output/model-best dev.spacy
Or test it interactively:
import spacy
nlp = spacy.load("output/model-best")
doc = nlp("UNEP published a report on ecosystem resilience in East Africa.")
print([(ent.text, ent.label_) for ent in doc.ents])
Then compare the results to the baseline model (en_core_web_sm
) or TaxoNERD.
What's Next
You’ve now seen the full lifecycle:
- Extract text and citations from reports
- Tag entities across languages
- Upload to a knowledge graph
- Annotate and refine human feedback
- Train and evaluate your own model
You’re ready to contribute — or build your own ecological pipeline.