Annotation: Teaching the System to Be Smarter
A look at Doccano and how humans refine machine learning.
Our pipeline can already extract text from reports, tag entities, and build a multilingual knowledge graph, but how accurate is it?
The truth is: even good AI needs human correction. That's where annotation comes in.
In this post, we’ll show you how we use Doccano, a friendly web interface, to correct the output of our pipeline and help the model improve over time.
Why Annotation Matters
Even the best models make mistakes:
- Misclassifying entities (e.g. "Amazon" as a product instead of a forest)
- Missing subtle ecological terms (like “resilience” or “eco-conversion”)
- Struggling with under-represented languages (like Arabic) or complex phrases
Annotation lets humans:
- Correct the labels
- Add missing terms
- Build reliable training data
It’s like proofreading, but for a machine learning system.
What is Doccano?
Doccano is an open-source tool for annotating text for NLP tasks.
It provides:
- A browser-based interface
- Support for named entity recognition (NER), classification, translation, and more
- Role-based user access
- Easy data import and export
We use Doccano to refine the results of running the ner_pipeline.py
script.
Exporting to Doccano Format
After entity tagging, we export the results in JSONL format:
python export_doccano.py my-report.entities.json
Which creates:
{
"text": "The WWF report on the Amazon rainforest...",
"labels": [[4, 7, "ORG"], [20, 36, "LOC"]]
}
This format can be imported directly into Doccano.
Using Doccano Locally
If you’re running the project with Docker Compose, Doccano is already available at:
Log in with
- Username:
admin
- Password:
password
From there, you can
- Create a new NER project - be sure to select Sequence Labeling
- Import your
.jsonl
file (generated in previous step - `data/output/) - Start tagging!
The Feedback Loop
Once documents are annotated in Doccano, we:
- Export the clean annotations
- Convert them to spaCy training format
- Fine-tune a custom NER model
This cycle helps the system:
- Improve tagging accuracy
- Recognize new or uncommon terms
- Adapt to multilingual and ecological contexts
We call this process active learning.
Try It Yourself
If you’ve already run:
make pipeline PDF=my-report.pdf
Then a Doccano file will be created at:
data/doccano/my-report.entities.jsonl
Import this file into your Doccano project and try refining the labels!
Who Can Help?
Annotation is one of the best ways to contribute, especially if you:
- Are a researcher in ecology, theology, or social sciences
- Are bilingual or multilingual
- Want to help shape AI to better understand the world
All you need is careful attention. No coding required!
What’s Next?
Now that we have clean, annotated data, we’re ready to train our own ecologically informed NER model.
In the next post, we’ll walk through how to train a custom spaCy pipeline using your Doccano data.