from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import spacy
from spacy import displacy
Named-entity recognition (NER) is about identifying and classifying named entities in text into predefined categories such as persons
, organizations
, locations
, etc. It extracts entities from unstructured text.
By identifying and categorizing real-world entities in text, NER bridges the gap between unstructured text and structured information that can drive decision-making and insights. Here’s why it’s so important:
- Business Intelligence & Market Analysis
- Competitive Intelligence: Extract competitor names, products, and initiatives from news articles and reports
- Market Trends: Identify emerging players, technologies, and market shifts by analyzing mentions of organizations and products
- Customer Insights: Extract brand mentions, product names, and sentiment indicators from customer feedback
- Information Extraction & Knowledge Management
- Automated Data Population: Populate databases and knowledge graphs with structured information extracted from unstructured text
- Document Classification: Enhance document categorization by understanding the entities mentioned within content
- Content Recommendation: Improve content recommendations by matching based on entities of interest to users
- Search & Retrieval Enhancement
- Semantic Search: Enable entity-based search beyond simple keyword matching
- Question Answering: Power systems that can answer “who”, “where”, and “when” questions by extracting relevant entities
- Document Summarization: Identify key figures, organizations, and locations for inclusion in automated summaries
- Healthcare & Biomedical Applications
- Clinical Data Mining: Extract patient information, conditions, treatments, and medications from clinical notes
- Medical Literature Analysis: Identify relationships between diseases, drugs, and genes in research papers
- Clinical Trial Matching: Match patients to appropriate clinical trials based on extracted medical entities
- Finance & Risk Management
- Regulatory Compliance: Extract reporting entities and requirements from regulatory documents
- Financial News Analysis: Track mentions of companies, executives, and financial metrics in real-time
- Risk Assessment: Identify potentially problematic entities (people, organizations) for due diligence
- Legal Document Analysis
- Contract Analysis: Extract parties, dates, monetary values, and conditions from legal documents
- Case Law Research: Identify precedents, judges, courts, and legal principles in legal opinions
- Compliance Monitoring: Extract regulated entities and requirements from regulatory documents
- Security & Intelligence
- Threat Detection: Identify potentially suspicious individuals, organizations, or locations in communications
- Network Analysis: Build relationship graphs between entities mentioned in various documents
- Predictive Analysis: Use historical patterns of entity mentions to predict future events or trends
- Benefits Across Applications
- Automation: Reduce manual data entry and information extraction
- Consistency: Apply uniform entity extraction rules across large document collections
- Discovery: Uncover hidden relationships between entities across disparate sources
- Time Efficiency: Process volumes of text at scale that would be impossible manually
- Structured Data Creation: Transform unstructured text into structured, queryable information
Example
Initialize the NER processor with a transformer model using the (dslim/bert-base-NER
).
Alternative model names for NER
"jean-baptiste/roberta-large-ner-english"
"dbmdz/bert-large-cased-finetuned-conll03-english"
"elastic/distilbert-base-cased-finetuned-conll03-english"
For domain-specific NER, look for models trained on data from your domain.
For example, "emilyalsentzer/Bio_ClinicalBERT"
for medical text.
="dslim/bert-base-NER"
model_name
= AutoTokenizer.from_pretrained(model_name)
tokenizer = AutoModelForTokenClassification.from_pretrained(model_name)
model = pipeline(
ner_pipeline "ner",
=model,
model=tokenizer,
tokenizer="simple") aggregation_strategy
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Extract named entities from text
= """Apple is looking at buying U.K. startup for $1 billion.
text CEO Tim Cook made the announcement from their headquarters in Cupertino.
"""
= ner_pipeline(text)
entities_list entities_list
[{'entity_group': 'ORG',
'score': np.float32(0.9989743),
'word': 'Apple',
'start': 0,
'end': 5},
{'entity_group': 'LOC',
'score': np.float32(0.98070335),
'word': 'U. K.',
'start': 27,
'end': 31},
{'entity_group': 'PER',
'score': np.float32(0.9997685),
'word': 'Tim Cook',
'start': 61,
'end': 69},
{'entity_group': 'LOC',
'score': np.float32(0.9980777),
'word': 'Cupertino',
'start': 119,
'end': 128}]
Visualize entities in text using spaCy’s displacy.
= spacy.blank("en")
nlp # Create a spaCy Doc
= nlp(text)
doc
# Build spaCy Span objects for each entity
= [
spans
doc.char_span(= ent["start"],
start_idx = ent["end"],
end_idx = ent["entity_group"],
label = "contract")
alignment_mode for ent
in entities_list]
# Set the entities in the Doc
= filter(lambda x:x != None, spans)
doc.ents
# Display using displacy
="ent", jupyter=True) displacy.render(doc, style
CEO Tim Cook PER made the announcement from their headquarters in Cupertino LOC .
Disclaimer: For information only. Accuracy or completeness not guaranteed. Illegal use prohibited. Not professional advice or solicitation. Read more: /terms-of-service
Reuse
Citation
@misc{kabui2025,
author = {{Kabui, Charles}},
title = {Named {Entity} {Recognition} with {spaCy}},
date = {2025-05-03},
url = {https://toknow.ai/posts/named-entity-recognition/index.html},
langid = {en-GB}
}