AI Intent Recognition
Overview of common approaches to AI intent recognition — how they work, their tradeoffs, and when to use each.
1. Rule-Based / Pattern Matching
Match user input against hand-written regex patterns and keyword dictionaries.
rules = {
"book_flight": ["book.*flight", "buy.*ticket", "fly.*to"],
"check_weather": ["weather", "temperature", "raining"],
}
Pros: Fully explainable, zero training data, sub-millisecond latency, fully controllable.
Cons: Low coverage, maintenance cost explodes with scale, poor generalization.
2. Traditional ML Classification (TF-IDF + SVM/LR)
Convert text to TF-IDF vectors, then train a multi-class classifier (SVM, Logistic Regression, Naive Bayes).
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
model = Pipeline([("tfidf", TfidfVectorizer()), ("clf", LinearSVC())])
model.fit(X_train, y_train)
Pros: Fast to train, interpretable, works with modest data.
Cons: No semantic understanding, poor handling of synonyms and ambiguity, heavy feature engineering needed.
3. Deep Learning Classification (BiLSTM / TextCNN)
Feed word embeddings into a BiLSTM or CNN encoder, then a classification head.
Pros: Better semantic capture than TF-IDF, end-to-end training.
Cons: Needs thousands of labeled examples, outclassed by Transformer-based models, largely obsolete now.
4. Pre-trained Model Fine-tuning (BERT / RoBERTa / ERNIE) ⭐ mainstream
Fine-tune a BERT-family model on domain data. The [CLS] token representation feeds into a classification head.
from transformers import BertForSequenceClassification, Trainer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=N)
trainer = Trainer(model=model, train_dataset=dataset)
trainer.train()
Pros: High accuracy, strong generalization, multilingual variants available (ERNIE, MacBERT for Chinese).
Cons: Inference latency 50–200ms, compute-heavy, needs hundreds of labeled examples minimum.
5. Sentence Embedding + Similarity Matching (Zero/Few-shot)
Encode user input and intent examples into a shared embedding space, then pick the closest intent by cosine similarity.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
intent_examples = {"book_flight": "I want to book a flight to New York"}
query_emb = model.encode(user_query)
# cosine similarity → best matching intent
Pros: Minimal labeling needed, new intents can be added without retraining, cold-start friendly.
Cons: Struggles to distinguish similar intents, threshold tuning required, inconsistent recall.
6. LLM Prompt-based (GPT / Claude / local models)
Prompt an LLM directly to classify intent, returning structured output.
prompt = """
User input: "{user_input}"
Choose the best matching intent from the list below and return JSON.
Intents: {intents}
Output format: {{"intent": "xxx", "confidence": 0.9, "slots": {{}}}}
"""
Pros: Zero labeling, handles complex semantics, can extract entities/slots simultaneously, easy to extend.
Cons: High latency (500ms+), API cost, non-deterministic output, requires prompt engineering.
Few-Shot Prompting for LLM Intent Classification
Zero-shot (just listing intents) works but is inconsistent — the model may return "book_flight", "Book Flight", or "booking a flight" for the same input. Few-shot prompting anchors the output format with examples.
Zero-shot (fragile):
prompt = """
Classify the user input into one of these intents: book_flight, check_weather, set_alarm
User: "fly me to Tokyo"
Intent:
"""
Few-shot + structured output (production):
import json
from anthropic import Anthropic
client = Anthropic()
INTENTS = ["book_flight", "check_weather", "set_alarm", "play_music", "send_message"]
FEW_SHOT_EXAMPLES = [
("I want a ticket to London", "book_flight", 0.97),
("will it rain tomorrow?", "check_weather", 0.95),
("wake me up at 7am", "set_alarm", 0.98),
("play some jazz music", "play_music", 0.96),
("text John I'll be late", "send_message", 0.94),
]
def build_prompt(user_input: str) -> str:
examples = "\n".join(
f'User: "{text}" → {{"intent": "{intent}", "confidence": {conf}}}'
for text, intent, conf in FEW_SHOT_EXAMPLES
)
intent_list = ", ".join(INTENTS)
return f"""Classify user input into exactly one intent from: {intent_list}
Examples:
{examples}
Now classify:
User: "{user_input}"
Output JSON only:"""
def classify_intent(user_input: str) -> dict:
response = client.messages.create(
model="claude-haiku-3-5", # cheapest/fastest — sufficient for classification
max_tokens=60,
messages=[{"role": "user", "content": build_prompt(user_input)}]
)
return json.loads(response.content[0].text)
# {"intent": "book_flight", "confidence": 0.96}
result = classify_intent("fly me to Beijing next Monday")
Dynamic Few-Shot (for 20+ intents)
Stuffing all examples into every prompt wastes tokens. Instead, retrieve the most relevant examples per query using semantic similarity:
from sentence_transformers import SentenceTransformer
import numpy as np
embedder = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
example_bank = [
{"text": "book me a flight to London", "intent": "book_flight"},
{"text": "I need a plane ticket", "intent": "book_flight"},
{"text": "what's the weather today", "intent": "check_weather"},
{"text": "will it snow tomorrow", "intent": "check_weather"},
# ... 2-3 examples per intent
]
bank_embeddings = embedder.encode([e["text"] for e in example_bank])
def get_top_k_examples(query: str, k: int = 3) -> list:
query_emb = embedder.encode(query)
scores = np.dot(bank_embeddings, query_emb)
top_k = np.argsort(scores)[::-1][:k]
return [example_bank[i] for i in top_k]
def classify_with_dynamic_fewshot(user_input: str) -> dict:
examples = get_top_k_examples(user_input, k=3)
example_str = "\n".join(
f'User: "{e["text"]}" → {e["intent"]}' for e in examples
)
# only the 3 most relevant examples go into the prompt
prompt = f"""...\nExamples:\n{example_str}\n\nUser: "{user_input}"\nIntent:"""
# ... call LLM
When to Use What
| Technique | When | Why |
|---|---|---|
| Zero-shot | Prototyping, very small intent sets | Simplest, no examples needed |
| Static few-shot | <20 intents, stable labels | Reliable output format, cheap |
| Dynamic few-shot | 20+ intents, large example bank | Stays within context window, higher accuracy |
| Structured JSON output | Production always | Parseable, no format drift |
Use a small/fast model (claude-haiku, gpt-4o-mini) — classification doesn’t need a large model.
7. Hybrid Architecture (production recommendation)
Layer the approaches by cost and confidence:
User input
│
├─ High-confidence rule match ──→ return immediately (<1ms)
│
├─ BERT classifier (primary) ───→ confidence above threshold → return
│
└─ LLM fallback ────────────────→ low confidence / complex semantics
This gives you speed on the common path, accuracy on the hard cases, and control over cost.
Comparison
| Approach | Data needed | Latency | Accuracy | Controllability | Best for |
|---|---|---|---|---|---|
| Rule matching | None | <1ms | Low | Highest | High-frequency fixed intents |
| TF-IDF + SVM | Thousands | <10ms | Medium | High | Rapid prototype |
| BERT fine-tune | 100–1000+ | 50–200ms | High | Medium | Production primary |
| Embedding similarity | Very few | <50ms | Medium | Medium | Cold start / new intents |
| LLM prompt | None | 500ms+ | High | Low | Complex semantics / fallback |
| Hybrid | 100+ | Tiered | Highest | High | Production (recommended) |
Choosing an Approach
- <20 intents, sufficient labeled data → BERT fine-tune
- Cold start or frequently changing intents → Embedding similarity + few examples
- Complex dialogue / multi-turn understanding → LLM (or hybrid)
- Strict latency requirements → Rules + lightweight classifier
BERT Fine-tuning in Practice
Labels and Training Data Volume
num_labels = number of intents. For 20 intents: num_labels=20.
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=20
)
How much data per intent:
| Quality bar | Samples per intent | Total (20 intents) |
|---|---|---|
| Minimum viable | 50 | 1,000 |
| Decent production | 100–200 | 2,000–4,000 |
| Comfortable | 500+ | 10,000+ |
BERT transfers well — far less data needed than training from scratch.
What Training Data Looks Like
Each sample = one user utterance + one intent label. An utterance is a single thing the user says — one sentence, question, or command.
Raw CSV:
text,intent
"book me a flight to London","book_flight"
"I want to fly to Tokyo next Friday","book_flight"
"can you get me a ticket to Paris","book_flight"
"what's the weather like today","check_weather"
"will it rain tomorrow in Shanghai","check_weather"
"set an alarm for 7am","set_alarm"
As a HuggingFace Dataset:
from datasets import Dataset
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
intent2id = {
"book_flight": 0,
"check_weather": 1,
"set_alarm": 2,
# ... 17 more
}
data = [
{"text": "book me a flight to London", "label": 0},
{"text": "I want to fly to Tokyo next Friday", "label": 0},
{"text": "what's the weather like today", "label": 1},
{"text": "will it rain tomorrow", "label": 1},
{"text": "set an alarm for 7am", "label": 2},
]
dataset = Dataset.from_list(data)
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
dataset = dataset.map(tokenize, batched=True)
dataset = dataset.train_test_split(test_size=0.1)
Training loop:
from transformers import TrainingArguments, Trainer
import numpy as np
args = TrainingArguments(
output_dir="./intent-model",
num_train_epochs=5,
per_device_train_batch_size=32,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {"accuracy": (preds == labels).mean()}
trainer = Trainer(
model=model,
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
compute_metrics=compute_metrics,
)
trainer.train()
What happens inside:
Input: "fly me to Beijing"
↓ tokenize
[CLS] fly me to beijing [SEP]
↓ BERT encoder (12 layers)
[CLS] embedding ← 768-dim vector
↓ linear layer (768 → 20)
logits: [-1.2, 3.8, 0.1, ...]
↓ softmax → argmax
predicted intent: "book_flight"
Key things to watch:
- Variance matters — each intent needs lexically diverse examples, not 100 paraphrases of one sentence
- Class balance — keep sample counts roughly equal across intents, or use weighted loss
Hardware Requirements
GPU intensive, but nowhere near LLM scale.
| Hardware | Time (~2000 samples, 5 epochs) | Cost |
|---|---|---|
| CPU only | 2–8 hours | free (just slow) |
| RTX 3060 (12GB VRAM) | ~5–10 min | consumer GPU |
| RTX 4090 (24GB VRAM) | ~2–5 min | prosumer |
| Google Colab free (T4) | ~10–15 min | free |
Why not like LLM training:
| LLM pre-training | BERT fine-tuning | |
|---|---|---|
| What you’re doing | Learning language from scratch on trillions of tokens | Adapting the classifier head on your small dataset |
| Parameters | 175B+ | 110M (BERT-base), converges fast |
| Data | Terabytes | ~2,000 rows |
| GPU needed | Hundreds of A100s | 1 consumer GPU or free Colab |
| Time | Weeks/months | Minutes |
| Cost | Millions $ | ~$0 |
You’re not pre-training BERT — that’s already done. You’re fine-tuning the last classification layer and slightly adjusting the rest. Google Colab free tier (T4) is sufficient for 20 intents.