AI Intent Recognition

Overview of common approaches to AI intent recognition — how they work, their tradeoffs, and when to use each.

1. Rule-Based / Pattern Matching

Match user input against hand-written regex patterns and keyword dictionaries.

rules = {
    "book_flight": ["book.*flight", "buy.*ticket", "fly.*to"],
    "check_weather": ["weather", "temperature", "raining"],
}

Pros: Fully explainable, zero training data, sub-millisecond latency, fully controllable.

Cons: Low coverage, maintenance cost explodes with scale, poor generalization.

2. Traditional ML Classification (TF-IDF + SVM/LR)

Convert text to TF-IDF vectors, then train a multi-class classifier (SVM, Logistic Regression, Naive Bayes).

from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer

model = Pipeline([("tfidf", TfidfVectorizer()), ("clf", LinearSVC())])
model.fit(X_train, y_train)

Pros: Fast to train, interpretable, works with modest data.

Cons: No semantic understanding, poor handling of synonyms and ambiguity, heavy feature engineering needed.

3. Deep Learning Classification (BiLSTM / TextCNN)

Feed word embeddings into a BiLSTM or CNN encoder, then a classification head.

Pros: Better semantic capture than TF-IDF, end-to-end training.

Cons: Needs thousands of labeled examples, outclassed by Transformer-based models, largely obsolete now.

4. Pre-trained Model Fine-tuning (BERT / RoBERTa / ERNIE) ⭐ mainstream

Fine-tune a BERT-family model on domain data. The [CLS] token representation feeds into a classification head.

from transformers import BertForSequenceClassification, Trainer

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=N)
trainer = Trainer(model=model, train_dataset=dataset)
trainer.train()

Pros: High accuracy, strong generalization, multilingual variants available (ERNIE, MacBERT for Chinese).

Cons: Inference latency 50–200ms, compute-heavy, needs hundreds of labeled examples minimum.

5. Sentence Embedding + Similarity Matching (Zero/Few-shot)

Encode user input and intent examples into a shared embedding space, then pick the closest intent by cosine similarity.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
intent_examples = {"book_flight": "I want to book a flight to New York"}

query_emb = model.encode(user_query)
# cosine similarity → best matching intent

Pros: Minimal labeling needed, new intents can be added without retraining, cold-start friendly.

Cons: Struggles to distinguish similar intents, threshold tuning required, inconsistent recall.

6. LLM Prompt-based (GPT / Claude / local models)

Prompt an LLM directly to classify intent, returning structured output.

prompt = """
User input: "{user_input}"
Choose the best matching intent from the list below and return JSON.
Intents: {intents}
Output format: {{"intent": "xxx", "confidence": 0.9, "slots": {{}}}}
"""

Pros: Zero labeling, handles complex semantics, can extract entities/slots simultaneously, easy to extend.

Cons: High latency (500ms+), API cost, non-deterministic output, requires prompt engineering.

Few-Shot Prompting for LLM Intent Classification

Zero-shot (just listing intents) works but is inconsistent — the model may return "book_flight", "Book Flight", or "booking a flight" for the same input. Few-shot prompting anchors the output format with examples.

Zero-shot (fragile):

prompt = """
Classify the user input into one of these intents: book_flight, check_weather, set_alarm
User: "fly me to Tokyo"
Intent:
"""

Few-shot + structured output (production):

import json
from anthropic import Anthropic

client = Anthropic()

INTENTS = ["book_flight", "check_weather", "set_alarm", "play_music", "send_message"]

FEW_SHOT_EXAMPLES = [
    ("I want a ticket to London",  "book_flight",   0.97),
    ("will it rain tomorrow?",     "check_weather", 0.95),
    ("wake me up at 7am",          "set_alarm",     0.98),
    ("play some jazz music",       "play_music",    0.96),
    ("text John I'll be late",     "send_message",  0.94),
]

def build_prompt(user_input: str) -> str:
    examples = "\n".join(
        f'User: "{text}" → {{"intent": "{intent}", "confidence": {conf}}}'
        for text, intent, conf in FEW_SHOT_EXAMPLES
    )
    intent_list = ", ".join(INTENTS)
    return f"""Classify user input into exactly one intent from: {intent_list}

Examples:
{examples}

Now classify:
User: "{user_input}"
Output JSON only:"""

def classify_intent(user_input: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-3-5",  # cheapest/fastest — sufficient for classification
        max_tokens=60,
        messages=[{"role": "user", "content": build_prompt(user_input)}]
    )
    return json.loads(response.content[0].text)

# {"intent": "book_flight", "confidence": 0.96}
result = classify_intent("fly me to Beijing next Monday")

Dynamic Few-Shot (for 20+ intents)

Stuffing all examples into every prompt wastes tokens. Instead, retrieve the most relevant examples per query using semantic similarity:

from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")

example_bank = [
    {"text": "book me a flight to London", "intent": "book_flight"},
    {"text": "I need a plane ticket",      "intent": "book_flight"},
    {"text": "what's the weather today",   "intent": "check_weather"},
    {"text": "will it snow tomorrow",      "intent": "check_weather"},
    # ... 2-3 examples per intent
]

bank_embeddings = embedder.encode([e["text"] for e in example_bank])

def get_top_k_examples(query: str, k: int = 3) -> list:
    query_emb = embedder.encode(query)
    scores = np.dot(bank_embeddings, query_emb)
    top_k = np.argsort(scores)[::-1][:k]
    return [example_bank[i] for i in top_k]

def classify_with_dynamic_fewshot(user_input: str) -> dict:
    examples = get_top_k_examples(user_input, k=3)
    example_str = "\n".join(
        f'User: "{e["text"]}" → {e["intent"]}' for e in examples
    )
    # only the 3 most relevant examples go into the prompt
    prompt = f"""...\nExamples:\n{example_str}\n\nUser: "{user_input}"\nIntent:"""
    # ... call LLM

When to Use What

Technique	When	Why
Zero-shot	Prototyping, very small intent sets	Simplest, no examples needed
Static few-shot	<20 intents, stable labels	Reliable output format, cheap
Dynamic few-shot	20+ intents, large example bank	Stays within context window, higher accuracy
Structured JSON output	Production always	Parseable, no format drift

Use a small/fast model (claude-haiku, gpt-4o-mini) — classification doesn’t need a large model.

7. Hybrid Architecture (production recommendation)

Layer the approaches by cost and confidence:

User input
  │
  ├─ High-confidence rule match ──→ return immediately (<1ms)
  │
  ├─ BERT classifier (primary) ───→ confidence above threshold → return
  │
  └─ LLM fallback ────────────────→ low confidence / complex semantics

This gives you speed on the common path, accuracy on the hard cases, and control over cost.

Comparison

Approach	Data needed	Latency	Accuracy	Controllability	Best for
Rule matching	None	<1ms	Low	Highest	High-frequency fixed intents
TF-IDF + SVM	Thousands	<10ms	Medium	High	Rapid prototype
BERT fine-tune	100–1000+	50–200ms	High	Medium	Production primary
Embedding similarity	Very few	<50ms	Medium	Medium	Cold start / new intents
LLM prompt	None	500ms+	High	Low	Complex semantics / fallback
Hybrid	100+	Tiered	Highest	High	Production (recommended)

Choosing an Approach

<20 intents, sufficient labeled data → BERT fine-tune
Cold start or frequently changing intents → Embedding similarity + few examples
Complex dialogue / multi-turn understanding → LLM (or hybrid)
Strict latency requirements → Rules + lightweight classifier

BERT Fine-tuning in Practice

Labels and Training Data Volume

num_labels = number of intents. For 20 intents: num_labels=20.

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=20
)

How much data per intent:

Quality bar	Samples per intent	Total (20 intents)
Minimum viable	50	1,000
Decent production	100–200	2,000–4,000
Comfortable	500+	10,000+

BERT transfers well — far less data needed than training from scratch.

What Training Data Looks Like

Each sample = one user utterance + one intent label. An utterance is a single thing the user says — one sentence, question, or command.

Raw CSV:

text,intent
"book me a flight to London","book_flight"
"I want to fly to Tokyo next Friday","book_flight"
"can you get me a ticket to Paris","book_flight"
"what's the weather like today","check_weather"
"will it rain tomorrow in Shanghai","check_weather"
"set an alarm for 7am","set_alarm"

As a HuggingFace Dataset:

from datasets import Dataset
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

intent2id = {
    "book_flight": 0,
    "check_weather": 1,
    "set_alarm": 2,
    # ... 17 more
}

data = [
    {"text": "book me a flight to London",        "label": 0},
    {"text": "I want to fly to Tokyo next Friday", "label": 0},
    {"text": "what's the weather like today",      "label": 1},
    {"text": "will it rain tomorrow",              "label": 1},
    {"text": "set an alarm for 7am",               "label": 2},
]

dataset = Dataset.from_list(data)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

dataset = dataset.map(tokenize, batched=True)
dataset = dataset.train_test_split(test_size=0.1)

Training loop:

from transformers import TrainingArguments, Trainer
import numpy as np

args = TrainingArguments(
    output_dir="./intent-model",
    num_train_epochs=5,
    per_device_train_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": (preds == labels).mean()}

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics,
)
trainer.train()

What happens inside:

Input:  "fly me to Beijing"
         ↓ tokenize
        [CLS] fly me to beijing [SEP]
         ↓ BERT encoder (12 layers)
        [CLS] embedding  ← 768-dim vector
         ↓ linear layer (768 → 20)
        logits: [-1.2, 3.8, 0.1, ...]
         ↓ softmax → argmax
        predicted intent: "book_flight"

Key things to watch:

Variance matters — each intent needs lexically diverse examples, not 100 paraphrases of one sentence
Class balance — keep sample counts roughly equal across intents, or use weighted loss

Hardware Requirements

GPU intensive, but nowhere near LLM scale.

Hardware	Time (~2000 samples, 5 epochs)	Cost
CPU only	2–8 hours	free (just slow)
RTX 3060 (12GB VRAM)	~5–10 min	consumer GPU
RTX 4090 (24GB VRAM)	~2–5 min	prosumer
Google Colab free (T4)	~10–15 min	free

Why not like LLM training:

	LLM pre-training	BERT fine-tuning
What you’re doing	Learning language from scratch on trillions of tokens	Adapting the classifier head on your small dataset
Parameters	175B+	110M (BERT-base), converges fast
Data	Terabytes	~2,000 rows
GPU needed	Hundreds of A100s	1 consumer GPU or free Colab
Time	Weeks/months	Minutes
Cost	Millions $	~$0

You’re not pre-training BERT — that’s already done. You’re fine-tuning the last classification layer and slightly adjusting the rest. Google Colab free tier (T4) is sufficient for 20 intents.