From Features to Predictions: Building Sequence Labelers with CRFSuite
Conditional Random Fields (CRFs) are a powerful class of probabilistic models for sequence labeling tasks such as named entity recognition (NER), part-of-speech (POS) tagging, and chunking. CRFSuite is a lightweight, fast implementation that makes training and deploying linear-chain CRFs practical for many NLP projects. This article walks through the end-to-end process of building a sequence labeler with CRFSuite: feature design, data preparation, training, evaluation, and inference.
1. Problem setup and data format
Assume a standard sequence-labeling task where each input is a tokenized sentence and the goal is to predict a label per token. CRFSuite expects data in an “instance-per-line” format where each token line contains the label (during training) and one or more feature fields; blank lines separate sentences.
Example training instance (label + features): U-LOC word=U lower=u suffix=U isupper=1
John word=John lower=john iscapitalized=1 prevword=BOS
Convert your dataset to this format: one token per line with features, blank line between sentences.
2. Feature engineering
Feature choice strongly influences CRF performance. Use local token features and contextual features. Typical features:
- Token-level: word, lowercase form, prefix/suffix (1–4 chars), shape (e.g., Xxxx, dddd), isdigit, isupper, ispunct.
- Morphological: lemma, POS tag (if available), morphological attributes.
- Orthographic: capitalization pattern, contains hyphen, contains digit.
- Context: previous/next word, previous/next POS, previous predicted label during inference (handled by CRF’s transition features).
- Gazetteers / dictionaries: binary indicator if token in list (cities, persons, organizations).
- Windowed combinations: bigrams/trigrams of words or word+POS across nearby positions.
Keep features sparse and informative; avoid extremely high-cardinality raw features without normalization (e.g., raw word forms can be kept but also include lowercased and suffix variants).
3. Preparing training and test sets
- Split data into train / dev / test (common splits: 80/10/10 or 70/15/15).
- Shuffle at document level, preserve sentence boundaries.
- Use consistent feature extraction for train and test.
- Normalize or map uncommon tokens (e.g., replace rare words with UNK) if helpful.
4. Installing and invoking CRFSuite
CRFSuite has a command-line interface and Python bindings (python-crfsuite). Install the Python wrapper:
pip install python-crfsuite
Basic Python workflow uses feature dictionaries per token (list of dicts per sentence) and label lists per sentence. Example minimal training code:
import pycrfsuite trainer = pycrfsuite.Trainer(verbose=False)for xseq, yseq in training_data: # xseq: list of feature dicts, yseq: list of labels trainer.append(xseq, yseq) trainer.set_params({ ‘c1’: 0.1, # L1 regularization ‘c2’: 0.01, # L2 regularization ‘max_iterations’: 100, ‘feature.possible_transitions’: True})trainer.train(‘model.crfsuite’)
Key parameters:
- c1 (L1) and c2 (L2) regularization to control sparsity and overfitting.
- max_iterations for convergence.
- algorithm (e.g., lbfgs, pa, etc.) — default is usually fine.
- feature.possible_transitions to include possible but unseen transitions.
5. Evaluation metrics and best practices
- Use token-level accuracy, precision, recall, and F1 — for NER use entity-level F1 (span-level) not just token-level.
- Use the dev set for hyperparameter tuning (c1, c2, feature choices).
- Perform error analysis: inspect confusion matrices, common mispredictions, and failure cases (boundary errors, rare entities).
- Use cross-validation if dataset is small.
Example evaluation with sklearn-style metrics:
- Flatten predictions and gold labels; compute per-class precision/recall/F1.
- For entity-level F1, use sequence-to-entity conversion (BIO/BILOU scheme) and compare spans.
6. Inference and decoding
Load the trained model and tag new sentences:
tagger = pycrfsuite.Tagger()tagger.open(‘model.crfsuite’)y_pred = tagger.tag(xseq) # xseq is list of feature dicts for a sentence
For confidence scores, use tagger.probability(yseq) or examine marginal probabilities if available via python-crfsuite utilities.
7. Speed and scaling tips
- Reduce feature dimensionality: avoid including full word embeddings per token; use compact features.
Leave a Reply