Tokenizers are often overlooked but significantly impact model behavior and efficiency.
Types
- BPE (Byte Pair Encoding)
- WordPiece
- SentencePiece
- Character-level
Considerations
- Vocabulary size
- Handling of rare words
- Multi-language support
- Special tokens