Token-Aware Text Chunking for RAG Pipelines in PHP
Introduction
You're building a RAG pipeline in PHP. A user uploads a 40-page PDF, and you need to split it into chunks that fit within your embedding model's token limit. The naive approach — cutting every N characters — splits mid-sentence, breaks context, and produces chunks that retrieve poorly.
Character count doesn't map cleanly to token count. The sentence "El precio es $3.14 por unidad" is 31 characters but only 10 tokens. A URL like https://example.com/api/v2/users?page=1&limit=50 is 49 characters but 20+ tokens. Splitting by characters means you're either wasting token budget or overflowing it.
What you need is a chunker that speaks tokens — one that splits on sentence boundaries, respects a hard token limit per chunk, and handles edge cases like sentences longer than the limit itself. In this article, we'll build exactly that.
The Three-Layer Strategy
Our chunker uses a hierarchical splitting approach:
Input text
│
▼
┌───────────────────────┐
│ Layer 1: Sentences │ Split on sentence boundaries
│ (regex-based) │ using punctuation + whitespace
└──────────┬────────────┘
▼
┌───────────────────────┐
│ Layer 2: Token Packing │ Pack sentences into chunks
│ (greedy bin-packing) │ that fit the token limit
└──────────┬────────────┘
▼
┌───────────────────────┐
│ Layer 3: Fallback │ Character-level split for
│ (character-level) │ sentences exceeding the limit
└───────────────────────┘Each layer handles a different granularity. Most text flows through layers 1 and 2. Layer 3 only activates for pathological cases — a base64 blob, a minified JSON block, or a URL-heavy paragraph with no sentence boundaries.
Layer 1: Sentence-Aware Splitting
Splitting on sentences sounds simple until you encounter "The price is $3.14. Next item." — a naive split on . followed by space would break after 3., producing a fragment.
This regex handles the common edge cases:
const SPLIT_SENTENCE_REGEX =
'/(?<!\b[0-9]\.)(?<![0-9])(?<=[.!?。?!])(?!\d)\s+/u';Breaking it down:
| Fragment | Purpose |
|---|---|
(?<!\b[0-9]\.) | Don't split after a decimal like 3. |
(?<![0-9]) | Don't split when preceded by a digit |
(?<=[.!?。?!]) | Split after sentence-ending punctuation |
(?!\d) | Don't split if a digit follows (e.g., v2.0 is...) |
\s+ | Consume the whitespace between sentences |
/u | Unicode mode for CJK punctuation (。?!) |
The lookbehind assertions prevent false splits on decimal numbers, version strings, and abbreviations, while the Unicode flag handles Chinese, Japanese, and other languages that use full-width punctuation.
$sentences = preg_split(
self::SPLIT_SENTENCE_REGEX,
$text,
-1,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
);TIP
The PREG_SPLIT_NO_EMPTY flag filters out empty strings that preg_split sometimes produces at boundaries. Without it, you'd need to manually filter blanks in the packing loop.
Layer 2: Token-Counting Greedy Packing
With sentences extracted, we pack them into chunks using a greedy algorithm: add sentences to the current chunk until the next one would exceed the token limit, then start a new chunk.
public static function splitTextIntoTokenChunks(
string $text,
int $token_limit_per_chunk
): array {
$chunks = [];
$current_chunk = '';
$sentences = preg_split(
self::SPLIT_SENTENCE_REGEX,
$text,
-1,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
);
if (empty($sentences)) {
if (TokenizerX::count($text) <= $token_limit_per_chunk) {
return $text ? [$text] : [];
}
return self::splitLongWord($text, $token_limit_per_chunk);
}
foreach ($sentences as $sentence) {
$sentence = trim($sentence);
if (empty($sentence)) {
continue;
}
$sentence_token_count = TokenizerX::count($sentence);
// Layer 3 trigger: sentence exceeds the limit on its own
if ($sentence_token_count > $token_limit_per_chunk) {
if ($current_chunk) {
$chunks[] = $current_chunk;
}
$sub_chunks = self::splitLongWord(
$sentence,
$token_limit_per_chunk
);
$chunks = array_merge($chunks, $sub_chunks);
$current_chunk = '';
continue;
}
// Try adding the sentence to the current chunk
$test_chunk = $current_chunk
? $current_chunk . ' ' . $sentence
: $sentence;
$test_chunk_tokens = TokenizerX::count($test_chunk);
if ($test_chunk_tokens <= $token_limit_per_chunk) {
$current_chunk = $test_chunk;
} else {
// Doesn't fit — finalize current chunk, start new one
if ($current_chunk) {
$chunks[] = $current_chunk;
}
$current_chunk = $sentence;
}
}
if ($current_chunk) {
$chunks[] = $current_chunk;
}
return $chunks;
}The key detail is the $test_chunk approach — we count tokens on the combined string (current chunk + space + new sentence), not on the sentence alone. This matters because tokenizers don't produce additive counts: tokens("A B") ≠ tokens("A") + tokens("B"). The space between sentences might merge with adjacent characters into a single token, or word boundaries might shift. By counting the combined string, we get the true token cost.
WARNING
Avoid the temptation to track a running $current_chunk_tokens counter and add $sentence_token_count to it. Token counts are not additive across concatenation. Always re-count the combined string.
Layer 3: Character-Level Fallback
When a single sentence exceeds the token limit — think a minified JSON blob, a base64 string, or a long URL list — we fall back to character-by-character splitting:
private static function splitLongWord(
string $word,
int $token_limit
): array {
$chunks = [];
$current_chunk = '';
$current_chunk_tokens = 0;
for ($i = 0; $i < mb_strlen($word); $i++) {
$char = mb_substr($word, $i, 1);
$char_token_count = TokenizerX::count($char);
if ($current_chunk_tokens + $char_token_count <= $token_limit) {
$current_chunk .= $char;
$current_chunk_tokens += $char_token_count;
} else {
$chunks[] = $current_chunk;
$current_chunk = $char;
$current_chunk_tokens = $char_token_count;
}
}
if ($current_chunk) {
$chunks[] = $current_chunk;
}
return $chunks;
}This method uses mb_substr for Unicode safety — a Chinese character that's 3 bytes and 1 token won't get split into invalid byte sequences. The per-character token counting is an acceptable approximation here: for single characters, the tokenizer's output is deterministic, so additive counting works (unlike sentence concatenation).
TIP
This fallback is intentionally conservative. It sacrifices readability (splits mid-word) for correctness (never exceeds the token limit). In practice, it rarely fires — most natural language text has sentence boundaries within any reasonable token limit.
Adding Chunk Overlap for Better Retrieval
Chunks that end abruptly can lose context that spans a boundary. If a user asks "What's the return policy for electronics?" and the answer starts at the end of chunk 3 and continues into chunk 4, neither chunk alone retrieves well.
Overlap fixes this by repeating a portion of adjacent chunks:
public static function overlapChunks(
array $chunks,
float $overlap_percentage = 0.2
): array {
$overlapped_chunks = [];
foreach ($chunks as $index => $chunk) {
// Prepend tail of previous chunk
if (isset($chunks[$index - 1])) {
$previous = $chunks[$index - 1];
$overlap_chars = (int) (strlen($previous) * $overlap_percentage);
$chunk = substr($previous, -$overlap_chars) . $chunk;
}
// Append head of next chunk
if (isset($chunks[$index + 1])) {
$next = $chunks[$index + 1];
$overlap_chars = (int) (strlen($next) * $overlap_percentage);
$chunk .= substr($next, 0, $overlap_chars);
}
$overlapped_chunks[] = $chunk;
}
return $overlapped_chunks;
}With 20% overlap, a 1536-token chunk gets roughly 300 tokens prepended from the previous chunk and 300 appended from the next. This creates redundancy that improves retrieval recall at the cost of ~40% more storage.
Chunk 1: [==========]
Chunk 2: [==========] ← overlaps with 1 and 3
Chunk 3: [==========]WARNING
Overlap happens at the character level, not the token level. This is a deliberate trade-off — character-based overlap is much faster and produces "good enough" results for retrieval. If you need precise token-level overlap, you'd need to re-tokenize and trim, which adds significant complexity.
Putting It Together: Document Parsing Pipeline
Here's how these pieces connect in a real RAG pipeline that processes uploaded documents:
use App\Utils\TextUtil;
class DocumentParser
{
const TOKENS_PER_CHUNK = 1536;
public function parse(string $text, string $source): array
{
// Step 1: Split into token-limited chunks
$chunks = TextUtil::splitTextIntoTokenChunks(
text: $text,
token_limit_per_chunk: self::TOKENS_PER_CHUNK,
);
// Step 2: Add overlap for better retrieval
$chunks = TextUtil::overlapChunks($chunks, 0.2);
// Step 3: Prepare for embedding
return array_map(fn (string $chunk, int $i) => [
'content' => $chunk,
'source' => $source,
'chunk_index' => $i,
'token_count' => TokenizerX::count($chunk),
], $chunks, array_keys($chunks));
}
}The TOKENS_PER_CHUNK = 1536 value is chosen to fit well within common embedding model limits (most accept up to 8192 tokens) while being granular enough for precise retrieval. Smaller chunks (512–1024 tokens) improve precision; larger chunks (2048–4096) improve context. 1536 is a practical middle ground.
Why 1536 Tokens?
The chunk size directly affects retrieval quality:
| Chunk Size | Precision | Context | Best For |
|---|---|---|---|
| 256–512 | High | Low | FAQ, short answers |
| 1024–1536 | Balanced | Balanced | General documents |
| 2048–4096 | Low | High | Long-form analysis |
Smaller chunks match specific queries better (higher precision) but may miss surrounding context. Larger chunks preserve context but dilute the embedding — a 4000-token chunk about five topics won't match any single topic as strongly as a 1000-token chunk about one topic.
1536 tokens ≈ 1,100 words ≈ 2–3 paragraphs. This usually captures a complete thought or section without mixing unrelated content.
Token Counting in PHP
This implementation uses the rajentrivedi/tokenizer-x package for token counting:
composer require rajentrivedi/tokenizer-xuse Rajentrivedi\TokenizerX\TokenizerX;
// Count tokens in a string
$count = TokenizerX::count("Hello, world!"); // 4
// Count with specific encoding
$count = TokenizerX::count("Hello, world!", "cl100k_base");The default encoding (cl100k_base) aligns with GPT-4 and most modern embedding models. If you're using a different model family, check which tokenizer it uses and configure accordingly.
TIP
If you don't need exact token counts, a fast approximation is ceil(mb_strlen($text) / 4) — English text averages roughly 4 characters per token. This won't work for CJK text (where each character is typically 1–2 tokens) or for code (where tokens are less predictable), but it's useful for quick estimates and test cases.
Conclusion
Token-aware chunking for RAG in PHP comes down to three ideas:
- Split on sentences first — A regex that handles decimals, version strings, and multilingual punctuation gives you natural boundaries.
- Pack greedily with real token counts — Don't approximate. Count tokens on the combined string because tokenization isn't additive.
- Fall back gracefully — When text has no sentence boundaries, character-level splitting guarantees the token limit is never exceeded.
The overlap step is optional but recommended for retrieval pipelines. 20% overlap is a good starting point — increase if you're seeing context-boundary misses, decrease if storage costs matter.
One thing this implementation intentionally doesn't do: semantic chunking (using embeddings to detect topic boundaries). That approach can improve retrieval by 15–25% but costs 3–5x more in compute and adds significant complexity. Start with sentence-aware token chunking — it handles most cases well and is deterministic, fast, and debuggable. Graduate to semantic chunking only when retrieval quality demands it.
