Robust JSON Decoding for LLM Responses in PHP
Introduction
You prompt an LLM with "respond in JSON format" and get back:
Here's the JSON you requested:
```json
{"mood": "happy", "score": 8}Let me know if you need anything else!
Or worse — the JSON looks clean but `json_decode()` returns `null` because there's an invisible zero-width space hiding between two keys. Or the response starts with a UTF-8 BOM that arrived from a downstream API.
PHP's `json_decode()` is strict by design. It expects perfectly formed JSON — no surrounding text, no markdown fences, no invisible characters. But LLMs are inherently messy: they wrap JSON in code blocks, prepend explanatory text, and sometimes introduce control characters from their training data.
The fix isn't to ask the LLM harder. It's to build a decoding pipeline that gracefully recovers from the most common malformations. In this article, we'll build exactly that — a `jsonDecode()` method that handles what production LLM integrations actually throw at you.
## The Recovery Pipeline
Our approach is a 5-stage pipeline where each stage handles a specific class of malformation:Raw LLM Response │ ▼ ┌─────────────────────┐ │ 1. Markdown Extract │ Strip json fences └─────────┬───────────┘ ▼ ┌─────────────────────┐ │ 2. Bracket Matching │ Find outermost {} or [] └─────────┬───────────┘ ▼ ┌─────────────────────┐ │ 3. UTF-8 Cleanup │ Remove control chars, BOM, zero-width └─────────┬───────────┘ ▼ ┌─────────────────────┐ │ 4. Trim │ Strip surrounding whitespace └─────────┬───────────┘ ▼ ┌─────────────────────┐ │ 5. Decode │ json_decode with error flags └─────────────────────┘
Each stage is idempotent — if the input doesn't match the pattern (e.g., no markdown fences), it passes through unchanged. This means clean JSON goes straight through with minimal overhead, while malformed JSON gets progressively repaired.
## Stage 1: Markdown Extraction
LLMs frequently wrap JSON in markdown code fences, especially when they've been trained on conversational data. The pattern is consistent enough to handle with a single regex:
```php
if (preg_match('/```(?:json)?\s*\n?(.*?)\n?\s*```/s', $json, $matches)) {
$json = $matches[1];
}Breaking this down:
```— Matches the opening fence(?:json)?— Optionaljsonlanguage identifier (non-capturing group)\s*\n?— Flexible whitespace and optional newline after the fence(.*?)— Captures the content inside (lazy match)\n?\s*— Flexible whitespace before closing fence```— Matches the closing fence/s— Dot matches newlines (critical for multi-line JSON)
The if check means this only fires when fences are actually present. Clean JSON without fences passes through untouched.
Stage 2: Bracket Matching
After stripping fences, there might still be surrounding text — the LLM's "Here's the result:" preamble, or a "Let me know if you need changes" postscript. We need to extract just the JSON object or array.
A naive strpos('{') to strrpos('}') approach fails because JSON values can contain braces inside strings:
{"message": "Use {name} as placeholder"}We need string-aware bracket matching:
private static function extractJsonFromText(string $text): string
{
// Find the first { or [
$firstBrace = strpos($text, '{');
$firstBracket = strpos($text, '[');
if ($firstBrace === false && $firstBracket === false) {
return $text; // No JSON structure found
}
// Determine which comes first
if ($firstBrace !== false
&& ($firstBracket === false || $firstBrace < $firstBracket)) {
$startPos = $firstBrace;
$openChar = '{';
$closeChar = '}';
} else {
$startPos = $firstBracket;
$openChar = '[';
$closeChar = ']';
}
// Track depth with string awareness
$depth = 0;
$inString = false;
$escapeNext = false;
$endPos = false;
for ($i = $startPos; $i < strlen($text); $i++) {
$char = $text[$i];
if ($escapeNext) {
$escapeNext = false;
continue;
}
if ($char === '\\' && $inString) {
$escapeNext = true;
continue;
}
if ($char === '"' && !$escapeNext) {
$inString = !$inString;
continue;
}
if ($inString) {
continue;
}
if ($char === $openChar) {
$depth++;
} elseif ($char === $closeChar) {
$depth--;
if ($depth === 0) {
$endPos = $i;
break;
}
}
}
if ($endPos !== false) {
return substr($text, $startPos, $endPos - $startPos + 1);
}
return $text;
}The algorithm walks character by character, tracking whether we're inside a JSON string (where braces don't count) and handling escape sequences (so \" doesn't toggle the string state). When depth returns to zero, we've found the matching closing bracket.
TIP
This handles both objects ({}) and arrays ([]). Whichever appears first in the text determines what we're extracting.
Stage 3: UTF-8 Cleanup
Invisible characters are the most insidious problem. The JSON looks perfect in your editor, but json_decode() rejects it. Four regex passes clean the most common offenders:
// Pass 1: C1 control characters (U+0080–U+009F)
// These appear in text from Windows-1252 encoded sources
$cleanedJson = preg_replace('/\xC2[\x80-\x9F]/u', '', $json);
// Pass 2: Zero-width characters (U+200B–U+200F)
// Zero-width space, zero-width non-joiner, zero-width joiner, etc.
// Common in copy-pasted text from web pages
$cleanedJson = preg_replace('/\xE2\x80[\x8B-\x8F]/u', '', $cleanedJson);
// Pass 3: UTF-8 BOM (U+FEFF)
// Byte Order Mark that some editors and APIs prepend
$cleanedJson = preg_replace('/\xEF\xBB\xBF/u', '', $cleanedJson);
// Pass 4: ASCII control characters (except tab, newline, carriage return)
// Characters 0x00–0x08, 0x0B, 0x0C, 0x0E–0x1F, and 0x7F
$cleanedJson = preg_replace(
'/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u',
'',
$cleanedJson
);Why four separate passes instead of one big regex? Readability and debuggability. When a character class causes issues, you can disable individual passes to isolate the problem. The /u flag ensures each pass treats the input as UTF-8.
WARNING
Don't strip tabs (\x09), newlines (\x0A), or carriage returns (\x0D) — these are valid inside JSON strings. The ASCII pass explicitly skips them.
Stage 4–5: Decode with Flags
After extraction and cleanup, we trim whitespace and decode:
$cleanedJson = trim($cleanedJson);
if (empty($cleanedJson)) {
return null;
}
return json_decode(
json: $cleanedJson,
associative: true,
flags: JSON_THROW_ON_ERROR | JSON_INVALID_UTF8_IGNORE
);Two flags work together here:
JSON_THROW_ON_ERROR— Throws aJsonExceptioninstead of silently returningnull. This lets us catch failures in a structured way rather than checkingjson_last_error()after every call.JSON_INVALID_UTF8_IGNORE— Silently drops any remaining invalid UTF-8 sequences rather than failing. This is our safety net for characters that survived the cleanup passes.
The associative: true parameter returns arrays instead of stdClass objects — a practical choice for data processing pipelines where you're accessing keys by name.
Real-World Examples
Here are the actual malformations we've encountered in production LLM integrations, and how each pipeline stage handles them:
Markdown-Wrapped JSON
Here's the analysis:
```json
{"sentiment": "positive", "confidence": 0.92}
**Stage 1** strips the fences → `{"sentiment": "positive", "confidence": 0.92}` → decoded.
### Text-Surrounded JSONBased on the conversation, here is the structured data: {"intent": "booking", "date": "2026-03-15", "guests": 4} Please confirm if this looks correct.
Stage 1 finds no fences (pass-through). **Stage 2** bracket-matches from the first `{` to its closing `}` → `{"intent": "booking", "date": "2026-03-15", "guests": 4}` → decoded.
### Zero-Width Contamination
```json
{"name": "María", "city": "São Paulo"}Invisible zero-width spaces (U+200B) after "María" and "São". Stage 1 and 2 pass through (structure is fine). Stage 3 removes the zero-width characters → clean JSON → decoded.
BOM-Prefixed Response
\xEF\xBB\xBF{"status": "ok", "data": [1, 2, 3]}A UTF-8 BOM precedes the JSON (common from certain API middleware). Stage 3 strips the BOM → {"status": "ok", "data": [1, 2, 3]} → decoded.
The Complete Implementation
Putting all stages together into a single, production-ready method:
class JsonHelper
{
/**
* Safely decode a JSON string with recovery for common LLM malformations.
*
* Handles: markdown fences, surrounding text, control characters,
* zero-width spaces, BOM, and invalid UTF-8 sequences.
*/
public static function decode(?string $json): ?array
{
if (empty($json)) {
return null;
}
try {
// Stage 1: Strip markdown code fences
if (preg_match('/```(?:json)?\s*\n?(.*?)\n?\s*```/s', $json, $matches)) {
$json = $matches[1];
}
// Stage 2: Extract JSON by matching outermost brackets
$json = self::extractJsonFromText($json);
// Stage 3: Remove invisible/control characters
$cleaned = preg_replace('/\xC2[\x80-\x9F]/u', '', $json);
$cleaned = preg_replace('/\xE2\x80[\x8B-\x8F]/u', '', $cleaned);
$cleaned = preg_replace('/\xEF\xBB\xBF/u', '', $cleaned);
$cleaned = preg_replace(
'/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u',
'',
$cleaned
);
// Stage 4: Trim whitespace
$cleaned = trim($cleaned);
if (empty($cleaned)) {
return null;
}
// Stage 5: Decode with safety flags
return json_decode(
json: $cleaned,
associative: true,
flags: JSON_THROW_ON_ERROR | JSON_INVALID_UTF8_IGNORE
);
} catch (\Throwable $e) {
Log::warning('Failed to decode JSON', [
'error' => $e->getMessage(),
'input' => mb_substr($json, 0, 500),
]);
}
return null;
}
/**
* Extract JSON from surrounding text by finding outermost matching brackets.
*/
private static function extractJsonFromText(string $text): string
{
$firstBrace = strpos($text, '{');
$firstBracket = strpos($text, '[');
if ($firstBrace === false && $firstBracket === false) {
return $text;
}
if ($firstBrace !== false
&& ($firstBracket === false || $firstBrace < $firstBracket)) {
$startPos = $firstBrace;
$openChar = '{';
$closeChar = '}';
} else {
$startPos = $firstBracket;
$openChar = '[';
$closeChar = ']';
}
$depth = 0;
$inString = false;
$escapeNext = false;
$endPos = false;
for ($i = $startPos; $i < strlen($text); $i++) {
$char = $text[$i];
if ($escapeNext) {
$escapeNext = false;
continue;
}
if ($char === '\\' && $inString) {
$escapeNext = true;
continue;
}
if ($char === '"' && !$escapeNext) {
$inString = !$inString;
continue;
}
if ($inString) {
continue;
}
if ($char === $openChar) {
$depth++;
} elseif ($char === $closeChar) {
$depth--;
if ($depth === 0) {
$endPos = $i;
break;
}
}
}
if ($endPos !== false) {
return substr($text, $startPos, $endPos - $startPos + 1);
}
return $text;
}
}Complementary Utility: Detecting Unwanted JSON
Sometimes the opposite problem occurs — you ask the LLM for plain text and it returns a JSON object. This helper detects and filters those cases:
public static function removeJsonIfExists(?string $answer): ?string
{
if (!$answer) {
return null;
}
$trimmed = trim($answer);
if (str_starts_with($trimmed, '{') && str_ends_with($trimmed, '}')) {
$decoded = self::decode($answer);
if ($decoded !== null) {
Log::warning('AI returned JSON instead of plain text', [
'json_response' => $decoded,
]);
return null;
}
}
return $answer;
}This is useful in conversational AI pipelines where the model's response should be displayed directly to users. If it accidentally returns structured data instead of prose, this catches it before it reaches the UI.
Testing
Here are the key test cases for a comprehensive test suite:
use PHPUnit\Framework\TestCase;
class JsonHelperTest extends TestCase
{
public function test_decodes_clean_json(): void
{
$result = JsonHelper::decode('{"key": "value"}');
$this->assertSame(['key' => 'value'], $result);
}
public function test_strips_markdown_fences(): void
{
$input = "```json\n{\"key\": \"value\"}\n```";
$result = JsonHelper::decode($input);
$this->assertSame(['key' => 'value'], $result);
}
public function test_extracts_from_surrounding_text(): void
{
$input = "Here's the result:\n{\"key\": \"value\"}\nHope this helps!";
$result = JsonHelper::decode($input);
$this->assertSame(['key' => 'value'], $result);
}
public function test_handles_braces_inside_strings(): void
{
$input = 'prefix {"msg": "use {name} here"} suffix';
$result = JsonHelper::decode($input);
$this->assertSame(['msg' => 'use {name} here'], $result);
}
public function test_removes_zero_width_spaces(): void
{
$input = "{\"key\":\xE2\x80\x8B \"value\"}";
$result = JsonHelper::decode($input);
$this->assertSame(['key' => 'value'], $result);
}
public function test_strips_utf8_bom(): void
{
$input = "\xEF\xBB\xBF{\"key\": \"value\"}";
$result = JsonHelper::decode($input);
$this->assertSame(['key' => 'value'], $result);
}
public function test_decodes_array_responses(): void
{
$input = 'The items are: [1, 2, 3]';
$result = JsonHelper::decode($input);
$this->assertSame([1, 2, 3], $result);
}
public function test_returns_null_for_empty_input(): void
{
$this->assertNull(JsonHelper::decode(null));
$this->assertNull(JsonHelper::decode(''));
}
public function test_returns_null_for_invalid_json(): void
{
$this->assertNull(JsonHelper::decode('not json at all'));
}
public function test_remove_json_if_exists_passes_text(): void
{
$text = 'This is a normal response.';
$this->assertSame($text, JsonHelper::removeJsonIfExists($text));
}
public function test_remove_json_if_exists_catches_json(): void
{
$this->assertNull(
JsonHelper::removeJsonIfExists('{"key": "value"}')
);
}
}TIP
The zero-width space test uses raw bytes (\xE2\x80\x8B) instead of a Unicode escape because PHP string literals don't support \u escapes. This also makes the invisible character visible in your test file.
Conclusion
The key insight behind this pipeline is "parse and recover" beats "validate and reject" for AI integrations. When a human writes JSON, malformation usually means a bug that should fail loudly. When an LLM writes JSON, malformation is expected noise that should be cleaned up silently.
The 5-stage pipeline handles the most common cases:
- Markdown fences — regex extraction
- Surrounding text — string-aware bracket matching
- Invisible characters — targeted UTF-8 cleanup
- Whitespace — trim
- Residual issues —
JSON_INVALID_UTF8_IGNOREas safety net
One important caveat: don't silently swallow failures in production. The catch block should always log the failed input so you can identify new malformation patterns. Today's edge case is tomorrow's common pattern.
For cases where the pipeline still fails — deeply malformed JSON, truncated responses from token limits, or structural errors like missing keys — the right fallback is to retry the LLM call with a corrective prompt that includes the error message. Recovery handles noise; retry handles failures.

