Skip to main content

Streaming Engine

Shim’s streaming engine solves three core problems with LLM streaming outputs.

The Three Problems

1. Buffer Bottleneck

Problem: Most JSON parsers require the full string. Waiting for full output adds 2-5 seconds of latency. Solution: Shim attempts JSON.parse() on every chunk. Returns partial object as soon as parseable.
// Chunk 1: '{"name": "Jo'
state.json_parseable = false;  // Can't parse yet
state.partial = null;

// Chunk 2: 'hn", "age": 30}'
state.json_parseable = true;   // Parseable now
state.partial = { name: "John", age: 30 };

2. Markdown Fence Trap

Problem: LLMs often wrap JSON in markdown fences:
```json
{"data": "value"}
```
Parsers fail because of the fence. Solution: Shim strips fences early in the pipeline:
// Input: '```json\n{"name": "John"}\n```'
// After fence removal: '{"name": "John"}'
Fence patterns detected:
  • ```json ... ```
  • ``` ... ```
  • Leading/trailing text before { or [

3. Numerical Gyrations

Problem: Partial numbers cause UI flicker:
// Chunk 1: '{"score": 0'
partial = { score: 0 };  // UI shows 0

// Chunk 2: '.5'
partial = { score: 0.5 }; // UI updates to 0.5 (flicker)
Solution: Shim detects incomplete tokens and holds them:
// Chunk 1: '{"score": 0'
state.incomplete_token = "0";  // Held
state.partial = null;           // Not emitted

// Chunk 2: '.5'
state.incomplete_token = "";    // Released
state.partial = { score: 0.5 }; // Emitted complete
Incomplete token patterns:
  • Trailing decimal: 0.
  • Leading decimal: .5
  • Partial escape: \u00
  • Partial string: "incomplete

Architecture

Session State Machine

START → ACCUMULATING → PARSEABLE → FINALIZED
  ↓          ↓             ↓            ↓
buffer   structural    partial      repaired
empty    complete      object       + metadata

State Transitions

// 1. START
session = new StreamingRepairSession();
// buffer: ""
// brace_depth: 0
// in_string: false

// 2. Push: '{"name": "Jo'
session.push('{"name": "Jo');
// buffer: '{"name": "Jo'
// brace_depth: 1
// in_string: true
// structurally_complete: false

// 3. Push: 'hn", "age": 30}'
session.push('hn", "age": 30}');
// buffer: '{"name": "John", "age": 30}'
// brace_depth: 0
// in_string: false
// structurally_complete: true
// json_parseable: true
// partial: { name: "John", age: 30 }

// 4. Finalize
result = session.finalize();
// success: true
// repaired: { name: "John", age: 30 }
// metadata.confidence: "high"

Key Algorithms

Structural Completeness Check

function isStructurallyComplete(state): boolean {
  return (
    state.brace_depth === 0 &&
    !state.in_string &&
    state.incomplete_token === '' &&
    state.buffer.trim().length > 0
  );
}

Incomplete Token Detection

function detectIncompleteToken(buffer: string): string {
  // Trailing decimal: "0."
  if (/\d\.$/.test(buffer)) {
    return buffer.match(/\d\.$/)[0];
  }

  // Partial escape: "\u00"
  if (/\\u[0-9a-fA-F]{0,3}$/.test(buffer)) {
    return buffer.match(/\\u[0-9a-fA-F]{0,3}$/)[0];
  }

  // Unclosed string: '"text
  if (/"[^"]*$/.test(buffer)) {
    return buffer.match(/"[^"]*$/)[0];
  }

  return '';
}

Brace Tracking

function updateBraceDepth(char: string, state): void {
  if (state.in_string) {
    return; // Ignore braces in strings
  }

  if (char === '{' || char === '[') {
    state.brace_depth++;
  }

  if (char === '}' || char === ']') {
    state.brace_depth--;
  }
}

Performance Characteristics

Time Complexity

OperationComplexityNotes
Push chunkO(n)Linear scan of chunk
Parse attemptO(m)m = buffer size
FinalizeO(m)Full repair pipeline

Memory Usage

  • Buffer: O(m) where m = total output size
  • State: O(1) constant overhead
  • Circuit breaker: 5MB max buffer

Throughput

  • Chunks/sec: 1M+ per Worker
  • Latency/push: <1ms average
  • Parse attempts: 1 per push

Safety Features

Circuit Breaker

Terminates sessions at 5MB buffer:
if (buffer.length > MAX_BUFFER_SIZE) {
  throw {
    code: 'BUFFER_SIZE_EXCEEDED',
    message: 'Possible hallucination loop',
    severity: 'critical'
  };
}

Session Expiration

Sessions expire after 60 seconds:
if (Date.now() - session.createdAt > 60000) {
  throw {
    code: 'SESSION_NOT_FOUND',
    message: 'Session expired',
    severity: 'critical'
  };
}

Junk Seek

Ignores data before first { or [:
const jsonStart = buffer.search(/[{\[]/);

if (jsonStart > 0) {
  buffer = buffer.substring(jsonStart);
}

Comparison: Streaming vs Batch

MetricStreamingBatch
Latency<1ms per chunk2-10ms total
MemoryO(m) bufferO(m) input
Use caseReal-time UIsCompleted outputs
ComplexityStateful sessionsStateless calls
PreviewYes (partial)No

When to Use Streaming

Use Streaming When:
  • LLM is streaming tokens
  • You want real-time UI updates
  • Output size is large (>1MB)
  • You need progress indicators
Use Batch When:
  • You have the full output
  • Real-time updates aren’t needed
  • Output is small (<100KB)
  • You want simplicity

Next Steps

Streaming Guide

Best practices for streaming

Streaming API

API reference

Confidence Levels

Understand confidence scoring

TypeScript SDK

Use the official SDK