CWE-135: Incorrect Calculation of Multi-Byte String Length

Overview

Incorrect calculation of multi-byte string length occurs when code uses byte-counting functions (like C's strlen) on multi-byte character strings (UTF-8, UTF-16, Shift-JIS), leading to buffer overflows, truncation, or incorrect string operations. This is especially dangerous when allocating buffers or copying strings.

Risk

High: Miscalculating string lengths in multi-byte encodings can cause buffer overflows (when buffer is too small), data truncation (cutting characters mid-byte), injection attacks (incomplete encoding validation), or crashes when processing international text.

Remediation Steps

Core principle: Validate and compute string lengths using correct units (bytes vs chars) after canonicalization.

Identify Where Multi-Byte Strings Are Processed

Review the security findings to locate multi-byte string handling issues:

Find the vulnerable operation: Identify where string length is calculated or buffer is allocated
Determine encoding: Check if strings are UTF-8, UTF-16, Shift-JIS, or other multi-byte encoding
Trace the data flow: Review how string data flows through length calculations and buffer operations
Check buffer allocations: Find where buffers are sized based on string length

Use Encoding-Aware String Functions

Replace byte-counting functions with character-aware alternatives:

PHP: Use mb_strlen(), mb_substr() instead of strlen(), substr()
Java/JavaScript: Use String.length() correctly (but be aware it counts code units in JavaScript, not visual characters)
C/C++: Use wide character functions (wcslen) or Unicode libraries (ICU, libiconv)
Python: Use native Unicode strings (Python 3 handles this automatically)
Never assume 1 byte = 1 character: Always use encoding-aware functions

Validate Encoding and Character Boundaries

Ensure string operations respect character boundaries:

Validate UTF-8/UTF-16 sequences: Check that byte sequences are well-formed
Don't truncate mid-character: When cutting strings, ensure you're at a character boundary
Check for invalid sequences: Reject strings with invalid byte sequences
Use character-based operations: Count, slice, and manipulate by characters, not bytes
Apply normalization: Use Unicode normalization (NFC, NFD) when comparing strings

Allocate Buffers Based on Byte Length

When working with raw bytes (C/C++), allocate correctly:

Allocate by byte size: Use actual byte length, not character count
Account for null terminators: Add space for \0 terminator
Use wide character types: Use wchar_t arrays or std::wstring for Unicode
Validate byte length before operations: Check actual buffer size before copying
Track both counts: Keep track of both character count and byte length separately

Use Unicode-Safe APIs and Libraries

Leverage modern Unicode support:

C++ Unicode strings: Use std::u8string (UTF-8), std::u16string (UTF-16), std::u32string (UTF-32)
ICU library: Use ICU (International Components for Unicode) for robust Unicode handling
Python 3: Native Unicode strings handle multi-byte correctly
Java: String class uses UTF-16 internally, handles multi-byte correctly
Set encoding explicitly: Always specify character encoding when reading/writing files

Test with Multi-Byte Characters

Verify the fix handles international text correctly:

Test with emoji: Try text containing emoji (multi-byte in UTF-8)
Test with Asian characters: Use Chinese, Japanese, Korean characters
Test with combining characters: Try characters with diacritics (accents)
Test with RTL text: Use Arabic or Hebrew (right-to-left)
Test boundary conditions: Verify truncation doesn't split characters
Test length calculations: Ensure byte length != character count is handled correctly

Common Vulnerable Patterns

Using strlen() on UTF-8 strings in PHP

<?php
// User input: "Hello世界" (11 bytes, 7 characters)
$user_input = $_GET['text'];  // UTF-8 encoded

// Dangerous: strlen counts bytes, not characters
$length = strlen($user_input);  // Returns 11, not 7

// Buffer overflow risk if expecting character count
if ($length <= 10) {
    // Incorrectly allows 11-byte string thinking it's 10 chars
    save_to_database($user_input);
}

// Dangerous: substr cuts at byte position, may split character
$truncated = substr($user_input, 0, 8);  // May cut mid-character

Using strlen() on UTF-8 strings in C

#include <string.h>

void process_utf8(const char *utf8_input) {
    // Dangerous: strlen counts bytes, not characters
    int len = strlen(utf8_input);

    // Allocate buffer assuming 1 byte = 1 char
    char buffer[len + 1];
    strcpy(buffer, utf8_input);

    // Dangerous: truncates at byte position
    buffer[10] = '\0';  // May split UTF-8 character
}

Truncating at byte boundary, not character boundary
Assuming 1 character = 1 byte
Using [ ] indexing on multi-byte strings
Incorrect buffer allocation for multi-byte strings

Secure Patterns

Use multi-byte aware functions in PHP

<?php
$user_input = $_GET['text'];

// Correct: mb_strlen counts characters in UTF-8
$char_count = mb_strlen($user_input, 'UTF-8');
$byte_count = strlen($user_input);

// Validate both character and byte limits
if ($char_count > 100) {
    die('Input exceeds 100 characters');
}

if ($byte_count > 1000) {
    die('Input exceeds byte limit');
}

// Correct: mb_substr respects character boundaries
$truncated = mb_substr($user_input, 0, 10, 'UTF-8');

Why this works: mb_strlen() counts characters (code points) rather than bytes, correctly handling multi-byte UTF-8 sequences. mb_substr() cuts at character boundaries, never splitting a multi-byte character mid-sequence. Validating both character count and byte length prevents truncation and buffer issues.

Use ICU library for safe UTF-8 handling in C

#include <unicode/ustring.h>
#include <unicode/ucnv.h>

void process_utf8_safe(const char *utf8_input) {
    UErrorCode status = U_ZERO_ERROR;

    // Convert to UTF-16 for processing
    int32_t utf16_len;
    u_strFromUTF8(NULL, 0, &utf16_len, utf8_input, -1, &status);

    if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR) {
        // Invalid UTF-8 sequence
        return;
    }

    UChar *utf16_buffer = (UChar*)malloc((utf16_len + 1) * sizeof(UChar));
    status = U_ZERO_ERROR;
    u_strFromUTF8(utf16_buffer, utf16_len + 1, NULL, 
                  utf8_input, -1, &status);

    if (U_SUCCESS(status)) {
        // Now can safely work with character count
        int32_t char_count = u_strlen(utf16_buffer);

        // Truncate at character boundary
        if (char_count > 10) {
            utf16_buffer[10] = 0;
        }
    }

    free(utf16_buffer);
}

Why this works: ICU library validates UTF-8 sequences and safely converts to UTF-16 for character-based operations. u_strFromUTF8() rejects invalid UTF-8, preventing malformed sequences. Working with UTF-16 allows correct character counting and boundary-safe truncation. The conversion preserves all Unicode data without corruption.

Use native Unicode support in Python 3

# Python 3 strings are Unicode by default
user_input = request.args.get('text', '')

# Correct: len() returns character count
char_count = len(user_input)  # Counts characters
byte_count = len(user_input.encode('utf-8'))  # Counts bytes

if char_count > 100:
    raise ValueError('Input too long')

# Correct: slicing respects character boundaries
truncated = user_input[:10]  # First 10 characters

Why this works: Python 3 strings are Unicode by default, automatically handling multi-byte characters correctly. len() returns character count (code points), not bytes. String slicing operates on characters, never splitting multi-byte sequences. Explicit encoding to bytes separates character operations from byte operations.