CWE-135: Incorrect Calculation of Multi-Byte String Length
Overview
Incorrect calculation of multi-byte string length occurs when code uses byte-counting functions (like C's strlen) on multi-byte character strings (UTF-8, UTF-16, Shift-JIS), leading to buffer overflows, truncation, or incorrect string operations. This is especially dangerous when allocating buffers or copying strings.
Risk
High: Miscalculating string lengths in multi-byte encodings can cause buffer overflows (when buffer is too small), data truncation (cutting characters mid-byte), injection attacks (incomplete encoding validation), or crashes when processing international text.
Remediation Steps
Core principle: Validate and compute string lengths using correct units (bytes vs chars) after canonicalization.
Identify Where Multi-Byte Strings Are Processed
Review the security findings to locate multi-byte string handling issues:
- Find the vulnerable operation: Identify where string length is calculated or buffer is allocated
- Determine encoding: Check if strings are UTF-8, UTF-16, Shift-JIS, or other multi-byte encoding
- Trace the data flow: Review how string data flows through length calculations and buffer operations
- Check buffer allocations: Find where buffers are sized based on string length
Use Encoding-Aware String Functions
Replace byte-counting functions with character-aware alternatives:
- PHP: Use
mb_strlen(),mb_substr()instead ofstrlen(),substr() - Java/JavaScript: Use
String.length()correctly (but be aware it counts code units in JavaScript, not visual characters) - C/C++: Use wide character functions (
wcslen) or Unicode libraries (ICU, libiconv) - Python: Use native Unicode strings (Python 3 handles this automatically)
- Never assume 1 byte = 1 character: Always use encoding-aware functions
Validate Encoding and Character Boundaries
Ensure string operations respect character boundaries:
- Validate UTF-8/UTF-16 sequences: Check that byte sequences are well-formed
- Don't truncate mid-character: When cutting strings, ensure you're at a character boundary
- Check for invalid sequences: Reject strings with invalid byte sequences
- Use character-based operations: Count, slice, and manipulate by characters, not bytes
- Apply normalization: Use Unicode normalization (NFC, NFD) when comparing strings
Allocate Buffers Based on Byte Length
When working with raw bytes (C/C++), allocate correctly:
- Allocate by byte size: Use actual byte length, not character count
- Account for null terminators: Add space for
\0terminator - Use wide character types: Use
wchar_tarrays orstd::wstringfor Unicode - Validate byte length before operations: Check actual buffer size before copying
- Track both counts: Keep track of both character count and byte length separately
Use Unicode-Safe APIs and Libraries
Leverage modern Unicode support:
- C++ Unicode strings: Use
std::u8string(UTF-8),std::u16string(UTF-16),std::u32string(UTF-32) - ICU library: Use ICU (International Components for Unicode) for robust Unicode handling
- Python 3: Native Unicode strings handle multi-byte correctly
- Java: String class uses UTF-16 internally, handles multi-byte correctly
- Set encoding explicitly: Always specify character encoding when reading/writing files
Test with Multi-Byte Characters
Verify the fix handles international text correctly:
- Test with emoji: Try text containing emoji (multi-byte in UTF-8)
- Test with Asian characters: Use Chinese, Japanese, Korean characters
- Test with combining characters: Try characters with diacritics (accents)
- Test with RTL text: Use Arabic or Hebrew (right-to-left)
- Test boundary conditions: Verify truncation doesn't split characters
- Test length calculations: Ensure byte length != character count is handled correctly
Common Vulnerable Patterns
Using strlen() on UTF-8 strings in PHP
<?php
// User input: "Hello世界" (11 bytes, 7 characters)
$user_input = $_GET['text']; // UTF-8 encoded
// Dangerous: strlen counts bytes, not characters
$length = strlen($user_input); // Returns 11, not 7
// Buffer overflow risk if expecting character count
if ($length <= 10) {
// Incorrectly allows 11-byte string thinking it's 10 chars
save_to_database($user_input);
}
// Dangerous: substr cuts at byte position, may split character
$truncated = substr($user_input, 0, 8); // May cut mid-character
Using strlen() on UTF-8 strings in C
#include <string.h>
void process_utf8(const char *utf8_input) {
// Dangerous: strlen counts bytes, not characters
int len = strlen(utf8_input);
// Allocate buffer assuming 1 byte = 1 char
char buffer[len + 1];
strcpy(buffer, utf8_input);
// Dangerous: truncates at byte position
buffer[10] = '\0'; // May split UTF-8 character
}
- Truncating at byte boundary, not character boundary
- Assuming 1 character = 1 byte
- Using [ ] indexing on multi-byte strings
- Incorrect buffer allocation for multi-byte strings
Secure Patterns
Use multi-byte aware functions in PHP
<?php
$user_input = $_GET['text'];
// Correct: mb_strlen counts characters in UTF-8
$char_count = mb_strlen($user_input, 'UTF-8');
$byte_count = strlen($user_input);
// Validate both character and byte limits
if ($char_count > 100) {
die('Input exceeds 100 characters');
}
if ($byte_count > 1000) {
die('Input exceeds byte limit');
}
// Correct: mb_substr respects character boundaries
$truncated = mb_substr($user_input, 0, 10, 'UTF-8');
Why this works: mb_strlen() counts characters (code points) rather than bytes, correctly handling multi-byte UTF-8 sequences. mb_substr() cuts at character boundaries, never splitting a multi-byte character mid-sequence. Validating both character count and byte length prevents truncation and buffer issues.
Use ICU library for safe UTF-8 handling in C
#include <unicode/ustring.h>
#include <unicode/ucnv.h>
void process_utf8_safe(const char *utf8_input) {
UErrorCode status = U_ZERO_ERROR;
// Convert to UTF-16 for processing
int32_t utf16_len;
u_strFromUTF8(NULL, 0, &utf16_len, utf8_input, -1, &status);
if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR) {
// Invalid UTF-8 sequence
return;
}
UChar *utf16_buffer = (UChar*)malloc((utf16_len + 1) * sizeof(UChar));
status = U_ZERO_ERROR;
u_strFromUTF8(utf16_buffer, utf16_len + 1, NULL,
utf8_input, -1, &status);
if (U_SUCCESS(status)) {
// Now can safely work with character count
int32_t char_count = u_strlen(utf16_buffer);
// Truncate at character boundary
if (char_count > 10) {
utf16_buffer[10] = 0;
}
}
free(utf16_buffer);
}
Why this works: ICU library validates UTF-8 sequences and safely converts to UTF-16 for character-based operations. u_strFromUTF8() rejects invalid UTF-8, preventing malformed sequences. Working with UTF-16 allows correct character counting and boundary-safe truncation. The conversion preserves all Unicode data without corruption.
Use native Unicode support in Python 3
# Python 3 strings are Unicode by default
user_input = request.args.get('text', '')
# Correct: len() returns character count
char_count = len(user_input) # Counts characters
byte_count = len(user_input.encode('utf-8')) # Counts bytes
if char_count > 100:
raise ValueError('Input too long')
# Correct: slicing respects character boundaries
truncated = user_input[:10] # First 10 characters
Why this works: Python 3 strings are Unicode by default, automatically handling multi-byte characters correctly. len() returns character count (code points), not bytes. String slicing operates on characters, never splitting multi-byte sequences. Explicit encoding to bytes separates character operations from byte operations.