CWE-183: Permissive List of Allowed Inputs - Python

Overview

Python-specific guidance for implementing strict input validation using regular expressions, sets, and path manipulation functions.

Primary Defence: Use fully anchored regex patterns with ^ and $ along with re.fullmatch() for complete string matching, validate complex inputs with specialized libraries like pathlib.Path.resolve() for file paths and ipaddress module for IP addresses, and enforce strict length limits to ensure complete input validation and prevent injection attacks.

Common Vulnerable Patterns

Unanchored Regular Expressions

import re

def validate_email(email):
    # VULNERABLE - no anchors, allows extra content
    # Attacker: "valid@example.com<script>alert(1)</script>"
    if re.match(r'[\w.-]+@[\w.-]+', email):
        return True  # Matched prefix, ignores suffix!
    return False

Permissive File Extension Check

def validate_filename(filename):
    # VULNERABLE - just checks if extension appears anywhere
    # Attacker: "malware.exe.jpg", "file.jpg.php"
    if re.search(r'\.(jpg|png|gif)', filename):
        return True
    return False

Path Traversal Allowed

def get_file(filename):
    # VULNERABLE - allows path traversal
    # Attacker: "../../../etc/passwd"
    allowed_chars = re.compile(r'^[a-zA-Z0-9._/-]+$')
    if allowed_chars.match(filename):
        return open(f'/var/data/{filename}')  # Path traversal!
    return None

Secure Patterns

Strict Email Validation

This is strictly based on xxxxx@yyyyy.zzzzzz. Full RFC5322 can be much more complex.

import re

def validate_email(email):
    # Check length before regex
    if len(email) > 254:
        return False

    # Anchored regex ensures entire string matches
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

    if not re.match(pattern, email):
        return False

    # Additional semantic checks
    local, domain = email.split('@')
    if len(local) > 64:  # RFC 5321 limit
        return False

    return True

Why this works: The anchored regex pattern r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' enforces strict email structure with clear separation between local part, @ symbol, domain, and TLD. The ^ and $ anchors prevent accepting emails embedded in larger strings (like "user@example.com<script>alert(1)</script>"). Length validation at 254 characters matches RFC 5321 limits and prevents ReDoS attacks from extremely long inputs. The local part length check (64 characters) enforces RFC 5321 mailbox limits. The pattern requires at least a 2-character TLD (.co, .uk) which blocks invalid domains and most typos. This simplified approach balances security with usability - full RFC 5322 compliance is extremely complex and rarely needed for web applications.

Strict Filename Validation

import re

def validate_filename(filename):
    # Length check
    if len(filename) > 255:
        return False

    # Anchored pattern - must END with allowed extension
    pattern = r'^[a-zA-Z0-9_-]+\.(jpg|png|gif)$'

    if not re.match(pattern, filename, re.IGNORECASE):
        return False

    # Additional security checks
    if '..' in filename or '/' in filename:
        return False

    return True

Why this works: The pattern r'^[a-zA-Z0-9_-]+\.(jpg|png|gif)$' uses the $ anchor to ensure the filename ends with an allowed extension, preventing double-extension attacks like "malware.exe.jpg" where the real extension is .exe but .jpg appears in the filename. The character allowlist [a-zA-Z0-9_-] blocks special characters that could be used for path traversal or command injection. Length validation prevents buffer overflows and denial-of-service from extremely long filenames. The explicit checks for .. and / provide defense-in-depth against path traversal, even though the regex should already block these. Case-insensitive matching with re.IGNORECASE prevents bypasses like "file.JPG" vs "file.jpg".

Path Validation with Canonicalization

import os

def get_file(filename):
    # Best: strict allowlist of specific files
    allowed_files = {
        'report.pdf',
        'data.csv',
        'summary.txt'
    }

    if filename not in allowed_files:
        raise ValueError('File not allowed')

    # Resolve to absolute path and verify within allowed directory
    base_dir = os.path.abspath('/var/data')
    file_path = os.path.abspath(os.path.join(base_dir, filename))

    # Ensure resolved path is within base directory
    if not file_path.startswith(base_dir + os.sep):
        raise ValueError('Path traversal detected')

    return open(file_path)

Why this works: The allowlist approach with a set of specific filenames provides the strongest security by explicitly defining which files can be accessed, blocking any unauthorized file requests. The os.path.abspath() and os.path.join() combination resolves symbolic links and normalizes paths (removing ., .., redundant separators), preventing path traversal attacks that use techniques like "../../etc/passwd", symbolic links, or OS-specific tricks. The startswith() check with base_dir + os.sep ensures the canonical path remains within the base directory, blocking escapes even if normalization was bypassed. Using os.sep ensures platform-independent validation (works on both Windows \ and Unix /). This defense-in-depth approach combines allowlisting, canonicalization, and boundary checking.

Username Validation

import re

def validate_username(username):
    # Length check
    if not username or len(username) > 20:
        return False

    # Strict pattern: lowercase letters, numbers, underscore only
    pattern = r'^[a-z0-9_]{3,20}$'

    if not re.match(pattern, username.lower()):
        return False

    # Reject reserved names
    reserved = {'admin', 'root', 'system', 'administrator'}
    if username.lower() in reserved:
        return False

    return True

Why this works: The anchored regex pattern r'^[a-z0-9_]{3,20}$' ensures the entire string matches exactly (3-20 characters of lowercase letters, digits, and underscores), preventing substring matches that would allow "admin'; DROP TABLE users--" to pass validation. Converting to lowercase before matching provides case-insensitive validation while maintaining strict character requirements. Length validation prevents ReDoS attacks and buffer overflows. The set data structure for reserved names provides O(1) lookup performance and prevents privilege escalation by blocking admin/system accounts. This layered approach validates format, length, and semantic constraints.

URL Validation

from urllib.parse import urlparse
import ipaddress

def validate_url(url):
    try:
        parsed = urlparse(url)

        # Strict: only allow http and https
        if parsed.scheme not in ('http', 'https'):
            return False

        # Validate host exists
        if not parsed.netloc:
            return False

        # Extract hostname (remove port if present)
        hostname = parsed.hostname
        if not hostname:
            return False

        # Optional: reject private/loopback addresses
        try:
            ip = ipaddress.ip_address(hostname)
            if ip.is_private or ip.is_loopback:
                return False
        except ValueError:
            # Not an IP address, that's okay
            pass

        return True
    except Exception:
        return False

Why this works: Python's urlparse() function provides robust parsing that correctly handles URL components and rejects malformed URLs. By validating parsed.scheme against a tuple of allowed protocols, the code prevents dangerous protocols like javascript:, data:, file:, or ftp: that could enable XSS or local file access attacks. Checking for a non-empty netloc prevents URLs like http:// that have valid schemes but no destination. The ipaddress module check for private and loopback addresses prevents SSRF attacks targeting internal services (192.168.x.x, 10.x.x.x, 172.16-31.x.x, 127.0.0.1). Using a broad exception handler ensures that any parsing or validation errors result in rejection, following a fail-secure pattern. This approach is much safer than regex-based URL validation which is prone to bypasses.

Enum-Based Validation

def validate_role(role):
    # Best practice: use set for known values
    allowed_roles = {'user', 'moderator', 'admin'}

    # Exact match only (case-insensitive)
    return role.lower() in allowed_roles

# Alternative: use Enum
from enum import Enum

class Role(Enum):
    USER = 'user'
    MODERATOR = 'moderator'
    ADMIN = 'admin'

def validate_role_enum(role):
    try:
        Role(role.lower())
        return True
    except ValueError:
        return False

Why this works: Using a Python set for allowed values provides O(1) lookup performance (compared to list membership which is O(n)) and creates an explicit allowlist that cannot be bypassed. Converting input to lowercase enables case-insensitive matching while maintaining strict value validation. The Enum alternative provides additional type safety and IDE support - attempting to create an enum value that doesn't exist raises ValueError, which can be caught for validation. This approach eliminates injection risks entirely because there's no pattern matching - the value either exists in the allowed set or it doesn't. Enums are also more maintainable because the IDE can detect typos, refactoring tools work correctly, and the values are defined in one location.

Numeric ID Validation

import re

def validate_id(id_str):
    # Strict: exactly 8 digits
    pattern = r'^[0-9]{8}$'

    if not re.match(pattern, id_str):
        return False

    # Semantic validation: check range
    num_id = int(id_str)
    return 10000000 <= num_id <= 99999999

Why this works: The pattern r'^[0-9]{8}$' with anchors enforces exactly 8 digits, preventing inputs like "12345678abc" or "abc12345678" that contain valid substrings. This format validation happens before parsing, catching malformed input early. Unlike using int() directly (which would raise ValueError on non-numeric input), the regex ensures the input is purely numeric before conversion. The range check with explicit min/max values enforces semantic validity - for example, if your IDs start at 10000000, inputs like "00000001" that match the format but are outside valid ranges get rejected. This layered validation (format → parsing → range) provides defense-in-depth and clear error boundaries.

Python-Specific Best Practices

Use `re.fullmatch()` for Exact Matching

import re

# GOOD: Python 3.4+ provides fullmatch()
pattern = r'[a-z0-9]{3,20}'

# Old way: add anchors
if re.match(r'^[a-z0-9]{3,20}$', username):
    pass

# Better way: use fullmatch (no anchors needed)
if re.fullmatch(r'[a-z0-9]{3,20}', username):
    pass

Pre-compile Patterns for Performance

import re

# Compile pattern once at module level
USERNAME_PATTERN = re.compile(r'^[a-z0-9_]{3,20}$', re.IGNORECASE)

def validate_username(username):
    return bool(USERNAME_PATTERN.match(username))

Use `pathlib` for Path Operations

from pathlib import Path

def get_file_safe(filename):
    allowed_files = {'report.pdf', 'data.csv', 'summary.txt'}

    if filename not in allowed_files:
        raise ValueError('File not allowed')

    base_dir = Path('/var/data').resolve()
    file_path = (base_dir / filename).resolve()

    # Check if file_path is relative to base_dir
    try:
        file_path.relative_to(base_dir)
    except ValueError:
        raise ValueError('Path traversal detected')

    return open(file_path)

Verification

After implementing the recommended secure patterns, verify the fix through multiple approaches:

Manual testing: Submit malicious payloads relevant to this vulnerability and confirm they're handled safely without executing unintended operations
Code review: Confirm all instances use the secure pattern (parameterized queries, safe APIs, proper encoding) with no string concatenation or unsafe operations
Static analysis: Use security scanners to verify no new vulnerabilities exist and the original finding is resolved
Regression testing: Ensure legitimate user inputs and application workflows continue to function correctly
Edge case validation: Test with special characters, boundary conditions, and unusual inputs to verify proper handling
Framework verification: If using a framework or library, confirm the recommended APIs are used correctly according to documentation
Authentication/session testing: Verify security controls remain effective and cannot be bypassed (if applicable to the vulnerability type)
Rescan: Run the security scanner again to confirm the finding is resolved and no new issues were introduced