CWE-183: Permissive List of Allowed Inputs - Python
Overview
Python-specific guidance for implementing strict input validation using regular expressions, sets, and path manipulation functions.
Primary Defence: Use fully anchored regex patterns with ^ and $ along with re.fullmatch() for complete string matching, validate complex inputs with specialized libraries like pathlib.Path.resolve() for file paths and ipaddress module for IP addresses, and enforce strict length limits to ensure complete input validation and prevent injection attacks.
Common Vulnerable Patterns
Unanchored Regular Expressions
import re
def validate_email(email):
# VULNERABLE - no anchors, allows extra content
# Attacker: "valid@example.com<script>alert(1)</script>"
if re.match(r'[\w.-]+@[\w.-]+', email):
return True # Matched prefix, ignores suffix!
return False
Permissive File Extension Check
def validate_filename(filename):
# VULNERABLE - just checks if extension appears anywhere
# Attacker: "malware.exe.jpg", "file.jpg.php"
if re.search(r'\.(jpg|png|gif)', filename):
return True
return False
Path Traversal Allowed
def get_file(filename):
# VULNERABLE - allows path traversal
# Attacker: "../../../etc/passwd"
allowed_chars = re.compile(r'^[a-zA-Z0-9._/-]+$')
if allowed_chars.match(filename):
return open(f'/var/data/{filename}') # Path traversal!
return None
Secure Patterns
Strict Email Validation
This is strictly based on xxxxx@yyyyy.zzzzzz. Full RFC5322 can be much more complex.
import re
def validate_email(email):
# Check length before regex
if len(email) > 254:
return False
# Anchored regex ensures entire string matches
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
if not re.match(pattern, email):
return False
# Additional semantic checks
local, domain = email.split('@')
if len(local) > 64: # RFC 5321 limit
return False
return True
Why this works: The anchored regex pattern r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' enforces strict email structure with clear separation between local part, @ symbol, domain, and TLD. The ^ and $ anchors prevent accepting emails embedded in larger strings (like "user@example.com<script>alert(1)</script>"). Length validation at 254 characters matches RFC 5321 limits and prevents ReDoS attacks from extremely long inputs. The local part length check (64 characters) enforces RFC 5321 mailbox limits. The pattern requires at least a 2-character TLD (.co, .uk) which blocks invalid domains and most typos. This simplified approach balances security with usability - full RFC 5322 compliance is extremely complex and rarely needed for web applications.
Strict Filename Validation
import re
def validate_filename(filename):
# Length check
if len(filename) > 255:
return False
# Anchored pattern - must END with allowed extension
pattern = r'^[a-zA-Z0-9_-]+\.(jpg|png|gif)$'
if not re.match(pattern, filename, re.IGNORECASE):
return False
# Additional security checks
if '..' in filename or '/' in filename:
return False
return True
Why this works: The pattern r'^[a-zA-Z0-9_-]+\.(jpg|png|gif)$' uses the $ anchor to ensure the filename ends with an allowed extension, preventing double-extension attacks like "malware.exe.jpg" where the real extension is .exe but .jpg appears in the filename. The character allowlist [a-zA-Z0-9_-] blocks special characters that could be used for path traversal or command injection. Length validation prevents buffer overflows and denial-of-service from extremely long filenames. The explicit checks for .. and / provide defense-in-depth against path traversal, even though the regex should already block these. Case-insensitive matching with re.IGNORECASE prevents bypasses like "file.JPG" vs "file.jpg".
Path Validation with Canonicalization
import os
def get_file(filename):
# Best: strict allowlist of specific files
allowed_files = {
'report.pdf',
'data.csv',
'summary.txt'
}
if filename not in allowed_files:
raise ValueError('File not allowed')
# Resolve to absolute path and verify within allowed directory
base_dir = os.path.abspath('/var/data')
file_path = os.path.abspath(os.path.join(base_dir, filename))
# Ensure resolved path is within base directory
if not file_path.startswith(base_dir + os.sep):
raise ValueError('Path traversal detected')
return open(file_path)
Why this works: The allowlist approach with a set of specific filenames provides the strongest security by explicitly defining which files can be accessed, blocking any unauthorized file requests. The os.path.abspath() and os.path.join() combination resolves symbolic links and normalizes paths (removing ., .., redundant separators), preventing path traversal attacks that use techniques like "../../etc/passwd", symbolic links, or OS-specific tricks. The startswith() check with base_dir + os.sep ensures the canonical path remains within the base directory, blocking escapes even if normalization was bypassed. Using os.sep ensures platform-independent validation (works on both Windows \ and Unix /). This defense-in-depth approach combines allowlisting, canonicalization, and boundary checking.
Username Validation
import re
def validate_username(username):
# Length check
if not username or len(username) > 20:
return False
# Strict pattern: lowercase letters, numbers, underscore only
pattern = r'^[a-z0-9_]{3,20}$'
if not re.match(pattern, username.lower()):
return False
# Reject reserved names
reserved = {'admin', 'root', 'system', 'administrator'}
if username.lower() in reserved:
return False
return True
Why this works: The anchored regex pattern r'^[a-z0-9_]{3,20}$' ensures the entire string matches exactly (3-20 characters of lowercase letters, digits, and underscores), preventing substring matches that would allow "admin'; DROP TABLE users--" to pass validation. Converting to lowercase before matching provides case-insensitive validation while maintaining strict character requirements. Length validation prevents ReDoS attacks and buffer overflows. The set data structure for reserved names provides O(1) lookup performance and prevents privilege escalation by blocking admin/system accounts. This layered approach validates format, length, and semantic constraints.
URL Validation
from urllib.parse import urlparse
import ipaddress
def validate_url(url):
try:
parsed = urlparse(url)
# Strict: only allow http and https
if parsed.scheme not in ('http', 'https'):
return False
# Validate host exists
if not parsed.netloc:
return False
# Extract hostname (remove port if present)
hostname = parsed.hostname
if not hostname:
return False
# Optional: reject private/loopback addresses
try:
ip = ipaddress.ip_address(hostname)
if ip.is_private or ip.is_loopback:
return False
except ValueError:
# Not an IP address, that's okay
pass
return True
except Exception:
return False
Why this works: Python's urlparse() function provides robust parsing that correctly handles URL components and rejects malformed URLs. By validating parsed.scheme against a tuple of allowed protocols, the code prevents dangerous protocols like javascript:, data:, file:, or ftp: that could enable XSS or local file access attacks. Checking for a non-empty netloc prevents URLs like http:// that have valid schemes but no destination. The ipaddress module check for private and loopback addresses prevents SSRF attacks targeting internal services (192.168.x.x, 10.x.x.x, 172.16-31.x.x, 127.0.0.1). Using a broad exception handler ensures that any parsing or validation errors result in rejection, following a fail-secure pattern. This approach is much safer than regex-based URL validation which is prone to bypasses.
Enum-Based Validation
def validate_role(role):
# Best practice: use set for known values
allowed_roles = {'user', 'moderator', 'admin'}
# Exact match only (case-insensitive)
return role.lower() in allowed_roles
# Alternative: use Enum
from enum import Enum
class Role(Enum):
USER = 'user'
MODERATOR = 'moderator'
ADMIN = 'admin'
def validate_role_enum(role):
try:
Role(role.lower())
return True
except ValueError:
return False
Why this works: Using a Python set for allowed values provides O(1) lookup performance (compared to list membership which is O(n)) and creates an explicit allowlist that cannot be bypassed. Converting input to lowercase enables case-insensitive matching while maintaining strict value validation. The Enum alternative provides additional type safety and IDE support - attempting to create an enum value that doesn't exist raises ValueError, which can be caught for validation. This approach eliminates injection risks entirely because there's no pattern matching - the value either exists in the allowed set or it doesn't. Enums are also more maintainable because the IDE can detect typos, refactoring tools work correctly, and the values are defined in one location.
Numeric ID Validation
import re
def validate_id(id_str):
# Strict: exactly 8 digits
pattern = r'^[0-9]{8}$'
if not re.match(pattern, id_str):
return False
# Semantic validation: check range
num_id = int(id_str)
return 10000000 <= num_id <= 99999999
Why this works: The pattern r'^[0-9]{8}$' with anchors enforces exactly 8 digits, preventing inputs like "12345678abc" or "abc12345678" that contain valid substrings. This format validation happens before parsing, catching malformed input early. Unlike using int() directly (which would raise ValueError on non-numeric input), the regex ensures the input is purely numeric before conversion. The range check with explicit min/max values enforces semantic validity - for example, if your IDs start at 10000000, inputs like "00000001" that match the format but are outside valid ranges get rejected. This layered validation (format → parsing → range) provides defense-in-depth and clear error boundaries.
Python-Specific Best Practices
Use re.fullmatch() for Exact Matching
import re
# GOOD: Python 3.4+ provides fullmatch()
pattern = r'[a-z0-9]{3,20}'
# Old way: add anchors
if re.match(r'^[a-z0-9]{3,20}$', username):
pass
# Better way: use fullmatch (no anchors needed)
if re.fullmatch(r'[a-z0-9]{3,20}', username):
pass
Pre-compile Patterns for Performance
import re
# Compile pattern once at module level
USERNAME_PATTERN = re.compile(r'^[a-z0-9_]{3,20}$', re.IGNORECASE)
def validate_username(username):
return bool(USERNAME_PATTERN.match(username))
Use pathlib for Path Operations
from pathlib import Path
def get_file_safe(filename):
allowed_files = {'report.pdf', 'data.csv', 'summary.txt'}
if filename not in allowed_files:
raise ValueError('File not allowed')
base_dir = Path('/var/data').resolve()
file_path = (base_dir / filename).resolve()
# Check if file_path is relative to base_dir
try:
file_path.relative_to(base_dir)
except ValueError:
raise ValueError('Path traversal detected')
return open(file_path)
Verification
After implementing the recommended secure patterns, verify the fix through multiple approaches:
- Manual testing: Submit malicious payloads relevant to this vulnerability and confirm they're handled safely without executing unintended operations
- Code review: Confirm all instances use the secure pattern (parameterized queries, safe APIs, proper encoding) with no string concatenation or unsafe operations
- Static analysis: Use security scanners to verify no new vulnerabilities exist and the original finding is resolved
- Regression testing: Ensure legitimate user inputs and application workflows continue to function correctly
- Edge case validation: Test with special characters, boundary conditions, and unusual inputs to verify proper handling
- Framework verification: If using a framework or library, confirm the recommended APIs are used correctly according to documentation
- Authentication/session testing: Verify security controls remain effective and cannot be bypassed (if applicable to the vulnerability type)
- Rescan: Run the security scanner again to confirm the finding is resolved and no new issues were introduced