CWE-91: XML Injection - Python
Overview
XML Injection in Python applications occurs when untrusted user input is used to construct XML documents or queries without proper validation or escaping. Attackers can manipulate XML structure by injecting special characters like <, >, &, ', and ", leading to data corruption, authentication bypass, or information disclosure.
Primary Defence: Use xml.etree.ElementTree or lxml with proper element creation methods instead of string concatenation, validate and escape user input with xml.sax.saxutils.escape() before including in XML, disable external entity processing (XMLParser(resolve_entities=False)) to prevent XXE attacks, and use schema validation to ensure XML structure integrity and prevent XML injection.
Common Python XML Vulnerability Scenarios:
- Building XML documents with string concatenation or f-strings
- Using user input directly in XML elements or attributes
- Manual XML parsing without escaping
- XML query injection (XPath/XQuery)
- SOAP/REST XML payloads with unsanitized data
Python XML Libraries:
- xml.etree.ElementTree: Standard library XML API (recommended)
- lxml: Feature-rich XML/HTML library with XPath support
- defusedxml: Security-hardened XML parsing
- xmltodict: Dict-to-XML conversion
- untangle: Simple XML-to-object parsing
XML Special Characters Requiring Escaping:
<→<>→>&→&'→'"→"
Common Vulnerable Patterns
String Concatenation
# VULNERABLE - Direct string concatenation
def create_user_xml(username, email):
# VULNERABLE - User input directly in XML string
xml = f"""<?xml version="1.0"?>
<user>
<username>{username}</username>
<email>{email}</email>
</user>"""
return xml
# Attack: username = "</username><admin>true</admin><username>"
# Result: <username></username><admin>true</admin><username></username>
# Creates unintended <admin> element
Why this is vulnerable:
- No escaping of XML special characters
- Allows injection of new elements
- Can modify XML structure
- Bypasses validation
Flask/Django XML API
# VULNERABLE - Flask endpoint returning XML
from flask import Flask, request, Response
app = Flask(__name__)
@app.route('/api/user', methods=['GET'])
def get_user():
username = request.args.get('username', '')
email = request.args.get('email', '')
# VULNERABLE - String interpolation in XML
xml_response = f"""<?xml version="1.0" encoding="UTF-8"?>
<response>
<user>
<name>{username}</name>
<email>{email}</email>
</user>
</response>"""
return Response(xml_response, mimetype='application/xml')
# Attack: username = "<admin>true</admin>"
# Response contains: <name><admin>true</admin></name>
Why this is vulnerable:
- Web framework doesn't auto-escape XML
- User input from request parameters
- No validation or sanitization
- Information disclosure possible
SOAP Request Construction
# VULNERABLE - Building SOAP XML manually
import requests
def call_soap_service(user_id, action):
# VULNERABLE - User input in SOAP envelope
soap_envelope = f"""<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<GetUserData>
<UserId>{user_id}</UserId>
<Action>{action}</Action>
</GetUserData>
</soap:Body>
</soap:Envelope>"""
response = requests.post(
'https://api.example.com/soap',
data=soap_envelope,
headers={'Content-Type': 'text/xml'}
)
return response.text
# Attack: user_id = "</UserId><Role>admin</Role><UserId>"
# Injects admin role into SOAP request
Why this is vulnerable:
- SOAP envelope built with string formatting
- Allows element injection
- Can escalate privileges
- Modify request structure
XML Configuration Files
# VULNERABLE - Writing XML config with user data
def save_user_settings(username, theme, language):
# VULNERABLE - User input in XML config
config_xml = f"""<?xml version="1.0"?>
<config>
<user>{username}</user>
<preferences>
<theme>{theme}</theme>
<language>{language}</language>
</preferences>
</config>"""
with open('config.xml', 'w') as f:
f.write(config_xml)
# Attack: theme = "</theme><admin_access>true</admin_access><theme>"
# Modifies configuration structure
Why this is vulnerable:
- Configuration files parsed by XML parser
- Persistent injection
- Can modify application behavior
- Privilege escalation
lxml with String Building
# VULNERABLE - Using lxml with string concatenation
from lxml import etree
def create_xml_response(data):
# VULNERABLE - Building XML string manually
xml_str = f"""<response>
<status>success</status>
<data>{data}</data>
</response>"""
# Parse the vulnerable XML
root = etree.fromstring(xml_str.encode())
return etree.tostring(root, pretty_print=True)
# Attack: data = "</data><malicious>payload</malicious><data>"
# Injects malicious elements
Why this is vulnerable:
- lxml doesn't escape string concatenation
- Parser accepts malformed structure
- Injection before parsing
- No validation
XPath Query Injection
# VULNERABLE - User input in XPath query
from lxml import etree
def find_user_by_name(xml_doc, username):
root = etree.fromstring(xml_doc)
# VULNERABLE - XPath injection
xpath = f"//user[name='{username}']"
results = root.xpath(xpath)
return results
# Attack: username = "' or '1'='1"
# XPath: //user[name='' or '1'='1']
# Returns all users
Why this is vulnerable:
- XPath query built with concatenation
- Boolean-based injection
- Bypasses authentication checks
- Information disclosure
XML Attribute Injection
# VULNERABLE - User input in XML attributes
def create_element_with_attrs(name, value, attr_value):
# VULNERABLE - Attribute injection
xml = f'<element name="{name}" value="{value}" custom="{attr_value}"/>'
return xml
# Attack: attr_value = 'test" malicious="true'
# Result: <element name="..." value="..." custom="test" malicious="true"/>
# Injects additional attributes
Why this is vulnerable:
- Attribute quotes can be escaped
- Allows additional attribute injection
- Modifies element properties
- Can bypass security checks
Django Model to XML
# VULNERABLE - Django model serialization to XML
from django.http import HttpResponse
def export_users_xml(request):
from myapp.models import User
search_term = request.GET.get('search', '')
users = User.objects.filter(username__icontains=search_term)
# VULNERABLE - Manual XML construction
xml_parts = ['<?xml version="1.0"?><users>']
for user in users:
# VULNERABLE - No escaping of user data
xml_parts.append(f'''
<user>
<username>{user.username}</username>
<email>{user.email}</email>
<bio>{user.bio}</bio>
</user>''')
xml_parts.append('</users>')
return HttpResponse(
''.join(xml_parts),
content_type='application/xml'
)
# Attack: User.bio contains "</bio><admin>true</admin><bio>"
# Stored XSS-like injection
Why this is vulnerable:
- Database values not escaped
- Assumes clean data
- Persistent injection via stored data
- Framework doesn't auto-escape XML
Secure Patterns
ElementTree with Proper API
# SECURE - Using ElementTree API (recommended)
import xml.etree.ElementTree as ET
def create_user_xml(username, email):
"""
Create XML document using ElementTree API.
ElementTree automatically escapes special characters.
"""
# Validate inputs
if not isinstance(username, str) or len(username) > 100:
raise ValueError("Invalid username")
if not isinstance(email, str) or len(email) > 255:
raise ValueError("Invalid email")
# SECURE - Use ElementTree API
root = ET.Element('user')
username_elem = ET.SubElement(root, 'username')
username_elem.text = username # Automatically escaped
email_elem = ET.SubElement(root, 'email')
email_elem.text = email # Automatically escaped
# Convert to string
xml_str = ET.tostring(root, encoding='unicode')
return f'<?xml version="1.0"?>\n{xml_str}'
# Example usage
xml = create_user_xml(
username="<script>alert('xss')</script>",
email="test@example.com"
)
# Result: <username><script>alert('xss')</script></username>
# Special characters properly escaped
Why this works: xml.etree.ElementTree (ET.Element(), ET.SubElement(), .text = value) is Python's standard library XML API that automatically escapes XML special characters (<, >, &, ', ") when setting element text content. The .text property assignment treats the value as character data, not markup - so even if username contains </username><admin>true</admin>, it becomes </username><admin>true</admin> in the output, preventing structural changes.
Pre-validation (isinstance(), len() checks) provides defense-in-depth - rejecting overly long inputs prevents DoS and limits attack surface. ET.tostring() with encoding='unicode' returns a string (not bytes), making it easy to prepend the XML declaration (<?xml version="1.0"?>). ElementTree is lightweight (no external dependencies), fast (written in C), and safe by default - it doesn't support dangerous features like DTDs or entity expansion that can lead to XXE attacks.
This pattern is immune to injection because ElementTree builds a tree structure internally and serializes it, never concatenating raw strings. Recommended for all Python XML generation unless you need advanced features like XPath (use lxml) or XXE protection for parsing (use defusedxml).
lxml with Safe API
# SECURE - Using lxml with proper API
from lxml import etree
import re
def create_xml_with_lxml(data_dict):
"""
Create XML using lxml's safe API.
"""
# Validate input
for key, value in data_dict.items():
if not isinstance(key, str) or not isinstance(value, str):
raise ValueError("Keys and values must be strings")
if len(value) > 1000:
raise ValueError("Value too long")
# SECURE - Use lxml Element API
root = etree.Element('response')
status = etree.SubElement(root, 'status')
status.text = 'success'
data_elem = etree.SubElement(root, 'data')
for key, value in data_dict.items():
# Validate key name (XML element names)
if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_-]*$', key):
raise ValueError(f"Invalid XML element name: {key}")
item = etree.SubElement(data_elem, key)
item.text = value # Automatically escaped
return etree.tostring(
root,
pretty_print=True,
encoding='unicode',
xml_declaration=True
)
# Example usage
xml = create_xml_with_lxml({
'name': '<admin>true</admin>',
'email': 'test@example.com'
})
# Malicious input properly escaped
Why this works: lxml (etree.Element(), etree.SubElement(), .text = value) is a feature-rich C-based XML library that provides automatic escaping like ElementTree but with additional capabilities (XPath, XSLT, XML Schema validation). The .text property assignment treats content as character data, escaping special characters to prevent injection of elements like </data><malicious>payload</malicious>.
Element name validation (re.match(r'^[a-zA-Z_][a-zA-Z0-9_-]*$', key)) enforces XML naming rules - element names must start with a letter/underscore, preventing names like 123invalid or <script>. Length limits (len(value) > 1000) prevent DoS via extremely large values that consume memory. etree.tostring() with xml_declaration=True produces a complete XML document with <?xml version="1.0"?> header. pretty_print=True adds indentation for human-readable output. encoding='unicode' returns a string (not bytes), avoiding encoding issues in Python 3.
lxml is faster than ElementTree for large documents and supports XPath for querying (root.xpath("//item[@type='active']")) and XSLT for transformations. Use lxml when you need advanced XML features; use ElementTree for simpler use cases. Both are safe from injection when using the Element API.
Flask with defusedxml
# SECURE - Flask endpoint with defusedxml
from flask import Flask, request, Response
import xml.etree.ElementTree as ET
from defusedxml import ElementTree as DefusedET
import re
app = Flask(__name__)
def validate_username(username):
"""Validate username format"""
if not username or len(username) > 50:
return False
return bool(re.match(r'^[a-zA-Z0-9._-]+$', username))
def validate_email(email):
"""Validate email format"""
if not email or len(email) > 255:
return False
return bool(re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email))
@app.route('/api/user', methods=['GET'])
def get_user():
username = request.args.get('username', '')
email = request.args.get('email', '')
# SECURE - Validate inputs
if not validate_username(username):
return Response(
'<error>Invalid username</error>',
status=400,
mimetype='application/xml'
)
if not validate_email(email):
return Response(
'<error>Invalid email</error>',
status=400,
mimetype='application/xml'
)
# SECURE - Build XML with ElementTree
root = ET.Element('response')
user_elem = ET.SubElement(root, 'user')
name_elem = ET.SubElement(user_elem, 'name')
name_elem.text = username
email_elem = ET.SubElement(user_elem, 'email')
email_elem.text = email
xml_str = ET.tostring(root, encoding='unicode')
full_xml = f'<?xml version="1.0" encoding="UTF-8"?>\n{xml_str}'
return Response(full_xml, mimetype='application/xml')
if __name__ == '__main__':
app.run()
Why this works: Flask integration with ElementTree (ET.Element(), .text = value) demonstrates how to safely return XML responses in web APIs. Regex validation (^[a-zA-Z0-9._-]+$ for usernames, email pattern) blocks XML metacharacters (<, >, &) and XPath metacharacters (', "), preventing injection before it reaches the XML API. Early validation with 400 errors provides fail-fast behavior - invalid requests are rejected immediately without expensive XML processing.
Generic error messages ("Invalid username", "Invalid email") prevent information disclosure - attackers don't learn whether rejection was due to length, character set, or format checks. Response() with mimetype='application/xml' sets the Content-Type header (application/xml), ensuring browsers and clients parse the response correctly (not as HTML). defusedxml (imported but used for parsing untrusted XML inputs in other endpoints) provides XXE protection - it blocks DOCTYPE declarations, entity expansion, and external entity references that can lead to file disclosure or SSRF.
ElementTree escaping handles user-generated content safely. This pattern is ideal for RESTful APIs returning XML to clients, with Flask's routing (@app.route()) and request handling (request.args.get()).
SOAP with Zeep Library
# SECURE - Using zeep library for SOAP
from zeep import Client
from zeep.wsse.username import UsernameToken
def call_soap_service_secure(user_id, action):
"""
Use zeep library which handles XML construction safely.
"""
# Validate inputs
if not isinstance(user_id, (int, str)) or not str(user_id).isalnum():
raise ValueError("Invalid user_id")
if not action or len(action) > 50:
raise ValueError("Invalid action")
# SECURE - zeep handles XML construction
wsdl_url = 'https://api.example.com/service?wsdl'
client = Client(
wsdl_url,
wsse=UsernameToken('username', 'password')
)
# SECURE - zeep automatically escapes parameters
result = client.service.GetUserData(
UserId=user_id,
Action=action
)
return result
# zeep library builds proper SOAP envelope with escaping
Why this works: Zeep (Client(), client.service.MethodName()) is a modern Python SOAP client that automatically constructs SOAP envelopes with proper XML escaping, eliminating manual template literals like f"<UserId>{user_id}</UserId>". The Client(wsdl_url) constructor fetches the WSDL document, parses the service definition, and generates type-safe method calls (client.service.GetUserData()) that match the SOAP operations defined in the WSDL.
Parameter passing (UserId=user_id, Action=action) is serialized to XML by Zeep's lxml-based serializer, which automatically escapes special characters - even if user_id contains </UserId><Role>admin</Role>, it becomes </UserId><Role>.... Pre-validation (isinstance(), isalnum(), length checks) provides defense-in-depth and prevents DoS. WS-Security support (UsernameToken()) adds authentication headers to the SOAP envelope, handling credentials securely. Zeep handles namespaces automatically (SOAP-ENV, xsd, xsi), which is error-prone in manual construction.
The library parses SOAP responses into Python dictionaries/objects, avoiding manual XML parsing vulnerabilities. Successor to suds-jurko (deprecated), Zeep supports SOAP 1.1/1.2, WSDL 1.1/2.0, and is actively maintained. This pattern is ideal for enterprise SOAP integrations where manual envelope construction is error-prone and WSDL compliance is required.
Django with xml.etree
# SECURE - Django model export with ElementTree
from django.http import HttpResponse
import xml.etree.ElementTree as ET
import re
def validate_search_term(search):
"""Validate search term"""
return bool(re.match(r'^[a-zA-Z0-9\s]{1,50}$', search))
def export_users_xml(request):
from myapp.models import User
search_term = request.GET.get('search', '')
# SECURE - Validate input
if not validate_search_term(search_term):
return HttpResponse(
'<error>Invalid search term</error>',
content_type='application/xml',
status=400
)
users = User.objects.filter(
username__icontains=search_term
)[:100] # Limit results
# SECURE - Build XML with ElementTree
root = ET.Element('users')
root.set('count', str(len(users)))
for user in users:
user_elem = ET.SubElement(root, 'user')
user_elem.set('id', str(user.id))
username_elem = ET.SubElement(user_elem, 'username')
username_elem.text = user.username # Auto-escaped
email_elem = ET.SubElement(user_elem, 'email')
email_elem.text = user.email # Auto-escaped
# SECURE - Even user-generated content is escaped
if user.bio:
bio_elem = ET.SubElement(user_elem, 'bio')
bio_elem.text = user.bio # Auto-escaped
xml_str = ET.tostring(root, encoding='unicode')
full_xml = f'<?xml version="1.0"?>\n{xml_str}'
return HttpResponse(full_xml, content_type='application/xml')
Why this works: Django ORM integration with ElementTree demonstrates how to safely export database records to XML, addressing the common pitfall of assuming database content is safe. Regex validation on the search term (^[a-zA-Z0-9\s]{1,50}$) blocks XML metacharacters in user queries, preventing injection via query parameters. .filter() with __icontains performs case-insensitive substring search on the database, but the search term is validated before querying. Result limiting ([:100]) prevents DoS by capping the number of records processed and returned.
root.set('count', str(len(users))) demonstrates safe attribute setting - ElementTree escapes attribute values automatically. .text = user.username and .text = user.email show that even database content is escaped - if a user's bio contains </bio><admin>true</admin>, it becomes </bio>... in the XML output, preventing stored injection (like stored XSS but for XML). HttpResponse() with content_type='application/xml' sets the Content-Type header.
Django's ORM prevents SQL injection (via parameterized queries), but doesn't handle XML escaping - that's ElementTree's job. This pattern is ideal for Django REST APIs that need to export data in XML format (e.g., for legacy clients), with Django's view routing (def export_users_xml(request)) and ORM (User.objects.filter()).
XPath with Parameterization
# SECURE - XPath with safe practices
from lxml import etree
import re
def validate_username(username):
"""Allowlist validation"""
return bool(re.match(r'^[a-zA-Z0-9._-]{1,50}$', username))
def find_user_by_name_secure(xml_doc, username):
"""
Secure XPath query using ElementTree.
"""
# SECURE - Validate input
if not validate_username(username):
raise ValueError("Invalid username format")
root = etree.fromstring(xml_doc)
# SECURE - Use XPath with proper escaping
# Option 1: Iterate and compare (safest)
for user in root.xpath('//user'):
name_elem = user.find('name')
if name_elem is not None and name_elem.text == username:
return user
return None
# Alternative: Use lxml's XPath variables (also secure)
def find_user_with_xpath_vars(xml_doc, username):
if not validate_username(username):
raise ValueError("Invalid username")
root = etree.fromstring(xml_doc)
# SECURE - XPath with variables
results = root.xpath(
"//user[name=$username]",
username=username # Passed as variable, not string concat
)
return results[0] if results else None
Why this works: XPath injection is a different attack vector from XML injection - instead of injecting XML elements, attackers manipulate XPath query logic (similar to SQL injection). String concatenation (f"//user[name='{username}']") allows injection like username = "' or '1'='1", resulting in //user[name='' or '1'='1'] which returns all users instead of filtering by name. Regex validation (^[a-zA-Z0-9._-]{1,50}$) blocks XPath metacharacters (', ", [, ], (), =) before they reach the query, making injection structurally impossible.
Iterate-and-compare (for user in root.xpath('//user') + name_elem.text == username) is the safest pattern - it retrieves all candidate elements with a static XPath expression ('//user') and filters in Python code. This avoids dynamic query construction entirely. lxml's XPath variables (root.xpath("//user[name=$username]", username=username)) provide parameterization - the variable is passed separately and lxml handles escaping internally. This is similar to parameterized SQL queries.
Note: Not all Python XML libraries support XPath variables - ElementTree doesn't support XPath at all (only limited XPath-like find() methods), so lxml is required for complex queries. Use iterate-and-compare for maximum safety, XPath variables for readability, and never concatenate strings in XPath expressions.
Verification
After implementing the recommended secure patterns, verify the fix through multiple approaches:
- Manual testing: Submit malicious payloads relevant to this vulnerability and confirm they're handled safely without executing unintended operations
- Code review: Confirm all instances use the secure pattern (parameterized queries, safe APIs, proper encoding) with no string concatenation or unsafe operations
- Static analysis: Use security scanners to verify no new vulnerabilities exist and the original finding is resolved
- Regression testing: Ensure legitimate user inputs and application workflows continue to function correctly
- Edge case validation: Test with special characters, boundary conditions, and unusual inputs to verify proper handling
- Framework verification: If using a framework or library, confirm the recommended APIs are used correctly according to documentation
- Authentication/session testing: Verify security controls remain effective and cannot be bypassed (if applicable to the vulnerability type)
- Rescan: Run the security scanner again to confirm the finding is resolved and no new issues were introduced