CWE-91: XML Injection - Python

Overview

XML Injection in Python applications occurs when untrusted user input is used to construct XML documents or queries without proper validation or escaping. Attackers can manipulate XML structure by injecting special characters like <, >, &, ', and ", leading to data corruption, authentication bypass, or information disclosure.

Primary Defence: Use xml.etree.ElementTree or lxml with proper element creation methods instead of string concatenation, validate and escape user input with xml.sax.saxutils.escape() before including in XML, disable external entity processing (XMLParser(resolve_entities=False)) to prevent XXE attacks, and use schema validation to ensure XML structure integrity and prevent XML injection.

Common Python XML Vulnerability Scenarios:

Building XML documents with string concatenation or f-strings
Using user input directly in XML elements or attributes
Manual XML parsing without escaping
XML query injection (XPath/XQuery)
SOAP/REST XML payloads with unsanitized data

Python XML Libraries:

xml.etree.ElementTree: Standard library XML API (recommended)
lxml: Feature-rich XML/HTML library with XPath support
defusedxml: Security-hardened XML parsing
xmltodict: Dict-to-XML conversion
untangle: Simple XML-to-object parsing

XML Special Characters Requiring Escaping:

< → <
> → >
& → &
' → '
" → "

Common Vulnerable Patterns

String Concatenation

# VULNERABLE - Direct string concatenation
def create_user_xml(username, email):
    # VULNERABLE - User input directly in XML string
    xml = f"""<?xml version="1.0"?>
<user>
    <username>{username}</username>
    <email>{email}</email>
</user>"""
    return xml

# Attack: username = "</username><admin>true</admin><username>"
# Result: <username></username><admin>true</admin><username></username>
# Creates unintended <admin> element

Why this is vulnerable:

No escaping of XML special characters
Allows injection of new elements
Can modify XML structure
Bypasses validation

Flask/Django XML API

# VULNERABLE - Flask endpoint returning XML
from flask import Flask, request, Response

app = Flask(__name__)

@app.route('/api/user', methods=['GET'])
def get_user():
    username = request.args.get('username', '')
    email = request.args.get('email', '')

    # VULNERABLE - String interpolation in XML
    xml_response = f"""<?xml version="1.0" encoding="UTF-8"?>
<response>
    <user>
        <name>{username}</name>
        <email>{email}</email>
    </user>
</response>"""

    return Response(xml_response, mimetype='application/xml')

# Attack: username = "<admin>true</admin>"
# Response contains: <name><admin>true</admin></name>

Why this is vulnerable:

Web framework doesn't auto-escape XML
User input from request parameters
No validation or sanitization
Information disclosure possible

SOAP Request Construction

# VULNERABLE - Building SOAP XML manually
import requests

def call_soap_service(user_id, action):
    # VULNERABLE - User input in SOAP envelope
    soap_envelope = f"""<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
    <soap:Body>
        <GetUserData>
            <UserId>{user_id}</UserId>
            <Action>{action}</Action>
        </GetUserData>
    </soap:Body>
</soap:Envelope>"""

    response = requests.post(
        'https://api.example.com/soap',
        data=soap_envelope,
        headers={'Content-Type': 'text/xml'}
    )
    return response.text

# Attack: user_id = "</UserId><Role>admin</Role><UserId>"
# Injects admin role into SOAP request

Why this is vulnerable:

SOAP envelope built with string formatting
Allows element injection
Can escalate privileges
Modify request structure

XML Configuration Files

# VULNERABLE - Writing XML config with user data
def save_user_settings(username, theme, language):
    # VULNERABLE - User input in XML config
    config_xml = f"""<?xml version="1.0"?>
<config>
    <user>{username}</user>
    <preferences>
        <theme>{theme}</theme>
        <language>{language}</language>
    </preferences>
</config>"""

    with open('config.xml', 'w') as f:
        f.write(config_xml)

# Attack: theme = "</theme><admin_access>true</admin_access><theme>"
# Modifies configuration structure

Why this is vulnerable:

Configuration files parsed by XML parser
Persistent injection
Can modify application behavior
Privilege escalation

lxml with String Building

# VULNERABLE - Using lxml with string concatenation
from lxml import etree

def create_xml_response(data):
    # VULNERABLE - Building XML string manually
    xml_str = f"""<response>
    <status>success</status>
    <data>{data}</data>
</response>"""

    # Parse the vulnerable XML
    root = etree.fromstring(xml_str.encode())
    return etree.tostring(root, pretty_print=True)

# Attack: data = "</data><malicious>payload</malicious><data>"
# Injects malicious elements

Why this is vulnerable:

lxml doesn't escape string concatenation
Parser accepts malformed structure
Injection before parsing
No validation

XPath Query Injection

# VULNERABLE - User input in XPath query
from lxml import etree

def find_user_by_name(xml_doc, username):
    root = etree.fromstring(xml_doc)

    # VULNERABLE - XPath injection
    xpath = f"//user[name='{username}']"
    results = root.xpath(xpath)

    return results

# Attack: username = "' or '1'='1"
# XPath: //user[name='' or '1'='1']
# Returns all users

Why this is vulnerable:

XPath query built with concatenation
Boolean-based injection
Bypasses authentication checks
Information disclosure

XML Attribute Injection

# VULNERABLE - User input in XML attributes
def create_element_with_attrs(name, value, attr_value):
    # VULNERABLE - Attribute injection
    xml = f'<element name="{name}" value="{value}" custom="{attr_value}"/>'
    return xml

# Attack: attr_value = 'test" malicious="true'
# Result: <element name="..." value="..." custom="test" malicious="true"/>
# Injects additional attributes

Why this is vulnerable:

Attribute quotes can be escaped
Allows additional attribute injection
Modifies element properties
Can bypass security checks

Django Model to XML

# VULNERABLE - Django model serialization to XML
from django.http import HttpResponse

def export_users_xml(request):
    from myapp.models import User

    search_term = request.GET.get('search', '')
    users = User.objects.filter(username__icontains=search_term)

    # VULNERABLE - Manual XML construction
    xml_parts = ['<?xml version="1.0"?><users>']

    for user in users:
        # VULNERABLE - No escaping of user data
        xml_parts.append(f'''
        <user>
            <username>{user.username}</username>
            <email>{user.email}</email>
            <bio>{user.bio}</bio>
        </user>''')

    xml_parts.append('</users>')

    return HttpResponse(
        ''.join(xml_parts),
        content_type='application/xml'
    )

# Attack: User.bio contains "</bio><admin>true</admin><bio>"
# Stored XSS-like injection

Why this is vulnerable:

Database values not escaped
Assumes clean data
Persistent injection via stored data
Framework doesn't auto-escape XML

Secure Patterns

ElementTree with Proper API

# SECURE - Using ElementTree API (recommended)
import xml.etree.ElementTree as ET

def create_user_xml(username, email):
    """
    Create XML document using ElementTree API.
    ElementTree automatically escapes special characters.
    """
    # Validate inputs
    if not isinstance(username, str) or len(username) > 100:
        raise ValueError("Invalid username")
    if not isinstance(email, str) or len(email) > 255:
        raise ValueError("Invalid email")

    # SECURE - Use ElementTree API
    root = ET.Element('user')

    username_elem = ET.SubElement(root, 'username')
    username_elem.text = username  # Automatically escaped

    email_elem = ET.SubElement(root, 'email')
    email_elem.text = email  # Automatically escaped

    # Convert to string
    xml_str = ET.tostring(root, encoding='unicode')
    return f'<?xml version="1.0"?>\n{xml_str}'

# Example usage
xml = create_user_xml(
    username="<script>alert('xss')</script>",
    email="test@example.com"
)
# Result: <username>&lt;script&gt;alert('xss')&lt;/script&gt;</username>
# Special characters properly escaped

Why this works: xml.etree.ElementTree (ET.Element(), ET.SubElement(), .text = value) is Python's standard library XML API that automatically escapes XML special characters (<, >, &, ', ") when setting element text content. The .text property assignment treats the value as character data, not markup - so even if username contains </username><admin>true</admin>, it becomes </username><admin>true</admin> in the output, preventing structural changes.

Pre-validation (isinstance(), len() checks) provides defense-in-depth - rejecting overly long inputs prevents DoS and limits attack surface. ET.tostring() with encoding='unicode' returns a string (not bytes), making it easy to prepend the XML declaration (<?xml version="1.0"?>). ElementTree is lightweight (no external dependencies), fast (written in C), and safe by default - it doesn't support dangerous features like DTDs or entity expansion that can lead to XXE attacks.

This pattern is immune to injection because ElementTree builds a tree structure internally and serializes it, never concatenating raw strings. Recommended for all Python XML generation unless you need advanced features like XPath (use lxml) or XXE protection for parsing (use defusedxml).

lxml with Safe API

# SECURE - Using lxml with proper API
from lxml import etree
import re

def create_xml_with_lxml(data_dict):
    """
    Create XML using lxml's safe API.
    """
    # Validate input
    for key, value in data_dict.items():
        if not isinstance(key, str) or not isinstance(value, str):
            raise ValueError("Keys and values must be strings")
        if len(value) > 1000:
            raise ValueError("Value too long")

    # SECURE - Use lxml Element API
    root = etree.Element('response')
    status = etree.SubElement(root, 'status')
    status.text = 'success'

    data_elem = etree.SubElement(root, 'data')

    for key, value in data_dict.items():
        # Validate key name (XML element names)
        if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_-]*$', key):
            raise ValueError(f"Invalid XML element name: {key}")

        item = etree.SubElement(data_elem, key)
        item.text = value  # Automatically escaped

    return etree.tostring(
        root,
        pretty_print=True,
        encoding='unicode',
        xml_declaration=True
    )

# Example usage
xml = create_xml_with_lxml({
    'name': '<admin>true</admin>',
    'email': 'test@example.com'
})
# Malicious input properly escaped

Why this works: lxml (etree.Element(), etree.SubElement(), .text = value) is a feature-rich C-based XML library that provides automatic escaping like ElementTree but with additional capabilities (XPath, XSLT, XML Schema validation). The .text property assignment treats content as character data, escaping special characters to prevent injection of elements like </data><malicious>payload</malicious>.

Element name validation (re.match(r'^[a-zA-Z_][a-zA-Z0-9_-]*$', key)) enforces XML naming rules - element names must start with a letter/underscore, preventing names like 123invalid or <script>. Length limits (len(value) > 1000) prevent DoS via extremely large values that consume memory. etree.tostring() with xml_declaration=True produces a complete XML document with <?xml version="1.0"?> header. pretty_print=True adds indentation for human-readable output. encoding='unicode' returns a string (not bytes), avoiding encoding issues in Python 3.

lxml is faster than ElementTree for large documents and supports XPath for querying (root.xpath("//item[@type='active']")) and XSLT for transformations. Use lxml when you need advanced XML features; use ElementTree for simpler use cases. Both are safe from injection when using the Element API.

Flask with defusedxml

# SECURE - Flask endpoint with defusedxml
from flask import Flask, request, Response
import xml.etree.ElementTree as ET
from defusedxml import ElementTree as DefusedET
import re

app = Flask(__name__)

def validate_username(username):
    """Validate username format"""
    if not username or len(username) > 50:
        return False
    return bool(re.match(r'^[a-zA-Z0-9._-]+$', username))

def validate_email(email):
    """Validate email format"""
    if not email or len(email) > 255:
        return False
    return bool(re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email))

@app.route('/api/user', methods=['GET'])
def get_user():
    username = request.args.get('username', '')
    email = request.args.get('email', '')

    # SECURE - Validate inputs
    if not validate_username(username):
        return Response(
            '<error>Invalid username</error>',
            status=400,
            mimetype='application/xml'
        )

    if not validate_email(email):
        return Response(
            '<error>Invalid email</error>',
            status=400,
            mimetype='application/xml'
        )

    # SECURE - Build XML with ElementTree
    root = ET.Element('response')
    user_elem = ET.SubElement(root, 'user')

    name_elem = ET.SubElement(user_elem, 'name')
    name_elem.text = username

    email_elem = ET.SubElement(user_elem, 'email')
    email_elem.text = email

    xml_str = ET.tostring(root, encoding='unicode')
    full_xml = f'<?xml version="1.0" encoding="UTF-8"?>\n{xml_str}'

    return Response(full_xml, mimetype='application/xml')

if __name__ == '__main__':
    app.run()

Why this works: Flask integration with ElementTree (ET.Element(), .text = value) demonstrates how to safely return XML responses in web APIs. Regex validation (^[a-zA-Z0-9._-]+$ for usernames, email pattern) blocks XML metacharacters (<, >, &) and XPath metacharacters (', "), preventing injection before it reaches the XML API. Early validation with 400 errors provides fail-fast behavior - invalid requests are rejected immediately without expensive XML processing.

Generic error messages ("Invalid username", "Invalid email") prevent information disclosure - attackers don't learn whether rejection was due to length, character set, or format checks. Response() with mimetype='application/xml' sets the Content-Type header (application/xml), ensuring browsers and clients parse the response correctly (not as HTML). defusedxml (imported but used for parsing untrusted XML inputs in other endpoints) provides XXE protection - it blocks DOCTYPE declarations, entity expansion, and external entity references that can lead to file disclosure or SSRF.

ElementTree escaping handles user-generated content safely. This pattern is ideal for RESTful APIs returning XML to clients, with Flask's routing (@app.route()) and request handling (request.args.get()).

SOAP with Zeep Library

# SECURE - Using zeep library for SOAP
from zeep import Client
from zeep.wsse.username import UsernameToken

def call_soap_service_secure(user_id, action):
    """
    Use zeep library which handles XML construction safely.
    """
    # Validate inputs
    if not isinstance(user_id, (int, str)) or not str(user_id).isalnum():
        raise ValueError("Invalid user_id")

    if not action or len(action) > 50:
        raise ValueError("Invalid action")

    # SECURE - zeep handles XML construction
    wsdl_url = 'https://api.example.com/service?wsdl'
    client = Client(
        wsdl_url,
        wsse=UsernameToken('username', 'password')
    )

    # SECURE - zeep automatically escapes parameters
    result = client.service.GetUserData(
        UserId=user_id,
        Action=action
    )

    return result

# zeep library builds proper SOAP envelope with escaping

Why this works: Zeep (Client(), client.service.MethodName()) is a modern Python SOAP client that automatically constructs SOAP envelopes with proper XML escaping, eliminating manual template literals like f"<UserId>{user_id}</UserId>". The Client(wsdl_url) constructor fetches the WSDL document, parses the service definition, and generates type-safe method calls (client.service.GetUserData()) that match the SOAP operations defined in the WSDL.

Parameter passing (UserId=user_id, Action=action) is serialized to XML by Zeep's lxml-based serializer, which automatically escapes special characters - even if user_id contains </UserId><Role>admin</Role>, it becomes </UserId><Role>.... Pre-validation (isinstance(), isalnum(), length checks) provides defense-in-depth and prevents DoS. WS-Security support (UsernameToken()) adds authentication headers to the SOAP envelope, handling credentials securely. Zeep handles namespaces automatically (SOAP-ENV, xsd, xsi), which is error-prone in manual construction.

The library parses SOAP responses into Python dictionaries/objects, avoiding manual XML parsing vulnerabilities. Successor to suds-jurko (deprecated), Zeep supports SOAP 1.1/1.2, WSDL 1.1/2.0, and is actively maintained. This pattern is ideal for enterprise SOAP integrations where manual envelope construction is error-prone and WSDL compliance is required.

Django with xml.etree

# SECURE - Django model export with ElementTree
from django.http import HttpResponse
import xml.etree.ElementTree as ET
import re

def validate_search_term(search):
    """Validate search term"""
    return bool(re.match(r'^[a-zA-Z0-9\s]{1,50}$', search))

def export_users_xml(request):
    from myapp.models import User

    search_term = request.GET.get('search', '')

    # SECURE - Validate input
    if not validate_search_term(search_term):
        return HttpResponse(
            '<error>Invalid search term</error>',
            content_type='application/xml',
            status=400
        )

    users = User.objects.filter(
        username__icontains=search_term
    )[:100]  # Limit results

    # SECURE - Build XML with ElementTree
    root = ET.Element('users')
    root.set('count', str(len(users)))

    for user in users:
        user_elem = ET.SubElement(root, 'user')
        user_elem.set('id', str(user.id))

        username_elem = ET.SubElement(user_elem, 'username')
        username_elem.text = user.username  # Auto-escaped

        email_elem = ET.SubElement(user_elem, 'email')
        email_elem.text = user.email  # Auto-escaped

        # SECURE - Even user-generated content is escaped
        if user.bio:
            bio_elem = ET.SubElement(user_elem, 'bio')
            bio_elem.text = user.bio  # Auto-escaped

    xml_str = ET.tostring(root, encoding='unicode')
    full_xml = f'<?xml version="1.0"?>\n{xml_str}'

    return HttpResponse(full_xml, content_type='application/xml')

Why this works: Django ORM integration with ElementTree demonstrates how to safely export database records to XML, addressing the common pitfall of assuming database content is safe. Regex validation on the search term (^[a-zA-Z0-9\s]{1,50}$) blocks XML metacharacters in user queries, preventing injection via query parameters. .filter() with __icontains performs case-insensitive substring search on the database, but the search term is validated before querying. Result limiting ([:100]) prevents DoS by capping the number of records processed and returned.

root.set('count', str(len(users))) demonstrates safe attribute setting - ElementTree escapes attribute values automatically. .text = user.username and .text = user.email show that even database content is escaped - if a user's bio contains </bio><admin>true</admin>, it becomes </bio>... in the XML output, preventing stored injection (like stored XSS but for XML). HttpResponse() with content_type='application/xml' sets the Content-Type header.

Django's ORM prevents SQL injection (via parameterized queries), but doesn't handle XML escaping - that's ElementTree's job. This pattern is ideal for Django REST APIs that need to export data in XML format (e.g., for legacy clients), with Django's view routing (def export_users_xml(request)) and ORM (User.objects.filter()).

XPath with Parameterization

# SECURE - XPath with safe practices
from lxml import etree
import re

def validate_username(username):
    """Allowlist validation"""
    return bool(re.match(r'^[a-zA-Z0-9._-]{1,50}$', username))

def find_user_by_name_secure(xml_doc, username):
    """
    Secure XPath query using ElementTree.
    """
    # SECURE - Validate input
    if not validate_username(username):
        raise ValueError("Invalid username format")

    root = etree.fromstring(xml_doc)

    # SECURE - Use XPath with proper escaping
    # Option 1: Iterate and compare (safest)
    for user in root.xpath('//user'):
        name_elem = user.find('name')
        if name_elem is not None and name_elem.text == username:
            return user

    return None

# Alternative: Use lxml's XPath variables (also secure)
def find_user_with_xpath_vars(xml_doc, username):
    if not validate_username(username):
        raise ValueError("Invalid username")

    root = etree.fromstring(xml_doc)

    # SECURE - XPath with variables
    results = root.xpath(
        "//user[name=$username]",
        username=username  # Passed as variable, not string concat
    )

    return results[0] if results else None

Why this works: XPath injection is a different attack vector from XML injection - instead of injecting XML elements, attackers manipulate XPath query logic (similar to SQL injection). String concatenation (f"//user[name='{username}']") allows injection like username = "' or '1'='1", resulting in //user[name='' or '1'='1'] which returns all users instead of filtering by name. Regex validation (^[a-zA-Z0-9._-]{1,50}$) blocks XPath metacharacters (', ", [, ], (), =) before they reach the query, making injection structurally impossible.

Iterate-and-compare (for user in root.xpath('//user') + name_elem.text == username) is the safest pattern - it retrieves all candidate elements with a static XPath expression ('//user') and filters in Python code. This avoids dynamic query construction entirely. lxml's XPath variables (root.xpath("//user[name=$username]", username=username)) provide parameterization - the variable is passed separately and lxml handles escaping internally. This is similar to parameterized SQL queries.

Note: Not all Python XML libraries support XPath variables - ElementTree doesn't support XPath at all (only limited XPath-like find() methods), so lxml is required for complex queries. Use iterate-and-compare for maximum safety, XPath variables for readability, and never concatenate strings in XPath expressions.

Verification

After implementing the recommended secure patterns, verify the fix through multiple approaches:

Manual testing: Submit malicious payloads relevant to this vulnerability and confirm they're handled safely without executing unintended operations
Code review: Confirm all instances use the secure pattern (parameterized queries, safe APIs, proper encoding) with no string concatenation or unsafe operations
Static analysis: Use security scanners to verify no new vulnerabilities exist and the original finding is resolved
Regression testing: Ensure legitimate user inputs and application workflows continue to function correctly
Edge case validation: Test with special characters, boundary conditions, and unusual inputs to verify proper handling
Framework verification: If using a framework or library, confirm the recommended APIs are used correctly according to documentation
Authentication/session testing: Verify security controls remain effective and cannot be bypassed (if applicable to the vulnerability type)
Rescan: Run the security scanner again to confirm the finding is resolved and no new issues were introduced

CWE-91: XML Injection - Python

Overview

Common Vulnerable Patterns

String Concatenation

Flask/Django XML API

SOAP Request Construction

XML Configuration Files

lxml with String Building

XPath Query Injection

XML Attribute Injection

Django Model to XML

Secure Patterns

ElementTree with Proper API

lxml with Safe API

Flask with defusedxml

SOAP with Zeep Library

Django with xml.etree

XPath with Parameterization

Verification

Additional Resources