CWE-502: Insecure Deserialization - Python
Overview
Python's pickle module can execute arbitrary code during deserialization, making it extremely dangerous when used with untrusted data. Attackers can craft malicious pickle payloads that execute commands when unpickled.
Primary Defence: Use JSON (json.loads()), MessagePack, or Protocol Buffers for data serialization instead of pickle. Never unpickle untrusted data.
Common Vulnerable Patterns
pickle.loads() with Untrusted Data
# VULNERABLE - Never unpickle untrusted data!
import pickle
def load_user(data):
# DANGEROUS: Can execute arbitrary code
user = pickle.loads(data)
return user
# Attacker can craft malicious payload:
# import os; os.system('rm -rf /')
Why this is vulnerable:
- Executes attacker-controlled opcodes during unpickling.
- Invokes
__reduce__/__setstate__hooks. - Can import modules and run system commands.
- Enables RCE before validation occurs.
pickle.load() from Untrusted File
# VULNERABLE - File could contain malicious pickle
import pickle
def load_from_file(filename):
with open(filename, 'rb') as f:
data = pickle.load(f) # RCE if file is malicious!
return data
Why this is vulnerable:
- File contents can be attacker-controlled.
pickle.load()executes during read.- Payloads run as the app user.
- Easy to trigger via uploads or traversal.
Django Session Deserialization
# VULNERABLE - Using pickle for sessions
SESSION_SERIALIZER = 'django.contrib.sessions.serializers.PickleSerializer'
# Attacker can manipulate session cookie to execute code!
Why this is vulnerable:
- If an attacker can forge or tamper with session data, pickle deserialization runs.
- Pickle deserialization runs on every request.
- Magic hooks can execute code server-side.
- Enables persistent RCE via session tampering.
PyYAML unsafe_load()
# VULNERABLE - yaml.unsafe_load can execute Python code
import yaml
def load_config(yaml_data):
config = yaml.unsafe_load(yaml_data) # DANGEROUS!
return config
# Can execute: !!python/object/apply:os.system ["rm -rf /"]
Why this is vulnerable:
- Supports arbitrary object construction tags.
- Can call functions via
!!python/object/apply. - Executes code during parsing.
yaml.unsafe_load()is explicitly unsafe for untrusted YAML.
jsonpickle with Untrusted Data
# VULNERABLE - jsonpickle can instantiate arbitrary classes
import jsonpickle
def deserialize(data):
obj = jsonpickle.decode(data) # Can deserialize any class!
return obj
Why this is vulnerable:
- Embeds class metadata in JSON payloads.
- Instantiates attacker-chosen classes.
__init__and property hooks can execute code.- Bypasses JSON's data-only safety.
pandas.read_pickle() with Untrusted Data
# VULNERABLE - pandas uses pickle internally
import pandas as pd
def load_dataframe(filename):
# DANGEROUS: Executes arbitrary code if file is malicious!
df = pd.read_pickle(filename)
return df
# Or from untrusted bytes:
def load_from_bytes(data):
import io
# DANGEROUS: Can execute code during deserialization
df = pd.read_pickle(io.BytesIO(data))
return df
Why this is vulnerable:
read_pickle()usespickle.load()internally.- Malicious pickle payload executes on deserialization.
- File uploads or user-provided paths can trigger RCE.
- No validation occurs before code execution.
- Works with files, BytesIO, or any file-like object.
pandas.DataFrame.to_pickle() Data Tampering
# VULNERABLE - Storing pickled data accessible to users
import pandas as pd
def save_user_data(user_id, dataframe):
# Saves pickled DataFrame to predictable path
filename = f'/tmp/user_data_{user_id}.pkl'
dataframe.to_pickle(filename)
def load_user_data(user_id):
filename = f'/tmp/user_data_{user_id}.pkl'
# DANGEROUS: Attacker can replace file with malicious pickle
return pd.read_pickle(filename)
Why this is vulnerable:
- Pickle files can be modified by attackers.
- No integrity checking or signatures.
- Predictable paths enable tampering.
- Loading modified pickle executes attacker code.
- Trust boundary violated when files leave app control.
Secure Patterns
Use JSON Instead of Pickle
# SECURE - JSON cannot execute code
import json
def save_user(user):
user_dict = {
'name': user.name,
'email': user.email,
'age': user.age
}
return json.dumps(user_dict)
def load_user(json_data):
# JSON only creates basic Python types (dict, list, str, int, etc.)
user_dict = json.loads(json_data)
# Manually reconstruct object
user = User(
name=user_dict['name'],
email=user_dict['email'],
age=user_dict['age']
)
return user
# Example
user = User(name="John", email="john@example.com", age=30)
json_str = save_user(user)
restored_user = load_user(json_str)
Why this works:
- Builds only primitive types and collections.
- No object instantiation or code execution.
- Ignores class/module metadata entirely.
- Forces explicit reconstruction with validation.
- Prevents gadget chains by design.
PyYAML with safe_load()
# SECURE - safe_load only creates safe Python types
import yaml
def load_config(yaml_data):
# safe_load prevents arbitrary code execution
config = yaml.safe_load(yaml_data)
return config
# Example YAML
yaml_data = """
database:
host: localhost
port: 5432
name: mydb
"""
config = yaml.safe_load(yaml_data)
print(config['database']['host']) # localhost
Why this works:
- Restricts tags to safe scalar/collection types.
- Blocks object construction tags by default.
- No constructor or function invocation.
- Rejects unsafe YAML with clear errors.
- Safe for configs and data exchange.
msgpack for Binary Serialization
# SECURE - MessagePack is safe binary format
import msgpack
def serialize(data):
# Only serializes basic types
return msgpack.packb(data)
def deserialize(packed_data):
# Cannot execute code
return msgpack.unpackb(packed_data, raw=False)
# Example
user_data = {'name': 'John', 'email': 'john@example.com'}
packed = serialize(user_data)
restored = deserialize(packed)
Why this works:
- Encodes only primitives and collections.
unpackb()returns built-in types only.- No object metadata or callables.
- Language-agnostic, data-only format.
- Safer binary alternative to pickle.
Installation:
pandas with Safe Formats (CSV, Parquet, Feather)
# SECURE - Use CSV, Parquet, or Feather instead of pickle
import pandas as pd
# Option 1: CSV (human-readable, widely compatible)
def save_dataframe_csv(df, filename):
df.to_csv(filename, index=False)
def load_dataframe_csv(filename):
# CSV is safe - only contains data, no code
return pd.read_csv(filename)
# Option 2: Parquet (efficient, compressed, type-safe)
def save_dataframe_parquet(df, filename):
df.to_parquet(filename, engine='pyarrow', compression='snappy')
def load_dataframe_parquet(filename):
# Parquet is safe - binary format, no code execution
return pd.read_parquet(filename, engine='pyarrow')
# Option 3: Feather (fast, preserves types)
def save_dataframe_feather(df, filename):
df.to_feather(filename)
def load_dataframe_feather(filename):
# Feather is safe - columnar format, data only
return pd.read_feather(filename)
# Option 4: HDF5 (for large datasets, trusted sources only)
def save_dataframe_hdf(df, filename):
df.to_hdf(filename, key='data', mode='w')
def load_dataframe_hdf(filename):
# HDF5 should only be used with trusted files
return pd.read_hdf(filename, key='data')
Why this works:
- CSV/Parquet/Feather are data-only formats.
- No object deserialization or code execution.
- Preserves DataFrame structure and types.
- Better performance than pickle in many cases.
- Cross-language compatibility (especially Parquet).
- Industry-standard formats for data science.
Format comparison:
| Format | Safety | Speed | Size | Type Preservation | Use Case |
|---|---|---|---|---|---|
| CSV | ✅ Safe | Slow | Large | Limited | Human-readable, universal |
| Parquet | ✅ Safe | Fast | Small | Excellent | Production, big data |
| Feather | ✅ Safe | Very Fast | Medium | Excellent | Inter-process, caching |
| HDF5 | ⚠️ Trusted-only | Fast | Small | Good | Scientific, time-series |
| Pickle | ❌ Unsafe | Medium | Medium | Perfect | Never use with untrusted data |
Installation for optional formats:
# For Parquet
pip install pyarrow
# or
pip install fastparquet
# For HDF5 (trusted sources only)
pip install tables
pandas with JSON for Simple DataFrames
# SECURE - JSON for DataFrames with basic types
import pandas as pd
import json
def save_dataframe_json(df, filename):
# Convert to JSON with proper orientation
df.to_json(filename, orient='records', lines=True)
def load_dataframe_json(filename):
# JSON is safe - no code execution
return pd.read_json(filename, orient='records', lines=True)
# Alternative: For API responses or small data
def dataframe_to_dict(df):
# Convert to list of dicts
return df.to_dict(orient='records')
def dict_to_dataframe(data):
# Safe reconstruction from dicts
return pd.DataFrame(data)
# Usage
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
# Save and load
save_dataframe_json(df, 'data.json')
restored_df = load_dataframe_json('data.json')
# API serialization
data_dict = dataframe_to_dict(df)
json_str = json.dumps(data_dict)
# ... send over network ...
parsed_data = json.loads(json_str)
new_df = dict_to_dataframe(parsed_data)
Why this works:
- JSON only contains primitive data types.
read_json()doesn't execute code.- Works well for DataFrames with simple types.
- Human-readable and debuggable.
- Perfect for APIs and web applications.
- No code execution vectors.
Note: JSON is less efficient than Parquet/Feather for large datasets and doesn't preserve all pandas dtypes perfectly (e.g., timezone-aware datetimes need special handling).
Python Library Safety Matrix
When reviewing code, use this matrix to identify unsafe deserialization libraries:
Safe Alternatives
json (standard library)
- Use instead of pickle for all serialization needs
- Cannot execute code or instantiate arbitrary classes
- Only creates basic Python types (dict, list, str, int, float, bool, None)
PyYAML with safe_load()
- NOT yaml.load() - that's unsafe!
- Only creates safe Python objects
import yaml
data = yaml.safe_load(input) # SAFE
# yaml.load(input) is UNSAFE - allows arbitrary objects
msgpack
- Safe binary serialization format
- Fast and compact
- Cannot instantiate classes
pandas safe formats
- Use CSV, Parquet, or Feather instead of pickle
- HDF5 should only be used with trusted sources
import pandas as pd
# Safe alternatives
df = pd.read_csv('data.csv') # Safe
df = pd.read_parquet('data.parquet') # Safe
df = pd.read_feather('data.feather') # Safe
df = pd.read_hdf('data.h5', key='data') # Safe
df = pd.read_json('data.json') # Safe
NEVER Use with Untrusted Data
pickle / cPickle / _pickle
- Always allows arbitrary code execution
- Can invoke
__reduce__method to execute commands - Replace with json or msgpack
marshal
- Similar to pickle, allows code execution
- Designed for internal Python use only
- Never use with external data
shelve
- Uses pickle internally
- Inherits all pickle vulnerabilities
- Replace with JSON-based storage
PyYAML yaml.load()
- Allows arbitrary object instantiation
- Deprecated in favor of safe_load()
- Always use yaml.safe_load() instead
import yaml
yaml.load(data) # UNSAFE - deprecated
yaml.unsafe_load(data) # UNSAFE - explicitly dangerous
yaml.safe_load(data) # SAFE
jsonpickle.decode()
- Deserializes to Python objects
- Can instantiate arbitrary classes
- Use json.loads() instead
pandas.read_pickle()
- Uses pickle internally, inherits all pickle vulnerabilities
- Can execute arbitrary code during DataFrame loading
- Use pd.read_csv(), pd.read_parquet(), or pd.read_feather() instead
import pandas as pd
df = pd.read_pickle('data.pkl') # NEVER with untrusted data - RCE vulnerability!
df = pd.read_pickle(BytesIO(untrusted_bytes)) # DANGEROUS!
Migration Recommendations
If you find these patterns in security scan results:
- pickle.loads() → Switch to json.loads()
- pickle.load() → Switch to json.load()
- marshal.loads() → Switch to json.loads()
- shelve.open() → Use JSON files or database
- yaml.load() → Switch to yaml.safe_load()
- jsonpickle.decode() → Switch to json.loads()
- pd.read_pickle() → Switch to pd.read_parquet(), pd.read_csv(), or pd.read_feather()
Example migration:
# BEFORE (Unsafe)
import pickle
user = pickle.loads(request.data)
# AFTER (Safe)
import json
user_data = json.loads(request.data)
user = User(**user_data) # Manually construct object
# BEFORE (Unsafe)
import pandas as pd
df = pd.read_pickle('user_data.pkl')
# AFTER (Safe) - Option 1: Parquet (recommended for performance)
import pandas as pd
df = pd.read_parquet('user_data.parquet')
# AFTER (Safe) - Option 2: CSV (for human-readable data)
import pandas as pd
df = pd.read_csv('user_data.csv')
# AFTER (Safe) - Option 3: Feather (for fast I/O)
import pandas as pd
df = pd.read_feather('user_data.feather')
Django with JSON Serializer
# SECURE - Use JSON for Django sessions
# settings.py
SESSION_SERIALIZER = 'django.contrib.sessions.serializers.JSONSerializer'
# Or for custom serialization:
from django.core.serializers import serialize, deserialize
# Serialize Django models
json_data = serialize('json', User.objects.all())
# Deserialize
users = list(deserialize('json', json_data))
Restricted Unpickler (If Pickle Required)
# SECURE - Allowlist allowed classes
import pickle
import io
class RestrictedUnpickler(pickle.Unpickler):
"""Only allow allowlisted classes to be unpickled"""
ALLOWED_CLASSES = {
('__main__', 'User'),
('__main__', 'Address'),
('builtins', 'dict'),
('builtins', 'list'),
('builtins', 'str'),
('builtins', 'int'),
}
def find_class(self, module, name):
if (module, name) not in self.ALLOWED_CLASSES:
raise pickle.UnpicklingError(
f"Class {module}.{name} is not allowed"
)
return super().find_class(module, name)
def safe_unpickle(data):
return RestrictedUnpickler(io.BytesIO(data)).load()
# Usage
try:
obj = safe_unpickle(untrusted_data)
except pickle.UnpicklingError as e:
print(f"Unsafe pickle rejected: {e}")
Framework-Specific Guidance
Django
# SECURE - Django REST Framework with JSON
from rest_framework import serializers, viewsets
from rest_framework.response import Response
class UserSerializer(serializers.ModelSerializer):
class Meta:
model = User
fields = ['id', 'name', 'email', 'age']
class UserViewSet(viewsets.ModelViewSet):
queryset = User.objects.all()
serializer_class = UserSerializer
def create(self, request):
# DRF automatically deserializes JSON to User model
serializer = self.get_serializer(data=request.data)
serializer.is_valid(raise_exception=True)
user = serializer.save()
return Response(serializer.data)
# settings.py - Use JSON session serializer
SESSION_SERIALIZER = 'django.contrib.sessions.serializers.JSONSerializer'
# Never use:
# SESSION_SERIALIZER = 'django.contrib.sessions.serializers.PickleSerializer' # INSECURE
Why this works:
- DRF parses JSON into primitive types, not arbitrary Python objects.
- Serializers validate and coerce fields before saving models.
- JSON session serializer avoids pickle gadgets on request load.
Flask
# SECURE - Flask with JSON
from flask import Flask, request, jsonify
from dataclasses import dataclass, asdict
import json
app = Flask(__name__)
@dataclass
class User:
name: str
email: str
age: int
@app.route('/users', methods=['POST'])
def create_user():
# Flask automatically parses JSON
data = request.get_json()
# Manually construct object from dict
user = User(
name=data['name'],
email=data['email'],
age=data['age']
)
# Save user...
return jsonify(asdict(user)), 201
@app.route('/users/<int:user_id>')
def get_user(user_id):
user = get_user_from_db(user_id)
# JSON serialization is safe
return jsonify(asdict(user))
# For sessions, Flask uses signed cookies (integrity, not confidentiality)
app.config['SECRET_KEY'] = 'generate-strong-random-key'
Why this works:
request.get_json()yields basic types only (dict/list/str/int).- The dataclass is constructed explicitly from validated fields.
- Signed cookies prevent tampering without exposing server-side objects.
FastAPI
# SECURE - FastAPI with Pydantic (JSON-based)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, EmailStr, Field
app = FastAPI()
class User(BaseModel):
name: str = Field(..., min_length=1, max_length=100)
email: EmailStr
age: int = Field(..., ge=0, le=150)
@app.post("/users", response_model=User)
async def create_user(user: User):
# FastAPI automatically validates and deserializes JSON
# Pydantic ensures type safety
save_user(user)
return user
@app.get("/users/{user_id}", response_model=User)
async def get_user(user_id: int):
user = get_user_from_db(user_id)
if not user:
raise HTTPException(status_code=404)
return user
Why this works:
- Pydantic validates and parses JSON into a safe, typed model.
- No arbitrary class instantiation from attacker-supplied metadata.
- Responses are serialized to JSON without executing code.
Data Classes with JSON
# SECURE - Python 3.7+ dataclasses with JSON
from dataclasses import dataclass, asdict
import json
@dataclass
class User:
name: str
email: str
age: int
def serialize_user(user: User) -> str:
return json.dumps(asdict(user))
def deserialize_user(json_str: str) -> User:
data = json.loads(json_str)
return User(**data)
# Usage
user = User(name="John", email="john@example.com", age=30)
json_str = serialize_user(user)
restored = deserialize_user(json_str)
Input Validation
# Validate after deserialization
from pydantic import BaseModel, validator, EmailStr
class User(BaseModel):
name: str
email: EmailStr
age: int
@validator('name')
def name_must_not_be_empty(cls, v):
if not v or not v.strip():
raise ValueError('Name cannot be empty')
return v
@validator('age')
def age_must_be_reasonable(cls, v):
if v < 0 or v > 150:
raise ValueError('Age must be between 0 and 150')
return v
# Usage
try:
user_data = json.loads(untrusted_json)
user = User(**user_data) # Validates during construction
except ValueError as e:
print(f"Validation error: {e}")
Important: Validation is only safe after deserializing data-only formats like JSON or MessagePack. It is not sufficient for unsafe deserialization formats (pickle, yaml.unsafe_load, marshal) because code can execute during parsing.
Signature Verification
# SECURE - Verify HMAC before deserializing
import hmac
import hashlib
import json
class SignedSerializer:
def __init__(self, secret_key: bytes):
self.secret_key = secret_key
def serialize(self, obj: dict) -> bytes:
json_data = json.dumps(obj).encode('utf-8')
# Create HMAC
signature = hmac.new(
self.secret_key,
json_data,
hashlib.sha256
).digest()
# Return signature + data
return signature + json_data
def deserialize(self, signed_data: bytes) -> dict:
# Extract signature and data
signature = signed_data[:32] # SHA-256 is 32 bytes
json_data = signed_data[32:]
# Verify signature
expected_signature = hmac.new(
self.secret_key,
json_data,
hashlib.sha256
).digest()
if not hmac.compare_digest(signature, expected_signature):
raise ValueError("Invalid signature")
# Only deserialize if signature is valid
return json.loads(json_data)
# Usage
serializer = SignedSerializer(b'your-secret-key-here')
signed = serializer.serialize({'user': 'john', 'role': 'admin'})
data = serializer.deserialize(signed)
Verification
After implementing the recommended secure patterns, verify the fix through multiple approaches:
- Manual testing: Submit malicious payloads relevant to this vulnerability and confirm they're handled safely without executing unintended operations
- Code review: Confirm all instances use safe deserialization APIs and reject unsafe formats
- Static analysis: Use security scanners to verify no unsafe deserialization patterns remain
- Regression testing: Ensure legitimate user inputs and application workflows continue to function correctly
- Edge case validation: Test with special characters, boundary conditions, and unusual inputs to verify proper handling
- Framework verification: If using a framework or library, confirm the recommended APIs are used correctly according to documentation
- Authentication/session testing: Verify security controls remain effective and cannot be bypassed (if applicable to the vulnerability type)
- Rescan: Run the security scanner again to confirm the finding is resolved and no new issues were introduced
Python Deserialization Library Safety Matrix
Use this reference when reviewing Python deserialization code:
Safe (Recommended)
json (standard library):
- Cannot execute code
- Only deserializes basic types (dict, list, str, int, float, bool, None)
- Use for all untrusted data
yaml.safe_load() (PyYAML):
- Only constructs standard Python objects
- Cannot instantiate arbitrary classes
- No
!!python/objector!!python/object/applytags
msgpack (MessagePack):
- Binary JSON alternative
- No code execution capability
- Efficient for large datasets
WARNING: Requires Careful Configuration
yaml.load() with SafeLoader:
- Explicitly specify
Loader=yaml.SafeLoader - Never use
Loader=yaml.LoaderorLoader=yaml.UnsafeLoader
yaml.load() without explicit loader (DEPRECATED):
# DEPRECATED in PyYAML 5.1+ - will raise warning
data = yaml.load(user_input) # Defaults to unsafe in older versions
- Modern PyYAML requires explicit Loader
- Update code to use
yaml.safe_load()instead
Unsafe (Never Use with Untrusted Data)
pickle / cPickle / _pickle:
- Can execute arbitrary Python code during deserialization
- Exploits via
__reduce__,__setstate__,__getstate__methods - No safe configuration exists
- Only use with data you generated yourself in controlled environment
marshal:
- Low-level serialization for Python bytecode
- Can execute code
- Intended for .pyc files, not data exchange
shelve (uses pickle internally):
import shelve
db = shelve.open('data.db')
db[key] = untrusted_object # DANGEROUS if object comes from user
- Built on pickle
- Inherits all pickle vulnerabilities
- Only use for locally-generated data
yaml.unsafe_load() / yaml.full_load() with custom tags:
import yaml
data = yaml.unsafe_load(user_input) # EXTREMELY DANGEROUS
# OR
data = yaml.full_load(user_input) # Can instantiate arbitrary classes
- Allows
!!python/object/applytag for arbitrary code execution:
- No legitimate use case for untrusted data
jsonpickle:
- Serializes Python objects to JSON, preserving type information
- Can instantiate arbitrary classes
- Vulnerable to gadget chains like pickle
dill (extended pickle):
- More powerful than pickle
- Same security issues, even worse
Migration Examples
From pickle to JSON:
# OLD (unsafe)
import pickle
with open('data.pkl', 'rb') as f:
data = pickle.load(f)
# NEW (safe)
import json
with open('data.json', 'r') as f:
data = json.load(f)
From yaml.load() to yaml.safe_load():
# OLD (unsafe in PyYAML < 5.1, deprecated in 5.1+)
import yaml
with open('config.yaml') as f:
config = yaml.load(f)
# NEW (safe)
import yaml
with open('config.yaml') as f:
config = yaml.safe_load(f)
Handling Custom Objects (Safe Pattern):
# Instead of pickling custom objects, serialize to dict:
from dataclasses import dataclass, asdict
import json
@dataclass
class User:
name: str
email: str
# Serialize
user = User("John", "john@example.com")
user_json = json.dumps(asdict(user))
# Deserialize
user_dict = json.loads(user_json)
restored_user = User(**user_dict)
Key Takeaway: Python's pickle/marshal/shelve modules can execute arbitrary code and should never be used with untrusted data. Always use JSON or MessagePack for data from external sources.