Metadata Extraction
Extracting sensitive information from files and documents.
Overview
Metadata in documents can reveal:
- Author names and usernames
- Software versions
- Internal file paths
- Creation/modification dates
- Corporate network information
- Email addresses
- GPS coordinates (images)
- Internal IP addresses
- Operating system details
ExifTool
The most comprehensive metadata extraction tool.
Installation
Basic Usage
# Extract all metadata
exiftool document.pdf
# Extract from image
exiftool photo.jpg
# Extract from multiple files
exiftool *.pdf
# Recursive directory scan
exiftool -r /path/to/directory
# Output to CSV
exiftool -csv *.pdf > metadata.csv
# Output to JSON
exiftool -json document.pdf > metadata.json
# Extract specific tags
exiftool -Author -CreateDate document.pdf
# Show only filenames with specific metadata
exiftool -if '$Author =~ /john/' -filename *.pdf
Advanced Queries
# Find documents by author
exiftool -r -if '$Author =~ /smith/i' -filename /path/to/docs
# Find images with GPS coordinates
exiftool -r -if '$GPSLatitude' -filename -GPSLatitude -GPSLongitude *.jpg
# Find files created by specific software
exiftool -r -if '$Software =~ /Adobe/i' -filename *.pdf
# Extract email addresses from metadata
exiftool -r -Author -Creator -Producer *.* | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
# Find documents with internal paths (can reveal directory structure)
exiftool -r -FileName -Producer -Creator . | grep -i "C:\\"
# Find office documents with track changes
exiftool -r -if '$TotalEditTime > 0' -filename *.docx
Removing Metadata
# Remove all metadata
exiftool -all= document.pdf
# Remove specific tags
exiftool -Author= -Creator= document.pdf
# Remove GPS data from images
exiftool -gps:all= photo.jpg
# Batch remove metadata
exiftool -all= -r /path/to/directory
Metadata to Look For
# Authors and creators
exiftool -Author -Creator -LastModifiedBy *.docx
# Software versions
exiftool -Software -CreatorTool -Producer *.pdf
# File paths (reveals internal structure)
exiftool -FileName -Directory -SourceFile *.*
# Dates (useful for timeline)
exiftool -CreateDate -ModifyDate -MetadataDate *.*
# Company/organization
exiftool -Company -Department *.docx
# Computer names
exiftool -HostComputer -Computer *.*
# Email addresses
exiftool -r -Author -Creator -Producer . | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
FOCA
Fingerprinting Organizations with Collected Archives - comprehensive metadata analysis.
Installation
# Windows only - Download from:
# https://github.com/ElevenPaths/FOCA
# Or use Linux alternative: metagoofil
apt install metagoofil
Using FOCA (Windows)
- Create new project
- Enter domain name
- Select search engines (Google, Bing, etc.)
- Select file types (pdf, doc, xls, ppt)
- Start search
- Download documents
- Extract metadata
- Analyze results:
- Users/Authors
- Folders/Paths
- Software versions
- Printers
- Email addresses
- Operating systems
metagoofil (Linux Alternative)
# Search and download documents
metagoofil -d target.com -t pdf,doc,xls,ppt -l 100 -n 25 -o output -f results.html
# Parameters:
# -d : target domain
# -t : file types
# -l : limit search results
# -n : limit downloads
# -o : output directory
# -f : output file
Document-Specific Tools
PDF Analysis
# pdfinfo - basic PDF info
pdfinfo document.pdf
# pdfid - identify PDF features
pdfid document.pdf
# pdf-parser - detailed analysis
pdf-parser document.pdf
# Check for JavaScript
pdfid document.pdf | grep JavaScript
# Extract embedded files
pdf-parser --search "/EmbeddedFile" document.pdf
Office Documents
# Extract metadata from Office files
exiftool document.docx
# List authors
exiftool -Author -LastModifiedBy document.docx
# Check for macros
olevba document.docm
# Extract VBA macros
olevba -c document.xlsm
# Analyze .doc files
oleid suspicious.doc
Image Metadata
# Extract GPS coordinates
exiftool -gps:all photo.jpg
# Convert to decimal coordinates
exiftool -n -gps:all photo.jpg
# Extract camera information
exiftool -Make -Model -LensModel photo.jpg
# Find images with GPS data
exiftool -r -if '$GPSLatitude' -filename -GPSPosition *.jpg
# Extract thumbnail (may contain unedited version)
exiftool -b -ThumbnailImage photo.jpg > thumbnail.jpg
Batch Analysis
Creating Metadata Database
# Extract all metadata to SQLite
exiftool -r -csv /path/to/documents > metadata.csv
# Import to SQLite
sqlite3 metadata.db
.mode csv
.import metadata.csv documents
.schema documents
# Query the database
SELECT SourceFile, Author, CreateDate FROM documents WHERE Author LIKE '%john%';
Statistical Analysis
#!/usr/bin/env python3
import subprocess
import json
from collections import Counter
# Extract metadata as JSON
result = subprocess.run(['exiftool', '-json', '-r', '/path/to/docs'],
capture_output=True, text=True)
data = json.loads(result.stdout)
# Count authors
authors = [d.get('Author', 'Unknown') for d in data]
print("Top Authors:")
for author, count in Counter(authors).most_common(10):
print(f" {author}: {count}")
# Count software
software = [d.get('Software', 'Unknown') for d in data]
print("\nTop Software:")
for sw, count in Counter(software).most_common(10):
print(f" {sw}: {count}")
# Extract email addresses
emails = []
for d in data:
for field in ['Author', 'Creator', 'Producer', 'Company']:
value = d.get(field, '')
if '@' in str(value):
emails.append(value)
print(f"\nEmail addresses found: {len(set(emails))}")
for email in sorted(set(emails)):
print(f" {email}")
Web-Based Metadata Extraction
Online Tools
Use with caution - data is uploaded to third party:
Downloading Files for Analysis
# Download all PDFs from website
wget -r -l1 -A pdf http://target.com
# Download multiple file types
wget -r -l1 -A pdf,doc,xls,ppt http://target.com
# Using curl with Google dorking results
# First find files with Google
site:target.com filetype:pdf
# Then download
curl -O https://target.com/document.pdf
exiftool document.pdf
Automated Workflow
Complete Metadata Extraction Script
#!/bin/bash
DOMAIN="target.com"
TYPES="pdf,doc,docx,xls,xlsx,ppt,pptx"
OUTPUT="metadata_results"
mkdir -p $OUTPUT
# 1. Search for documents
echo "[+] Searching for documents..."
metagoofil -d $DOMAIN -t $TYPES -l 200 -n 50 -o $OUTPUT/files -f $OUTPUT/results.html
# 2. Extract metadata
echo "[+] Extracting metadata..."
exiftool -r -csv $OUTPUT/files > $OUTPUT/metadata.csv
# 3. Extract interesting info
echo "[+] Analyzing metadata..."
# Authors
echo -e "\n=== Authors ===" > $OUTPUT/summary.txt
exiftool -r -Author $OUTPUT/files | grep Author | cut -d: -f2 | sort -u >> $OUTPUT/summary.txt
# Email addresses
echo -e "\n=== Email Addresses ===" >> $OUTPUT/summary.txt
exiftool -r -Author -Creator -Producer $OUTPUT/files | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' | sort -u >> $OUTPUT/summary.txt
# Internal paths
echo -e "\n=== Internal Paths ===" >> $OUTPUT/summary.txt
exiftool -r $OUTPUT/files | grep -Eo 'C:\\[^"]*' | sort -u >> $OUTPUT/summary.txt
# Software
echo -e "\n=== Software ===" >> $OUTPUT/summary.txt
exiftool -r -Software $OUTPUT/files | grep Software | cut -d: -f2 | sort -u >> $OUTPUT/summary.txt
# Usernames
echo -e "\n=== Usernames ===" >> $OUTPUT/summary.txt
exiftool -r -Creator -Author -LastModifiedBy $OUTPUT/files | grep -v "^===" | cut -d: -f2 | grep -v "@" | sort -u >> $OUTPUT/summary.txt
echo "[+] Results saved to $OUTPUT/"
echo "[+] Summary: $OUTPUT/summary.txt"
cat $OUTPUT/summary.txt
Privacy and Sanitization
Checking Your Own Documents
Before publishing documents, check for sensitive metadata:
# Check what you're about to publish
exiftool sensitive_document.pdf
# Clean before publishing
exiftool -all= clean_document.pdf
# Verify cleaning
exiftool clean_document.pdf
PDF Sanitization
# qpdf - clean and linearize
qpdf --linearize input.pdf output.pdf
# mat2 - metadata removal tool
mat2 document.pdf
# Ghostscript - reprocess PDF
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
What to Look For
Security Relevant Metadata
- Usernames: May be valid login names
- Email addresses: Useful for phishing campaigns
- Internal paths: Reveals directory structure, server names
- Software versions: Identify vulnerable applications
- Modification dates: Understand document lifecycle
- Network printers: May reveal internal network layout
- MAC addresses: In some printer metadata
- GPS coordinates: Physical location of photos
- Company information: Verify target organization
- Department names: Aid in social engineering
Red Flags
# Track changes still present
exiftool -TrackedChanges document.docx
# Comments and annotations
exiftool -Comments -Subject document.pdf
# Hidden text/layers
pdfid document.pdf | grep "Hidden"
# Macros present
exiftool -MacroSecurity document.xlsm
Useful Links
- ExifTool Official Site
- FOCA GitHub
- metagoofil GitHub
- MAT2 - Metadata Anonymization Toolkit
- Metadata Extraction Tools List
- Image Metadata Viewer
- PDF Analysis Tools
Quick Reference
# Extract all metadata
exiftool document.pdf
# Batch extract to CSV
exiftool -csv -r /path/to/docs > metadata.csv
# Find documents by author
exiftool -if '$Author =~ /john/i' -filename *.pdf
# Extract GPS from images
exiftool -gps:all -n photo.jpg
# Remove all metadata
exiftool -all= document.pdf
# Search and analyze domain documents
metagoofil -d target.com -t pdf,doc,xls -l 100 -o output -f results.html
# Extract email addresses
exiftool -r *.* | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' | sort -u