Skip to content

Metadata Extraction

Extracting sensitive information from files and documents.

Overview

Metadata in documents can reveal:

  • Author names and usernames
  • Software versions
  • Internal file paths
  • Creation/modification dates
  • Corporate network information
  • Email addresses
  • GPS coordinates (images)
  • Internal IP addresses
  • Operating system details

ExifTool

The most comprehensive metadata extraction tool.

Installation

# Debian/Ubuntu
apt install exiftool

# Windows
# Download from https://exiftool.org/

Basic Usage

# Extract all metadata
exiftool document.pdf

# Extract from image
exiftool photo.jpg

# Extract from multiple files
exiftool *.pdf

# Recursive directory scan
exiftool -r /path/to/directory

# Output to CSV
exiftool -csv *.pdf > metadata.csv

# Output to JSON
exiftool -json document.pdf > metadata.json

# Extract specific tags
exiftool -Author -CreateDate document.pdf

# Show only filenames with specific metadata
exiftool -if '$Author =~ /john/' -filename *.pdf

Advanced Queries

# Find documents by author
exiftool -r -if '$Author =~ /smith/i' -filename /path/to/docs

# Find images with GPS coordinates
exiftool -r -if '$GPSLatitude' -filename -GPSLatitude -GPSLongitude *.jpg

# Find files created by specific software
exiftool -r -if '$Software =~ /Adobe/i' -filename *.pdf

# Extract email addresses from metadata
exiftool -r -Author -Creator -Producer *.* | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find documents with internal paths (can reveal directory structure)
exiftool -r -FileName -Producer -Creator . | grep -i "C:\\" 

# Find office documents with track changes
exiftool -r -if '$TotalEditTime > 0' -filename *.docx

Removing Metadata

# Remove all metadata
exiftool -all= document.pdf

# Remove specific tags
exiftool -Author= -Creator= document.pdf

# Remove GPS data from images
exiftool -gps:all= photo.jpg

# Batch remove metadata
exiftool -all= -r /path/to/directory

Metadata to Look For

# Authors and creators
exiftool -Author -Creator -LastModifiedBy *.docx

# Software versions
exiftool -Software -CreatorTool -Producer *.pdf

# File paths (reveals internal structure)
exiftool -FileName -Directory -SourceFile *.*

# Dates (useful for timeline)
exiftool -CreateDate -ModifyDate -MetadataDate *.*

# Company/organization
exiftool -Company -Department *.docx

# Computer names
exiftool -HostComputer -Computer *.*

# Email addresses
exiftool -r -Author -Creator -Producer . | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

FOCA

Fingerprinting Organizations with Collected Archives - comprehensive metadata analysis.

Installation

# Windows only - Download from:
# https://github.com/ElevenPaths/FOCA

# Or use Linux alternative: metagoofil
apt install metagoofil

Using FOCA (Windows)

  1. Create new project
  2. Enter domain name
  3. Select search engines (Google, Bing, etc.)
  4. Select file types (pdf, doc, xls, ppt)
  5. Start search
  6. Download documents
  7. Extract metadata
  8. Analyze results:
  9. Users/Authors
  10. Folders/Paths
  11. Software versions
  12. Printers
  13. Email addresses
  14. Operating systems

metagoofil (Linux Alternative)

# Search and download documents
metagoofil -d target.com -t pdf,doc,xls,ppt -l 100 -n 25 -o output -f results.html

# Parameters:
# -d : target domain
# -t : file types
# -l : limit search results
# -n : limit downloads
# -o : output directory
# -f : output file

Document-Specific Tools

PDF Analysis

# pdfinfo - basic PDF info
pdfinfo document.pdf

# pdfid - identify PDF features
pdfid document.pdf

# pdf-parser - detailed analysis
pdf-parser document.pdf

# Check for JavaScript
pdfid document.pdf | grep JavaScript

# Extract embedded files
pdf-parser --search "/EmbeddedFile" document.pdf

Office Documents

# Extract metadata from Office files
exiftool document.docx

# List authors
exiftool -Author -LastModifiedBy document.docx

# Check for macros
olevba document.docm

# Extract VBA macros
olevba -c document.xlsm

# Analyze .doc files
oleid suspicious.doc

Image Metadata

# Extract GPS coordinates
exiftool -gps:all photo.jpg

# Convert to decimal coordinates
exiftool -n -gps:all photo.jpg

# Extract camera information
exiftool -Make -Model -LensModel photo.jpg

# Find images with GPS data
exiftool -r -if '$GPSLatitude' -filename -GPSPosition *.jpg

# Extract thumbnail (may contain unedited version)
exiftool -b -ThumbnailImage photo.jpg > thumbnail.jpg

Batch Analysis

Creating Metadata Database

# Extract all metadata to SQLite
exiftool -r -csv /path/to/documents > metadata.csv

# Import to SQLite
sqlite3 metadata.db
.mode csv
.import metadata.csv documents
.schema documents

# Query the database
SELECT SourceFile, Author, CreateDate FROM documents WHERE Author LIKE '%john%';

Statistical Analysis

#!/usr/bin/env python3
import subprocess
import json
from collections import Counter

# Extract metadata as JSON
result = subprocess.run(['exiftool', '-json', '-r', '/path/to/docs'], 
                       capture_output=True, text=True)
data = json.loads(result.stdout)

# Count authors
authors = [d.get('Author', 'Unknown') for d in data]
print("Top Authors:")
for author, count in Counter(authors).most_common(10):
    print(f"  {author}: {count}")

# Count software
software = [d.get('Software', 'Unknown') for d in data]
print("\nTop Software:")
for sw, count in Counter(software).most_common(10):
    print(f"  {sw}: {count}")

# Extract email addresses
emails = []
for d in data:
    for field in ['Author', 'Creator', 'Producer', 'Company']:
        value = d.get(field, '')
        if '@' in str(value):
            emails.append(value)
print(f"\nEmail addresses found: {len(set(emails))}")
for email in sorted(set(emails)):
    print(f"  {email}")

Web-Based Metadata Extraction

Online Tools

Use with caution - data is uploaded to third party:

Downloading Files for Analysis

# Download all PDFs from website
wget -r -l1 -A pdf http://target.com

# Download multiple file types
wget -r -l1 -A pdf,doc,xls,ppt http://target.com

# Using curl with Google dorking results
# First find files with Google
site:target.com filetype:pdf

# Then download
curl -O https://target.com/document.pdf
exiftool document.pdf

Automated Workflow

Complete Metadata Extraction Script

#!/bin/bash
DOMAIN="target.com"
TYPES="pdf,doc,docx,xls,xlsx,ppt,pptx"
OUTPUT="metadata_results"

mkdir -p $OUTPUT

# 1. Search for documents
echo "[+] Searching for documents..."
metagoofil -d $DOMAIN -t $TYPES -l 200 -n 50 -o $OUTPUT/files -f $OUTPUT/results.html

# 2. Extract metadata
echo "[+] Extracting metadata..."
exiftool -r -csv $OUTPUT/files > $OUTPUT/metadata.csv

# 3. Extract interesting info
echo "[+] Analyzing metadata..."

# Authors
echo -e "\n=== Authors ===" > $OUTPUT/summary.txt
exiftool -r -Author $OUTPUT/files | grep Author | cut -d: -f2 | sort -u >> $OUTPUT/summary.txt

# Email addresses
echo -e "\n=== Email Addresses ===" >> $OUTPUT/summary.txt
exiftool -r -Author -Creator -Producer $OUTPUT/files | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' | sort -u >> $OUTPUT/summary.txt

# Internal paths
echo -e "\n=== Internal Paths ===" >> $OUTPUT/summary.txt
exiftool -r $OUTPUT/files | grep -Eo 'C:\\[^"]*' | sort -u >> $OUTPUT/summary.txt

# Software
echo -e "\n=== Software ===" >> $OUTPUT/summary.txt
exiftool -r -Software $OUTPUT/files | grep Software | cut -d: -f2 | sort -u >> $OUTPUT/summary.txt

# Usernames
echo -e "\n=== Usernames ===" >> $OUTPUT/summary.txt
exiftool -r -Creator -Author -LastModifiedBy $OUTPUT/files | grep -v "^===" | cut -d: -f2 | grep -v "@" | sort -u >> $OUTPUT/summary.txt

echo "[+] Results saved to $OUTPUT/"
echo "[+] Summary: $OUTPUT/summary.txt"
cat $OUTPUT/summary.txt

Privacy and Sanitization

Checking Your Own Documents

Before publishing documents, check for sensitive metadata:

# Check what you're about to publish
exiftool sensitive_document.pdf

# Clean before publishing
exiftool -all= clean_document.pdf

# Verify cleaning
exiftool clean_document.pdf

PDF Sanitization

# qpdf - clean and linearize
qpdf --linearize input.pdf output.pdf

# mat2 - metadata removal tool
mat2 document.pdf

# Ghostscript - reprocess PDF
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

What to Look For

Security Relevant Metadata

  1. Usernames: May be valid login names
  2. Email addresses: Useful for phishing campaigns
  3. Internal paths: Reveals directory structure, server names
  4. Software versions: Identify vulnerable applications
  5. Modification dates: Understand document lifecycle
  6. Network printers: May reveal internal network layout
  7. MAC addresses: In some printer metadata
  8. GPS coordinates: Physical location of photos
  9. Company information: Verify target organization
  10. Department names: Aid in social engineering

Red Flags

# Track changes still present
exiftool -TrackedChanges document.docx

# Comments and annotations
exiftool -Comments -Subject document.pdf

# Hidden text/layers
pdfid document.pdf | grep "Hidden"

# Macros present
exiftool -MacroSecurity document.xlsm

Quick Reference

# Extract all metadata
exiftool document.pdf

# Batch extract to CSV
exiftool -csv -r /path/to/docs > metadata.csv

# Find documents by author
exiftool -if '$Author =~ /john/i' -filename *.pdf

# Extract GPS from images
exiftool -gps:all -n photo.jpg

# Remove all metadata
exiftool -all= document.pdf

# Search and analyze domain documents
metagoofil -d target.com -t pdf,doc,xls -l 100 -o output -f results.html

# Extract email addresses
exiftool -r *.* | grep -Eo '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' | sort -u