Document Intelligence

PII Anonymization. Contract Redaction. Signature Detection.

Process documents with AI without exposing personal data. Roundtrip pseudonymization for LLM input, rule-based redaction for controlled sharing, automated signature detection for contract management. GDPR-compliant by architecture.

Book a Meeting Reference Architecture

The Problem: Personal Data in Every Document

Enterprises want to process documents with AI - analyze contracts, classify invoices, query policies. But every document contains personal data: names, salaries, social security numbers, addresses, bank details, signatures.

Sending this data to a language model - even a self-hosted one - without protection violates GDPR principles of data minimization. Employee governance agreements restrict processing of employee data. Trade secrets in contracts must not reach third parties.

Current approaches fall short: Manual redaction in Adobe Acrobat is time-consuming, error-prone, and often only cosmetic - the text remains accessible beneath the black bars. Alternatively, avoiding AI processing for sensitive documents eliminates most of the productivity gain.

Three Capabilities

PII Anonymization for LLM Input

Roundtrip pseudonymization: personal data is replaced with consistent pseudonyms before LLM input. The output is re-anonymized - real data appears only in the result, never in the model. The mapping table never leaves the pre-processing layer.

Details

Contract Redaction

Rule-based redaction for different recipients. The same contract is redacted differently for employee representatives, due diligence, or external advisors - governed by versioned redaction rules in the Decision Layer. Physical redaction, not just visual overlay.

Details

Signature Detection

Automated detection of signature fields and present signatures in documents. Bulk verification of contract archives, onboarding quality checks, audit preparation. Anomalies are escalated to humans - never autonomously accepted.

Details

PII Anonymization: Roundtrip Pseudonymization for LLM Input

Most PII tools on the market perform one-way redaction - they remove data. For processing with language models, that is insufficient. When an agent needs to analyze a contract, it requires context: "Employee X has salary Y at location Z." Without this context, the model cannot produce a meaningful assessment.

The Gosign approach is roundtrip pseudonymization: data is pseudonymized before the model, processed by the model, and re-anonymized in the result. The model only sees pseudonyms. The result contains the real data.

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  Document   │     │  PII Detection   │     │  Pseudonym-  │     │  Language Model  │     │  Re-Mapping │
│  (Original) │────▶│  & Classifi-     │────▶│  ization     │────▶│  processes only  │────▶│  Pseudonyms │
│             │     │  cation          │     │              │     │  pseudonyms      │     │  → real data│
└─────────────┘     └──────────────────┘     └──────────────┘     └──────────────────┘     └─────────────┘
│                        │                                            │
▼                        ▼                                            ▼
┌──────────────┐         ┌──────────────┐                              ┌──────────────┐
│  Decision    │         │  Mapping     │                              │  Result      │
│  Layer:      │         │  Table       │◀─────────────────────────────│  with real   │
│  What gets   │         │  (stays      │   Reverse mapping           │  data        │
│  anonymized  │         │  local)      │                              └──────────────┘
└──────────────┘         └──────────────┘

Decision Steps in the PII Process

Micro-Decision	Who Decides	Why
Define PII categories	Human + Ruleset	GDPR requirements, governance agreement, client-specific rules
Detect PII in document	AI (NER + patterns)	Named Entity Recognition + rule-based patterns
Review false positives	AI; human when uncertain	Confidence routing - "Baker" as surname or occupation?
Assign pseudonyms	Automatic	Consistent mapping, "Person_A" instead of "John Smith"
Send pseudonymized document to model	Automatic	No decision, pure forwarding
Re-anonymize output	Automatic	Apply mapping table in reverse
Audit: what was anonymized	Automatic	GDPR evidence in audit trail

The mapping table (pseudonym → real data) never leaves the pre-processing layer. It is deleted after processing is complete - or retained for a defined period, depending on configuration. The language model never sees personal data at any point.

Contract Redaction: Rule-Based, Recipient-Dependent, Physical

Contracts regularly need to be shared in redacted form - with auditors, potential buyers during due diligence, with employee representation bodies, with external advisors. Today, someone does this manually. It takes hours per contract, is error-prone, and the redaction is often only cosmetic: the text remains accessible beneath the black bars. A frequently underestimated data leak.

The Gosign approach: the Document Agent recognizes contract structure - parties, amounts, terms, clauses, signatures. The Decision Layer defines recipient-dependent redaction rules:

Contract Element	Employee Rep.	Due Diligence	External Advisor	Auditor
Contracting parties (names)	✓ Visible	✗ Redacted	✗ Redacted	✓ Visible
Contract values / amounts	✓ Visible	✓ Visible	✗ Redacted	✓ Visible
Salaries / compensation	✓ Visible	Aggregated	✗ Redacted	✓ Visible
Contract clauses	✓ Visible	✓ Visible	Clause types only	✓ Visible
Trade secrets	✗ Redacted	✓ Visible	✗ Redacted	✓ Visible
Signatures	✗ Redacted	✗ Redacted	✗ Redacted	✓ Visible

Redaction rules are versioned in the Decision Layer. When requirements change - new recipient group, updated governance agreement, changed compliance rule - a new rule version is created. The previous version remains traceable.

Physical redaction: The PDF is re-rendered from scratch. The original data is physically no longer present in the document - not as text, not as metadata, not as an invisible layer. No copy-paste beneath black bars, no PDF editing to uncover content. This is not cosmetic - it is cryptographically clean.

Signature Detection: Find, Verify, Document

Contract management, audit preparation, compliance reviews - all require regular verification: Is this document signed? Where is the signature? Is a countersignature missing? With 5,000 contracts in the archive, manual checking is not feasible.

Signature Detection - Finding Signatures

The Document Agent detects signature fields and present signatures in scanned documents and PDFs. Computer vision, not a language model - specialized ML models for image analysis. The output is structured: page, position, confidence that a signature is present.

Bulk archive verification: "Which of the 5,000 contracts are missing a countersignature?" - Results in minutes instead of weeks.

Onboarding quality check: "Are all mandatory documents for the new employee signed?" - Automated checklist, missing signatures escalated as workflow tasks.

Audit preparation: "Show all documents without a signature in Q3 2025." - Structured export list for the auditor.

┌─────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Document   │     │  Signature       │     │  Comparison      │
│  with       │────▶│  Detection       │────▶│  against         │
│  signature  │     │  (position,      │     │  reference       │
│             │     │   confidence)    │     │  signature       │
└─────────────┘     └──────────────────┘     └──────────────────┘
│
┌───────────┼───────────┐
▼           ▼           ▼
┌────────────┐ ┌────────┐ ┌────────────┐
│  High      │ │ Medium │ │  Low       │
│  match     │ │ match  │ │  match     │
└────────────┘ └────────┘ └────────────┘
│              │           │
▼              ▼           ▼
Automatically  Escalation   Blocked
accepted,      to human     Human
documented     with side-   review
by-side      mandatory
comparison
view

Important: Signature comparison is an anomaly detector, not a forgery detector. Signatures vary naturally - depending on the day, pen, and surface. The system identifies anomalies and escalates them to a human. It never claims "this signature is forged" or "this signature is authentic." That would be irresponsible.

The Decision Layer: Who Decides What Gets Anonymized, Redacted, or Escalated?

The Decision Layer decomposes every document process into individual decision steps. For each step, it defines: human, ruleset, or AI.

Process	Micro-Decision	Who Decides	Why
PII	Which data fields are PII?	Ruleset	GDPR Art. 4, governance agreement
PII	Is "Baker" a name or an occupation?	AI; human at <80% confidence	NER ambiguity - avoid false positives
PII	Choose pseudonymization method	Ruleset	Consistent pseudonyms vs. random values
Redaction	Which recipient group?	Human	Domain decision, not automatable
Redaction	Which fields are redacted?	Ruleset	Recipient-dependent redaction matrix
Redaction	Unknown clause type detected	Human	New clause types must be classified
Signature	Signature present?	AI	Computer vision with confidence score
Signature	Does signature match reference?	AI + human on anomaly	High match: accepted. Anomaly: escalated
Signature	No reference available	Human	New reference signature must be captured
All	Document audit trail	Automatic	Every decision immutably recorded

Integration

Document Intelligence is a capability of the existing Document Agent - not separate software. Integration uses the same standardized interfaces:

SAP DMS, SAP ArchiveLink - contracts and receipts from SAP archives
SharePoint, OneDrive - document management via Microsoft Graph
Email inboxes (IMAP/Exchange) - process attachments automatically
File system watchers - monitor local directories
REST API - for client-specific DMS systems

Document Intelligence capabilities are configured per tenant: which PII categories are detected, which redaction rules apply, which reference signatures are stored. All versioned, all in the Decision Layer.

Business Impact

GDPR-compliant LLM processing: Documents containing personal data can be securely processed with language models for the first time - without privacy risk.

Contract redaction in minutes instead of hours: Rule-based, recipient-dependent, physically secure. A contract that takes 2 hours manually is processed in seconds.

Proactive signature gap detection: Missing signatures are found before the auditor asks - not after.

Audit evidence for data protection: The audit trail documents every anonymization, every redaction, every signature check. During a GDPR inquiry or tax audit, it is provable which data was processed when and how.

No new tool: Document Intelligence is part of the existing agent architecture. No additional vendor, no additional license, no additional training.

Deep Dive in the Agent Briefing

Our article series for decision-makers implementing AI agents in the enterprise.

Infrastructure

RAG & Document Intelligence: How AI Understands Your Documents

Governance

PII Anonymization for Enterprise AI

Governance

Data Security in Enterprise AI - Data Residency, GDPR and EU AI Act

Frequently Asked Questions

What is the difference between anonymization and pseudonymization?

Anonymization irreversibly removes personal data - the link to the individual is permanently destroyed. Pseudonymization replaces the data with pseudonyms while the mapping remains possible through a separate table. For LLM processing, we use pseudonymization with subsequent re-anonymization: the model sees only pseudonyms, the result contains the real data again.

Does PII detection work with scanned documents?

Yes. Scanned documents are first converted to machine-readable text via OCR. The text then undergoes the same PII detection as digital documents. Detection accuracy depends on scan quality - at standard scans (300 DPI), OCR accuracy exceeds 99%.

Is the contract redaction truly secure?

Yes. Unlike manual redaction in PDF editors, the document is physically re-rendered. The redacted content is no longer present in the document - neither as text, nor as metadata, nor as invisible layers. This is cryptographically verifiable.

Can signature comparison detect forgeries?

Signature comparison detects anomalies - deviations from a reference signature. When anomalies are found, the system automatically escalates to a human. It never claims a signature is forged or authentic. That decision is made by a human. This is the only responsible approach.

Which documents should be processed securely?

PII anonymization, contract redaction, or signature detection - we start with one specific document type.