Document Intelligence
PII Anonymization. Contract Redaction. Signature Detection.
Process documents with AI without exposing personal data. Roundtrip pseudonymization for LLM input, rule-based redaction for controlled sharing, automated signature detection for contract management. GDPR-compliant by architecture.
The Problem: Personal Data in Every Document
Enterprises want to process documents with AI - analyze contracts, classify invoices, query policies. But every document contains personal data: names, salaries, social security numbers, addresses, bank details, signatures.
Sending this data to a language model - even a self-hosted one - without protection violates GDPR principles of data minimization. Employee governance agreements restrict processing of employee data. Trade secrets in contracts must not reach third parties.
Current approaches fall short: Manual redaction in Adobe Acrobat is time-consuming, error-prone, and often only cosmetic - the text remains accessible beneath the black bars. Alternatively, avoiding AI processing for sensitive documents eliminates most of the productivity gain.
Three Capabilities
PII Anonymization for LLM Input
Roundtrip pseudonymization: personal data is replaced with consistent pseudonyms before LLM input. The output is re-anonymized - real data appears only in the result, never in the model. The mapping table never leaves the pre-processing layer.
DetailsContract Redaction
Rule-based redaction for different recipients. The same contract is redacted differently for employee representatives, due diligence, or external advisors - governed by versioned redaction rules in the Decision Layer. Physical redaction, not just visual overlay.
DetailsSignature Detection
Automated detection of signature fields and present signatures in documents. Bulk verification of contract archives, onboarding quality checks, audit preparation. Anomalies are escalated to humans - never autonomously accepted.
DetailsPII Anonymization: Roundtrip Pseudonymization for LLM Input
Most PII tools on the market perform one-way redaction - they remove data. For processing with language models, that is insufficient. When an agent needs to analyze a contract, it requires context: "Employee X has salary Y at location Z." Without this context, the model cannot produce a meaningful assessment.
The Gosign approach is roundtrip pseudonymization: data is pseudonymized before the model, processed by the model, and re-anonymized in the result. The model only sees pseudonyms. The result contains the real data.
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Document │ │ PII Detection │ │ Pseudonym- │ │ Language Model │ │ Re-Mapping │
│ (Original) │────▶│ & Classifi- │────▶│ ization │────▶│ processes only │────▶│ Pseudonyms │
│ │ │ cation │ │ │ │ pseudonyms │ │ → real data│
└─────────────┘ └──────────────────┘ └──────────────┘ └──────────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Decision │ │ Mapping │ │ Result │
│ Layer: │ │ Table │◀─────────────────────────────│ with real │
│ What gets │ │ (stays │ Reverse mapping │ data │
│ anonymized │ │ local) │ └──────────────┘
└──────────────┘ └──────────────┘
Decision Steps in the PII Process
| Micro-Decision | Who Decides | Why |
|---|---|---|
| Define PII categories | Human + Ruleset | GDPR requirements, governance agreement, client-specific rules |
| Detect PII in document | AI (NER + patterns) | Named Entity Recognition + rule-based patterns |
| Review false positives | AI; human when uncertain | Confidence routing - "Baker" as surname or occupation? |
| Assign pseudonyms | Automatic | Consistent mapping, "Person_A" instead of "John Smith" |
| Send pseudonymized document to model | Automatic | No decision, pure forwarding |
| Re-anonymize output | Automatic | Apply mapping table in reverse |
| Audit: what was anonymized | Automatic | GDPR evidence in audit trail |
The mapping table (pseudonym → real data) never leaves the pre-processing layer. It is deleted after processing is complete - or retained for a defined period, depending on configuration. The language model never sees personal data at any point.
Contract Redaction: Rule-Based, Recipient-Dependent, Physical
Contracts regularly need to be shared in redacted form - with auditors, potential buyers during due diligence, with employee representation bodies, with external advisors. Today, someone does this manually. It takes hours per contract, is error-prone, and the redaction is often only cosmetic: the text remains accessible beneath the black bars. A frequently underestimated data leak.
The Gosign approach: the Document Agent recognizes contract structure - parties, amounts, terms, clauses, signatures. The Decision Layer defines recipient-dependent redaction rules:
| Contract Element | Employee Rep. | Due Diligence | External Advisor | Auditor |
|---|---|---|---|---|
| Contracting parties (names) | ✓ Visible | ✗ Redacted | ✗ Redacted | ✓ Visible |
| Contract values / amounts | ✓ Visible | ✓ Visible | ✗ Redacted | ✓ Visible |
| Salaries / compensation | ✓ Visible | Aggregated | ✗ Redacted | ✓ Visible |
| Contract clauses | ✓ Visible | ✓ Visible | Clause types only | ✓ Visible |
| Trade secrets | ✗ Redacted | ✓ Visible | ✗ Redacted | ✓ Visible |
| Signatures | ✗ Redacted | ✗ Redacted | ✗ Redacted | ✓ Visible |
Redaction rules are versioned in the Decision Layer. When requirements change - new recipient group, updated governance agreement, changed compliance rule - a new rule version is created. The previous version remains traceable.
Physical redaction: The PDF is re-rendered from scratch. The original data is physically no longer present in the document - not as text, not as metadata, not as an invisible layer. No copy-paste beneath black bars, no PDF editing to uncover content. This is not cosmetic - it is cryptographically clean.
Signature Detection: Find, Verify, Document
Contract management, audit preparation, compliance reviews - all require regular verification: Is this document signed? Where is the signature? Is a countersignature missing? With 5,000 contracts in the archive, manual checking is not feasible.
Signature Detection - Finding Signatures
The Document Agent detects signature fields and present signatures in scanned documents and PDFs. Computer vision, not a language model - specialized ML models for image analysis. The output is structured: page, position, confidence that a signature is present.
Bulk archive verification: "Which of the 5,000 contracts are missing a countersignature?" - Results in minutes instead of weeks.
Onboarding quality check: "Are all mandatory documents for the new employee signed?" - Automated checklist, missing signatures escalated as workflow tasks.
Audit preparation: "Show all documents without a signature in Q3 2025." - Structured export list for the auditor.
┌─────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Document │ │ Signature │ │ Comparison │
│ with │────▶│ Detection │────▶│ against │
│ signature │ │ (position, │ │ reference │
│ │ │ confidence) │ │ signature │
└─────────────┘ └──────────────────┘ └──────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌────────────┐ ┌────────┐ ┌────────────┐
│ High │ │ Medium │ │ Low │
│ match │ │ match │ │ match │
└────────────┘ └────────┘ └────────────┘
│ │ │
▼ ▼ ▼
Automatically Escalation Blocked
accepted, to human Human
documented with side- review
by-side mandatory
comparison
view
Important: Signature comparison is an anomaly detector, not a forgery detector. Signatures vary naturally - depending on the day, pen, and surface. The system identifies anomalies and escalates them to a human. It never claims "this signature is forged" or "this signature is authentic." That would be irresponsible.
The Decision Layer: Who Decides What Gets Anonymized, Redacted, or Escalated?
The Decision Layer decomposes every document process into individual decision steps. For each step, it defines: human, ruleset, or AI.
| Process | Micro-Decision | Who Decides | Why |
|---|---|---|---|
| PII | Which data fields are PII? | Ruleset | GDPR Art. 4, governance agreement |
| PII | Is "Baker" a name or an occupation? | AI; human at <80% confidence | NER ambiguity - avoid false positives |
| PII | Choose pseudonymization method | Ruleset | Consistent pseudonyms vs. random values |
| Redaction | Which recipient group? | Human | Domain decision, not automatable |
| Redaction | Which fields are redacted? | Ruleset | Recipient-dependent redaction matrix |
| Redaction | Unknown clause type detected | Human | New clause types must be classified |
| Signature | Signature present? | AI | Computer vision with confidence score |
| Signature | Does signature match reference? | AI + human on anomaly | High match: accepted. Anomaly: escalated |
| Signature | No reference available | Human | New reference signature must be captured |
| All | Document audit trail | Automatic | Every decision immutably recorded |
Integration
Document Intelligence is a capability of the existing Document Agent - not separate software. Integration uses the same standardized interfaces:
- SAP DMS, SAP ArchiveLink - contracts and receipts from SAP archives
- SharePoint, OneDrive - document management via Microsoft Graph
- Email inboxes (IMAP/Exchange) - process attachments automatically
- File system watchers - monitor local directories
- REST API - for client-specific DMS systems
Document Intelligence capabilities are configured per tenant: which PII categories are detected, which redaction rules apply, which reference signatures are stored. All versioned, all in the Decision Layer.
Business Impact
GDPR-compliant LLM processing: Documents containing personal data can be securely processed with language models for the first time - without privacy risk.
Contract redaction in minutes instead of hours: Rule-based, recipient-dependent, physically secure. A contract that takes 2 hours manually is processed in seconds.
Proactive signature gap detection: Missing signatures are found before the auditor asks - not after.
Audit evidence for data protection: The audit trail documents every anonymization, every redaction, every signature check. During a GDPR inquiry or tax audit, it is provable which data was processed when and how.
No new tool: Document Intelligence is part of the existing agent architecture. No additional vendor, no additional license, no additional training.
Deep Dive in the Agent Briefing
Our article series for decision-makers implementing AI agents in the enterprise.
Frequently Asked Questions
What is the difference between anonymization and pseudonymization?
Anonymization irreversibly removes personal data - the link to the individual is permanently destroyed. Pseudonymization replaces the data with pseudonyms while the mapping remains possible through a separate table. For LLM processing, we use pseudonymization with subsequent re-anonymization: the model sees only pseudonyms, the result contains the real data again.
Does PII detection work with scanned documents?
Yes. Scanned documents are first converted to machine-readable text via OCR. The text then undergoes the same PII detection as digital documents. Detection accuracy depends on scan quality - at standard scans (300 DPI), OCR accuracy exceeds 99%.
Is the contract redaction truly secure?
Yes. Unlike manual redaction in PDF editors, the document is physically re-rendered. The redacted content is no longer present in the document - neither as text, nor as metadata, nor as invisible layers. This is cryptographically verifiable.
Can signature comparison detect forgeries?
Signature comparison detects anomalies - deviations from a reference signature. When anomalies are found, the system automatically escalates to a human. It never claims a signature is forged or authentic. That decision is made by a human. This is the only responsible approach.
Which documents should be processed securely?
PII anonymization, contract redaction, or signature detection - we start with one specific document type.