Sanctions screening — OpenSearch broad + re-ranker Kotlin + RCA

Module : sanctions-svc (microservice JVM Kotlin + OpenSearch packagé). Plus pipeline ETL sanctions-etl-svc.

ADRs : ADR-002 (RLS), ADR-006 (listes AML), ADR-030 (architecture).

POC : poc-sanctions-matcher (Kotlin pur, re-ranker + RCA, dataset OFAC SDN public).

Cette page fixe la spec d’ingénierie complète de sanctions-svc. Un développeur qui la lit doit pouvoir construire le module sans question résiduelle :

Les sources de listes (publiques free + Dow Jones option) avec URL canoniques et formats
Le pipeline ETL (parsing → normalisation FtM → bulk index OpenSearch + dénormalisation RCA)
Le modèle d’indexation OpenSearch (mappings ICU + phonetic + n-gram + flattened RCA)
L’algorithme de matching (broad search → re-ranker Kotlin → threshold par typologie)
Le scoring RCA (≤ 2 sauts dénormalisés)
L’audit log signé (modèle, signature Ed25519, reproductibilité)
Les API REST + events Kafka
Le packaging (embedded vs external cluster) + reproductibilité audit BCT

1. Vue d’ensemble

Flow nominal (~150 ms p95) :

Pipeline ETL ingère les listes selon cadence (quotidien publiques, full-refresh hebdo + delta quotidien DJ).
Au moment d’un screening, sanctions-svc reçoit une ScreeningQuery {name, dob?, nationality?, type, ...}.
Pass 1 broad : OpenSearch retourne top-50 candidats avec multi-fields (ICU, phonetic, n-gram).
Pass 2 re-rank : Kotlin recalcule un score consolidé déterministe (Jaro-Winkler + phonetic + DOB + alias).
Pass 3 threshold : verdict par typologie (OFAC strict, PEP tolérant, RCA décay, adverse media manuel).
Audit log signé (queryNormalisée, listVersion, topN, scores, rcaPath, signature) → Kafka sanctions.screening.completed.

2. Sources de listes

2.1 Listes publiques (free, MVP par défaut)

Source	URL	Format	Cadence	Volume
OFAC SDN	https://www.treasury.gov/ofac/downloads/sdn.xml	XML	quotidien	~13 K
OFAC Consolidated (FSE, NS-PLC, etc.)	https://www.treasury.gov/ofac/downloads/consolidated/consolidated.xml	XML	quotidien	~7 K
UN Security Council Consolidated	https://scsanctions.un.org/resources/xml/en/consolidated.xml	XML	hebdomadaire	~700
EU CFSP	https://webgate.ec.europa.eu/fsd/fsf	XML (token gratuit)	quotidien	~3 K
UK HM Treasury OFSI Consolidated	https://ofsistorage.blob.core.windows.net/publishlive/2022format/ConList.xml	XML	quotidien	~7 K
World Bank Debarred Firms	https://www.worldbank.org/en/projects-operations/procurement/debarred-firms	CSV	mensuel	~3 K
OpenSanctions agrégateur	https://www.opensanctions.org/datasets/	JSON / FtM / CSV	quotidien	~50 K (recouvre les autres)

Volume total après dédup : ~35 K entrées uniques. Index OpenSearch ~600 MB avec multi-fields ICU + phonetic.

2.2 Dow Jones Watchlist (option payante)

Adapter optionnel activé par feature flag tenant. Détails à compléter à l’activation par le tenant (format de feed, identifiants, schéma RCA spécifique).

Métrique	Valeur typique
Volume entités	3,8 M (PEP, sanctioned, RCA, SI/adverse media, DI/disqualified)
Volume associations	8 M edges (RCA family, business, address)
Cadence	full hebdo + delta quotidien
Format	XML feed propriétaire (parser dédié `dj-feed-parser`)
Categories	SAN, PEP, RCA, SI, DI
Datasets	Watchlist consolidated, Adverse Media, State-Owned Enterprise

L’adapter DJ implémente l’interface SanctionsListAdapter (cf §3.2) au même titre que les listes publiques.

3. Pipeline ETL

3.1 Architecture

3.2 Interface `SanctionsListAdapter`

interface SanctionsListAdapter {
    val sourceName: String                              // "OFAC_SDN", "UN_CONSOLIDATED", ...
    val urlSource: String
    val cadence: Cadence                                // DAILY, WEEKLY, MONTHLY, ON_DEMAND
    val format: SourceFormat                            // XML, JSON, CSV, FtM_JSONL

    fun fetch(lastModifiedAt: Instant?): SourcePayload  // streaming si possible
    fun parse(payload: SourcePayload): Sequence<RawEntity>
    fun normalize(raw: RawEntity): NormalizedEntity     // → schéma FtM
}

data class NormalizedEntity(
    val ftmId: String,                                  // hash canonique
    val schema: FtmSchema,                              // Person, Company, Vessel, Aircraft, Position, ...
    val primaryName: String,
    val primaryNameFolded: String,                      // ICU folded
    val akas: List<String>,                             // alias variants
    val datesOfBirth: List<String>,                     // multi-DOB possibles
    val citizenships: List<String>,                     // ISO 3166 alpha-2
    val countries: List<String>,                        // résidence, opérations, etc.
    val identifiers: Map<String, String>,               // passport, NIF, DUNS
    val categories: Set<Category>,                      // SAN, PEP, RCA, SI, DI
    val sourceListsHits: List<SourceHit>,
    val associatedEntityIds: List<AssociationEdge>,     // edges (cible, type)
    val flattenedAssociatedSanctioned: List<RcaPath>,   // pré-calculé ≤ 2 sauts
    val firstSeenAt: String,                            // ISO-8601
    val lastSeenAt: String
)

data class AssociationEdge(
    val targetFtmId: String,
    val type: AssociationType,                          // FAMILY, BUSINESS, ADDRESS, EMPLOYER, OWNER
    val strength: Double                                // 1.0 direct, 0.5 reported, 0.3 weak
)

data class RcaPath(
    val targetFtmId: String,
    val depth: Int,                                     // 1 ou 2
    val pathTypes: List<AssociationType>,
    val pathDecay: Double                               // produit des strength × depth_decay
)

3.3 Versioning et reproductibilité

À chaque ETL :

val listVersion = sha256(entries.sortedBy { it.ftmId }.joinToString { "${it.ftmId}|${it.lastSeenAt}" })

L’index OpenSearch est nommé sanctions_{tenantId}_{listVersion}. Un alias sanctions_{tenantId} pointe vers la version active (rotation atomique au moment du switch). Snapshot vers MinIO conservé 10 ans WORM.

3.4 Cadence

Étape	Cadence	Heure UTC
Full reindex (toutes sources publiques)	hebdomadaire	dimanche 02:00
Delta sync (OFAC, EU, UK quotidiennes)	quotidien	06:00
Force-refresh on-demand	admin via API	n/a
Dow Jones full	hebdomadaire	dimanche 03:00
Dow Jones delta	quotidien	06:30

4. Modèle d’indexation OpenSearch

4.1 Mappings entités

{
  "settings": {
    "analysis": {
      "analyzer": {
        "name_icu_folded": {
          "type": "custom",
          "tokenizer": "icu_tokenizer",
          "filter": ["icu_folding", "lowercase"]
        },
        "name_phonetic_bm": {
          "type": "custom",
          "tokenizer": "icu_tokenizer",
          "filter": ["lowercase", "beider_morse_phonetic"]
        },
        "name_ngram": {
          "type": "custom",
          "tokenizer": "icu_tokenizer",
          "filter": ["lowercase", "edge_ngram_3_15"]
        }
      },
      "filter": {
        "beider_morse_phonetic": {
          "type": "phonetic",
          "encoder": "beider_morse",
          "rule_type": "approx",
          "name_type": "generic",
          "languageset": ["any", "arabic", "french", "english", "russian"]
        },
        "edge_ngram_3_15": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 15
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "ftm_id": { "type": "keyword" },
      "schema": { "type": "keyword" },
      "primary_name": {
        "type": "text",
        "fields": {
          "raw": { "type": "keyword" },
          "icu_folded": { "type": "text", "analyzer": "name_icu_folded" },
          "phonetic": { "type": "text", "analyzer": "name_phonetic_bm" },
          "ngram": { "type": "text", "analyzer": "name_ngram" }
        }
      },
      "akas": {
        "type": "text",
        "fields": {
          "icu_folded": { "type": "text", "analyzer": "name_icu_folded" },
          "phonetic": { "type": "text", "analyzer": "name_phonetic_bm" }
        }
      },
      "dates_of_birth": { "type": "keyword" },
      "citizenships": { "type": "keyword" },
      "countries": { "type": "keyword" },
      "categories": { "type": "keyword" },
      "source_lists_hits": {
        "type": "nested",
        "properties": {
          "source": { "type": "keyword" },
          "list_version": { "type": "keyword" },
          "last_seen_at": { "type": "date" }
        }
      },
      "flattened_associated_sanctioned": {
        "type": "nested",
        "properties": {
          "target_ftm_id": { "type": "keyword" },
          "depth": { "type": "integer" },
          "path_decay": { "type": "float" }
        }
      },
      "last_seen_at": { "type": "date" }
    }
  }
}

4.2 Dimensionnement

Profil	Nœuds	Heap JVM	RAM hôte	Volume index	Profil tenant
`embedded-light`	1	4 GB	8 GB	≤ 2 GB (publiques)	tenant on-prem light, no DJ
`embedded-full`	1	16 GB	32 GB	≤ 30 GB (publiques + DJ)	tenant on-prem full, DJ activé
`external-cluster`	3	16 GB / nœud	32 GB / nœud	≤ 50 GB	tenant tier-1 cluster existant

5. Algorithme de matching

5.1 Pass 1 — broad search OpenSearch

POST /sanctions_{tenantId}/_search
{
  "size": 50,
  "query": {
    "bool": {
      "should": [
        {
          "dis_max": {
            "tie_breaker": 0.3,
            "queries": [
              { "match": { "primaryName.icu_folded": { "query": "{queryName}", "boost": 4.0 } } },
              { "match": { "primaryName.phonetic":   { "query": "{queryName}", "boost": 2.5 } } },
              { "match": { "primaryName.ngram":      { "query": "{queryName}", "boost": 1.5 } } },
              { "match": { "akas.icu_folded":        { "query": "{queryName}", "boost": 3.0 } } },
              { "match": { "akas.phonetic":          { "query": "{queryName}", "boost": 2.0 } } }
            ]
          }
        }
      ],
      "filter": [
        { "term": { "schema": "{Person|Company}" } }
      ]
    }
  },
  "_source": ["ftmId","primaryName","akas","datesOfBirth","citizenships","categories","sourceListsHits","flattenedAssociatedSanctioned"]
}

Cible : rappel ≥ 95 % sur top-50, latence p95 ≤ 30 ms.

5.2 Pass 2 — re-ranker Kotlin déterministe

data class RerankerScore(
    val nameScore: Double,                  // Jaro-Winkler max(primary, akas)
    val phoneticScore: Double,              // Beider-Morse equality 0/1
    val dobScore: Double,                   // exact match 1.0, partial 0.5, none 0.0
    val nationalityScore: Double,           // exact match 1.0, mismatch 0.0
    val aliasUnfoldScore: Double,           // ICU translit + Levenshtein
    val sourceTypeBoost: Double,            // OFAC=1.0, PEP=0.85, etc.
    val finalScore: Double                  // pondération
)

fun score(query: ScreeningQuery, candidate: NormalizedEntity): RerankerScore {
    val nameScore = maxOf(
        jaroWinkler(query.name.normalized, candidate.primaryName.normalized),
        candidate.akas.maxOfOrNull { jaroWinkler(query.name.normalized, it.normalized) } ?: 0.0
    )
    val phoneticScore = if (
        beiderMorse(query.name) intersects beiderMorse(candidate.primaryName)
    ) 1.0 else 0.0
    val dobScore = matchDob(query.dob, candidate.datesOfBirth)
    val nationalityScore = matchNationality(query.nationality, candidate.citizenships)
    val aliasUnfoldScore = unfoldArabicTranslit(query.name)
        .map { variant -> candidate.akas.maxOfOrNull { aka -> 1.0 - levenshtein(variant, aka) / max(variant.length, aka.length).toDouble() } ?: 0.0 }
        .maxOrNull() ?: 0.0
    val sourceTypeBoost = candidate.categories.maxOfOrNull { categoryBoost(it) } ?: 1.0

    val final = (
        0.40 * nameScore +
        0.15 * phoneticScore +
        0.20 * dobScore +
        0.10 * nationalityScore +
        0.15 * aliasUnfoldScore
    ) * sourceTypeBoost

    return RerankerScore(nameScore, phoneticScore, dobScore, nationalityScore, aliasUnfoldScore, sourceTypeBoost, final)
}

Pondérations défaut MVP : 0.40/0.15/0.20/0.10/0.15. Configurables par tenant.

Pourquoi ces poids : nameScore domine (40 %) — le nom écrit reste l’élément le plus discriminant ; dobScore 20 % car une DOB exacte casse les faux positifs ; phonetic 15 % capture les variantes orthographiques arabes/slaves ; aliasUnfold 15 % traite les translittérations multiples (Mohammed ↔ محمد) ; nationality 10 % est un signal faible mais utile.

5.3 Pass 3 — threshold par typologie

Typologie	Source	Threshold	Verdict
Direct sanctions	OFAC SDN, UN, EU CFSP, UK OFSI	`≥ 0.92`	`MATCH_DIRECT_SANCTIONS` → block + case manuel
Direct PEP	DJ PEP, OpenSanctions PEP	`≥ 0.85`	`MATCH_PEP` → enhanced due diligence
RCA 1-hop	flattened depth=1	`≥ 0.80`	`MATCH_RCA_DIRECT` → review manuel
RCA 2-hop	flattened depth=2	`≥ 0.70`	`MATCH_RCA_INDIRECT` → review manuel + lien path
Adverse media	DJ SI	`≥ 0.75`	`POTENTIAL_ADVERSE_MEDIA` → manuel only
Below threshold	n/a	`< threshold`	`CLEAR` → log mais pas de hit

Tous les seuils sont par tenant (BCT peut imposer une politique plus stricte qu’un tenant fintech). Configurables via SanctionsPolicy (DSL Kotlin réutilisant grammaire ADR-004).

5.4 Diagramme du pipeline matching

6. RCA scoring

6.1 Dénormalisation à l’ETL

Pour chaque entité X de la base, l’ETL pré-calcule :

val flattenedAssociatedSanctioned = bfs(
    start = X,
    edges = associationGraph,
    maxDepth = 2,
    edgeFilter = { it.target.categories intersects setOf(SAN, PEP) }
).map { (target, depth, edgePath) ->
    RcaPath(
        targetFtmId = target.ftmId,
        depth = depth,
        pathTypes = edgePath.map { it.type },
        pathDecay = edgePath.map { it.strength }.fold(1.0, Double::times) * decayPerDepth(depth)
    )
}

Fonction de décroissance par profondeur : decayPerDepth(1) = 1.0, decayPerDepth(2) = 0.6. Configurable.

6.2 Score RCA au screening

Au screening, si une entité matche un candidat (finalScore ≥ 0.70) avec flattenedAssociatedSanctioned non vide, on retourne en plus du match direct la liste des paths sanctionnés atteignables :

data class ScreeningHit(
    val candidate: NormalizedEntity,
    val directScore: Double,
    val typology: HitTypology,
    val rcaHits: List<RcaHit>             // les paths sanctionnés depuis ce candidat
)

data class RcaHit(
    val sanctionedTargetFtmId: String,
    val depth: Int,
    val pathTypes: List<AssociationType>,
    val rcaScore: Double                  // = directScore * pathDecay
)

6.3 Limites

2 sauts max au MVP — couvre 90 % des cas banque (frère sanctionné, UBO indirect via une SARL). > 2 sauts = nouvel ADR avec graph DB.
Recompute hebdomadaire — un nouvel edge ajouté dans la liste DJ entre 2 reindex est invisible jusqu’au reindex hebdo. Acceptable pour les cas RCA (pas de SLA temps-réel).

7. Audit log signé append-only

7.1 Modèle d’event

@Serializable
data class ScreeningAuditEvent(
    val eventId: String,
    val tenantId: String,
    val screeningId: String,
    val queryNormalized: ScreeningQuery,
    val listVersion: String,                   // sha256 de la liste indexée
    val openSearchAlias: String,               // "sanctions_{tenant}"
    val openSearchIndex: String,               // "sanctions_{tenant}_{listVersion}"
    val topNCandidates: List<CandidateRecord>, // top 50 ranked
    val verdict: ScreeningVerdict,
    val rerankerVersion: String,               // SemVer du re-ranker
    val rerankerWeights: Map<String, Double>,  // pondérations utilisées
    val thresholds: Map<String, Double>,       // thresholds par typologie
    val signature: Ed25519Signature,
    val previousEventHash: String,
    val hash: String,
    val occurredAt: String                     // ISO-8601
)

data class CandidateRecord(
    val ftmId: String,
    val primaryName: String,
    val finalScore: Double,
    val nameScore: Double,
    val phoneticScore: Double,
    val dobScore: Double,
    val nationalityScore: Double,
    val aliasUnfoldScore: Double,
    val sourceTypeBoost: Double,
    val rcaPath: List<RcaHit>?
)

7.2 Reproductibilité audit BCT

Pour reproduire un screening 5 ans plus tard :

Récupère l’event dans screening_audit (table append-only avec RLS, cf ADR-002).
Vérifie le chaînage hash et la signature Ed25519 (clé publique tenant pinned).
Récupère le snapshot OpenSearch correspondant à listVersion depuis MinIO.
Réjoue le re-ranker (version rerankerVersion, pondérations rerankerWeights) sur topNCandidates.
Vérifie que verdict est identique.

Garantie : si le re-ranker Kotlin est déterministe (testé en CI sur corpus golden) et la liste figée par snapshot, le verdict est strictement reproductible.

7.3 Schéma SQL

CREATE TABLE screening_audit (
  event_id            UUID PRIMARY KEY,
  screening_id        UUID NOT NULL,
  tenant_id           UUID NOT NULL,
  query_normalized    JSONB NOT NULL,
  list_version        VARCHAR(74) NOT NULL,
  opensearch_alias    VARCHAR(128) NOT NULL,
  opensearch_index    VARCHAR(128) NOT NULL,
  top_n_candidates    JSONB NOT NULL,
  verdict             JSONB NOT NULL,
  reranker_version    VARCHAR(16) NOT NULL,
  reranker_weights    JSONB NOT NULL,
  thresholds          JSONB NOT NULL,
  signature           TEXT NOT NULL,
  previous_event_hash CHAR(74) NOT NULL,
  hash                CHAR(74) NOT NULL,
  occurred_at         TIMESTAMPTZ NOT NULL
);

ALTER TABLE screening_audit ENABLE ROW LEVEL SECURITY;
CREATE POLICY screening_audit_tenant_isolation ON screening_audit
  USING (tenant_id = current_setting('app.current_tenant_id')::uuid);
REVOKE UPDATE, DELETE ON screening_audit FROM app_role;

8. API REST + events Kafka

8.1 API

Méthode	Endpoint	Description
`POST`	`/v1/sanctions/screen`	screening unitaire (synchrone)
`POST`	`/v1/sanctions/screen/batch`	batch screening (async, Temporal workflow)
`GET`	`/v1/sanctions/screenings/:id`	poll batch status
`GET`	`/v1/sanctions/screenings/:id/audit`	audit complet d’un screening
`GET`	`/v1/sanctions/lists`	liste des sources actives + listVersion + stats
`POST`	`/v1/sanctions/lists/:source/refresh`	force refresh on-demand (admin)
`GET`	`/v1/sanctions/policies`	récupère SanctionsPolicy tenant
`PUT`	`/v1/sanctions/policies`	met à jour (avec dual-control + audit)

8.2 Events Kafka

Topic	Émis quand	Payload
`sanctions.screening.completed`	screening unitaire fini	`{screeningId, verdict, hits, listVersion}`
`sanctions.batch.progress`	batch en cours	`{batchId, processed, total}`
`sanctions.list.reindexed`	ETL finalisé	`{source, listVersion, entityCount}`
`sanctions.policy.updated`	tenant change la policy	`{tenantId, version, signedBy}`

Topics partitionnés par tenantId. Rétention 30 j broker (audit long est dans screening_audit).

8.3 OpenAPI

Spec complète : voir /api/openapi/ section Sanctions. Contract tests Pact en CI.

9. Packaging

9.1 Mode `embedded` (défaut tenant on-prem light)

Container Docker single-node vitakyc/sanctions-svc-embedded:1.0
Image inclut OpenSearch 2.x + plugins ICU + phonetic + Java 17
4-8 GB RAM recommandé (pour publiques only)
Healthcheck /_cluster/health
Volumes persistants /data/opensearch + /data/snapshots
Backups vers MinIO via cron interne (snapshots OS S3)
Pas de cluster, pas de master election, pas de sharding
Pré-tuné pour 35 K entrées + future expansion DJ

9.2 Mode `external` (tenant tier-1)

Le tenant fournit l’URL d’un cluster OpenSearch existant (3+ nœuds, son ops)
Image VitaKYC sanctions-svc:1.0 (sans OpenSearch embarqué) qui s’y connecte
Index sanctions_{tenantId}_* créés par VitaKYC dans le cluster tenant
Tenant gère monitoring + backups + upgrades du cluster

9.3 Sélection mode

Feature flag tenant sanctions.deployment.mode = embedded | external au niveau tenant config. Switch possible à n’importe quel moment via re-export / re-import des données + bascule de l’alias.

10. Performance et capacité

Metric	MVP cible	V2 cible
Latence broad OpenSearch p95	≤ 30 ms	≤ 15 ms
Latence re-ranker p95 (top-50)	≤ 50 ms	≤ 25 ms
Latence pipeline complète p95	≤ 200 ms	≤ 100 ms
Throughput screening unitaire	≥ 100 /s /tenant	≥ 500 /s
Throughput batch screening	≥ 1000 /min /tenant	≥ 10 000 /min
Reindex full publiques (35 K)	≤ 1 min	≤ 30 s
Reindex full DJ (3,8 M + 8 M edges)	≤ 60 min	≤ 30 min
Disponibilité	99,5 %	99,9 %
Stockage index publiques	≤ 600 MB	≤ 400 MB
Stockage index DJ	≤ 30 GB	≤ 25 GB

11. Sécurité

mTLS entre sanctions-svc et opensearch (cluster ou embedded).
Auth admin : OIDC + MFA pour modification SanctionsPolicy (dual-control via Vault, cf ADR-006).
Signature Ed25519 des audit events par clé tenant pinned (rotation 12 mois).
RLS PostgreSQL sur screening_audit (cf ADR-002).
Index OpenSearch séparés par tenant — pas de partage cross-tenant.
PII protection : pas de hashing des noms (sinon screening impossible). Mais tous les accès logs anonymisés (queryNormalized contient le nom mais le log applicatif tronque).
Rate limit : 30 screenings/s/tenant, 1000/min/utilisateur API.
Audit append-only : signatures + chainage hash → tampering détectable.

12. Plan de migration MVP → V2

Item	MVP (V0)	V2 (S+12)
Listes publiques	7 sources, full hebdo + delta quotidien	+ custom tenant lists (whitelist/blacklist propriétaires)
Dow Jones	adapter optionnel	+ Refinitiv WorldCheck adapter, ComplyAdvantage adapter
RCA depth	≤ 2 sauts dénormalisés	≤ 4 sauts via graph DB (Neo4j ou Postgres CTE)
Phonetic	Beider-Morse + Soundex	+ custom MENA-tuned (ANSI X3.30)
ML re-ranker	aucun (Kotlin déterministe)	optionnel learn-to-rank avec model card audit
Cache	24 h sur (queryNormalisée, listVersion)	+ Redis distribué
Reproductibilité	snapshots OpenSearch MinIO	+ replay automatique pour audits aléatoires
Streaming	batch only	+ transaction streaming (Kafka Streams)

13. Checklist go-live MVP

14. Références

ADR-001, ADR-002, ADR-006, ADR-029, ADR-030
POC poc-sanctions-matcher
Listes publiques :
Standards : Follow the Money (FtM) schema, Beider-Morse phonetic algorithm, Daitch-Mokotoff soundex
Algorithmes : Jaro-Winkler distance, Levenshtein distance, ICU transliterators

Document de spec sanctions screening — version 1.0 (2026-04-27). Mises à jour bloquantes nécessitent un ADR.