PII Matching Methods
About
GTR has constituted the standards for verifying PII, based on strict regulations and considering user experience. Please make sure you use this standard to improve verification accuracy.
We can distinguish two types of validation entities: 1. Legal Person (Company/Entity/Enterprise), 2. Natural Person (Individual/Citizen).
They have different verification methods and applicable conditions. Please mind the following attributes and parameters of each verification standard.
In the chapter of PII Verify Fields, we list the all Verify Fields in the table, the column Verify Rules ID refer to this page to make the verification.
For example the fields 121001 use the NAME_FUZZY_VD method to match the data, please refer to the content below the table.
| VerifyFields | IVMS Name | Direction | Entity Type | Verify Rules ID | Format | How to Fill | Other Description | IVMS Field Name |
|---|---|---|---|---|---|---|---|---|
| 121001 | Legal Person Name | OriginatingVASP | Legal Person | NAME_FUZZY_VD | 公司法人名稱 | legalPersonName |
Pre-processing
Natural Person Name
The name fields should have primaryIdentifier (Last Name), secondaryIdentifier (First Name), and the middle name is including in the part of Last Name.
If your KYC/B database cannot recognize the First Name or Last Name, please fill all to the primaryIdentifier.
The name is always describe in the list, means that could have many of name in the list to be verify. the strategy of MATCHED is that the name has match to the one of list, then it sohuld consider to be matched.
Natural Person Local Name
Most of the system cannot split non-english name or english name, we defined that all the local name should treat as Natural Person Name, to verify in same array.
You can merge local name array and name array to one list and verify, if one of them has been matched, it should flag as MATCHED.
Legal Person Name
Legal Person Name just a single field name in the list, if any one of list been matched, then consider to be MATCHED.
NAME_FUZZY_VD
Input: Name-1, Name-2 Output: 0-1 (Similarity) Threshold: 0.8 (Recommend)
NAME_FUZZY_VD is the fuzzy matching method for check the similarity between two name (A, B), it is suitable to compare in the list one-by-one, the case is non-sensitive.
For example, the list is come from the decrypted PII data what the conter-party VASP sent.
[
"John Wick",
"Wick John",
"John",
"Wick"
]
In our KYC/B, the name is JohnWick, so to apply the name matching method will be:
[
NAME_FUZZY_VD("John Wick", "JohnWick"),
NAME_FUZZY_VD("Wick John", "JohnWick"),
NAME_FUZZY_VD("John", "JohnWick"),
NAME_FUZZY_VD("Wick, "JohnWick")"
]
and the applied function will be:
[
0.94,
0.91,
0.6,
0.2
]
and theres two similarity score of text on the list is grether than 0.8, it should consider to be MATCHED.
Preprocessing
- Convert to lower case
Convert all names to lowercase.
For example (KYC):
- LastName: Maynard → maynard
- MiddleName: Victor P. → victor p.
- FirstName: Ausburn → ausburn
For example (IVMS101):
-
primaryIdentifier: Maynard Victor P. → maynard victor p.
-
secondaryIdentifier: Ausburn → ausburn
-
legalPersonNameIdentifier: Happy Company Co., Ltd → happycompanyco.,ltd
- Replace with regular expressions
Each field should use regular rules to remove special characters, please refer to this pattern, remove whitespace and some special characters.
[-,\.\s&%#^?!@{}\[\]()><*"'~\/;:$\\\|\/_=+-]
For example the name "maynard victor p. ausburn" will replace to be: "maynardvictorpausburn", and the company name "happycompanyco.,ltd" will replace to be "happycompanycoltd" after applying the pattern.
The fuzzy matching method in GTR is using the algorithm module describe as follows:
Algorithm Details
The algorithm measures similarity between two names by combining multiple techniques:
- Tokenization: Split names into individual words
- Sorting: Arrange tokens alphabetically to normalize word order
- Levenshtein Distance: Calculate similarity between token pairs
- Threshold Filtering: Only count matches above similarity threshold (0.7)
- Missing Token Penalty: Penalize unmatched tokens
Step-by-Step Process
Step 1: Preprocessing
- Convert to lowercase
- Remove special characters using regex pattern
- Split into tokens (words)
- Sort tokens alphabetically
function preprocess(name) {
// Convert to lowercase and remove special characters
const cleaned = name.toLowerCase()
.replace(/[-,\.\s&%#^?!@{}\[\]()><*"'~\/;:$\\\|\/_=+-]/g, '');
// For token-based matching, keep spaces for splitting
const forTokens = name.toLowerCase()
.replace(/[-,\.&%#^?!@{}\[\]()><*"'~\/;:$\\\|\/_=+-]/g, ' ')
.split(/\s+/)
.filter(token => token.length > 0)
.sort();
return { cleaned, tokens: forTokens };
}
// Example:
// Input: "John A. Smith"
// Output: { cleaned: "johasmith", tokens: ["a", "john", "smith"] }
Step 2: Token Matching with Levenshtein
function levenshteinSimilarity(str1, str2) {
const maxLen = Math.max(str1.length, str2.length);
if (maxLen === 0) return 1.0;
const distance = levenshteinDistance(str1, str2);
return 1 - (distance / maxLen);
}
function matchTokens(tokens1, tokens2, threshold = 0.7, missingPenalty = 0.2) {
const smaller = tokens1.length <= tokens2.length ? tokens1 : tokens2;
const larger = tokens1.length > tokens2.length ? tokens1 : tokens2;
let totalScore = 0;
const used = new Set();
// Match tokens from smaller list to larger list
for (const token of smaller) {
let bestMatch = -1;
let bestScore = 0;
for (let i = 0; i < larger.length; i++) {
if (used.has(i)) continue;
const similarity = levenshteinSimilarity(token, larger[i]);
if (similarity > bestScore) {
bestMatch = i;
bestScore = similarity;
}
}
if (bestScore >= threshold) {
totalScore += bestScore;
used.add(bestMatch);
} else {
// Missing token penalty
totalScore += Math.max(0, bestScore - missingPenalty);
}
}
// Penalize unmatched tokens in larger list
const unmatchedCount = larger.length - used.size;
totalScore -= unmatchedCount * missingPenalty;
// Normalize by average token count
const avgTokenCount = (tokens1.length + tokens2.length) / 2;
return Math.max(0, Math.min(1, totalScore / avgTokenCount));
}
Step 3: Complete NAME_FUZZY_VD Implementation
function NAME_FUZZY_VD(name1, name2) {
const processed1 = preprocess(name1);
const processed2 = preprocess(name2);
// Token-based similarity (primary method)
const tokenSimilarity = matchTokens(processed1.tokens, processed2.tokens);
// Character-based similarity (fallback for short names)
const charSimilarity = levenshteinSimilarity(processed1.cleaned, processed2.cleaned);
// Use token-based if both names have multiple tokens, otherwise character-based
const hasMultipleTokens = processed1.tokens.length > 1 || processed2.tokens.length > 1;
return hasMultipleTokens ? tokenSimilarity : charSimilarity;
}
Examples
// Example 1: Different word order
NAME_FUZZY_VD("John Smith", "Smith John")
// Tokens: ["john", "smith"] vs ["john", "smith"]
// Result: ~0.95 (high similarity)
// Example 2: With typo
NAME_FUZZY_VD("John Smith", "Jon Smith")
// Tokens: ["john", "smith"] vs ["jon", "smith"]
// "john" vs "jon": similarity ~0.75 (above 0.7 threshold)
// Result: ~0.87
// Example 3: Missing token
NAME_FUZZY_VD("John A Smith", "John Smith")
// Tokens: ["a", "john", "smith"] vs ["john", "smith"]
// "a" has no good match, gets penalty
// Result: ~0.73
// Example 4: Preprocessed company name
NAME_FUZZY_VD("Happy Company Co., Ltd", "HappyCompanyCo Ltd")
// After preprocessing: "happycompanycoltd" vs "happycompanyco ltd"
// Result: ~0.91
This algorithm handles common name variations like different word orders, typos, missing middle names, and company name formats while maintaining high accuracy.
TYPE
Input: Type-1, Type-2 Output: MATCH, MISMATCHED (Boolean)
TYPE is mean the type name id, it use to check the value between two Type are same, it has to be 100% match, and it is case non-sensitive.
Simple Implementation
function TYPE(type1, type2) {
return type1.toLowerCase() === type2.toLowerCase();
}
Examples
// Example 1: Match
TYPE("CCPT", "ccpt") // true (MATCH)
// Example 2: Mismatch
TYPE("CCPT", "RAID") // false (MISMATCH)
// CCPT = Passport, RAID = Tax ID - different types
// Example 3: Case insensitive
TYPE("PASSPORT", "passport") // true (MATCH)
ABS_CI
Input: Value-1, Value-2 Output: MATCH, MISMATCHED (Boolean)
ABS_CI is to check the value between two Value are same, it have to be 100% match, and it is case non-sensitive.
Simple Implementation
function ABS_CI(value1, value2) {
return value1.toLowerCase() === value2.toLowerCase();
}
Examples
// Example 1: Country codes match
ABS_CI("US", "us") // true (MATCH)
// Example 2: Country codes mismatch
ABS_CI("US", "UK") // false (MISMATCH)
// Example 3: Case insensitive
ABS_CI("Singapore", "SINGAPORE") // true (MATCH)
// Example 4: Exact match required
ABS_CI("New York", "New York City") // false (MISMATCH)
FUZZY_TEXT
Input: Text-1, Text-2 Output: 0-1 (Similarity) Threshold: 0.7 (Recommend)
FUZZY_TEXT is the fuzzy matching method for check the similarity between two text (Text-1, Text-2), and it is case non-sensitive.
Simple Implementation
function FUZZY_TEXT(text1, text2, threshold = 0.7) {
const similarity = levenshteinSimilarity(
text1.toLowerCase(),
text2.toLowerCase()
);
return similarity >= threshold;
}
function levenshteinSimilarity(str1, str2) {
const maxLen = Math.max(str1.length, str2.length);
if (maxLen === 0) return 1.0;
const distance = levenshteinDistance(str1, str2);
return 1 - (distance / maxLen);
}
Examples
// Example 1: Partial address match
FUZZY_TEXT("New York City, A Street", "A Street")
// Similarity: ~0.42, Result: false (below 0.7 threshold)
// Example 2: Similar addresses
FUZZY_TEXT("123 Main Street", "123 Main St")
// Similarity: ~0.85, Result: true (MATCH)
// Example 3: Typo in address
FUZZY_TEXT("Wall Street", "Wal Street")
// Similarity: ~0.91, Result: true (MATCH)
// Example 4: Different addresses
FUZZY_TEXT("Wall Street", "Park Avenue")
// Similarity: ~0.18, Result: false (MISMATCH)
POST_CODE
Input: PostCode-1, PostCode-2 Output: MATCH, MISMATCHED (Boolean)
POST_CODE is to check the value between two PostCode are same, it have to be 100% match, and it is case non-sensitive.
post code need do the preprocessing to remove all non-digits value by the pattern below:
[^0-9]
Simple Implementation
function POST_CODE(postcode1, postcode2) {
// Remove all non-digit characters
const cleaned1 = postcode1.replace(/[^0-9]/g, '');
const cleaned2 = postcode2.replace(/[^0-9]/g, '');
return cleaned1 === cleaned2;
}
Examples
// Example 1: Same postcode different format
POST_CODE("171-0023", "1710023") // true (MATCH)
// Both become "1710023" after preprocessing
// Example 2: Different postcodes
POST_CODE("171-0023", "249-3203") // false (MISMATCH)
// "1710023" vs "2493203"
// Example 3: Complex formatting
POST_CODE("SW1A 1AA", "SW1A1AA") // true (MATCH)
// Both become "" after removing non-digits (no digits in UK postcode)
// Example 4: US ZIP codes
POST_CODE("10001-1234", "10001") // false (MISMATCH)
// "100011234" vs "10001"
NONE
NONE means this field is not use for matching or verify.