Avoiding duplicate clients¶
Lino Welfare offers some functionality for avoiding duplicate
Client
records.
Side note: Code snippets (lines starting with >>>
) in this document get
tested as part of our development workflow. The following
initialization snippet tells you which demo project is being used in
this document.
>>> import lino
>>> lino.startup('lino_welfare.projects.gerd.settings.doctests')
>>> from lino.api.doctest import *
In Lino Welfare, a Client
inherits from DupableClient
.
Phonetic words¶
See lino.mixins.dupable.PhoneticWordBase
.
>>> rt.show(pcsw.CoachedClients, column_names="id name dupable_words")
...
===== ======================= =========================================
ID Name dupable_words
----- ----------------------- -----------------------------------------
116 Ausdemwald Alfons `ASTMLT <…>`__, `ALFNS <…>`__
177 Brecht Bernd `PRKT <…>`__, `PRNT <…>`__
118 Collard Charlotte `KLRT <…>`__, `XRLT <…>`__
124 Dobbelstein Dorothée `TPLSTN <…>`__, `TR0 <…>`__
179 Dubois Robin `TP <…>`__, `RPN <…>`__
128 Emonts Daniel `AMNTS <…>`__, `TNL <…>`__
152 Emonts-Gast Erna `AMNTS <…>`__, `KST <…>`__, `ARN <…>`__
129 Engels Edgar `ANJLS <…>`__, `ATKR <…>`__
127 Evers Eberhart `AFRS <…>`__, `APRRT <…>`__
132 Groteclaes Gregory `KRTKLS <…>`__, `KRKR <…>`__
133 Hilgers Hildegard `HLKRS <…>`__, `HLTKRT <…>`__
137 Jacobs Jacqueline `JKPS <…>`__, `JKLN <…>`__
181 Jeanémart Jérôme `JNMRT <…>`__, `JRM <…>`__
139 Jonas Josef `JNS <…>`__, `JSF <…>`__
141 Kaivers Karl `KFRS <…>`__, `KRL <…>`__
178 Keller Karl `KLR <…>`__, `KRL <…>`__
142 Lambertz Guido `LMPRTS <…>`__, `KT <…>`__
144 Lazarus Line `LSRS <…>`__, `LN <…>`__
146 Malmendier Marc `MLMNT <…>`__, `MRK <…>`__
147 Meessen Melissa `MSN <…>`__, `MLS <…>`__
153 Radermacher Alfons `RTRMKR <…>`__, `ALFNS <…>`__
155 Radermacher Christian `RTRMKR <…>`__, `KRSXN <…>`__
157 Radermacher Edgard `RTRMKR <…>`__, `ATKRT <…>`__
159 Radermacher Guido `RTRMKR <…>`__, `KT <…>`__
161 Radermacher Hedi `RTRMKR <…>`__, `HT <…>`__
173 Radermecker Rik `RTRMKR <…>`__, `RK <…>`__
165 da Vinci David `FNS <…>`__, `TFT <…>`__
166 van Veen Vincent `FN <…>`__, `FNSNT <…>`__
168 Östges Otto `ASTJS <…>`__, `AT <…>`__
===== ======================= =========================================
Similar Clients¶
The test database contains a fictive person named Dorothée Dobbelstein-Demeulenaere as an example of accidental duplicate data entry. Dorothée exists 3 times in our database:
>>> for p in pcsw.Client.objects.filter(name__contains="Dorothée"):
... print(str(p))
...
DEMEULENAERE Dorothée (122)
DOBBELSTEIN-DEMEULENAERE Dorothée (123*)
DOBBELSTEIN Dorothée (124)
The detail window of each of these records shows some of the other records in the SimilarClients table:
>>> translation.activate("en")
>>> rt.show(dupable_clients.SimilarClients, pcsw.Client.objects.get(pk=122))
`DOBBELSTEIN-DEMEULENAERE Dorothée (123*) <…>`__ Phonetic words: TMLNR, TR0
>>> rt.show(dupable_clients.SimilarClients, pcsw.Client.objects.get(pk=123))
...
`DEMEULENAERE Dorothée (122) <…>`__ `DOBBELSTEIN Dorothée (124) <…>`__ Phonetic words: TPLSTN, TMLNR, TR0
>>> rt.show(dupable_clients.SimilarClients, pcsw.Client.objects.get(pk=124))
...
`DOBBELSTEIN-DEMEULENAERE Dorothée (123*) <…>`__ Phonetic words: TPLSTN, TR0
Note how the result can differ depending on the partner. Our algorithm is not perfect and does not detect all duplicates.
Checked at input¶
If a user tries to create a fourth record of that person, then Lino will ask a confirmation first:
>>> data = dict(an="submit_insert")
>>> data.update(first_name="Dorothée")
>>> data.update(last_name="Dobbelstein")
>>> data.update(genderHidden="F")
>>> data.update(gender="Weiblich")
>>> test_client.force_login(rt.login('robin').user)
>>> res = test_client.post('/api/pcsw/Clients', data=data, REMOTE_USER="robin")
>>> res.status_code
200
>>> r = json.loads(res.content)
>>> print(r['message'])
There are 2 similar Clients:<br/>
DOBBELSTEIN-DEMEULENAERE Dorothée (123*)<br/>
DOBBELSTEIN Dorothée (124)<br/>
Are you sure you want to create a new Client named Mrs Dorothée DOBBELSTEIN?
This is because lino.mixins.dupable.Dupable
replaces
the standard submit_insert action by the CheckedSubmitInsert
action.
The algorithm¶
The alarm bell rings when there are two similar name components in both first and last name. Punctuation characters (like “-” or “&” or “,”) are ignored, and also the ordering of elements does not matter.
The current implementation splits the name
of each client into its parts,
removing punctuation characters, computes a phonetic version using the
NYSIIS algorithm
and stores them in a separate database table.
How good (how bad) is our algorithm? See the source code of lino.projects.min2.tests.test_min2.