Disambiguating USPTO inventor names with semantic fingerprinting and DBSCAN clustering

Published date01 April 2019
Date01 April 2019
DOIhttps://doi.org/10.1108/EL-12-2018-0232
Pages225-239
AuthorHongqi Han,Yongsheng Yu,Lijun Wang,Xiaorui Zhai,Yaxin Ran,Jingpeng Han
Subject MatterInformation & knowledge management
Disambiguating USPTO inventor
names with semantic ngerprinting
and DBSCAN clustering
Hongqi Han
Data Mining Group, Institute of Scientic and Technical Information of China,
Haidian-qu, China and Key Laboratory of Rich-media Knowledge Organization
and Service of Digital Publishing Content, SAPPRFT, Beijing, China
Yongsheng Yu
Institute of Scientic and Technical Information of China, Haidian-qu, China
Lijun Wang,Xiaorui Zhai and Yaxin Ran
Data Mining Group, Institute of Scientic and Technical Information of China,
Haidian-qu, China, and
Jingpeng Han
Beijing University of Technology, Beijing, China
Abstract
Purpose The aim of this study is to present a novel approach based on semantic ngerprinting and a
clustering algorithm called density-based spatial clustering of applications with noise (DBSCAN),
which can be used to convert investor records into 128-bit semantic ngerprints. Inventor
disambiguation is a method used to discover a unique set of underlying inventors and map a set of
patents to their corresponding inventors. Resolving the ambiguities between inventors is necessary to
improve the quality of the patent database and to ensure accurate entity-level analysis. Most existing
methods are based on machine learning and, while they often show good performance, this comes at the
cost of time, computational power and storage space.
Design/methodology/approach Using DBSCAN, the meta and textual data in inventor records are
converted into 128-bit semantic ngerprints. However, rather than using a string comparison or cosine
similarity to calculate the distance between pair-wise ngerprint records, a binary number comparison
function was used in DBSCAN. DBSCAN then clusters the inventor records based on this distance to
disambiguateinventor names.
Findings Experiments conductedon the PatentsView campaign database of the United States Patent and
Trademark Ofce show that thismethod disambiguates inventor names with recall greater than 99 percent
in less timeand with substantially smaller storage requirement.
Research limitations/implications A better semantic ngerprint algorithm and a better distance
function may improve precision. Setting of different clustering parameters for each block or other
clustering algorithms will be considered to improve the accuracy of the disambiguation results even
further.
Originality/value Compared with the existing methods, the proposed method does notrely on feature
selection and complex feature comparison computation. Most importantly, running time and storage
requirementsare drastically reduced.
Keywords Cluster analysis, Patent analysis, Inventor name disambiguation,
Semantic ngerprinting
Paper type Research paper
Semantic
ngerprinting
and DBSCAN
clustering
225
Received1 December 2018
Revised9 March 2019
23March 2019
Accepted24 March 2019
TheElectronic Library
Vol.37 No. 2, 2019
pp. 225-239
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-12-2018-0232
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT