Domain-specific word embeddings for patent classification

Date04 February 2019
Published date04 February 2019
AuthorJulian Risch,Ralf Krestel
Subject MatterLibrary & information science
Domain-specific word embeddings
for patent classification
Julian Risch and Ralf Krestel
Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
Purpose Patent offices and other stakeholders in the patent domain need to classify patent applications
according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an
application it can then be compared to previously granted patents in the same class. Automatic classification
would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed
to accomplish this costly manual task. However, a challenge for the automation is patent-specific language
use, such as special vocabulary and phrases.
Design/methodology/approach To account for this language use, the authors present domain-specific
pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of
more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a
deep learning approach based on gated recurrent units for automatic patent classification built on the trained
word embeddings.
Findings Experiments on a standardized evaluation data set show that the approach increases average
precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, the
authors further investigate the models strengths and weaknesses. An extensive error analysis reveals that
the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and
underrepresented classes are the most difficult remaining challenge.
Originality/value The proposed approach fulfills the need for domain-specific word embeddings for
downstream tasks in the patent domain, such as patent classification or patent analysis.
Keywords Deep learning, Document classification, Word embedding, Patents
Paper type Research paper
1. Introduction
In 2018, 308,853 US patents have been granted by the US Patent and Trademark Office
(USPTO), which is the second largest number of grants ever (
rankings-trends-2018.htm). All granted US patents since 1976 are publicly available as full
text ( These large text collections represent an extensive amount
of human knowledge in an almost unstructured form. This makes mining information from
them challenging and automatic classification and retrieval a hard problem.
Not only thenumber of documents, but also thepatent-specific vocabularymakes the tasks
more difficult. Because of the underlying legal purpose of patent documents, they follow a
specific writingstyle. Patent applications need to definethe scope of an invention and need to
delimit it fromothers whilst covering as muchvariation as possible. As a consequence, patent
descriptions use vague language. For example, a patent calls an invention electronic still
cameraand electronic imaging apparatus,whereas such a deviceis called digital camera
in colloquial speech( Figure1). A patents claims are a controversialsubject, because a patent
grants rights andalso limits the rights of others. Patents grant a monopoly fora limited time
in exchange for the disclosure of the invention so that others can license it.
Unstructured text sections, such as abstracts, descriptions and claims, make up the
largest part of a patent. The claims section is essential for defining the scope of an invention.
It describes the extent of the monopoly rights granted by the patent. Court decisions of the
past precisely define the meaning of patent speak.An example is the slight difference of
consist ofand comprise(
4_21.htm): consist ofimplies an exhaustive enumeration, whereas comprisecommences
an enumeration that is not necessarily exhaustive. Classifying patents is challenging
because of patent-specific language use even for domain experts.
Data Technologies and
Vol. 53 No. 1, 2019
pp. 108-122
© Emerald PublishingLimited
DOI 10.1108/DTA-01-2019-0002
Received 11 January 2019
Revised 4 March 2019
Accepted 4 March 2019
The current issue and full text archive of this journal is available on Emerald Insight at:

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT