Domain-specific word embeddings
for patent classification
Julian Risch and Ralf Krestel
Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
Purpose –Patent offices and other stakeholders in the patent domain need to classify patent applications
according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an
application it can then be compared to previously granted patents in the same class. Automatic classification
would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed
to accomplish this costly manual task. However, a challenge for the automation is patent-specific language
use, such as special vocabulary and phrases.
Design/methodology/approach –To account for this language use, the authors present domain-specific
pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of
more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a
deep learning approach based on gated recurrent units for automatic patent classification built on the trained
Findings –Experiments on a standardized evaluation data set show that the approach increases average
precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, the
authors further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that
the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and
underrepresented classes are the most difficult remaining challenge.
Originality/value –The proposed approach fulfills the need for domain-specific word embeddings for
downstream tasks in the patent domain, such as patent classification or patent analysis.
Keywords Deep learning, Document classification, Word embedding, Patents
Paper type Research paper
In 2018, 308,853 US patents have been granted by the US Patent and Trademark Office
(USPTO), which is the second largest number of grants ever (www.ificlaims.com/
rankings-trends-2018.htm). All granted US patents since 1976 are publicly available as full
text (https://bulkdata.uspto.gov/). These large text collections represent an extensive amount
of human knowledge in an almost unstructured form. This makes mining information from
them challenging and automatic classification and retrieval a hard problem.
Not only thenumber of documents, but also thepatent-specific vocabularymakes the tasks
more difficult. Because of the underlying legal purpose of patent documents, they follow a
specific writingstyle. Patent applications need to definethe scope of an invention and need to
delimit it fromothers whilst covering as muchvariation as possible. As a consequence, patent
descriptions use vague language. For example, a patent calls an invention “electronic still
camera”and “electronic imaging apparatus,”whereas such a deviceis called “digital camera”
in colloquial speech( Figure1). A patent’s claims are a controversialsubject, because a patent
grants rights andalso limits the rights of others. Patents grant a monopoly fora limited time
in exchange for the disclosure of the invention so that others can license it.
Unstructured text sections, such as abstracts, descriptions and claims, make up the
largest part of a patent. The claims section is essential for defining the scope of an invention.
It describes the extent of the monopoly rights granted by the patent. Court decisions of the
past precisely define the meaning of “patent speak.”An example is the slight difference of
“consist of”and “comprise”(www.epo.org/law-practice/legal-texts/html/guidelines/e/f_iv_
4_21.htm): “consist of”implies an exhaustive enumeration, whereas “comprise”commences
an enumeration that is not necessarily exhaustive. Classifying patents is challenging
because of patent-specific language use –even for domain experts.
Data Technologies and
Vol. 53 No. 1, 2019
© Emerald PublishingLimited
Received 11 January 2019
Revised 4 March 2019
Accepted 4 March 2019
The current issue and full text archive of this journal is available on Emerald Insight at: