Migrating annotations from PDF files to Linked Data

From Lsdf
Revision as of 19:00, 15 April 2019 by Nico.schlitter (talk | contribs)


The annotation of books is a very old scholarly practice (especially in the field of humanities). With the mass digitization of cultural heritage objects (such as historical books), digital collaborative annotation gains in importance and enables new research methods combining manual annotations with algorithmic annotation processes. One common way of annotating digitized books is to use the commenting/annotation functionality of popular PDF editors. Annotations in PDFs however are not sufficient for use in research data management and can hardly be handled in further data analysis. The goal of the project is to retrieve annotations from PDF files with images of book pages and to transfer them to the standard of the Web Annotation Data Model [1]. This data model will allow to use the annotations in a client-server architecture based on the Web Annotation Protocol [2]. The annotation server offers CRUD functionalities via RESTful interfaces [3] and the annotations can be analyzed via SPARQL[4] requests. (project supervision in German or English)


  • Exporting images and annotation data from PDFs
  • Designing a data model for the annotations compliant to the WADM [1]
  • Transformation of annotation data to the new model
  • Ingest in an existing RDF[5] annotation server via REST apis


Götzelmann germaine.goetzelmann@kit.edu


[1] https://www.w3.org/TR/annotation-model/

[2] https://www.w3.org/TR/annotation-protocol/

[3] https://restfulapi.net/

[4] https://www.w3.org/TR/rdf-sparql-query/

[5] https://www.w3.org/TR/rdf-primer/