Deliverables Verrijkt Koninkrijk

The data source for the VerrijktKoninkrijk project has been the pdf collection at http://www.niod.nl/koninkrijk/default.asp, which comprises a scanned (in color) and OCR'ed version of the complete scientific edition of “Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog”, written by L. de Jong.

These pdf files have each been transformed into XML with the open-source tool pdf2xml:

In order to clean up some of the most obvious OCR mistakes such as floating non-legible characters

due to dirt on the scannerbed or page, we performed a pre-processing clean-up step for all documents with the following xslt script:

The resulting xml documents were transformed into the final book format with the following xslt script:

http://schema.loedejongdigitaal.nl/book.rnc (Note that this links to an html version of the schema. The original file on which it is based can be found by changing .html back to .rnc)

Description of the data format

The data is made available in the following formats at EASY DANS (link through the assigned Persistent Identifier):

General Description

This is a collection of datasets related to the work of Loe de Jong: "Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog". This is the standard reference on the history of the Netherlands during World War II, and was digitally enriched and curated for CLARIN-NL.

1) The loedejongdigitaal.nl xml data

This dataset comprises all of the XML enriched books from the loedejongdigitaal.nl xml collection. The collection is organized as a set of 30 XML files, each one corresponding to one of the paper binds and based on the associated pdf with scan and OCR data.

2) The loedejongdigitaal.nl xml data further enriched with semantic analysis

3) The Named Entities detected in the loedejongdigitaal.nl xml data in table format

This dataset comprises all of the detected named entities in the loedejongdigitaal.nl xml collection. The database is organized as table containing the named entity text, type, and paragraph identifier. If relevant and available, it also contains a dutch wikipedia link and an english wikipedia link.

For related datasets see the thematic collection: 'Verrijkt Koninkrijk'. You can find a link to this collection under 'Relations'.

4) The Semantic Layer (RDF/XML Data)

This dataset contains RDF data in XML format for the Linked Data version of the semanticized Named Entities and back of the book terms of the loedejongdigitaal.nl collection.

Description of the data xml format

Each document is a UTF-8 encoded XML file and valid with respect to the book.rnc compact RelaxNG file. The structure of the documents is as follows. The root element root of each document contains 3 elements:

The book element is created based on an automatic detection of a number of visual and textual cues that can be found throughout the different pages. Thanks to a relatively consistent layout used throughout the different parts (with an unfortunate exception of 'deel 14') it was possible to use the same feature detector for all books.

Data post processing

The back of the book will be enriched with data from the books themselves, and the pages from those books will be enriched with data from the back of the book.

This creates a co-dependency in the transformation process, which is solved by repeating the transformation process once after it is done the first time.

Back of the book

The lemmas in the back-of-the-book (nl.vk.d.reg) contain page references. Because paragraphs are the smallest resolvable element in the curated collection, paragraph references to all paragraphs that have some or complete overlap with a given page have been added.

Pages

To each page element a backof-book-ref element is added if there one or more lemmas which refer to that specific page. These lemma references may function as a 'summary' of a given page, or used in a visualization to allow users easy navigation to other pages which related to the current page via the lemmas. Example:

Statistics for the collection

Element	Aantal
vk:book	30
vk:chapter	226
vk:section	1885
vk:subsection	4708
vk:p	86257
vk:quote	56547
vk:foreword	6
vk:statement	2
vk:appendix	92
vk:corrections	2
vk:header	16015
vk:footer	7881
vk:page	16922
vk:backofbook	1
vk:block	80
vk:lemma	16186
vk:lemma[.//vk:page-ref/@vk:page-ref]	15369
vk:lemma-ref	148370

D3

For the purpose of this research, Het Koninkrijk has been subdivided hierarchically, as follows, and each element of the hierarchy has been given a unique identifier. The XML snippet corresponding to each element/identifier can be obtaines via:

http://resolver.loedejongdigitaal.nl/<id>

where <id> is the identifier.

The highest level of organisation is the volume. In the case of a logical volume consisting of multiple subvolumes, these subvolumes are what we consider the volumes and the logical volumes are not represented in the hierarchy. A volume has the XML tag vk:book and occurs as a child element of a root element (which also contains Dublin Core metadata).
Below the level of volumes are elements with XML tags vk:index (table of contents), vk:chapter (chapter of main text), vk:appendix, vk:foreword, vk:corrections, vk:statement, and vk:backofbook (the inverted index in the book). The final two only occur once each; vk:backofbook is the main body of the final volume of Het Koninkrijk.
Below vk:chapter are elements of type vk:section, sometimes (but not always) containing vk:subsection elements.
The actual text is contained in elements of type vk:p (paragraph). These may in turn contain vk:footer (footnotes), vk:page (pagebreaks) and vk:header elements. Footnotes have been moved to within the paragraph that refers to them rather than near the page break where they occurred in the original text.

Each of these element types has an identifier attribute @vk:id, which reflects the hierarchical structure. Each identifier consists of the prefix nl.vk.d., followed by a point-separated list of numbers denoting book, chapter, section, paragraph. E.g., in Volume 11b, second half, we find a footnote with the identifier:

nl.vk.d.11a-2.2.1.2.6.6

meaning

11a-2: Volume 11b, second subvolume's vk:book (regarded as a single identifier part; the separator is ., not -).
2: Chapter Het gouvernement en de nationalisten, the second element below vk:book. The tenth chapter in De Jong's scheme; this is not reflected in the identifier but in a separate attribute.
1: First vk:section (untitled, as the first section of a chapter always is).
1: First vk:subsection.
6: Sixth paragraph (vk:p).
6: The actual footnote.