Deliverables Verrijkt Koninkrijk
D1
The data source for the VerrijktKoninkrijk project has been the pdf collection at http://www.niod.nl/koninkrijk/default.asp, which comprises a scanned (in color) and OCR'ed version of the complete scientific edition of “Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog”, written by L. de Jong.
These pdf files have each been transformed into XML with the open-source tool pdf2xml:
http://sourceforge.net/projects/pdf2xml
In order to clean up some of the most obvious OCR mistakes such as floating non-legible characters
(combinations of ·.,;:"'|/\^~`•_=><)
due to dirt on the scannerbed or page, we performed a pre-processing clean-up step for all documents with the following xslt script:
http://transformer.loedejongdigitaal.nl/pdf2htmlcleanup.xsl
The resulting xml documents were transformed into the final book format with the following xslt script:
http://transformer.loedejongdigitaal.nl/loedejong.xsl
This resulted in xml files which validate against the following schema:
http://schema.loedejongdigitaal.nl/book.rnc (Note that this links to an html version of the schema. The original file on which it is based can be found by changing .html back to .rnc)
Description of the data format
The data is made available in the following formats at EASY DANS (link through the assigned Persistent Identifier):
General Description
This is a collection of datasets related to the work of Loe de Jong: "Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog". This is the standard reference on the history of the Netherlands during World War II, and was digitally enriched and curated for CLARIN-NL.
The collection includes:
- Structurally enriched XML versions of the books.
- The same structurally enriched XML versions of the books furhter enriched with semantic analysis.
- A table containing all Named Entities from the collection.
- A semantic layer with RDF/XML data
1) The loedejongdigitaal.nl xml data
This dataset comprises all of the XML enriched books from the loedejongdigitaal.nl xml collection. The collection is organized as a set of 30 XML files, each one corresponding to one of the paper binds and based on the associated pdf with scan and OCR data.
2) The loedejongdigitaal.nl xml data further enriched with semantic analysis
This dataset comprises all of the XML enriched books from the loedejongdigitaal.nl xml collection. The collection is organized as a set of 30 XML files, each one corresponding to one of the paper binds and based on the associated pdf with scan and OCR data. This set contains the semanticized text in FoLiA annotation.
3) The Named Entities detected in the loedejongdigitaal.nl xml data in table format
This dataset comprises all of the detected named entities in the loedejongdigitaal.nl xml collection. The database is organized as table containing the named entity text, type, and paragraph identifier. If relevant and available, it also contains a dutch wikipedia link and an english wikipedia link.
For related datasets see the thematic collection: 'Verrijkt Koninkrijk'. You can find a link to this collection under 'Relations'.
4) The Semantic Layer (RDF/XML Data)
This dataset contains RDF data in XML format for the Linked Data version of the semanticized Named Entities and back of the book terms of the loedejongdigitaal.nl collection.
Description of the data xml format
Each document is a UTF-8 encoded XML file and valid with respect to the book.rnc compact RelaxNG file. The structure of the documents is as follows. The root element root of each document contains 3 elements:
- docinfo: contains meta-information about the transformation steps used to produce the document. These are references to the xslt scripts that were used.
- meta: contains Dublin Core Metadata elements with meta data regarding the contents of the document, the collection, and formatting.
- book: contains the structured content of the book.
The book element is created based on an automatic detection of a number of visual and textual cues that can be found throughout the different pages. Thanks to a relatively consistent layout used throughout the different parts (with an unfortunate exception of 'deel 14') it was possible to use the same feature detector for all books.
The elements detected were:
-
chapter A chapter is detected by the string 'HOOFDSTUK' (or variations thereof) as the first word on a page. For part 14, the string to match is either 'Discussie' or 'Reacties en recensies'
-
section Sections are detected by a combination of whitepace before and after the section title, font size, and italic text.
-
subsection Subsections are detected by a combination whitespace before and after a line containing '*'.
-
paragraph A paragraph is detected by the whitespace above or indentation of the first line, and font size. If a line is ended with a hyphen and its not the last line, then the hyphen is removed and the two word fragments are concatenated. Bold and italic words are detected and place in b and i elements.
-
quote Two types of quotes are detected: inline quotes and block quotes. Inline quotes are either single words (or separated by a hyphen) within quotation symbols, or complete sentences within quotation symbols. The closing quotation quote may either appear within the sentence ending '.?!' or outside.
-
foreword A foreword is detected like a chapter, except with the string 'Voorwoord'.
-
statement A statement of accountability is detected like a chapter, except with the string 'Verantwoording'.
-
appendix An appendix is detected like a chapter, except with one of the strings 'Lijst', 'Overzicht van', 'Datumlijst' 'Register' 'BRON-OPGA', 'BIJLAGE', 'Bron-opgave der illustraties'.
-
corrections A correction is detected like a chapter, except with the string 'Tweede aanvullend overzicht van' or 'Overzicht van wijzigingen'.
-
header A header is detected by the location on the page, font size, and surrounding white space.
-
footer A footer is detected by the location on the page, font size, and surrounding white space.
-
page number A page number is detected by the location on the page, font size, and surrounding white space. To prevent OCR mistakes, the actual number is calculated based on the pagen number within the pdf file with an offset. The offset is set by 'majority vote': the offset is calculated indivivually for all pages with a legible number and the most common value is selected.
-
page Page changes are indicated within the source file.
-
back-of-book The back of book is detected by its first lemma "AAA-actie" in part 14.
-
lemma The lemmas within the back-of-book are parsed based on text formatting. Page references are grouped by book reference. The 'see' and 'see also' references are parsed separately. Special care is given to correction of OCR mistakes in page numbers. Irrecoverable or otherwise invalid page references are left out.
Data post processing
The back of the book will be enriched with data from the books themselves, and the pages from those books will be enriched with data from the back of the book.
This creates a co-dependency in the transformation process, which is solved by repeating the transformation process once after it is done the first time.
The order in which all data is processed is as follows:
- Transorm all xml's with http://transformer.loudejongdigitaal.nl/d/vk/loudejong.xsl
- Place the resulting back of the book (vk.d.reg.xml) at http://transformer.loudejongdigitaal.nl/d/vk/nl.vk.d.reg.xml
- Create an XML with all paragraph id's mapped to the page id's those paragraphs appear on with http://www.loedejongdigitaal.nl/parids.xq
- Save the resulting XML file as http://transformer.loudejongdigitaal.nl/d/vk/parids.xml
- Again, transform all xml's with http://transformer.loudejongdigitaal.nl/d/vk/loudejong.xsl
As a result we will have the following enrichment:
Back of the book
The lemmas in the back-of-the-book (nl.vk.d.reg) contain page references. Because paragraphs are the smallest resolvable element in the curated collection, paragraph references to all paragraphs that have some or complete overlap with a given page have been added.
Pages
To each page element a backof-book-ref element is added if there one or more lemmas which refer to that specific page. These lemma references may function as a 'summary' of a given page, or used in a visualization to allow users easy navigation to other pages which related to the current page via the lemmas. Example:
Anti-communisme
Anti-fascisme
Centrale Inlichtingsdienst (voor de oorlog)
Concordaat (20 juli 1933)
Consulaat, Duits, in Amsterdam
Foreign Office/State Department Document Center
Heerlen
Jansen, J. H. G.
Limburg
Noorr, G. C. van
Pius XI, paus
Poels, H. A.
Rooms-Katholiek Episcopaat
Rooms-Katholieke Mijnwerkersbond
Rooms-Katholieke Staatspartij (RKSP)
Statistics for the collection
All elements were counted with the following result:
Element |
Aantal |
vk:book |
30 |
vk:chapter |
226 |
vk:section |
1885 |
vk:subsection |
4708 |
vk:p |
86257 |
vk:quote |
56547 |
vk:foreword |
6 |
vk:statement |
2 |
vk:appendix |
92 |
vk:corrections |
2 |
vk:header |
16015 |
vk:footer |
7881 |
vk:page |
16922 |
vk:backofbook |
1 |
vk:block |
80 |
vk:lemma |
16186 |
vk:lemma[.//vk:page-ref/@vk:page-ref] |
15369 |
vk:lemma-ref |
148370 |
D3
For the purpose of this research, Het Koninkrijk has been subdivided
hierarchically, as follows,
and each element of the hierarchy has been given a unique identifier.
The XML snippet corresponding to each element/identifier can be obtaines via:
http://resolver.loedejongdigitaal.nl/<id>
where
<id>
is the identifier.
- The highest level of organisation is the volume.
In the case of a logical volume consisting of multiple subvolumes,
these subvolumes are what we consider the volumes
and the logical volumes are not represented in the hierarchy.
A volume has the XML tag
vk:book
and occurs as a child element of a
root
element
(which also contains Dublin Core metadata).
- Below the level of volumes are elements with XML tags
vk:index
(table of contents),
vk:chapter
(chapter of main text),
vk:appendix
,
vk:foreword
,
vk:corrections
,
vk:statement
,
and
vk:backofbook
(the inverted index in the book).
The final two only occur once each;
vk:backofbook
is the main body
of the final volume of Het Koninkrijk.
- Below
vk:chapter
are elements of type
vk:section
,
sometimes (but not always) containing
vk:subsection
elements.
- The actual text is contained in elements of type
vk:p
(paragraph).
These may in turn contain
vk:footer
(footnotes),
vk:page
(pagebreaks) and
vk:header
elements.
Footnotes have been moved to within the paragraph that refers to them
rather than near the page break where they occurred in the original text.
Each of these element types has an identifier attribute
@vk:id
,
which reflects the hierarchical structure.
Each identifier consists of the prefix
nl.vk.d.
,
followed by a point-separated list of numbers denoting
book, chapter, section, paragraph.
E.g., in Volume 11b, second half, we find a footnote with the identifier:
nl.vk.d.11a-2.2.1.2.6.6
meaning
- 11a-2
- Volume 11b, second subvolume's
vk:book
(regarded as a single identifier part; the separator is
.
, not
-
).
- 2
- Chapter Het gouvernement en de nationalisten, the second element
below
vk:book
.
The tenth chapter in De Jong's scheme;
this is not reflected in the identifier but in a separate attribute.
- 1
- First
vk:section
(untitled, as the first section of a chapter always is).
- 1
- First
vk:subsection
.
- 6
- Sixth paragraph (
vk:p
).
- 6
- The actual footnote.