About

About this project

During 1920s Colonial Korea, the arrival of a new Governer-General marked the start of a more liberal period in the colony. This led to significant growth of publishing companies, newspapers and magazines in the colony. One of the crucial components of this new print culture is often overlooked, the printers and printshops they worked at. Due to the nature of the Korean language, acquiring the typefaces for the metal-presses used was expensive. Not only were the typefaces itself expensive to acquire as the Korean language needed far more faces than Western languages, but through needing more faces, also required more storage space and thus cost in rent. This led to printshops staying with one typeface for a long period of time and as has been shown by De Fremery. Typefaces during this era thus function as a sort of 'handwriting' for printshops and can be used to identify these.
This websites host a Neural Network that classifies pages of Korean books dating from the 1920s per the printshop that produced these. This model was trained on about ~2200 pages (in the form of png/jpg images) of books from currently only two printshops. Namely, the Hansong Toso (한성도서주식회사 - 漢城圖書株式會社) and Taedong Inswaeso (대동인쇄소 - 大東印刷所). These printers were chosen as research has shown that these two printshops were responsible for the production of 60% of the Korean vernacular poetry. Because of the significant amount time it takes to collect training data (images), this model currently only supports two printshops and is therefore extremely specific.
The current model is overfitting the data, meaning that it has learned the features of the training data in detail and predictions on validations images from the same set of books leads to almost 100% accuracy. When shown data the model has never encountered, we can see that confidence levels of predictions drop to about 50%, showing the model is essentially guessing. Solutions to this are twofold, one) More training data for these two printshops, but also requires data from more printshops for this model to deliver benefits. This would ensure that the model has enough data to reliably find differentiating chacteristics. Additionally, I predict that the model can be vastly improved by segmenting images per syllable before classifiction. This would effectively split the a page that is currenlty only one data point, into the amount of syllables on the page. This would not only greatly increase the number of training data, but also allow for more accurate feature distinction as font style has been shown to be the factor on which classification can take place.
With such expension for scholars of both Korean Studies as well as those focused on using Digital tools within the humanities (Digital Humanities). Regardless of the shortcomings, this project does serve as an example for the potential and viablity of constructing such a model.

Training Data

Data was gathered from the National Library of Korea and from the Hyundam Mun’go. See the list below for a detailed overview which books were selected per printshop as training data.

Hansong Toso (한성도서주식회사 - 漢城圖書株式會社)
Taedong Inswaeso (대동인쇄소 - 大東印刷所)

Tools used

Special Thanks

I want to thank Dr. Christopher Handy from Leiden University for inspiring me to keep pushing the bounderies of my comfort zone within coding this project. His class and help on this project proved invaluable for the succesful creation of this project. I also want to thank Prof. Dr. Wayne De Fremery from Sogang University, his class and enthusiasm inspired me to pursue Digital Humanities and proved invaluable for my interest in Korean print culture duing the 1920s.