Messages of good cheer are traditional at this time of year and we wanted to join in and send you all of our Best Wishes for your Winter Holiday celebrations and best wishes for a “Bright and Shiny” New Year.
Many of you have probably noticed that our newsletters have not been published for a few months now and we’d like to update you on the reasons for this:
It started with an overload of “other” work coinciding with “issues” preparing the books we had in process for our next publications (i.e., they needed far more work than we had anticipated). We have always said we would only publish our Newsletter when we had something worth publishing so we let a couple of newsletter publication dates slip.
And then, with no warning, the company which provided the “platform*” for our website suffered a massive failure, taking us off line completely. I wont bore you with all the gory, techie, details, suffice it to say that after a week or so this company managed to put the “non-smart” Face of our website back on-line, and after another week or so they managed to recover the basic functionality of our “smart” Catalogue so you could once again browse our products titles.
(*In this context a “platform” is a cluster of computers which provide the hardware for our website software to run on and also connects it to the Internet “backbone” so everyone can “visit” us - and we can “see & hear” you.)
We are still, however, without the specially “smart” back end of the web site which allows us to take your orders online and to make changes to our site.
I hasten to add here that we will be delighted to fill your orders in person, but we will have to do it by eMail or by Phone (or Snail Mail if you prefer.)
We are working a couple of solutions to these problems, but whichever we chose is going to take time to implement so we beg for your patience while we do whatever is necessary to get back to “normal service” as soon as possible.
Above I mentioned we were having “issues” preparing a book(s) for publication, and you may have seen me mentioning the same in previous newsletters. If you have ever wondered what I was talking about I have prepared an explanation and I might as well add it in here if you should wish to understand a little more of what it is we are trying to do:
Why we have Publication delays!:
I’m sure you have noticed that I sometimes mention that the production of a book has been held up and it may have occurred to you to wonder what on Earth I was talking about. I can imagine your reaction being, “Surely all you do is scan a book and record it in the pdf format. Something that many inexpensive scanners will do in a few minutes!” This is quite true - if you are content to rely on all the many pre-set and automated set-ups & procedures inherent in the automated process. As you might expect all these set-ups & procedures are optimized for a “modern” printed document which is “typeset” by a computer and “printed” by a micro managed and processor controlled modern printer.
Almost all of the documents we are processing, however, are much older than 50 years and many of them are older than 100 years. They were “set” by hand using jumbled boxes of individual type pieces - many of which are damaged or worn in some way. The printing was done (frequently one sheet at a time) on a purely mechanical printing frame with the beds of type being positioned and inked, individually, by hand, while the dampened paper was loaded and unloaded by the printers themselves.
It is not surprising then that, even without the inevitable paper degradation, damage, and dirt collections of 100 years of use, these pages need a significantly different scanner set up and processing procedure to that designed for a contemporary document. Further, the individual printing of the papers in the book means that each page scan needs to be checked to ensure the scanner is set up correctly for that particular page.
Surprisingly, the average person does not notice all this variability, and that is because we have a wonderful processor in out heads which not only evens all this variability out for us but it also corrects the hundreds of printing mistakes, both caused by the printer/publisher and by the book readers, over the years. Unfortunately even modern computers are nowhere near that clever, so when the scanned pages are submitted to the OCR* process the “translated” results often come out as “gobble-de-goop.” Unedited OCR’s of old newspapers, for instance, can result in OCR errors of much greater than half of the available content. Similar OCR’s of fine quality books printed on acid free paper and only lightly circulated may achieve accuracies in the 90% region. On the whole we have found that “raw” scans of old books produced OCR results in the 70% to 80% region. Cheaply produced documents, like directories and news print periodicals, were usually well below the 70% level.
(* See a footnote below for what these percentages mean to you as a book reader / user.)
Since most people use our old book reproductions to find information by searching for words or phrases we believe you deserve to have the best chance of finding all the relevant information in the book so we strive to achieve character recognition results in the high 90’s % region and the only way we have found to do this is through our rather laborious, and long winded, post scan processing both at the scanned image, and at the “raw” OCR’ed levels.
And this is where we can be surprised by how much processing we must do! Just like everyone else, we tend to not see the faults and errors present in old printed materials and only discover our errors when we do highly magnified examinations of the scans. This is where the unanticipated delays often start!
“Why is the OCR so important, surely if I can read the book that is enough?” We provide books specifically for genealogy and history students and while some may just wish to read the book to get a feel or sense of the times their ancestors lived through, many, many more are looking for specific events, places or (mostly) Names. These people want to find their information as quickly as possible and so they use the available computer search functions. If the computer search fails to find any matches then they put the book aside as “not useful.”
A friend allowed us to make a digital reproduction of an old book she had for many years and had read (in the original), “from cover to cover,” several times. She said she had extracted all the information it contained about her ancestors. After we gave her back the book, together with our thanks and a copy of the digital reproduction she contacted us again to say that she had used the search feature and had found another ancestor in the book which she had completely missed in her “cover to cover” multiple reading!
THIS is why we believe it is very important that a digital reproduction’s OCR results should be as good as can be made - not just some routine, “any old result,” text interpretation that comes out of a purely automated, “quick and cheap” processing regime . . . . . . . . and why it sometimes takes us a long time to bring our books to publication.
*OCR - Optical Character Reading; a computer software process whereby images of typescript are “recognized” and translated into computer code which can be used to print out a simile, or simulacrum, of the original, or can be used to recognize and extract individual, or strings of, characters (i.e., Search for or Find).
The accuracy, or success, of an OCR pass is usually measured as the percentage of correctly identified (individual) characters to the total number (individual) characters. Note that the probability of multi character words being fully correct is lower than the probability of an individual character being correct depending on the number of characters in the word. The following table give the simple statistical probability of a 5 character word being correct for a given individual character probability of being correct:
Probability of a correct result:
1 Character % 5 Character Word %