I'm neither a librarian nor archivist, by trade or penchant. But like many readers of this site, I am a great lover of older books. And even with all the good work being done by so many people, institutions, and organizations like Project Gutenberg and Google Books, it's taking years to bring many of them into the 21st century.
Many older books – even those currently in print – are not being converted. What then of out-of-print and those in the public domain?
This situation has spawned a subculture of amateur book archivists who have begun their own conversions of books.
A recent project of mine required I read a copy of a book first published in 1926. And while it's a heavily referenced work, it's been out of print for over half century, with nothing on the used market.
I turned to my local library, but their one reference copy was out for repairs. My good friend Brigette, living halfway across the country, heard of my plight and found a lending copy in her own state's library network. Fedex delivered it to me the next day, and I got to work turning it into an ebook, learning some best practices as I went along.
When I decided to give it a go myself, my first assumption was that I would simply scan the book on my desktop scanner, page by boring page. Drudgery, yes, but I was only doing one book.
But then I began to read up on the options and learned how slow a standard scanner can be for such work. And I began to see why books were generally photographed rather than scanned. It's the difference between 10 seconds a page versus 1/30 of a second. Now the slowest link in the process becomes the page turning. Not too bad.
High end professional systems are available commercially, designed for deep-pocket institutions. Even smaller commercial units can be costly. A cutting-edge open source book copying system is the Linear Book Scanner developed by Google. A video showing how it works can be seen below:
On the other side of the spectrum are groups like DIY Book Scanner who have toiled to produce low cost do-it-yourself scanners that can be assembled for under $300. That's more my speed.
I decided to build my own rig for this project. I've designed and built a great many things, so I didn't see this as a huge challenge. It's hardly beyond anyone's ability, especially the way I did it.
The plans put out by most of the groups are great, but for one book I didn't want to even spend $300, or too much time building it. Keeping it simple, most of what I needed I already had:
- A digital camera. Nine is an 18megapixel DSLR, but many use point and shoots.
- A tripod – one that allows pointing the camera perpendicular to the floor.
- A bright light. I used a strobe, some prefer continuous light.
- A sheet of cardboard and tape – to make the actual rig.
- A sheet of Plexiglas, large enough to cover the open book ($5).
My design was made much simpler by the fact that the spine of the book I was photographing would easily allow it to open flat without causing any damage to the binding. If this had not been the case, I would have to had created a more standard V-shaped rig.
In turn, this necessitate a two-camera rig. (Unless you just shoot all even pages first, flip the book and shoot the odd pages. Software can correct for this later.)
Since I could shoot the book flat, I took advantage of that to shoot spreads at a time, thus cutting my photography work in half. This means I'd need to split the spreads in post – no big deal. But it also meant I was shooting each page with half of the resolution of shooting one page at a time. Yet I tested it, and with 18 megapixels (3456x5184), there was enough resolution to clearly capture the text character shapes for optical character recognition (OCR) later.
Here is a shortlist of the things I learned to be careful about:
- Holding the rig steady: The cardboard rig had a back and side ridge to keep the book stationary (see above). But I also needed to tape the cardboard rig to the floor to kept it in place.
- Strobes and sharp focus: A strobe or flash (NOT on camera) has the advantage of crisp light that "freezes" the shot, and reduce chance of blur. It's also a more powerful light, allowing you to use a smaller aperture on the lens, this increases depth of field and focus issues. Any blurring and you can forget about successful OCR later. Make sure the camera is pointing straight down and centered, and test-test-test with sample shots in Photoshop.
- Keeping the camera steady: Continuous lighting is a fine alternative, but you need to be extra-careful of sharpness. Even tripping the camera with your finger can create enough movement to soften the image. Use a remote trigger, or in a pinch, a delayed firing option. Lock the mirror on SLRs.
- The well-placed lights: Whichever light source you use, make sure the lights are not reflected in the Plexiglas. Two lights placed at 45 degree angles gives the most even light. But one light can be used if done carefully. Make sure to set an exposure that does not overexpose the white paper, which would make OCR harder.
- Save it as jpegs: Going straight to a high quality JPEG will save you conversion work later on. It will also save you a lot of disk space. I can't see much benefit in shooting this raw.
Here are some links to rig-building projects online, if you wish to build a better rig than I did:
- The original DIY instructions (older version)
- An updated/simplified version (though I feel all this could still be far simpler)
Once the pages have been optically captured, well, there's still more to be done. Just how much will depend on what you want as a final product. Let's look at some of the variables.
- Go straight to PDF: If you took care in your photography, and shot one page at a time (or don't mind it being spreads), and you're fine with assembling your book in the PDF format, then you are in great shape. You can simply use Adobe Acrobat or other PDF utility program to assemble all the pages into one file. Fini!
- If you need to split the spreads, there are a few ways to do this. One of the easiest I've found was to use the free software called Briss, or a more robust program that is also open source called Scan Tailor which does many other things as well, including deskewing, adding/removing borders, etc. Another option is to use Acrobat. Here's a tutorial for that.
- OCR for live text: PDFs are great, but chances are you will want an actual reflowing ebook. This requires optical character recognition. See more on OCR options below.
- Correct for tonality: Even the best OCR packages need clean source images. Your images should not be too dark, and the text characters should have a solid dark tone. Corrections can be made in Photoshop to one page. When it's the best it can be, create a Photoshop Action and apply to all the pages.
- Remove the extraneous: You won't want your OCR to start including any header and footer text that's on the pages, like titles and page numbers. So crop the pages to remove all of that prior to doing the OCR.
- Export from OCR and start the ebook: The final step is OCR, after which you will bring the captured text into a word processing application like MS Word or maybe Open Office, for clean up. When done, you can start the process of actually creating your ePub reflowable ebook. See part one and part two for more on that process.
- Taking care of art: While there is some software that tries to automate parts of working with images, you may find it easier to just handle a small number of images manually, depending on your needs.
OCR has become much better and easier over the years, so it need not cost a great deal of money to get fabulous results. You could do all of this work with free and open source programs, as many professionals .
But there are a lot of commercial packages with great features that can make things easier for the more casual user. I looked at three, each being sort of the best of its own niche.
- The Generalist: Adobe Acrobat. Both Standard and Pro versions offer OCR tools. Acrobat is the Swiss Army knife of PDF tools and does so many things. While its Standard version is competitively priced, the pro version is pricey. The good news is you probably already have it as part of your Creative Cloud subscription.
- The Specialized Commercial Package – Abbyy FineReader. FineReader is a competitively priced OCR favourite among many doing book conversions. It consistently gets really good reviews, and is said to be easy to "train" for better results.
- Open Source and Institutional Favorite – Tesseract OCR/FreeOCR. Tesseract was originally developed at HP and has been underwritten by Google since 2005. It is billed as "the most accurate open source OCR engine available", and is available for Windows, Mac and Linux. A GUI front-end called FreeOCR is available for it for use on Windows.
Digital books are still brand new, hence to some degree, so is the process of converting. It's still an evolving art, but these forums are packed with information on all aspects of the process.
And of course, if all of this is just too much for you, well, there are quite a few scanning services out there that would love to do the work for you. But where's the fun in that?
Words: Lance Evans
Lance Evans is creative director of Graphlink Media.