Posts tagged PDF
Why Scan Books?
With the prevalence of eBook readers like the Nook, Kindle, Spring Design Alex and others, comes the necessity of building and maintaining a vast digital library. There are more resources online than one can easily list for both purchasing (and downloading) books in a suite of electronic formats, from PDF to DJVU, but what if you already own a book of the traditional dead-tree sort? What if you aren’t willing to purchase it again just for the convenience and ease of reading it on your brand new eBook reader?
Scanning becomes your only option.
I’ll be honest, the process isn’t easy, quick or glamorous. But it beats spending a day craning over your flatbed scanner or cutting the spine out of your expensive book to feed it through an equally expensive loose-leaf scanner (speaking of which, what the heck is up with how expensive they are?!). If the book is sufficiently expensive, it becomes an economical prospect quickly given the few hours required from start to finish.
I’m not going to address the legal/ethical/moral considerations. You could argue that making a PDF copy for yourself constitutes Fair Use, but the law being what it is, who the heck knows? Regardless, just exercise some moral introspection and decide for yourself.
- A relatively decent Digital SLR with wide to normal focal length lens
- Large sheet of black construction paper
- Tripod/Monopod and a way to hold the camera
- Snapter or other image processing software
- Adobe Acrobat/other PDF creating utility
- 2-4 hours of your time, depending on the book complexity
The specific equipment I use is:
- Nikon D80 with Tamron 17-50 f/2.8 lens
- Nikon SB600 flash (optional)
- Nikon remote shutter release (IR)
- Large piece of black construction paper from Michael’s
- Monopod, table, and a copy of my CRC Handbook (more on that weird combo later)
- Snapter for processing images
- Adobe Acrobat 9 Pro for making PDF and OCR
I’ve already mentioned Snapter twice, and although they’re commercial software (with a very generous 15 day free trial that gives you all the functionality of the real book), don’t let that fool you. I’ve had a lot of success with their software just because of how easy and functional it’s been in my experience. So much so that I went ahead and got the paid version.
That said, there are a few open source alternatives that do a pretty good job and are worth mentioning:
Scan Tailor is pretty good, has a nice GUI, and is very active. Unpaper doesn’t have a GUI but offers a lot for a command line tool. There’s always the advantage both OSS solutions offer that you can either code/propose functionality changes in the software itself with the active developers.
Another relevant article with tips is from /. , which posted ironically the week after I had already embarked on and discovered the ins and outs of scanning with a digital camera myself.
My setup is simple: I mount the camera on the monopod, stick it on the table, and balance it there with my trusty CRC handbook and some other heavy books.
You might be wondering why I didn’t just use a tripod. The reason is that it’s a much more challenging prospect to carefully both tilt the tripod and balance it so the camera is completely perpendicular to the book’s surface. For the best photo quality, one needs the book to be as close to coplanar with the camera sensor as possible. It makes sense, otherwise we’ll have a more challenging time getting the book totally in focus (depth of field will come into play), and have a harder time flattening the book in software.
I generally tape the black paper down to the floor, snap photos of the cover and back cover, and then tape those down as well. More on positioning later.
The whole thing looks like the following:
I have the flash set to bounce from the ceiling, just because in practice this yields the most readable photos. I also use all the light I can from the room itself.
A difficult consideration is that sometimes the print/copy itself has glare. This seems a lot more common with newer books than older ones; it’s almost like the print has a layer of varnish atop it. Just make sure you preview a few images and can actually read the copy.
Positioning the book is the tricky part; it’s difficult to balance between filling the frame with the book (so you have good resolution), and leaving enough space at the edges so that your software can do edge detection. Leave too little space around, and you’ll have a nightmarish time trying to field flatten. Leave too much, and you’ll be throwing away a ton of your image. Even worse, if you don’t tape the book down, it will gradually creep out of the frame.
Another big consideration is rotation. I’ve discovered that Snapter doesn’t really account that well for material that has even subtle rotation. You end up with slight skew in the resulting images. It isn’t a big problem, but rotation will immediately cause you headaches.
I usually go for something like this:
You could zoom in a bit more in this case if you wanted; in practice you’ll discover for yourself what works best.
I set the camera to use a relatively big F/# (in this case F/5.6) so there’s as much depth of field as possible. You want the whole book in focus.
Now just snap away
This is the grueling part, capture images of every page. Snag a friend or something as having two people makes this process go much faster. One can turn the page and crease stubborn ones into place, and the other can trigger the shutter with the remote and make sure the book isn’t creeping out of the frame.
I find this can take anywhere between a half hour to much longer, depending on how much trouble the book gives you. The most challenging parts are the very beginning and the end. At these points, the pages have the most curve to them, sometimes sticking up. This is where sometimes creasing them down or using some tape on the stubborn ones can make or break your day.
Eventually, you’ll have a directory full of images somewhere you need processed.
At this point, you can use whatever tool suits your fancy, but if you’re using Snapter, read on.
Click Book, grab all your photos, and go make yourself a drink as you wait for it to do initial edge detection and processing on images. Nothing is being changed, it’s just generating the initial traces around the book it finds.
After this is done comes the only other bothersome part. It’s very worthwhile to manually go through each page and make sure you’re happy with the edge detection. Frequently, pages that have black or dark color at the edge cause headaches. Drag the handles around until they match closer. This can be grueling, but it’s important.
Click Input, change the background color to black (since we’re using a black piece of paper, or at least I did). Under Output, I also generally turn cropping each page off since I’d rather deal with a spread. Grayscale output will save on space later, and I keep the DPI the same since I’ll compress and downsample later in Acrobat. Now, you can click process and have yourself another drink.
After this is done, you can preview the results on the right. If everything is right, click Save and wait a little longer.
Now you should have a directory full of images waiting to be made into a PDF.
You can use whatever you’d like to make the PDF from the resulting JPEGs, however, I’ve had luck just using Acrobat.
Click Create -> Merge Files into a Single PDF, and then grab all those images you have.
Combine them, and you should now have a huge PDF. Save it, but you aren’t done yet. At this point, I generally take a look at the PDF Optimizer under the Advanced tab, and click Audit Space Usage. Yeah, it should be pretty huge.
If you absolutely need color, just skip this. If your book is black and white, converting is going to save you a ton of space.
To convert pages to grayscale, under Advanced click Print Production -> Convert Colors. Check “Convert Colors to Output Intent” and select “Gray Gamma 1.8.” I usually then exclude the front and back covers from the page range, unless you don’t care about that pretty color you’ll be missing out on.
This process also will take some time. Adobe is multithreaded, but still doesn’t use all my 8 logical cores on my i7 920. Just be patient.
After this finishes, you should now see a dramatic difference under the space audit report for Images. There might be a lot of document overhead, however. Don’t worry, this is normal.
At this point, it usually makes the most sense to do some OCR if you want, just to make the document searchable. Document -> OCR Text Recognition -> Recognize Text Using OCR does the trick.
Click Edit and select Searchable Image (Exact). This won’t resize your images or do compression; we’ll do that later. Now, wait a long time while it consumes CPU cycles and hopefully makes your document so much more powerful and useful.
After this finishes, you’re ready to do some compression and hopefully make your document small enough to not be an embarrassment, you storage hog, you. I usually downsample to around 300 DPI, leave monochromatic images alone (since we don’t have any), and opt for JPEG2000. Check everything in the Discard Objects, Discard User Data, and Clean Up tabs.
Click Ok, and now be prepared to wait the longest you have yet. Even on my rig, this takes an hour or two.
Check the space audit once more, and you should now have a reasonable sized, fully searchable, readable PDF, ready for your enjoyment.
Back on the old brianklug.org I had a number of documents which I’m preserving here in this legacy post. A number of things that, while still are relevant, don’t really merit a whole new individual post per document.
- A presentation I put together for an IEEE student contest. It details (in a high level fashion) the installation, benefits, and driver modification required to install an MTRON 7000 SSD in an intel 965 platform. PDF: SSD Installation in Intel 965 platform for IEEE student contest
- The original documentation prepared for the Imaging Technology Laboratory detailing installation of the MTRON 7000 SSD. It details all the driver modification necessary to enable AHCI and subsequent full throughput of the SSD on the ICH8 platform: Mtron_SSD
- Curving a CCD, Overview and Goals: Technical Note 1. A basic overview of the considerations, challenges, and state of progress regarding field-matching CCD curvature. This document should serve as a primer for why this line of work is both relevant, and important: Curving_CCD_Report
- Twitter Exploration for AI Lab: Feasibility study for spidering, social network analysis, and further research. A primer for what the service is, how it works, how it looks, and how we can leverage it for business intelligence. PPTX: Twitter Exploration, Twitter Update
- ECE 372 microprocessors organization and design final project. This is another Beamer class compiled presentation which details the design and construction of a network appliance monitor and restarter: Final Report Presentation