Ask Hackaday: How Do You Digitize Your Documents?

Like many of you, I have a hard time getting rid of things. I have boxes and boxes of weird stuff and lots of gadgets that I will eventually manage to take apart and leave in equal parts. further bits and bobs. Despite periodic purges, I try to keep a car full of ~~foolishness~~ host local hacker spaces and meetups at least a couple of times a year; the pile continues to grow.

But the problem is not limited to hardware components. There are all kinds of things that the logical part of me understands that I will almost certainly never need, and yet I can’t bring myself to get rid of them. One of those things turns out to be documents. Everything printed is fair game. It could be notes from my last doctor’s appointment or flyers from events I attended years ago. It doesn’t matter, the batteries keep building up until I end up putting everything in a box and start the whole process again.

I have largely convinced myself that the perennial accumulation of electronic trinkets is an occupational hazard and I have accepted it. But I think there’s a lot of room to change the document situation, and if that involves a bit of high-tech over-engineering, even better. As such, I have spent the last few weeks researching the digitization of documents that contain information worth preserving so that the originals can be sent to Valhalla on my stove.

The following represents some of my observations so far, in the hope that others following a similar path may find them useful. But what really interests me is listening to the Hackaday community. I’m probably not the only one trying to save storage space by converting stacks of paper into ones and zeros.

Take a picture, it will last longer

Obviously, the first step to digitizing physical documents is capturing images. The most obvious way to achieve this is to simply use a flatbed scanner, and in some cases a strong argument can be made that this is the best approach. In fact, many of the documents I have already archived digitally were created this way. But it’s a tedious enough process that you’ll want to consider alternative methods.

If you have a decent camera, you can get a couple of lights and set up a nice aerial photography rig without spending too much money. Place your document under the camera, take a photo, and keep it moving.

There’s nothing faster than taking a photo, and as long as you’re not using a point-and-shoot system from the early 2000s, the resolution should be more than enough. This method is particularly attractive if you plan to digitize books or anything else that can’t lay perfectly flat on a scanner.

The main disadvantage of this approach is the setup itself. It’s one thing if you digitize documents and books on a daily basis, but for occasional use, putting something like this together is a big task. A flatbed scanner certainly takes up much less space and you don’t have to worry about getting the lighting right, mounting the camera, etc.

Casting some magic

Whether you used a scanner or a camera, once you have the image of your document, you have technically digitized it. Congratulations, you are now an amateur archivist.

If you’re looking to keep things simple, you can stop here. Save the files somewhere and you’re done. But depending on the type of content you’re working with and what your goals are, you’ll likely want to touch up your images a bit. Luckily for us, the amazing ImageMagick project has many of the features we need built in, from cropping and resizing to image enhancement.

Consider the image below. It’s clear enough to read, but the text is rotated and the lighting is not consistent across the page.

We can fix both problems with a simple ImageMagick command via the convert tool:

convert input.png -deskew 30% -threshold 25% output.png

We won’t get too bogged down in the details, the ImageMagick documentation can break it all down better than I can. The short version is that we tell it to straighten the image and turn it into pure black and white. The result looks like this:

The values can be modified a little to refine the result and, as you can imagine, there are many other ImageMagick functions that could be incorporated to clean up the result. Things get more complicated if you’re working with something more complex than plain text, but you get the general idea.

This type of post-processing is especially important if you plan to put the images through some type of optical character recognition (OCR) to capture the actual text of the document. That first image could be perfectly readable to our human eyes. You may even prefer it to the stark look of the processed image, but tools like Tesseract find it very difficult to select text when the background is inconsistent.

There is an app for that

The process described here is certainly not for everyone, and that’s okay. If you’re not looking to invest the time and effort required to make this work, luckily there is a much more affordable solution available. In fact, it may already be in your pocket.

The Google Drive mobile app offers a very impressive document scanning mode that essentially automates all of the above. If you give it access to your device’s camera, it will automatically detect documents in the field of view, find their edges, compensate for angle and rotation to straighten the image, and even run it through filters to highlight text. It’s fast, works reasonably well, and is exceptionally useful for generating multi-page PDF files.

The downside is that you have relatively little control over the process and, being a Google product, there are the usual concerns about what they may be doing with the information that passes through the system. For these reasons, it’s not something I’d personally recommend for private information, and its automated nature and lack of detailed control means it may not be a good choice if your needs stray too far off the beaten path.

Still, the speed and ease of use it offers is certainly very attractive.

Open to suggestions

I would love to hear the community’s opinion on digitalization, whether related to hardware or software. There are surely some clever projects out there that help create custom digital libraries, and there are many areas where real-world experience can help streamline and improve the overall process. For example, what is your file naming convention like?

Hackaday readers are rarely shy about sharing their opinions, so let’s hear them.

Source link

Take a picture, it will last longer

Casting some magic

There is an app for that

Open to suggestions

Related News

AI is focusing its attention on historical secrets and is already decoding centuries-old documents

AI is focusing its attention on historical secrets and is already decoding centuries-old documents

AI is focusing its attention on historical secrets and is already decoding centuries-old documents

A medieval book in Rome has been hiding the oldest English poem