OCR

From: Peter (BOUGHTONP) 6 Apr 2019 15:13
To: ANT_THOMAS 4 of 11
It's competition results from my climbing centre - usually they're also posted online, but the centre staff aren't given the ability to update the website, and it seems the person who does has forgotten or not bothered.

I could probably ask for an emailed copy, but I don't understand how detecting numerical digits can be difficult.

What's even more frustrating is that it's supposed to be possible to limit Terreract to only detecting digits, but it doesn't work - it turns out they removed the ability to blacklist/whitelist characters in the current version. :@

From: ANT_THOMAS 6 Apr 2019 18:08
To: Peter (BOUGHTONP) 5 of 11
I hope you've OCRed the names and typed the numbers by now. But you're right OCR software never seems accurate enough. Even trying to use it to read my electric meter was too much hassle for the time saving. And that was very regular characters. Probably could have put more effort in to find a way to give the OCR some history of characters to work with to make matching more accurate.
From: CHYRON (DSMITHHFX) 6 Apr 2019 18:51
To: Peter (BOUGHTONP) 6 of 11
Are you scanning it or taking a digital photo? Things like original print quality and typography will hugely affect accuracy. Also quality of the OCR software varies quite a bit.
From: Peter (BOUGHTONP) 6 Apr 2019 21:58
To: ANT_THOMAS 7 of 11
It's not urgent so I'm leaving it for now hoping someone will pop along with a miracle cure, or I'll just ask for the original file when I'm next in.

It is annoying it doesn't seem to have moved on in the past two decades - it should be possible to point OCR at anything, have it identify glyphs, then ask for feedback on which ones it got wrong, repeat until happy. Bleh.

From: Peter (BOUGHTONP) 6 Apr 2019 22:34
To: CHYRON (DSMITHHFX) 8 of 11
It's a photo taken with a digital compact, so there's a degree of noise and slight gradient, but there's no reason why it shouldn't be 99% OCR-able.

For example, attached is a crop of the row that gave "Q 124 on eel" - on its own it produces "124 97 2el", and in the first image (fixed horizontal/verticals, but gridlines still present and no brightness/contrast changes), it came closest with "0 124 on 221".

Attachments:
From: Peter (BOUGHTONP) 6 Apr 2019 23:03
To: ALL9 of 11
I had the thought of forgetting about OCR and searching for what I actually want, i.e: "image to spreadsheet conversion", which came up with this: https://online2pdf.com/convert-jpg-to-excel

The formatting it produced was all over the place, but it did a good job on the numbers - a handful of mistakes, mostly with zeroes. A couple of incorrect numbers (161->151 and 77->17) which were highlighted through the totals not matching, but compared to Tesseract it was brilliant.

Happy Peter -> :)

From: CHYRON (DSMITHHFX) 6 Apr 2019 23:23
To: Peter (BOUGHTONP) 10 of 11
I've had good luck with online OCR, though not tried for excel.
From: Peter (BOUGHTONP) 6 Apr 2019 23:37
To: CHYRON (DSMITHHFX) 11 of 11
I'm guessing it's mostly just regular OCR, but uses tabs if there's more than a single space, although the file I got back did have merged cells with a dozen spaces for some of the rows, suggesting buggy overcomplicated logic.

We need to set Stallman on them all.