[Dropped] Concept question about reading OCR files & creating list

Talk about anything at all....
Locked
User avatar
Cat101
Posts: 119
Joined: Thu May 23, 2019 4:06 am
Location: Midwest

[Dropped] Concept question about reading OCR files & creating list

Post by Cat101 »

In working with attorneys, they create files which are PDF/A files - meaning they can't be changed and are OCR readable.

I know I can use FIND to search in the document, after I open it, BUT...

What I'm pondering is:

1) writing something which can read all their documents located in a daily file and creating a list of them.
2) the list is created by what I read, as well as, some cross tables that gives codes, that is loaded into the file.
3) then write a link to the document's location in the file - Im estimating about 600 records per day - obviously will need some error handling things, but that will come later.
4) This will be a program which is kicked off by a daily cron job or a scheduler.
5) Then pushes it out the files & list, then moves files to done.

I know how to do steps 4 & 5, basically.
Its steps 1 - 3 that I'm pondering on.

First - is the reading & gathering information from an OCR document, in some automated manner, possible?
Second - If possible, which direction is best? CALC, Database, Combo, something else?
Last edited by MrProgrammer on Tue May 09, 2023 5:11 am, edited 1 time in total.
Reason: Dropped: No response about progress during the month -- MrProgrammer, forum moderator
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
User avatar
RoryOF
Moderator
Posts: 34786
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Concept Question about reading OCR files & creating list

Post by RoryOF »

It is possible that the OCR might not be necessary. I note, using OCR on many PDF documents, that my OCR front-end frequently announces that the text is embedded in the PDF, do I really want it to OCR the PDF. I have not yet discovered an application which will extract this embedded (I'm using linux is operating system)
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
User avatar
Cat101
Posts: 119
Joined: Thu May 23, 2019 4:06 am
Location: Midwest

Re: Concept Question about reading OCR files & creating list

Post by Cat101 »

Thanks.. Im not sure of all methods that the PDF/A documents are created. But based on what I've seen in their process, it appears to be via their scanners (ScanSnap). The staff mentioned that the scanner created two documents, one ending in OCR and they use the OCR document for filing. I didn't go down that rabbit hole at the time but maybe will have to.

Want to add, while the original doc is created in open office, currently they print for attorneys signature. It is that signed document that I've seen them scan and file.
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
User avatar
RoryOF
Moderator
Posts: 34786
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Concept Question about reading OCR files & creating list

Post by RoryOF »

For OCR I use gimagereader QT as front-end - (gimagereader gtk is very slow, for reasons I don't know), but the QT version flies. They use Tesseract as the OCR engine, very accurate with good scans. Numbers in particular should be checked.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
User avatar
MrProgrammer
Moderator
Posts: 5097
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Re: Concept Question about reading OCR files & creating list

Post by MrProgrammer »

RoryOF wrote: Sat Apr 01, 2023 9:03 pm I have not yet discovered an application which will extract this embedded (I'm using linux is operating system)
[Solved] Can I embed font in ODF document?
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.7, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
Cat101
Posts: 119
Joined: Thu May 23, 2019 4:06 am
Location: Midwest

Re: Concept Question about reading OCR files & creating list

Post by Cat101 »

OMG... Way, way cool. I coded in Perl in another lifetime. I wonder if I can load a perl program onto these attorney's system to test? I may need to brush-up on perl. Thanks MrProgrammer
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
User avatar
Cat101
Posts: 119
Joined: Thu May 23, 2019 4:06 am
Location: Midwest

Re: Concept Question about reading OCR files & creating list

Post by Cat101 »

RoryOF wrote: Sat Apr 01, 2023 10:22 pm For OCR I use gimagereader QT as front-end - (gimagereader gtk is very slow, for reasons I don't know), but the QT version flies. They use Tesseract as the OCR engine, very accurate with good scans. Numbers in particular should be checked.
Thanks Rory. At first I read gim-age-reader & said what? LOL
I'll check it out too.
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
User avatar
Cat101
Posts: 119
Joined: Thu May 23, 2019 4:06 am
Location: Midwest

Re: Concept Question about reading OCR files & creating list

Post by Cat101 »

oh. https://tesseract-ocr.github.io/tessdoc/Home.html is way interesting & Open Source. YEAH
I can see lots of play-time in my future.

I think I'll leave this open for a while & See how far I get in a month.



"Life is short. Find something you love to do. Then excel in what you do."
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
User avatar
RoryOF
Moderator
Posts: 34786
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Concept Question about reading OCR files & creating list

Post by RoryOF »

MrProgrammer wrote: Sat Apr 01, 2023 10:29 pm
RoryOF wrote: Sat Apr 01, 2023 9:03 pm I have not yet discovered an application which will extract this embedded (I'm using linux is operating system)
[Solved] Can I embed font in ODF document?
Thanks, Mr Programmer; I'll play with that later,when I have some time - my main backup computer is currently having hysterics and needs talking to.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
User avatar
robleyd
Moderator
Posts: 5263
Joined: Mon Aug 19, 2013 3:47 am
Location: Murbko, Australia

Re: Concept Question about reading OCR files & creating list

Post by robleyd »

my main backup computer is currently having hysterics and needs talking to.
With a Windows install medium in your hand, to properly frighten it :-)
Slackware 15 64 bit
Apache OpenOffice 4.1.15
LibreOffice 24.8.3.2; SlackBuild for 24.8.3 by Eric Hameleers
---------------------
Roses are Red, Violets are Blue
Unexpected '{' on line 32
.
User avatar
Cat101
Posts: 119
Joined: Thu May 23, 2019 4:06 am
Location: Midwest

Re: Concept Question about reading OCR files & creating list

Post by Cat101 »

In my never ending quest to find easiest way. I stumbled on this for VB.Net & C# coders. Been there, done those. As well as iTextSharp in ASP.NET. Posts were from 2018.

https://social.msdn.microsoft.com/Forum ... isualbasic I have no idea at this point, if anything in this links works, but thought I'd share.

This very thing (Extracting info) will be my project for next week - meeting with their techs to see what's available. And if the force is with me, it will be done next week. Then onto the API aspect of the program.



"If you can't do what you love, then love what you do."
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
User avatar
RoryOF
Moderator
Posts: 34786
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Concept Question about reading OCR files & creating list

Post by RoryOF »

Have a look at Mr Programmer's script which he points to earlier in this thread.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Locked