[Dropped] Concept question about reading OCR files & creating list
[Dropped] Concept question about reading OCR files & creating list
In working with attorneys, they create files which are PDF/A files - meaning they can't be changed and are OCR readable.
I know I can use FIND to search in the document, after I open it, BUT...
What I'm pondering is:
1) writing something which can read all their documents located in a daily file and creating a list of them.
2) the list is created by what I read, as well as, some cross tables that gives codes, that is loaded into the file.
3) then write a link to the document's location in the file - Im estimating about 600 records per day - obviously will need some error handling things, but that will come later.
4) This will be a program which is kicked off by a daily cron job or a scheduler.
5) Then pushes it out the files & list, then moves files to done.
I know how to do steps 4 & 5, basically.
Its steps 1 - 3 that I'm pondering on.
First - is the reading & gathering information from an OCR document, in some automated manner, possible?
Second - If possible, which direction is best? CALC, Database, Combo, something else?
I know I can use FIND to search in the document, after I open it, BUT...
What I'm pondering is:
1) writing something which can read all their documents located in a daily file and creating a list of them.
2) the list is created by what I read, as well as, some cross tables that gives codes, that is loaded into the file.
3) then write a link to the document's location in the file - Im estimating about 600 records per day - obviously will need some error handling things, but that will come later.
4) This will be a program which is kicked off by a daily cron job or a scheduler.
5) Then pushes it out the files & list, then moves files to done.
I know how to do steps 4 & 5, basically.
Its steps 1 - 3 that I'm pondering on.
First - is the reading & gathering information from an OCR document, in some automated manner, possible?
Second - If possible, which direction is best? CALC, Database, Combo, something else?
Last edited by MrProgrammer on Tue May 09, 2023 5:11 am, edited 1 time in total.
Reason: Dropped: No response about progress during the month -- MrProgrammer, forum moderator
Reason: Dropped: No response about progress during the month -- MrProgrammer, forum moderator
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
Windows 10.. tho we do bounce around :/
Re: Concept Question about reading OCR files & creating list
It is possible that the OCR might not be necessary. I note, using OCR on many PDF documents, that my OCR front-end frequently announces that the text is embedded in the PDF, do I really want it to OCR the PDF. I have not yet discovered an application which will extract this embedded (I'm using linux is operating system)
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Re: Concept Question about reading OCR files & creating list
Thanks.. Im not sure of all methods that the PDF/A documents are created. But based on what I've seen in their process, it appears to be via their scanners (ScanSnap). The staff mentioned that the scanner created two documents, one ending in OCR and they use the OCR document for filing. I didn't go down that rabbit hole at the time but maybe will have to.
Want to add, while the original doc is created in open office, currently they print for attorneys signature. It is that signed document that I've seen them scan and file.
Want to add, while the original doc is created in open office, currently they print for attorneys signature. It is that signed document that I've seen them scan and file.
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
Windows 10.. tho we do bounce around :/
Re: Concept Question about reading OCR files & creating list
For OCR I use gimagereader QT as front-end - (gimagereader gtk is very slow, for reasons I don't know), but the QT version flies. They use Tesseract as the OCR engine, very accurate with good scans. Numbers in particular should be checked.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
- MrProgrammer
- Moderator
- Posts: 5097
- Joined: Fri Jun 04, 2010 7:57 pm
- Location: Wisconsin, USA
Re: Concept Question about reading OCR files & creating list
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.7, iMac Intel. The locale for any menus or Calc formulas in my posts is English (USA).
AOO 4.1.7 Build 9800, MacOS 13.7, iMac Intel. The locale for any menus or Calc formulas in my posts is English (USA).
Re: Concept Question about reading OCR files & creating list
OMG... Way, way cool. I coded in Perl in another lifetime. I wonder if I can load a perl program onto these attorney's system to test? I may need to brush-up on perl. Thanks MrProgrammer
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
Windows 10.. tho we do bounce around :/
Re: Concept Question about reading OCR files & creating list
Thanks Rory. At first I read gim-age-reader & said what? LOL
I'll check it out too.
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
Windows 10.. tho we do bounce around :/
Re: Concept Question about reading OCR files & creating list
oh. https://tesseract-ocr.github.io/tessdoc/Home.html is way interesting & Open Source. YEAH
I can see lots of play-time in my future.
I think I'll leave this open for a while & See how far I get in a month.
"Life is short. Find something you love to do. Then excel in what you do."
I can see lots of play-time in my future.
I think I'll leave this open for a while & See how far I get in a month.
"Life is short. Find something you love to do. Then excel in what you do."
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
Windows 10.. tho we do bounce around :/
Re: Concept Question about reading OCR files & creating list
Thanks, Mr Programmer; I'll play with that later,when I have some time - my main backup computer is currently having hysterics and needs talking to.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Re: Concept Question about reading OCR files & creating list
With a Windows install medium in your hand, to properly frighten itmy main backup computer is currently having hysterics and needs talking to.
Slackware 15 64 bit
Apache OpenOffice 4.1.15
LibreOffice 24.8.3.2; SlackBuild for 24.8.3 by Eric Hameleers
---------------------
Apache OpenOffice 4.1.15
LibreOffice 24.8.3.2; SlackBuild for 24.8.3 by Eric Hameleers
---------------------
Roses are Red, Violets are Blue
Unexpected '{' on line 32
.Re: Concept Question about reading OCR files & creating list
In my never ending quest to find easiest way. I stumbled on this for VB.Net & C# coders. Been there, done those. As well as iTextSharp in ASP.NET. Posts were from 2018.
https://social.msdn.microsoft.com/Forum ... isualbasic I have no idea at this point, if anything in this links works, but thought I'd share.
This very thing (Extracting info) will be my project for next week - meeting with their techs to see what's available. And if the force is with me, it will be done next week. Then onto the API aspect of the program.
"If you can't do what you love, then love what you do."
https://social.msdn.microsoft.com/Forum ... isualbasic I have no idea at this point, if anything in this links works, but thought I'd share.
This very thing (Extracting info) will be my project for next week - meeting with their techs to see what's available. And if the force is with me, it will be done next week. Then onto the API aspect of the program.
"If you can't do what you love, then love what you do."
Apache Open Office 4.1.14
Windows 10.. tho we do bounce around :/
Windows 10.. tho we do bounce around :/
Re: Concept Question about reading OCR files & creating list
Have a look at Mr Programmer's script which he points to earlier in this thread.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS