In this stage, we will be adding OCR processing of uploaded pages, and allowing the BookReader to search the pages of the current book for text. The tesseract program will be used for the OCR process.
You can download a backup image of the Stage 2 system here (you will need to run sudi raspi-config to regain SD card space):
Library Pi Stage 2 Image
You can download the Stage2 PHP files here:
PHP Files for Stage 2
Please note that the image above was taken when the sample book was uploaded and about half done OCR’ing. This way you can confirm the scheduling is all working correctly as the book eventually completes.
SQL script for stage 2
SQL Stage 2
Install Tesseract
Tesseract is the open source OCR package we will be using. It’s another one-line install:
sudo apt-get install tesseract-ocr
To test it, download our sample file:
wget http://www.librarypi.com/downloads/Sample.jpg
Execute the following command to test tesseract:
tesseract Sample.jpg Sample
This may take several minutes to complete, but eventually output should appear like this:
This should have output a Sample.txt file, so look at the top of that file:
head Sample.txt
Now, compare that text with the actual Sample.jpg file, and we see tesseract did a decent job:
MySQL Changes
We need to add some new tables to our MySQL database to support the words on the pages. We’re going to create some new tables: a lp_word table to hold unique words, and an lp_page_word table to hold a list of the words on each page. The tables are defined as follows:
create table if not exists lp_word( id integer NOT NULL AUTO_INCREMENT, word varchar(32), PRIMARY KEY PK_lp_word (id), INDEX ilp_word (word) ); create table if not exists lp_page_word( id integer NOT NULL AUTO_INCREMENT, page_id integer, word_id integer, seq integer, posleft integer, postop integer, posright integer, posbottom integer, PRIMARY KEY PK_lp_page_word (id), INDEX ilp_page_word_page (page_id), INDEX ilp_page_word_word (word_id) );
Using MySQL_client or MySQL Workbench, execute the statements above in your database.
lp.php – Add OCR data to database
We need to modify the PHP ‘LP’ class to add the ability to OCR text using tesseract and store the resulting words in the database for searching. We’re also going to add methods that help the BookReader support text searching.
Method OCRFiles
This method will search the lp_pages table for any records with ‘N’ status. It then invokes tesseract on that file to create the hocr file, and then set’s the file status to ‘O’ indicating it was OCR’ed.
function OCRFiles( ) { $Ret = 0; $Cmd = 'select p.id, filename '. ' from lp_page p '. ' where status = \'N\''; $result = mysqli_query( db(), $Cmd ); if ( $row = mysqli_fetch_array( $result ) ) { $ID = $row["id"]; $Filename = $row["filename"]; // Create same file with .hocr extension appended exec( "tesseract $Filename $Filename hocr" ); if( file_exists( $Filename.".hocr" ) ) { // Update status to 'O' (for OCR'ed) mysqli_query( db(), "update lp_page set status='O' where id = $ID" ); $Ret++; } } $result->close(); return $Ret; }
Method AddOCRRecords
This method will look gor lp_page records with ‘O’ status. It will then load the corresponding hocr file for the image, and parse it into words. It calls GetWordID to add the word if needed to the database, and then records it’s position on the page into the lp_page_word table.
function AddOCRRecords() { $Ret = 0; $Cmd = 'select id, filename '. ' from lp_page '. 'where status = \'O\''; $result = mysqli_query( db(), $Cmd); $Max = 10; while ( $row = mysqli_fetch_array( $result ) ) { $Max--; if( $Max == 0 ) break; $PageID = $row["id"]; $Filename = $row["filename"].'.hocr'; $Text = file_get_contents( $Filename ); $Seq = 0; while( ($P = strpos( $Text, "<span class='ocrx_word'" )) !== FALSE ) { $Text = substr( $Text, $P+23 ); $P = strpos( $Text, "</span>" ); $Word = substr( $Text, 0, $P ); $P = strpos( $Word, 'bbox' ); $Word = substr( $Word, $P+4 ); $P = strpos( $Word, ';' ); $Dim = substr( $Word, 0, $P ); $P = strpos( $Word, '>' ); $Word = strip_tags( substr( $Word, $P+1 ) ); $WordID = $this->GetWordID( $Word ); if( $WordID > 0 ) { $Seq++; $ar = explode( ' ', $Dim ); $Cmd = "insert into page_word ( page_id, word_id, seq, posleft, postop, posright, posbottom ) values (". "$PageID, $WordID, $Seq, ".$ar[1].",".$ar[2].",".$ar[3].",".$ar[4].")"; // LTRB mysqli_query( db(), $Cmd ); } } $Cmd = "update lp_page set status='I' where id = $PageID"; mysqli_query( db(), $Cmd ); $Ret++;
} $result->close(); return $Ret; }
Method GetWord
This method is called to get the ID for a word. If not found, it automatically adds it and returns the new ID.
function GetWordID( $Word ) { $Word = $this->ProcessWord( $Word ); if( strlen( $Word ) == 0 ) return 0; $result = mysqli_query( db(), "select * from lp_word w where w.word = '$Word' "); if ( $row = mysqli_fetch_array( $result ) ) $Ret = $row["id"]; else { mysqli_query( db(), "insert into lp_word (word) values( '$Word') "); $Ret = mysqli_insert_id ( db() ); } $result->close(); return $Ret; }
Method OutputSearchAndExit
This method will perform a search on the lp_page_word and related tables to locate words in pages. It is used by the util.php file, which in turn is used by the BookReader component. Together they allow the user to perform a text search on the uploaded books.
This method is not listed here, to keep the description simple. Leave a comment if you want more details.
New proc.php file
We are going to add a new file, proc.php, to our site. We are going to setup a scheduled task to invoke this page with wget every minute. The job of this script will be to look for pages that need to be OCR’ed or indexed. Please note that this page runs for at least 50 seconds, and quite possibly longer. If you call it up in your web browser, you must be VERY patient, and then it won’t actually output anything. It’s only called by the scheduler, not normally in a browser.
<?php include_once( 'lp.php' ); function DoProcess() { $lp = new LP(); $lp->OCRFiles(); $lp->AddOCRRecords(); } // end of function DoProcess() $fp = fopen( "uploads/file.flag","w"); if (flock($fp, LOCK_EX)) { try { $StopAt = time() + 50; // Run for 50 seconds $Cnt = 0; while( $StopAt > time() ) { DoProcess(); sleep( 2 ); $Cnt++; } } catch( Exception $e ) { echo "Exception: " . $e-> $e->getMessage(); } flock($fp, LOCK_UN); } // No else needed fclose($fp); ?>
Starting scheduled job
In order to invoke proc.php every minute, we need to setup cron to call it. We do it by running crontab -e to edit the cront table:
crontab -e
Then, in the file want to add this line, so that wget is called to launch the proc.php file:
* * * * * wget -q http://localhost/proc.php -O /dev/null
Once the schedule has started, you should see the indicator of ‘Pages to process’ slowly dropping:
Note that this is still a Raspberry Pi, and it may take a minute or two for each page to process. So be patient.
Testing Stage 2
Once all the files are in place, and the crontab modified so that proc.php is called, and enough time given to the pi to OCR and index a book, you should now be able to search for text in the book (BookReader):
Enter text, like Lerner in the search field and click the Go button:
You should see an animation indicating the pages that contain the text:
When you go to that page, you should see the search text highlighted in blue:
In Stage 3, we will add user authentication and some security, as well as the ability to delete books.