Electronic library for PocketBook: automatic processing

Probably every electric reader would like to keep his entire collection of books directly on the e-book reader, and at the same time, in spite of the overall inhibition of the device, have easy navigation.
Often, it’s problematic to keep hundreds and thousands of books in an e-book: either the device does not read for a long time, reading information about each book from its insides, or manually maintain the collection with a breakdown into catalogs — one that is still hemorrhoids.

Preamble

I remember that in the early editions of Sony's electronic books there was such a problem: you upload several hundred books there, and the device hangs on when you turn it on, making up a list of downloaded files. Then there was still no support for collections and Sonya ran for a long time over the entire card, collecting information about the uploaded books. Many have complained.
It was more convenient to work with books from PocketBook - they supported navigation through the file system, so you could manually scatter books into folders, and only books in the folder in which we entered were read the device.
Manually from a computer it was already possible to form something and somehow live.

At the same time, I was introduced to LibRusEk (now Flibusta is more relevant, who does not know what it is, to you here ). And at some point I got the idea, and not try to drive the entire library into the electric book, having slightly automated this process?
')

And now important! In order not to waste the attention of people to whom this article probably will not be able to help (unfortunately there will be such), I will give a filter installation:
If you are the owner of the PocketBook e-reader, and you are interested in the idea of driving as many books as possible with convenient navigation into the device - you are definitely here. If you are the owner of another e-book, which nevertheless supports the FB2 format and navigation through directories, you probably also come here, see closer to the end of the article the description of the $ no_leased_storage setting . The rest, unfortunately, this article will not be able to help. Excuse me.

Having a little picked about the PocketBook functionality, I found out one interesting feature: it supports link files. Something like shortcuts in Windows or symlinks in Unix. I will explain later why we need them and what is the beauty of this feature, and now I’ll just say that they are used regularly for the functioning of the Favorites section. When you post a book there, only the link to the real book file is placed in a special folder on the device. The book itself is not copied to favorites.
Also, after talking with the developers, it was found that at that time SDHC capacities up to 32Gb were already supported. In general, for modern models that have a microSD slot, the same 32Gb is indicated in the TTX, which is somewhat disappointing. But not much. Yes, and it is necessary to check the specifications of the specifications, but actually can support more. Only a 64Gb card for the sake of unwarranted check broke to buy.
Well, there is still a great feature - support for .fb2.zip, this is when each book in fb2 format is packed into its own zip-archive. PocketBook sees such books transparently, that is, just as unpacked ones.

Immediately it is worth explaining one thing, preempting rash questions: no one is going to read hundreds of thousands of books (and we are talking about such quantities). Keeping such a collection on the device in the hope of ever reading it all is simply insane.
The convenience of such a cataloged volume is completely different. For example, someone advises you to read a particular book, you look, and you already have it. Or a specific author for example. You immediately find it in your book, and bookmark your favorites. Yes, the author's directory can also be put in favorites, links to directories in Pocketbooks also work.
Like “Honor Garrison”, “OK, I read”, and immediately put it into your favorites. Otherwise, it will be forgotten, not to write in notebooks.

Theoretical part

So, get down to business. A PHP script was written that takes fb2 files from the source folder, including zipped archives in batches, and creates a file system-based book collection for uploading to PocketBook.
He creates it slyly. Here you need to separately say about the nuances of the format fb2.
First, the book keeps its title (as well as information about authors, genres and series) within itself. The file name here is not important in principle.
Secondly, the book can be written by several authors, relate to several genres and be included in several series. And I would like to have a search and cataloging both by authors, by genres and by series. With that, if written by two authors, I would like the book to be present in the catalogs of both.
Copy to each appropriate directory book? Nekomilfo.
And this is where the link files come to the rescue.
The script places the bodies of the books (in zip) in a separate directory, creating nested subdirectories inside and scattering them so that the final directory contains no more than a hundred books, and in intermediate directories no more than a hundred subdirectories. This is so that the book is not stupid when accessing a particular zip, going through many files in one directory.
Nearly, for example, in the root of a flash drive, a cataloger directory is created in which all the necessary navigation through authors, series and genres is done inside, and which in the end nodes instead of books contains the above-described link files that refer to the corresponding zip files. A clever poketbook on the site of these links will show the books themselves, and will open them just as if they were in the cataloger.

For example, I will show you how arbitrary 10 books will be stored as a result, what they will turn into in the file system of the SD card:

/_zipstorage_/00000000/00000000/00000001.zip /_zipstorage_/00000000/00000000/00000002.zip /_zipstorage_/00000000/00000000/00000003.zip /_zipstorage_/00000000/00000000/00000004.zip /_zipstorage_/00000000/00000000/00000005.zip /_zipstorage_/00000000/00000000/00000006.zip /_zipstorage_/00000000/00000000/00000007.zip /_zipstorage_/00000000/00000000/00000008.zip /_zipstorage_/00000000/00000000/00000009.zip /_zipstorage_/00000000/00000000/0000000a.zip /// -///  /00000001.flk /// -///  /00000002.flk /// -///  /00000003.flk /// -///  /00000004.flk /// -///  /00000005.flk /// -///  /00000006.flk /// -/// /00000008.flk /// -/// /0000000a.flk /// -///  /00000007.flk /// -///  /00000009.flk //_/ /  / -///  /00000007.flk //_/  // -/// /0000000a.flk //_/, // -///  /00000009.flk //_// / -///  /00000001.flk //_// / -///  /00000002.flk //_// / -///  /00000003.flk //_// / -///  /00000004.flk //_// / -///  /00000005.flk //_// / -///  /00000006.flk //_/// -/// /00000008.flk //_// / -///  /00000001.flk //_/ /  / -///00000007.flk //_/ /  / //00000007.flk //_/  // -///0000000a.flk //_/  // //0000000a.flk /// -///00000007.flk /// -///00000007.flk /// -///0000000a.flk /// //00000007.flk /// //00000007.flk /// //0000000a.flk

Inside each zip file in the _zipstorage_ directory there is a book in .fb2 format with the same name as the zip file.
Inside each flk-file is the path to the corresponding zip-file with the book.
In place of the .flk files in the device, the corresponding books are displayed, and they behave in the same way as if they were lying instead of .flk files. That is, no "00000007.flk", "00000007.zip" and "00000007.fb2" users of the electric book will not see.
As you can see, without the .flk links, if we had to copy each file to where it should be, our 10 files would turn into 31 files with the same size on the flash drive.
But, thanks to the use of this fenki with links, this did not happen.

You do not need to go to the "_zipstorage_" directory at all - this is a repository.
Go to the cataloger called "Library".
There you will be given a visitor card, at your request the librarian will go to the vault himself and bring you from there ...

You may not immediately understand why it is necessary to get to the book “Cube of red plastic” in this library along such a long path: "/ Authors / Letters AZ / S / Sem / Semenova Maria Vasilyevna".
But I remind you that this is not about a dozen files, but about hundreds of thousands. It will be much faster to enter consistently the directories "Letters AZ", "C", "Sem" and find there "Semyonova Maria Vasilyevna" than wait a long time to display, and then for a long time leaf through the full list of authors until we get to letters "C".
In the same way, if you need to find all the books of Harry Harrison, it is quite logical to assume that the path will be "/ Authors / Letters AZ / Y / Gar / Harrison Harry". Since, for example, the letter “G” of the authors of the dofigishch had to introduce an additional level of directories with the first three letters of the last name. By the way, this is driven in the “process.php” script by parameters to the function calls get_splitten_dirs ().
Among other things, you do not need to go to the “library” every time to start reading the next Harrison book — you only need to get there once. And then, as I wrote above, we throw the entire Harrison into favorites and we always have it at hand. It is really convenient.

And I use it all almost “in one person” (not counting a few friends who took the ready fill out from me, or simply duplicated my SD card) for about three years or so. Disorder, it's probably time to share.

Practical part

Well, I think about theory is enough. Who is in the subject, they will understand everything at once, incl. where on the filibust to take the necessary "raw materials", and those who are not in the subject and who are not interested, are unlikely to finish reading this place.
So let's move on to the description of the script itself, a link to the zip archive with which you will find at the end of the article.
I will not describe the problems and solutions that I had to go through. This is a topic for a separate article, probably in the section on PHP.
I will describe only the general scheme of the script and its settings.

Let's start with what the script does outside

In addition to reading the source directory and forming the cataloger, the script also passes every fb2 file through itself. That is, parses the structure and re-saves the file again - parsit.
This is necessary to correct format errors. Some files initially come beaten - then the invalid character is not translated into HTML code, then the left tags are found, which are torn off from the Internet pages, but not valid in fb2-format, then unclosed tags, then something else. It’s not that there are too many of these errors, but they occur relatively regularly, and the electric book cannot open such files.

The script also translates genres into more readable names.

So. When you download a zip-archive, you will find several php-files in it and in the subdirectories a few books for example.
There is an “out_dir \ src” directory from which we will read books, incl. in zips.
And there is a directory "out_dir \ dest", in which we will put the cataloger and storage. After the script itself, everything that will be in this directory should be transferred to the root of the SD card.
Well, you can either configure the script right away so that you write directly to the USB flash drive.

Here it is necessary to make a reservation about a couple of nuances regarding the recording on SD-cards. Be prepared for the fact that this process is not fast, given the amount of information and fragmentation.

There will be a lot of files. In addition to the number of source files, at least three times more small flk-files will also be written to the USB flash drive. Each of them contains a link to a zip-archive and takes only a few bytes, but on the media takes up a whole cluster. We must remember this. Therefore, it makes sense to format the USB flash drive with the smallest cluster size. Otherwise, the megabytes of these files by the sum of their sizes in reality can easily gobble up on a flash drive already a gigabyte of space.
A flash drive (SD card, not important) will warm up. In addition, it will slow down after a short time due to the filling of buffers, it will also slow down the recording speed as it warms up. I do not know why this is happening, but there is such a thing. She needs to be given breaks to cool. I usually form the script library on the hard drive, and then from it I already copy on the USB flash drive in batches. It is also possible that it makes sense to use any container like TrueCrypt, on which you can form a library as on a USB flash drive, then tear off a disk image and deploy it to an SD card. This will be the fastest way to write to a USB flash drive.
An SD card, if you do not already have a large volume card, it is better to buy a higher class (for example, Class10), it will be a little more expensive, but it will significantly speed up the recording process. Class10, by the way, it is already quite tolerably possible to score against the stop during the working day. This I once again remind you that we are talking about hundreds of thousands of small files. In terms of recording speed, this is fundamentally different from the “roll up movie on 4 Gig.” That is absolutely absolutely not similar.

The script was written for the machine on Windows, and accordingly PHP was installed under Windows, version 5.2.9-2. In theory, he doesn’t care on which system to work on, but it may not be possible for someone to start something.
It is because of Windows that the .bat files for running scripts are found in the arihiva.
Here is a cropped version of the script. The full one consists of two parts and a database under Interbase, into which all books are first imported, dubbing is filtered, cleaned, different interpretations of authors, genres, series and titles are corrected (such as Pushkin A.S. => Pushkin Alexander Sergeevich) , MD5 hashes are collected for further library updates, etc.
I don’t see the point here to dump it all, because it’s one thing to raise PHP to execute the script, another thing is to set up a database, install all libraries, etc. Hardly many of you will bother with it. It's easier for everyone to finish their own database functionality to the script, if necessary. Therefore, the script is cut off for autonomous work without any databases. Slightly less functional, but not critical.

A log file with errors (“prc_errors.txt”) will be created next to the script during its operation. This is specified in the startup bat file — redirecting STDERR to this file. If there were no errors, it will be of zero size.

Internal kitchen

Actually the main script is in the file “process.php”, and it must be run.
Settings are stored in it at the beginning. Here they are:

 $out_file='./out.txt'; //        .    false     . $src_dir='./out_dir/src'; //  .  ,      . $dest_dir='./out_dir/dest'; //  .  ,      SD-. $storagename='_zipstorage_'; //   -.      flk-,         SD-. $libname=''; //   - $compress_storage=true; //     .  ,    fb2.  ,    .   -. $control_genres_export=false; //      (. genres.inc) $no_leased_storage=false; //      -. $dbid=0; //    ID ,        .       ))) $GLOBALS['process_config']=array( 'struct_only'=>false, //    'unknown_tags_processing'=>XMLP_UT_CUT, //    (  fb2 , . 'tags_hash') 'strip_comments'=>false, //   <!-- --> 'tags_hash'=>$GLOBALS['XMLP_FB2_elements'], //    'tags_processing'=>array( //      'name_first_alpha'=>false, 'name_len'=>20, 'content_len'=>512, 'pattern_type'=>XMLP_TPS_STRICT, 'include_comments'=>true, ), );

I will describe the parameter $ GLOBALS ['process_config'] ['unknown_tags_processing'] . It steers the parser, and more precisely, what the parser will do with the detected tags that are not valid for the fb2 format (FictionBook 2.0):

XMLP_UT_LEAVE - leave everything as it was in the source.
XMLP_UT_CONVERT - convert to text displayed in the book. That is, <tag> text </ tag> turns into & lt; tag & gt; text & lt; / tag & gt; as a result, the book will display as <tag> text </ tag>.
XMLP_UT_CUT - only the tags themselves will be cut. Their internal text will remain accessible to the reader. <phpcode> $ a = $ b; </ phpcode> turns into $ a = $ b;
XMLP_UT_CUT_FULL - unknown tags will be cut along with their contents. that is, completely.
XMLP_UT_CUT_SMART - not implemented. alias for XMLP_UT_CUT

Obviously, two options are best for correcting errors: XMLP_UT_CONVERT and XMLP_UT_CUT. Since in most cases the user is not interested in these tags at all, they are garbage, in my opinion, the most correct ones are simply cut using the XMLP_UT_CUT option.

The FictionBook format tags themselves are passed to the parser in the $ GLOBALS ['process_config'] [[tags_hash '] parameter . The $ GLOBALS ['XMLP_FB2_elements'] array is predefined at the beginning of the xmlp.inc file.

The $ GLOBALS ['process_config'] ['tags_processing'] parameter drives the generation of regular perception of tags in the text. There it is better not to touch too much, by default it's normal. You can only set the 'name_first_alpha' parameter to true - in this case, the engine will require an alpha character at the beginning of the tag name, and for example, this <:> will be perceived not as an unknown tag, but as garbage in the text, with all the consequences - with the XMLP_UT_CUT option , such tags will not be cut out, but will be turned into html entities and displayed to the end reader of the book.

If the option 'include_comments' is disabled here, a simpler regular schedule will be generated, as a result, script performance will increase two times at least, but in the files of books the comments in the <! - -> tags will turn into obfuscated porridge and become visible the ultimate reader of the book. If the comment goes to the book text area, it’s still nothing, but if it’s in the title area, the book probably won't open at all on the device, since this will be a violation of the XML structure of the document (arbitrary text between the XML tags).

Configure $ no_leased_storage . If set to true, a separate storage for books will not be used. That is, no '_zipstorage_' and flk-files. In place of flk-files in the library will go fb2 or zip (depending on the parameter $ compress_storage ). It will eat more space on the SD card at the expense of several copies of the same book (see the description of the problem above), but it will allow the owners of other electronic books (not PocketBook), if their book supports fb2 format and navigation through directories on a flash drive, use the same script to automatically catalog your libraries.

Another setting is $ control_genres_export . Its rules are in the file “genres.inc”.
There is an array of decoding genres, it looks like this:

  ''=>array('_ _','_ _', true), 'biography'=>array('','', true), 'biogr_historical'=>array('  ','', true), 'biogr_sports'=>array(' ','', true), 'biogr_arts'=>array(' ','', true), 'banking'=>array(' ',' ', true), 'accounting'=>array(', , ',' ', true), 'design'=>array('  ',' ', true), 'org_behavior'=>array(' ',' ', true),

See, at the end of each array there is an element with a value of true? So, with the $ control_genres_export option turned on, the script will only pour out those genres opposite of which in this last element is true. Books that do not have any genres marked in “genres.inc” as true will be ignored, their bodies will not be included in the memory card either.
This is very convenient if you understand that the size of your collection is larger than you have on the SD card.
In this case, you simply edit the list of genres, excluding from it unnecessary you. This allows you to pour the library onto the map partially, only with genres of interest to you. For example, you like fiction, and poetry, romance novels and business literature do not interest you in principle.
Also, no one bothers with this mechanism to break the library into sets of genres into several different memory cards.

There is one more mechanism in the list of genres - reassignment.
Below in the same file (and in the same array) you can see something like this:

  ''=>array('sci_psychology','r', true), 'science_history_philosophy'=>array('sci_philosophy','r', true), ''=>array('sci_linguistic','r', true), 'adv_history_avant'=>array('adv_history','r', true), ''=>array('adventure','r', true), ' '=>array('prose_history','r', true),

Do you see the lonely letter 'r' in the penultimate value of the meta tag?
For a script, this means, for example, that if the code for the genre of 'adventure' is found in a book, it should look at the record of the genre with the code 'adventure' (in the same array). As a result, he will attribute the book in the cataloger to the correct genre. Although the fb2 file itself will not be edited.

Not implemented

Not cataloged by the actual names of the books. Frankly speaking, I have no idea how it will look, and how to make a convenient breakdown of directories for quick navigation. By and large, if you do not know anything other than the title, you can quickly punch the author from the phone in the internet, and already find the book on the device. The time spent on the search, against the time spent on subsequent reading, IMHO, quite allow you to not bother with this.

Progress mapping is not very well done. In the source directory, both fb2-files, and zip'y with them, as well as subdirectories with the same sets can roll.
For zip'files, progress is calculated at random, for each subdirectory its progress is displayed. For good, it would be necessary at the beginning of the script to scan everything that is a separate passage and then show the overall progress of the implementation.

Only the xmlp.inc parser module is thoroughly perelopachen. In the rest of the source there is a mess with the names of variables, there are extra variables.

Well, and still in detail, various non-critical things are not done.

Example of structure error correction

Before processing:

  <p><:>   <tag></tag>    <  >,  </b>>  &lt; .<:></p> <p>—   ! —  <a>  .</p> <p>    </a>     . .</p>

After (setting up XMLP_UT_CONVERT):

  <p>&lt;:&gt;   &lt;tag&gt;&lt;/tag&gt;    &lt;  &gt;,  &lt;/b&gt;&gt;  &lt; .&lt;:&gt;</p> <p>—   ! —  <a>  .</a></p> <p>         . .</p>

After (setting up XMLP_UT_CUT):

  <p>       &lt;  &gt;,  &gt;  &lt; .</p> <p>—   ! —  <a>  .</a></p> <p>         . .</p>

Where to get this good?

And here .

For everyone who does not want to hemorrhoids with PHP settings for Windows, he also dropped his package here . By and large, it is better to download the latest version from php.net , but if you want to try scrap quickly and painlessly try, and if anything, immediately demolish, then deploy this archive to the c: \ php folder (in general, it is important to which one, it is desirable that the path be shorter and without spaces), and write this directory into the system environment variable path. Everything will be ready to work at once.

PS Yes, I know that there are desktop programs that allow you to store your collection. But…
First, it was done all this when such programs did not exist.
Secondly, they will not allow me to quickly do for myself what I can do myself (finish something in the code).
Thirdly, it is unlikely that these programs can correct critical errors in books. I intentionally did not use faster parsing of XML in the script using the C-shny libraries built into PHP, because they fall on errors. You can disable the script crash in case of errors, but you cannot normally copy the file and automatically fix it.
This script, although written in native PHP, is nevertheless optimized for processing speed.
And much more.
By the way, no one bothers to combine both. From the desktop program, we select the books we need according to the conditions (there, in GIU, it is much more convenient to put all the conditions and filter everything according to the filters), and then we feed this script and it creates a directory on the memory card.

UPD : I found an ambush with new Pocketbooks and new firmware. Removes old bookshelf functionality, and support for .flk files. You can see the discussion of the problem in the comments in my correspondence with the user Klu4nik . The functionality with .flk files runs on the 912th Pocket on the firmware E912.2.1.2 20110727_154845 from the store. On newer firmware is no longer working.
A call to the support Poketbukov did not give anything useful, only water.
After that, with the help of Klu4nik, I went out on the official forum to the user with the nickname Antuan, with the help of whom I found the solution. Well, as a solution, at least we have a temporary crutch.
It consists in the following:
Pocketbook firmware supports native and third-party applications, you could see them by going from the main menu to the folder of the same name.
Among other things, these applications can be shell scripts, just give them the extension .app
After the tests, it turned out that you can open the book with a simple shell-command like "/ ebrmain/bin/fbreader.app /mnt/ext2/_zipstorage_/00000000/00000000/00000001.zip".
In addition, if, being in the application folder, on the device, simply go to the directory up (the ".." directory, not the main menu of the shelf), you will see the memory of the library and the memory card, and you can go everywhere and easily run .app files. If you enter from the main menu in the library - .app files you can not see. And through the "Applications" can.
I updated the sources, now there is another option in the “process.php” file for setting $ app_instead_of_flk . If you enable it (true), then instead of .flk files, the cataloger will contain the .app files of the “War and Peace.app” type, with approximately the following content:

 #!/bin/sh /ebrmain/bin/fbreader.app /mnt/ext2/_zipstorage_/00000000/00000000/00000001.zip

Works with a bang, the book opens. I checked it on my PocketBook Pro 912, everything seems to be working fine.
But there are a couple of minuses.
When you enter through the "Applications" section, a simple file manager works instead of the bookshelf. No covers and beautiful bookshelf it will not display. It is also not possible to add to the "Favorites".

In general, download the script from the link above.

We also advised to add to the site idea.pocketbook-int.com a request for the return of the old functionality.
There is already a topic: idea.pocketbook-int.com/pocketbook-pro/idea/161, you only need to get more votes, then there is a chance that they will make the option of the old shelf in the new firmware. But there you need to register to vote for the implementation of this feature.

UPD 2 : Solution number two. Also temporary (well, as anyone).
On the current firmware 2.1.2 and 2.1.3, you can install the old shelf from PocketBook 611. There will be a main menu interface, like in the old firmware. At the same time, reference files .flk, “Favorites”, multitasking, new firmware settings work.
It looks like 9.7 inches like this (vertical and horizontal orientations): (the pictures are clickable, I added a one-pixel frame to the screenshots and previews so that you can see the edges of the screen on a white background) So far only two small bugs have been noticed:

1. The main menu is not full screen, but it works as it should, not annoying. It is understandable - the menu template was made under the screen of 6 inches. The remaining screens normally scale themselves. Also, the “detailed” view of the shelf is not rendered to the full width of the screen. In the screenshots can be seen.
2. The "Home" button (return to the main menu) does not work. Becauseswitching to the main menu from the context also does not work, apparently this is due to the multitasking environment - it can not switch the task to that little task menu that we replaced. A bit annoying for a long time to go to the main menu from the directory tree.

Installation is simple. We swing here . Inside the zip is one “bookshelf.app” file. You need to connect a USB-cord and write this file to the internal memory of the device (not to a USB flash drive) in the directory "/ system / bin /".
Actually everything. After disconnecting the cable, it will work immediately, although it is better for every fireman to reboot in order for the main shelf to be unloaded from memory.
Demolition is just as simple - we catch a string, delete this file "/system/bin/bookshelf.app" and everything will be back to square one.
Tested on a PocketBook Pro 912 device. Works fine.

UPD 3 : There is an additional discussion here . Wellcome with wishes and suggestions.

UPD 4 [05/15/2012 20:52 MSK]: Fixed a malicious error in “genres.inc”, the genres_get () function, which caused a memory buzz in a loop. Dull ochepyatka was. Copy-paste is to blame. Sources updated, download.

UPD 5 [05/15/2012 22:03 MSK]: As a result of the discussions on the link in UPD3, the following settings were added and changed:

 $dest_mount_point='/mnt/ext2/'; $storagename='.zipstorage'; $path_encoding=PATH_ENC_AUTO;

The variable $ path_encoding is responsible for what encoding will be generated in the cataloger paths.
In addition to the default automatic values, there can also be forced PATH_ENC_1251 or PATH_ENC_UTF8.
PATH_ENC_AUTO defines preg_match ('/ win / i', PHP_OS)
if the PHP_OS constant contains “win”, then it will be PATH_ENC_1251, otherwise PATH_ENC_UTF8.

UPD 6 [05/20/2012 3:10 MSK]: Moved to GitHub .

Source: https://habr.com/ru/post/143492/

All Articles