📜 ⬆️ ⬇️

Convert lib.ru library to epub format using Java tools.

Good day to all. Recently, I had an electronic reader - Kobo Touch, and the question of where to get the books came. The famous Flibusta is certainly a good thing and I take many books from there, but still I was attracted to lib.ru And for the sake of interest I wanted to write a converter. Hate copro-code worth thinking about. to read this text. For the code is really incredibly cruel.

After analyzing the library catalog, it immediately became clear that most of the books have the same scheme, namely:
[ Author ] . [ Names ]
[ Those data ]
[ # Head ]
[ Chapter Text ]
[ # Head ]
[ Chapter Text ]
Well, and so on. I met other forms, but did not complicate things. It is worth noting that [ Author ] [ Name ] and [ # Chapter ] are between certain tags - "" and "" (these are different characters, Ascii format).

The case is left for small, write a simple parser for the page. I used java. To begin with, the question arose of what encoding to read the data for, according to my observations, on each page the encoding is varied. To do this, I resorted to the third-party library universalchardet. So I recognize the encoding and write it into a string.
URLConnection con = url.openConnection(); con.connect(); InputStream urlfs; urlfs = con.getInputStream(); byte[] buf = new byte[4096]; UniversalDetector detector = new UniversalDetector(null); int nread; while ((nread = urlfs.read(buf)) > 0 && !detector.isDone()) { detector.handleData(buf, 0, nread); } detector.dataEnd(); String encoding = detector.getDetectedCharset(); detector.reset(); 

Next, read the page using BufferedReader.
  BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), encoding)); String str; while ((str = in.readLine()) != null) { string = string + str; } in.close(); 

For convenient parsing, I replace the sign on and add a new sign to the end of the file.
  string = string.replace(" ", " "); /* .     ,       . */ string = string + " "; 

In the end, you have to find out how many chapters in the book are in total (as mentioned, there are chapters between tags and). I also take away one chapter, for I myself added it.
  int count = 0; for (char c : string.toCharArray()) if (c == ' ') count++; loop = (count-1); 

The case comes to an end. It remains to divide all the content into chapters, descriptions and the author with the title.
  /*       ". ".  . */ String[] authorandtitle = parsedstring[2].split("\\."); AUTHOR = authorandtitle[0]; TITLE = authorandtitle[1]; /*   , .     .    . ,  */ for(int i = 4; i <= loop; i++){ if((i % 2) ==0 ){ CHAPTER[i] = parsedstring[i]; HEADER[i] = parsedstring[i]; }else{ PARAG[i] = parsedstring[i]; } } 

Of course, it was possible to pull out all this with regulars, but what first came to mind was written.
Now the name of everything to create a document, I used the library EPUBGen to create the final document. Fortunately, the examples are very informative and it took just a couple of minutes. To begin with I create the document and I enter metadata.
  Publication epub = new Publication(); epub.addDCMetadata("title", TITLE); epub.addDCMetadata("creator", AUTHOR); epub.addDCMetadata("language", "ru-RU"); 

Next, you need to save the image to the OPS / images directory and make a link to it in the document cover.xhtml
  DataSource dataSource = new FileDataSource(new File(cover)); BitmapImageResource imageResource = epub.createBitmapImageResource( "OPS/images/cover.jpg", "image/jpeg", dataSource); DataSource coverdata = new StringDataSource("<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<title>Cover</title>\n<style type=\"text/css\"> img { max-width: 100%; } </style>\n</head>\n<body>\n<div id=\"cover-image\">\n<img src=\"images/cover.jpg\" alt=\"Title\"/>\n</div>\n</body>\n</html>"); Resource coverres = epub.createResource("OPS/cover.xhtml", "xhtml", coverdata); epub.addToSpine(coverres); 

The final step is to add a table of content and then recursively add the content itself.
 NCXResource toc = epub.getTOC(); TOCEntry rootTOCEntry = toc.getRootTOCEntry(); for(int i = 4; i <= loop; i++){ if((i % 2) ==0 ){ /*  .*/ OPSResource main = epub.createOPSResource("OPS/"+i+".html"); epub.addToSpine(main); /*   . */ mainDoc = main.getDocument(); /*     .*/ TOCEntry mainTOCEntry = toc.createTOCEntry(CHAPTER[i], mainDoc .getRootXRef()); rootTOCEntry.add(mainTOCEntry); body = mainDoc.getBody(); /*  . */ Element h1 = mainDoc.createElement("h1"); h1.add(HEADER[i]); body.add(h1); }else{ /*   . */ Element paragraph = mainDoc.createElement("p"); paragraph.add(PARAG[i]); body.add(paragraph); } } 

    OCFContainerWriter writer = new OCFContainerWriter( new FileOutputStream(output)); epub.serialize(writer); 

The end result is as follows:

Swing interface, but cheap and cheerful.
Since I did not make much effort to understand the entire library, it works only with books of the old model (a simple text model), such as this one .
Who did not die after reading such an abundance of shit code, I ask for a bitbucket Binary can be taken from here

')

Source: https://habr.com/ru/post/127677/


All Articles