LJ in DB (Groovy script)

In continuation of the theme of small scripts on groovy - one more.
Previous: Gmail Big Letters , Addition Exercise (LATEX)

The new script shows the basics of working with XML and database in Groovy. As a task, we will choose to save our cozy ZhZhki from XML to the database.
Why do this? - SQL will tell us everything about our (or someone else's) LJ - topics, comments, tags - how much fantasy enough to collect statistics

First we need to download LiveJournal in XML.
This will make someone else's utility - ljdump
You will have to install Python, open IDLE (Python GUI), load the utility there and run. She will ask everything herself.
')
After its run, you will have a directory with files LXXX - posts and CXXX - comments.

And on these XML and we will run my script.
In this form, it uses pure Java, embedded Hypersonic database (HSQLDB), but you can connect to any, of course. Just make sure the JDBC driver is in your classpath.

Parsing and working with this type of database is only suitable for scripts and small programs. In the enterprise, no one will load all the XML into memory (and will use SAX), and no one will send SQL directly (and there will be a Connection Pool, prepared statement, batch, some Hibernate).

The fact that we parsim looks like this:

<?xml version="1.0"?> <event> <itemid>1</itemid> <eventtime>2006-04-09 16:52:00</eventtime> <url>http://my-ljnick.livejournal.com/460.html</url> <event_timestamp>1144601520</event_timestamp> <reply_count>3</reply_count> <props> <commentalter>1148051408</commentalter> </props> <subject>  </subject> <event>      </event> <anum>204</anum> </event>

Here is the script:

 import groovy.sql.Sql import java.sql.Timestamp def sql = Sql.newInstance("jdbc:hsqldb:file:lj", "sa", "", "org.hsqldb.jdbcDriver") // delete table if previously created try { sql.execute("drop table POST")} catch (Exception e) {} try { sql.execute("drop table COMMENT")} catch (Exception e) {} // create table sql.execute('''create table POST ( id integer not null primary key, url varchar(100), subject varchar(100), text varchar(10000), postdate timestamp )''') sql.execute('''create table COMMENT ( id integer not null primary key, postid integer, user varchar(100), subject varchar(100), text varchar(10000), commentdate timestamp, parentcomment integer )''') sql.execute(''' alter table COMMENT add constraint c_postid foreign key (postid) references POST (id) ''') def dir = "D:\\development\\ljdump-1.5.1\\my_lj\\" List<File> files = new File(dir).listFiles().findAll { it.name.startsWith("L") } files.each { record -> def event = new XmlSlurper().parse(record) int id = Integer.parseInt(event.itemid.text()) Date d = Date.parse("yyy-MM-dd HH:mm:ss", event.eventtime.text()) //No time zone java.sql.Timestamp sqlDate = new java.sql.Timestamp(d.time) String url = event.url.text() String subject = event.subject.text() String text = event.event.text() sql.execute("insert into post (id,url,subject, text, postdate) values ($id, $url, $subject,$text,$sqlDate )") File commentsFile = new File(dir + record.name.replace('L', 'C')); if (commentsFile.exists()) { def comments = new XmlSlurper().parse(commentsFile).comment comments.each() { comment -> String state = comment.state.text() if (!"D".equals(state)) { // not deleted int commentid = Integer.parseInt(comment.id.text()) String user = comment.user.text() String parentIdStr = comment.parentid.text() int parentid = 0; if (!parentIdStr.isEmpty()) Integer.parseInt(parentIdStr) Date commentDate = Date.parse("yyy-MM-dd HH:mm:ss", comment.date.text().replace('T',' ')) Timestamp sqlCommentTime = new Timestamp(commentDate.time) String commentsubject = comment.subject.text(); String body = comment.body.text() sql.execute("insert into comment (id,user,subject,text,commentdate,parentcomment, postid) values ($commentid,$user, $commentsubject, $body, $sqlCommentTime, $parentid, $id)") } } } } sql.close();

Of course, it can be improved:
1. Save the author of the post (It is important if you parse the community)
2. Make it get the file directory and database URL as an argument

You can improve the scheme:
1. Save tags
2. Keep authors in a separate table
3. Add indexes

You can add the Grab directive so that you do not need to add drivers to the classpath (I was trying to, I did not succeed :()

Now let's see how this database can be used.
Make a list of my posts with more than 50 comments:

 @Grab('hsqldb:hsqldb:1.8.0.7') import groovy.sql.Sql Sql sql = Sql.newInstance("jdbc:hsqldb:file:lj", "sa", "", "org.hsqldb.jdbcDriver") sql.eachRow("select subject, url, (select count (*) from comment where postid = mypost.id) as numcom from POST as mypost order by numcom desc") { row -> if (row.numcom>50) println row}

That's all.
I hope someone will be useful.

Source: https://habr.com/ru/post/113941/

All Articles

LJ in DB (Groovy script)

More articles: