Top science fiction novels, or I will make my “IMDB for books”, with preference and librarians

Long chose between "Algorithms", "Reading room" and "I am promoting", eventually settled on Data Mining.

This story began in late October, when I once again tried to choose what to read to me. Personally, I take with me on vacation / on the road something of fiction (as I think most of those present), and I absolutely don’t like any fashionable novelty.

And now, tormented by the agony of choice, I scored in the search for “IMDB for books” and ... did not find anything decent. The whole Internet is filled with recommendatory services for books, and they all give complete nonsense. Here, for example, is the top for the “Best fiction and fantasy” section:
')

1. The Master and Margarita. Mikhail Bulgakov, 1940
2. Flowers for Algernon (story). Daniel Keyes, 1959
3. Flowers for Algernon. Daniel Keyes, 1966
4. Battle of Kings. George Martin, 1998
5. Knight of the Order: Blades at the throne. Sergey Sadov, 2000
6. Dovecote in Orehov. Vladislav Krapivin, 1983

Uh ... This is not at all what I expected to see in the first places in the fiction rating. "We will go the other way," I thought. Having rejected the idea of finding a normal reader's rating, I just went to Vicky, found a list of winners of the Hugo and Neboul awards and chose a couple of books - as, actually, I always did before.

“And if I do not stir up my rating of books, taking prestigious prizes as a basis?” I suddenly thought. And muddied. Meet top-books.info

So, I needed to do the following:

1. Find and parse logs of nominees and winners of awards;
2. To form lists of books and authors from them;
3. Assign a rating to each book;
4. Find and paste a picture and description to each book;
5. To find and paste to each author a brief biographical note;
6. Make a search for all this;
7. To fasten a vote.

And now more ...

Premium logs

I decided to confine myself to three prizes: Hugo, Nebula and Locus. All others are either highly specialized or given recently.

I took the lists of winners and nominees for Hugo and Nebula from Wiki:
en.wikipedia.org/wiki/Hugo_Award_for_Best_Novel
en.wikipedia.org/wiki/Nebula_Award_for_Best_Novel

With Locus it turned out to be more difficult. Lists of nominees had to be collected by year:
www.locusmag.com/SFAwards/Db/Locus.html

Moreover, in these lists a huge number of nominees, pieces of 20, most of which I absolutely did not say anything. So I limited myself to the top five nominees from the category “Best Novel” (issued in 1971-1981) and the categories “SF Novel”, “Fantasy Novel” (from 1982 to 2011).

Books and Authors

I analyzed the whole thing with scripts written in the best language in the world - JavaScript-e :). Hugo and Nebula figured out easily (in Wikipedia they still adhere to the same design style), with Locus I had to suffer a little. This is how the analysis of Locus logs looked like:

parseBook = function (s) { var alternates = /\([^\)]+title ([^\)]+)\)/.exec(s); if (alternates) { var alternateTitle = trim(alternates[1]).replace(/^"/, '').replace(/"$/, ''); s = s.replace(alternates[0], ''); } s = s.replace(/ \(.+\)$/, ''); var parts = s.split(', '), delimeter = parts.length - 1; if (delimeter > 1 && parts[delimeter].indexOf('Jr') == 0) { delimeter--; } var title = trim(parts.slice(0, delimeter).join(', ')).replace(/^"/, '').replace(/"$/, ''), author = trim(parts.slice(delimeter).join(', ')); if (author.indexOf(' & ') != -1) { author = author.split(' & '); } return { title: alternateTitle ? [title, alternateTitle] : title, author: author } }

In the end, I got about this list of authors:

  "ae-van-vogt": { "fullName": "Vogt, AE van", "alias": "ae-van-vogt", "firstName": "A.", "middleName": "E.", "lastName": "Vogt", "preposition": "van" }, "kurt-vonnegut": { "fullName": "Vonnegut, Kurt", "alias": "kurt-vonnegut", "firstName": "Kurt", "middleName": "", "lastName": "Vonnegut", "preposition": "" },

And here is a list of books:

  "the-boy-who-bought-old-earth": { "see": "the-planet-buyer" }, "dune": { "alias": "dune", "title": "Dune", "awards": { "1965": [ { "award": "nebula", "won": true } ], "1966": [ { "award": "hugo", "won": true } ] }, "authorAlias": "frank-herbert" }, "and-call-me-conrad": { "alias": "and-call-me-conrad", "title": [ "...And Call Me Conrad", "This Immortal" ], "awards": { "1966": [ { "award": "hugo", "won": true } ] }, "authorAlias": "roger-zelazny" }, "this-immortal": { "see": "and-call-me-conrad" },

The nominees for Locus still have the place field - the occupied place. Hugo and Nebula ranking for nominees do not give.

Ratings

I tried several options, and eventually settled on the following formula:

rating = 6 + 3 * (sum (s [i])) / possibleAwards + yearTotal / 100

Here possibleAwards is the number of awards that a book could theoretically receive (= number of awards given out in the year the book was published), yearTotal is the total number of award nominees in the year the book was published, s [i] is the booked score for each award.

s [i] was considered as follows: 1, if the book won a prize; 1 / number of nominees if the book was nominated for Hugo or Nebula, but did not receive an award; (the number of nominees is occupied + 1) / the number of nominees for applicants for Locus.

In total, each book received 6 points just like that, upon the fact that a prize was placed on the shortlist; from 0 to 3 points depending on the premiums received (total from 6 to 9); plus a small amendment in the form of the total number of nominees in that year / 100, in order to (a) slightly pessimize the books that received awards at the very beginning, when there were no lists of nominees yet; (b) from the consideration that if there were many nominees in a certain year, then the year as a whole was more successful than the previous ones.

For example, take the "Curse of Shalion":

  "the-curse-of-chalion": { "alias": "the-curse-of-chalion", "title": "The Curse of Chalion", "awards": { "2002": [ { "award": "hugo", "won": false }, { "award": "locus", "won": false, "category": "fantasy novel", "place": 3 } ] }, "authorAlias": "lois-mcmaster-bujold" }

The book is gaining 0.16 (6) points for nomination for Hugo (1 out of 6 nominees) + 0.6 points for Locus (3rd place out of 5) + 0.11 for the total number of applicants (6 + 5) for awards in which the book participated. Total: 6.9.

As a result, the top 10 acquired the following form:

9.2 American Gods / Gaiman, Neil
9.2 Paladin of Souls / Bujold, Lois McMaster
9.1 The Forever War / Haldeman, Joe
9.1 The Gods Themselves / Asimov, Isaac
9.1 Dune / Herbert, Frank
9.1 Ringworld / Niven, Larry
9.1 Startide Rising / Brin, David
9.1 Speaker for the Dead / Card, Orson Scott
9.1 Doomsday Book / Willis, Connie
9.1 The Yiddish Policemen's Union / Chabon, Michael

Of the dozens personally, I, however, read only “Paladin of Souls,” “Dune,” and “The Gods themselves,” but I found their presence in the top 10 quite adequate.

Authors rating

With the rating of the authors had to suffer. I wanted the author with a lot of good books to be in the top above the author with one, but very good. I went through a lot of formulas, and stopped at this:

rating = (sum + 3) / (n + 1)

Here sum is the sum of the author’s book ratings, n is the number of books. It is easy to see that this formula is actually equivalent to the fact that each author is counted with a fictitious book with a rating of 3, which allows pessimizing authors with a small number of books. In the end, the top 10 was:

1 Heinlein, Robert A.
2 Le Guin, Ursula K.
3 Asimov, Isaac
4 Card, Orson Scott
5 Bujold, Lois McMaster
6 Willis, Connie
7 Brin, David
8 Haldeman, Joe
9 Clarke, Arthur C.
10 Pohl, Frederik

Here this top completely satisfied me :)

Mine data about books

I recruited information about books from the Amazon Product Advertising API - as part of an affiliate program, Amazon allows you to use information about the publications sold. I was interested in pictures and descriptions. In general, the scheme of work was this:

1. Choose a book
2. Make a request for the title of the book with a filter by one of the authors
3. We are looking for item in the response with the same title and author
4. Write the unique identifier (ASIN) and reviews.
5. If you haven’t found something, try searching for another title (if the book has several) or in another index.

I searched first in the Kindle Store index (I'm for progress and all that :)), and then for paper books. As a result, out of 580 books, 378 were found in the Kindle Store.

Looking for Amazon PAAPI is quite adequate, although some left-wing answers may slip to the first places. The only thing that the API completely ignores the diacritical marks and does not find such authors as Miéville and titles such as Tales of Nevèrÿon - they, as a result, had to be looked for.

Mine data about authors

Authors had to get out of Wikipedia through the Wikimedia API. Bicycle, frankly, he still. As a result, 90% of requests for authors normally worked just by their first and last names, but those 10% who had common names later had to be forgotten with their hands. If, in addition to the name, you can add something like “author” or “fantasy writer” to the search query, then 10% of non-unique names will start to work normally, but the remaining 90% will break completely.

As a result, for each author I pulled a preamble from the wikipedia article. Dear Wikipedia editors, the guideline for preambles is not a bad thing for you. Many articles sternly give captaincy ( David Brin , for example), in others in the preamble the whole essay is written ( Isaac Asimov ).

Search

Well, there was no particular choice - Google Custom Search Engine. I had to tinker a bit with CSS to place it where I wanted, but it seems to work.

By the way, Google CSE has an inverse problem with Amazon - according to Mieville refuses to search, you need to score Miéville.

Voting

I didn’t want to keep authorization and comments at all, so I decided to use Facebook.

Gentlemen developers 2gis API and Leaflet API! Forgive me! Your API is a fairy tale compared to FB. I have not yet met such a poorly organized and disgustingly documented API. It took me almost a week of torment to fasten this canoe.

Gentlemen Facebook developers! Get organized in the documentation! It is impossible to work at all.

Russian version

In the initial plans there was also the creation of a Russian-language version, but, as it turned out, I cannot draw out any Russian-language content. Ozone does not have its API, the Russian Vicki and half of the authors do not know. So in this place a full fail.

So, what is next?

Nothing special. Gentlemen of science fiction - enjoy. Rating, according to my feelings, is more than adequate. (As an experiment, I read No. 1 on the list - " American Gods " by Neil Gaiman. A very cool book, I tell you.) If any estimates seem to be wrong for you, welcome, vote. Just keep in mind that the initial ratings have a weight of 1000 votes, so it’s not easy to kill them, ahem. Personally, the first thing I put dozens of heavily undervalued, in my opinion, " Curse of Shalion ", " Nights in dreary October " and "The other wind ."

Immediately I warn you that the literary awards in honor are serious reading, so the entertainment fiction is very poorly represented in the rating. The same, unfortunately, applies to the pioneers of the fantasy genre - literature is widely represented in the ranking since the 1960s, the earlier one is intermittent. (By the way, I added to the “Lord of the Rings” rating 9.0 and “The Hobbit” rated 8.0 with a volitional voluntaristic solution, and Tolkien looked odd with the only Silmarillion.)

There are no new receipts in the rating (including domestic fiction), nor will there be, until there is a way to more or less reliably give them an initial rating. If this is interesting to someone (and I’m not too lazy), I can additionally screw in the rating of classic novels on the same principle.

In general, enjoy!

UPD Habreffect, Habreffect ... 3% CPU, 8% Memory.

Source: https://habr.com/ru/post/137632/

All Articles