Cooking rutracker for spring and kotlin

In anticipation of the first release of the kotlin language, I would like to share with you the experience of creating a small project on it. This will be a service application to search for torrents in the rutracker database. All code + bonus browser client can be found here . So, let's see what happened.

Task

The torrent database is distributed as a set of csv files and is periodically updated by adding a new version of the entire database dump to the directory with the name corresponding to the date of the dump. In this regard, our small project will monitor the emergence of new versions (already downloaded, and the client, who will download the database himself, we may do another time), disassemble, add to the database and provide json rest api for searching by name.
')

Facilities

For a quick start, take the spring boot. Spring boot has many features that can seriously complicate life in large projects, but for small applications like ours, boot is an excellent solution for creating a configuration for a typical set of technologies. The main way for a boot-up to understand for which technologies it is to create bins is the presence in the classpath of key classes for a given technology. We add them via connection dependencies to maven. In our case, boot will automatically configure us to connect to the base (h2) + pool (tomcat-jdbc) and the json (gson) provider. When connecting dependencies, we do not specify the versions of the libraries, we take a set from the pre-defined boot-ohm - for this we specify the parent project spring-boot-starter-parent in the mavena. We also connect spring-boot-starter-web and spring-boot-starter-tomcat so that boot will configure us web mvc for our future rest and tomcat as a container. Now let's look at the main.

// main.kt fun main(args: Array<String>) { SpringApplication .run(MainConfiguration::class.java, *args) }

And to the main configuration of MainConfiguration, which we pass to SpringApplication as the source for the beans.

 @Configuration @Import(JdbcRepositoriesConfiguration::class, ImportConfiguration::class, RestConfiguration::class) @EnableAutoConfiguration open class MainConfiguration : SpringBootServletInitializer() { override fun configure(builder: SpringApplicationBuilder): SpringApplicationBuilder { return builder.sources(MainConfiguration::class.java) } }

It should be noted that the boot allows you to deploy the resulting application as a web module, and not just run through the main method. In order for this approach to work, we override the configure method of the SpringBootServletInitializer, which will be called by the container when the application is deployed. Also, note that we do not use the @SpringBootApplication annotation on MainConfiguration, but we enable autoconfiguration directly with the @EnableAutoConfiguration annotation. I did this in order not to use the search for components annotated with @Component — all the bins that we will create will be explicitly created by kotlin configurations. It is also worth noting a feature of kotlin configurations - we have to mark configuration classes as open (as well as methods that create bins), because in kotlin all classes and methods are by default final, which will not allow spring to be created for them wrapper.

Model

The model of our application is very simple and consists of two entities. This is the category to which the torrent belongs (it has a parent field, but in fact the torrent is always in a category with only one parent), and the torrent itself.

 data class Category(val id:Long, val name:String, val parent:Category?) data class Torrent(val id:Long,val categoryId:Long, val hash:String, val name:String, val size:Long, val created:Date)

Our model classes, I described simply as immutable data classes. This project does not use jpa for ethical reasons and as a consequence of the principle of Occam's razor. In addition, the orm would require the use of extra technology and an obvious sinking of performance. For mapping data from a database to objects, I will simply use jdbc and jdbctemplate, as a tool sufficient for our task.

So, we have defined our model, in which, besides quite ordinary fields, attention should be paid to the hash field, which is actually an identifier of a torrent in the world of communication between torrent clients and which is enough to find (for example, through dht) happy owners distributing this torrent and get the missing information from them (like file names), which distinguishes the torrent file from the magnet link.

Repositories

For data access, we use a small abstraction that will allow us to separate the data storage from its consumer. For example, due to the specifics of the data, we could easily use storage and parsing of the csv-database at the start, also this abstraction would be suitable for those who especially keenly want to use jpa, which we talked about a little higher. So, for each entity we create our own repository, plus one repository to access the current version of the database.

 interface CategoryRepository { fun contains(id:Long):Boolean fun findById(id:Long): Category? fun count():Int fun clear() fun batcher(size:Int):Batcher<Category> } interface TorrentRepository { fun search(name:String):List<Torrent> fun count(): Int fun clear() fun batcher(size:Int):Batcher<Torrent> } interface VersionRepository { fun getCurrentVersion():Long? fun updateCurrentVersion(version:Long) fun clear() }

I would like to remind you if someone has forgotten or did not know that the question after the type name means that there may be no meaning, i.e. it may be null. If there is no question, then most often the attempt to push through null fails at the compilation stage. From the lyrical digression go to our ~~ram~~ interfaces. Interfaces are specially made minimalist, so as not to distract from the main thing. And in general, their meaning is clear, except for the batchers of the first two. Again, because of the specifics, we need to write a lot of data once, and then they do not change. Because of this, there is only one method to change, which provides the ability to batch add. Let's take a closer look at it.

Batcher

A very simple interface that allows you to add entities of a specific type:

 interface Batcher<T> : Closeable { fun add(value:T) }

Also, Batcher is inherited from Closable, so that you can send the started incomplete pack for adding when there is no more data in the source. They work according to the following logic: when creating a batcher, the pack size is specified, adding an entity accumulates in the buffer until the pack grows to the specified size, then a group add operation is performed, which is generally faster than a set of single additions. Moreover, the Batcher categories will have the functionality of adding only unique values, for torrents, a simple implementation using JdbcTemplate.updateBatch (). There is no perfect size for a pack, so I made these parameters in the application configuration (see application.yaml)

clear ()

When I talked about a single method that modifies data, I was a little impulsive, because all repositories have a clear () method that simply deletes all old data before processing the new version of the dump. In fact, we use truncate table ..., because delete from ... without where, works much slower, and for our situation the action is similar, if the base does not support the truncate operation, you can simply re-create the table, which is also much faster in speed than delete all rows.

Reading interface

There will be only the necessary methods, such as search () for torrents, which we will use to search, or findById () from categories to collect a full result when searching. we only need count () to log the data we need, it is not needed for the case. The jdbc implementation simply uses the JdbcTemplate for fetching and mapping, for example:

  private val rowMapper = RowMapper { rs: ResultSet, rowNum: Int -> Torrent( rs.getLong("id"), rs.getLong("category_id"), rs.getString("hash"), rs.getString("name"), rs.getLong("size"), rs.getDate("created") ) } override fun search(name: String): List<Torrent> { if(name.isEmpty()) return emptyList() val parts = name.split(" ") val whereSql = parts.map { "UPPER(name) like UPPER(?)" }.joinToString(" AND ") val parameters = parts.map { it.trim() }.map { "%$it%" }.toTypedArray() return jdbcTemplate.query("SELECT id, category_id, hash, name, size, created FROM torrent WHERE $whereSql", rowMapper, *parameters) }

In such a simple way we implement a search that finds a name containing all the words of the query. We do not use the limit on the number of records given at once, like splitting into pages, which would certainly be worth doing in a real project, but for our small experiment, we can do without it. I think here it is worth noting that such a decision on the forehead would require a complete crawl of the table each time to find all the results, which may be too much for a relatively small base of rutracker, but of course it would not be suitable for public production. To speed up the search, you need an additional solution in the form of an index, maybe a native full-text search or a third-party solution like apache lucene , elasticsearch or many others. The creation of such an index, of course, will increase both the time to create the base and its size. But in our application, we will focus on a simple sample with a detour, since our system is rather an educational one.

Import

Most of our system is importing data from csv files into our storage. There are several aspects to which it would be worth paying attention. First of all, our initial base, although not very large, is nevertheless already of such a property, when it is necessary to carefully treat its size - i.e. you need to think about how to reduce data transfer time, probably copying data to the forehead may be long. And second, the csv-base is denormalized, and we want to get the division into categories and torrents. So, we need to decide how we will make this separation.

Performance

Let's start with reading. In my implementation, a self-written csv parser on kotlin is used, taken from my other project, which is slightly faster and a little more attentive to the type of exceptions that are made than the existing open source market, but in fact does not change the order of parsing speed, i.e. it would be possible with the same success to take almost any parser that can work in a stream, for example, commons-csv .

Now record. As we have seen before, I added the batchers to reduce the overhead of adding a large number of entries. For categories, the problem is not so much in quantity, but in the fact that they are repeated many times. A number of tests have shown that it is faster to check availability before adding to a pack than to create huge packs from queries of type MERGE INTO. This is understandable, given that the first step is to check the existing bundle directly in the memory, then a special batcher appeared that checks the uniqueness.

And of course, it was worth thinking about parallelizing this process. Making sure that different files contain data independent of each other, I selected each such file as an object of work for a worker working in his own stream.

  private fun importCategoriesAndTorrents(directory:Path) = withExecutor { executor -> val topCategories = importTopCategories(directory) executor .invokeAll(topCategories.map { createImportFileWorker(directory, it) }) .map { it.get() } } private fun createImportFileWorker(directory: Path, topCategory: CategoryAndFile):Callable<Unit> = Callable { val categoryBatcher = categoryRepository.batcher(importProperties.categoryBatchSize) val torrentBatcher = torrentRepository.batcher(importProperties.torrentBatchSize) (categoryBatcher and torrentBatcher).use { parser(directory, topCategory.file).use { it .map { createCategoryAndTorrent(topCategory.category, it) } .forEach { categoryBatcher.add(it.category) torrentBatcher.add(it.torrent) } } } }

For such work, a pool with a fixed number of threads is well suited. We give the executor all the tasks at once, but at the same time it will perform as many tasks as there are threads in the pool, and in the performance of one task the flow will be given to another. The required number of threads can not be guessed, but you can choose experimentally. By default, the number of threads equals the number of cores, which is often not the worst strategy. Since we only need the pool at the time of import, we create it, work it out and close it. To do this, we do a small utility inline-function withExecutor (), which we have already used above:

  private inline fun <R> withExecutor(block:(ExecutorService)->R):R { val executor = createExecutor() try { return block(executor) } finally { executor.shutdown() } } private fun createExecutor(): ExecutorService = Executors.newFixedThreadPool(importProperties.threads)

Inline-function is good because it exists only at compilation and helps to streamline the code, put it in order and reuse functions with lambda parameters, while not having any overhead. After all, the code that we write in such a function will be embedded by the compiler at the place of use. This is convenient, for example, in cases when we need to close something in the finally block, and we do not want this to distract from the general logic of the program.

Separation

After making sure that entities could not depend on each other during the import, I decided to collect all entities (categories and torrents) in one pass, having previously created only top-level categories (at the same time having received information about files with torrents), selecting them for the parallelization unit .

Rest

Now we have almost everything to add a controller for retrieving torrent search data in the form of json. At the exit, I would like to have grouped torrents. We define a special bin that defines the appropriate structure of the response:

 data class CategoryAndTorrents(val category:Category, val torrents:List<Torrent>)

Done, it remains only to request the torrets, group and sort them:

 @RequestMapping("/api/torrents") class TorrentsController(val torrentRepository: TorrentRepository, val categoryRepository: CategoryRepository) { @ResponseBody @RequestMapping(method = arrayOf(RequestMethod.GET)) fun find(@RequestParam name:String):List<CategoryAndTorrents> = torrentRepository .search(name) .asSequence() .groupBy { it.categoryId } .map { CategoryAndTorrents(categoryRepository.findById(it.key)!!, it.value.sortedBy { it.name }) } .sortedBy { it.category.name } .toList() }

By annotating the @RequestParam parameter with the name parameter, we expect that the spring will write the value of the “name” request parameter in the parameter of our function. Having marked the method with the @ResponseBody annotation, we ask the spring to convert the bean returned from the method to json.

A bit about DI

Also in the previous code, you can see that the repositories come to the controller in the constructor. The rest of this application is done in the same way: the bins themselves created by the spring do not know about di, but accept all their dependencies in the constructor, even without any annotations. The real connection occurs at the level of the spring configuration:

 @Configuration open class RestConfiguration { @Bean open fun torrentsController(torrentRepository: TorrentRepository, categoryRepository: CategoryRepository):TorrentsController = TorrentsController(torrentRepository, categoryRepository) }

Spring transfers the dependencies created by a different configuration to the parameters of the method that creates the controller — the dependencies are passed to the controller.

Total

Done! We start, check (as part of the address localhost : 8080 / there is a javascript client for our service, the description of which is beyond the scope of this article) - it works! On my machine, the import takes about 80 seconds, quite well. And the search request takes another 5 seconds - not so good, but it also works.

About goals

When I was a novice programmer, I really wanted to find out how other more experienced developers write programs, how they think and reason, I wanted them to share their experience. In this article I wanted to show how I argued while working on this task, to show some real solutions to some completely mundane and not-so-difficult problems, the use of technologies and their aspects that I had to face. Perhaps even someone wants to make a more successful implementation of the repositories, or even the whole task, and talk about it. Or just offer it in the comments, all of this, we only increase our knowledge and experience.

Source: https://habr.com/ru/post/274713/

All Articles