Generating a family tree based on Wikipedia data

In this article, I want to show how using the Selenium Webdriver framework, based on Wikipedia data, you can create a genealogical tree of a given person (for example, the legendary founder of the first dynasty of Russian rulers Rurik ).

The article will explain how to determine the name of the person, calculate the links to the pages of the person’s children, and also will build a genealogical tree generation algorithm.
I will use Java , Selenium Webdriver and Chrome . Chrome, because it is faster than other browsers, and since the transition by url is the most time-consuming operation in the program, the choice of browser most noticeably affects time. You can even abandon the browser and use, say PhantomJs , but it's harder to debug. Therefore, I stopped at Chrome.

First of all, we will create a test that verifies that the browser has started up correctly and that when navigating through https://ru.wikipedia.org/wiki/Rurik, a page with the title “Rurik - Wikipedia” opens:
')

@BeforeClass public static void Start() {     driver = DriverHelper.getDriver(); } @Test public void testGetDriver() {     driver.navigate().to("https://ru.wikipedia.org/wiki/%D0%A0%D1%8E%D1%80%D0%B8%D0%BA");     assertTrue(driver.getTitle().equals(" — ")); } @AfterClass public static void Stop() {     driver.quit(); }

Create a class DriverHelper with a static getDriver () method so that the project compiles and the test is successful:

 public final class DriverHelper{    private static final int TIMEOUT = 30;    public static WebDriver getDriver() {        WebDriver driver = new ChromeDriver();        driver.manage().window().maximize();        driver.manage().timeouts().implicitlyWait(TIMEOUT, TimeUnit.SECONDS);        return driver;    } }

We run the test to make sure that the browser starts up correctly and opens the page you need.

Creating the Person class

Let's move on to the creation of the Person class, in which information about a person will be stored, as well as to the creation of a person's page class on the Wikipedia PersonPage.

In the class Person for now there will be only two fields - name and url. As the name, we will use the full name of the person, without division into a surname, first name, patronymic name, since most members of the dynasty will not have a surname, but will have nicknames, titles and ordinal names.

The url will point to a Wikipedia page dedicated to this person.
Create a test that checks the formation of the person:

 @Test public void testGetPerson() throws Exception {    PersonPage page = new PersonPage(driver);    Person person = page.getPerson("https://ru.wikipedia.org/wiki/_");    assertTrue(person.getName().equals(" "));    assertTrue(person.getUrl().equals(        "https://ru.wikipedia.org/wiki/        %D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_        %D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B0%D0%BD%D0%B4%D1%80%D0%BE%D0%B2%D0%B8%D1%87")); }

testGetPerson () does not compile. We need to design the PersonPage page to determine the name and page of the person. We identify the url by the url of the current page, and the name by the textual content of the tag with the identifier firstHeading. GetPerson () method:

 public Person getPerson(String url) throws MalformedURLException {    driver.navigate().to(url);    String name = getName();    Person person = new Person(driver.getCurrentUrl());    person.setName(name);    return person; } private String getName() throws MalformedURLException {    String namePage = driver.findElement(By.cssSelector("#firstHeading")).getText();    return namePage; }

Reproof Tests - turned green.
It is worth mentioning separately why the url is redefined, although it is passed as a parameter: the fact is that in Wikipedia several pages can be devoted to one person and redirected to one. As a result, if we use the original URLs, then duplicates can occur when there are several people with “different” URLs, which are actually one person. Therefore, the url is used as the URL to which all others redirect.

For example: the page https://ru.wikipedia.org/wiki/Yaroslav_Mudry is redirected to https://ru.wikipedia.org/wiki/Yaroslav_Vladimirovich_Wise , and the page https://ru.wikipedia.org/wiki/Andrey_Bogolyubsky - to https : //ru.wikipedia.org/wiki/Andrey_Yurievich_Bogolyubsky

Definition of person's children

Let's try to identify the children of the person who have their own pages in Wikipedia.
To begin with, we will write a test for determining the children of Rurik (more precisely, one - Igor ):

 @Test public void testGetChildrenUrl() throws Exception {    driver.navigate().to("https://ru.wikipedia.org/wiki/");    PersonPage page = new PersonPage(driver);    List<Person> children = page.getChildrenUrl();    assertTrue(children.size() == 1);    Person person = children.get(0);    assertTrue(person.getUrl().equals("https://ru.wikipedia.org/wiki/        %D0%98%D0%B3%D0%BE%D1%80%D1%8C_        %D0%A0%D1%8E%D1%80%D0%B8%D0%BA%D0%BE%D0%B2%D0%B8%D1%87")); }

In order for the test to be successful, you need to add to the PersonPage page a method for determining the urls of the person’s children:

 public List<Person> getChildrenUrl() throws MalformedURLException {    List<WebElement> childrenLinks = driver.findElements(        By.xpath("//table[contains(@class, 'infobox')]//tr[th[.=':']]//a"));    List<Person> children = new ArrayList<Person>();    for (WebElement link : childrenLinks) {        Person person = new Person(link.getAttribute("href"));        children.add(person);    }    return children; }

So far, we ignore the fact that children may not be devoted to the pages in Wikipedia and therefore they do not have links to their pages. For example, as in the case of the children of Vladimir Yaroslavich (the Galician prince) . We also ignore the fact that information about descendants can be located in the main area, such as, for example, on the page of Maria Dobronega or on the page of Svyatoslav Vsevolodovich (Prince Trubchevsky) .

Add tests that verify the correctness of the definition of children for the person.
As stated above, for the time being we assume that Vladimir Yaroslavich (the Galician prince) and Maria Dobronega did not have children, and Vladimir Svyatoslavich had 16 children, although Wikipedia claims that he had 5 more unknown persons named after his daughters.

 @Test public void testChildrenSize() throws Exception {    driver.navigate().to("https://ru.wikipedia.org/wiki/");    PersonPage page = new PersonPage(driver);    List<String> children = page.getChildrenUrl();    assertTrue(children.size() == 1);    driver.navigate().to("https://ru.wikipedia.org/wiki/_");    children = page.getChildrenUrl();    assertTrue(children.size() == 16);    driver.navigate().to("https://ru.wikipedia.org/wiki/__(_)");    children = page.getChildrenUrl();    assertTrue(children.size() == 0);    driver.navigate().to("https://ru.wikipedia.org/wiki/_");    children = page.getChildrenUrl();    assertTrue(children.size() == 0); }

In the Person class, we will add fields for the unique identifier of the person (int id) and the list of the person’s children (List <Integer> children), in which the identifiers of the children will be stored.
We will develop a method for adding a child's identifier to the list of person's children. A child can be added to the list only if he is not there yet.

 public void setChild(int childId) {    if (!children.contains(childId)) {        children.add(childId);    } }

Of course, do not forget to cover all the code with tests and achieve a green result.

Descendant Search Algorithm

We now turn to the most interesting - the development of an algorithm for the search for descendants of a given person. Create a GenerateGenealogicalTree class with the main method.

As already mentioned, the most time-consuming is the transition by url, so you need to minimize the number of these transitions. To do this, create a list of persons in which the whole genealogical tree will be stored. In this list, let's remember the index of the current person - the one on whose page we are at the moment. All persons with a lower index are considered “visited”, and all with a larger index (+ current) - “unvisited”. After the transition to the page of the current person has been made and its basic data have been calculated, the index is increased by one. Thus, the current person falls into the category of "visited". And it remains only to bypass the remaining "unvisited" persons. At each moment of time those persons are known whose pages have already been viewed.

The filling of the family tree with new “unvisited” persons happens by adding the children of the current person to the end of the list. At the same time, we add only those children who are not yet on the list so that duplicates do not occur (such a situation is possible when the husband and wife are both descendants of the forefather of the dynasty from different branches. Examples: husband and wife are descendants of Rurik, husband and wife are descendants Paul I ).

The genealogical tree is considered built when there are no “unvisited” persons left, i.e. when the index of the current person has become equal to the size of the pedigree tree.

The algorithm is as follows:

The founder of the dynasty is created on the basis of a given url.
Generated family tree based on the founder of the dynasty
In the cycle as long as there are "unvisited" person
The person is calculated on the basis of the current URL of the pedigree tree. This person is set as current.
If the current person is not a duplicate, then the list of her children is calculated and set. All children are added to the list.
If the current person has already met among the "visited" people, then it is removed.
The transition to the next "unvisited" person, which is taken as the "current".

Algorithm code:

 public final class GenerateGenealogicalTree {    public static void main(String[] args) throws Exception {        String url = getUrl(args);        GenealogicalTree tree = getGenealogicalTreeByUrl(url);        saveResultAndQuit(tree);    }    public static GenealogicalTree getGenealogicalTreeByUrl(String url) throws MalformedURLException {        WebDriver driver = DriverHelper.getDriver();        Person person = new Person(url);        GenealogicalTree tree = new GenealogicalTree(person);        PersonPage page = new PersonPage(driver);        while (tree.hasUnvisitingPerson()) {            String currentUrl = tree.getCurrentUrl();            Person currentPerson = page.getPerson(currentUrl);            tree.setCurrentPerson(currentPerson);            if (!tree.isCurrentPersonDeleted()) {                List<Person> children = page.getChildrenUrl();                tree.setChildren(children);            }            tree.updatingCurrentPerson();        }        driver.quit();        return tree;    } }

The GenealogicalTree class has three fields: List <Person> allPersons - a list of all members of the family tree, int indexCurrentUnvisitedPerson - the index of the current person in the list of allPersons, and boolean isCurrentPersonDeleted - a sign of whether the “current” person has been deleted (i.e., it is duplicate).

 public final class GenealogicalTree {    private List<Person> allPersons;    private int indexCurrentUnvisitedPerson;    private boolean isCurrentPersonDeleted; }

Initialization takes place on the basis of the “ancestor” of the dynasty - the first person whose descendants we are looking for:

 public GenealogicalTree(Person person) {    if (person == null) {        throw new IllegalArgumentException("   ");    }    allPersons = new ArrayList<Person>();    allPersons.add(person);    indexCurrentUnvisitedPerson = 0;    isCurrentPersonDeleted = false; }

At this point, the family tree consists of one current “unvisited” person. There are no "visited" persons.

As already mentioned, the list is checked for the presence of “unvisited” persons as follows: if the current person’s index “has reached the end,” we believe that there are no “unvisited” persons.

 public boolean hasUnvisitingPerson() {    return indexCurrentUnvisitedPerson < allPersons.size(); }

In the role of the url of the pedigree tree is the url of the current person:

 public String getCurrentUrl() {    return allPersons.get(indexCurrentUnvisitedPerson).getUrl(); }

The setCurrentPerson method replaces the current person with the specified one.

Initially, we only know about the person her url, which we get from the page of the parent. Therefore, a person is added to the family tree, having only this information. In essence, all “unvisited” persons are simply urls. The setCurrentPerson method “specifies” a person after the index “got to it” and the person has become current.

If the “refined” person being installed has already been encountered before (this is possible if a redirect has occurred from the current person's url to one of the pages that was previously encountered), then the current person is deleted. After that, the current person is marked as deleted. If a given person does not occur earlier, then it "replaces" the current one. In this case, the person is not considered deleted.

The concept of "occurs before" implies that we check only the "visited" person. "Unvisited" is not checked. Theoretically, it is possible that the url of the current person redirects to the url, which can be met "later" among the "unvisited". But this is such a rare situation that it is not worthwhile to “run through” the entire array each time. In this rare case, the duplicate will be deleted anyway, when a queue “comes to it” and the index of the current person will point to the person with the url to which the redirect occurred.

 public void setCurrentPerson(Person currentPerson) {    int indexDuplicate = allPersons.indexOf(currentPerson);    if ((0 <= indexDuplicate) && (indexDuplicate < indexCurrentUnvisitedPerson)) {        removePerson(indexDuplicate);    } else {        allPersons.get(indexCurrentUnvisitedPerson).copyMainData(currentPerson);        isCurrentPersonDeleted = false;    } }

In order for the indexOf (Object Object) method to work correctly, it is necessary in the Person class to override the equals (Object object) and hashCode () methods:

 @Override public boolean equals(Object object) {    if ((object == null) || (!(object instanceof Person))) {        return false;    }    Person person = (Person) object;    return this.url.equals(person.url); } @Override public int hashCode() {    return this.url.hashCode(); }

Why do I need to constantly check the presence of a person in the list?
The occurrence of duplicates is possible for many reasons:

Fatherhood is not known for certain. As, for example, in the case of Svyatopolk the Accursed , whose father is either Yaropolk Svyatoslavich or Vladimir Svyatoslavich
Both parents are descendants of Rurik from different branches. Example: Gleb Vseslavich - a descendant of Rurik in the 8th generation was married to Anastasia Yaropolkovna - also a descendant of Rurik (they are four cousins with her sister).
Errors on the page: it is doubtful that Vsevolod Mstislavich had a son, Volodar Glebovich , whose parents recorded other people, also belonging to the Rurik dynasty. Most likely, it's just a typo in Wikipedia

If these duplicates are not eliminated, then they will generate new repetitions, because for all descendants of duplicates, the round will be performed twice, or even three times (in the case of Volodar Glebovich ).

Now consider removing a person from the list when it is a duplicate. The person being removed may be on the list of children of a member of the family tree. For example, when both parents are representatives of the same dynasty, one parent has a link to one page of the “child” and the other to another, which will then redirect to the first. It turns out that if it is “simple” to remove the duplicate, then the second parent will have a link to a non-existent person.

Therefore, before deleting the current person, in the list of children’s identifiers of all “visited” persons, its identifier should be replaced by the identifier of the found match (the “unvisited” children do not have).

After deletion, the current person is marked as deleted.

 private void removePerson(int indexDuplicate) {    int idRemovedPerson = allPersons.get(indexCurrentUnvisitedPerson).getId();    int idDuplicate = allPersons.get(indexDuplicate).getId();    for (int i = 0; i < indexCurrentUnvisitedPerson; i++) {        Person person = allPersons.get(i);        person.replaceChild(idRemovedPerson, idDuplicate);    }    allPersons.remove(indexCurrentUnvisitedPerson);    isCurrentPersonDeleted = true; }

In the Person class, add the method of replacing the "child":

 public void replaceChild(int oldId, int newId) {    if (oldId == newId) {        return;    }    if (!children.contains(oldId)) {        return;    }    children.remove((Object) oldId);    setChild(newId); }

Consider adding children to the current person.

At the entrance we get a list of people who need to be set current as children.
The main difference in the search for duplicates is that now we will look for them not only among the “visited” persons, but also among the “unvisited”, i.e. within the whole family tree.
If the current person is deleted, an exception is thrown, because in fact, there is no one to install children.

If not deleted, then go over the list passed as a parameter. If the child is already found in the pedigree tree, then the identifier of the duplicate found is added to the list of children. If a child does not occur in the pedigree tree, then its identifier is added to the list of children, in addition, the child itself is added to the end of the pedigree tree, to the list of "unvisited" persons.

Thus, the list is filled with the setChildren () method.

 public void setChildren(List<Person> children) {    if (isCurrentPersonDeleted) {        throw new IllegalArgumentException(            "    .    ");    }    for (Person person : children) {        int index = allPersons.indexOf(person);        int id;        if (index >= 0) {            id = allPersons.get(index).getId();        } else {            allPersons.add(person);            id = person.getId();        }        allPersons.get(indexCurrentUnvisitedPerson).setChild(id);    } }

The counter of the current person needs to be updated, otherwise the family tree will never be built. It happens this way: if the current person is removed, then the next “unvisited” person is already in “her place”, therefore, simply “remove” the sign of the deleted person from the current person. If the current person is not deleted, then we consider it “filled” with all the data and proceed to the next “unvisited” person.

 public void updatingCurrentPerson() {    if (isCurrentPersonDeleted) {        isCurrentPersonDeleted = false;    } else {        indexCurrentUnvisitedPerson++;    } }

Circumvention is carried out by generations: first, the founder of the dynasty (0th generation), then all his children (1st generation) from elder to younger (we assume that the URLs in Wikipedia are in this order), then grandchildren (2nd generation ) (children of the eldest son by seniority, then - the 2nd son, and so on until the youngest), great-grandchildren (3rd generation) and so on until the last representative of the dynasty.

Naturally, do not forget to bring the code coverage with tests up to 100%, to make sure that everything works exactly as intended. A description of the tests is available in javadoc .

Separately, it is worth mentioning this: the GenealogicalTree class is very insecure and easy to force to work incorrectly if used outside the generation tree generation algorithm (outside GenerateGenealogicalTree). The only correct solution in this situation is to transfer this class as an internal private class for GenerateGenealogicalTree. But this has not yet been done for the convenience of testing the algorithm.
Run the program.

Logging results in the database

The first launch shows that we have a huge amount of data that must somehow be analyzed to weed out deliberately incorrect results. Looking ahead, I will inform you that on September 17, 2017 there were 3448 pages of direct descendants of Rurik on Wikipedia. The easiest way is to process a similar amount of information in the database.

First, we deploy a local database, which we call genealogtree. With a standard root user without a password. To interact with the database, we will use the standard MySQL JDBC Type 4 driver library.

And then we create a new class for interacting with the database and a method for saving the genealogical tree in the table with the specified name:

 public class MySqlHelper {    private static final String url = "jdbc:mysql://localhost:3306/genealogicaltree"        + "?serverTimezone=UTC&useUnicode=yes&characterEncoding=UTF-8";    private static final String user = "root";    private static final String password = "";    private static Connection connection;    private static Statement statement;    private static ResultSet resultSet;    public static void saveTree(String tableName, List<Person> tree) throws MalformedURLException {        try {            connection = DriverManager.getConnection(url, user, password);            statement = connection.createStatement();            String table = createTable(tableName);            statement.executeUpdate(table);            for (Person person : tree) {                String insert = insertPerson(tableName, person);                statement.executeUpdate(insert);            }        } catch (SQLException sqlEx) {            sqlEx.printStackTrace();        } finally {            try {                connection.close();            } catch (SQLException se) {            }            try {                statement.close();            } catch (SQLException se) {            }        }    }    private static String createTable(String tableName) {        StringBuilder sql = new StringBuilder();        sql.append("CREATE TABLE " + tableName + " (");        sql.append("id INTEGER not NULL, ");        sql.append("name VARCHAR(255), ");        sql.append("url VARCHAR(2048), ");        sql.append("children VARCHAR(255), ");        sql.append("PRIMARY KEY ( id ))");        return sql.toString();    }    private static String insertPerson(String tableName, Person person) {        StringBuilder sql = new StringBuilder();        sql.append("INSERT INTO genealogicaltree." + tableName);        sql.append("(id, name, url, nameUrl, children, parents, numberGeneration) \n VALUES (");        sql.append(person.getId() + ",");        sql.append("'" + person.getName() + "',");        sql.append("'" + person.getUrl() + "',");        sql.append("'" + person.getChildren() + "',");        sql.append(");");        return sql.toString();    } }

We finalize saving the generation results:

 private static void saveResultAndQuit(GenealogicalTree tree) throws Exception {    Timestamp timestamp = new Timestamp(System.currentTimeMillis());    String tableName = "generate" + timestamp.getTime();    MySqlHelper.saveTree(tableName, tree.getGenealogicalTree()); }

Parsing first results

The first run of GenerateGenealogicalTree.main () produced a lot of entries, a quick inspection of which shows the presence of non-existent and erroneous pages.

We expand errors into categories:

The year has been added to the list of children (for example, 1153 from the page of Yaroslav Svyatoslavovich )
Non-Russian article: Adelaide of France , daughter of Louis VII , King of France
Page "bastard child" , which appeared from the same Louis VII
External pages like this which are on the list, for example, from Galeran IV de Beaumont
"Creating a page." For example, Anna Yurievna , daughter of Prince Yury Yaroslavich of Turov

We will refine the getChildrenUrl () method of defining children’s pages in order to exclude obviously erroneous ones. In order not to fall 1 category, you need to remove those links, the text content of which begins with a number. To avoid 2 category, you need to remove those links whose class is equal to extiw.To avoid 3-4 categories, it is necessary to exclude links whose parent tag is equal to sup (clarifying links). To remove a category from list 5, you must exclude links whose class is new (page creation).

To begin with, let's finalize the testChildrenSize () test, adding to it a check of all categories of reference curves:

 driver.navigate().to("https://ru.wikipedia.org/wiki/_"); children = page.getChildrenUrl(); assertTrue(children.size() == 3); driver.navigate().to("https://ru.wikipedia.org/wiki/_VII"); children = page.getChildrenUrl(); assertTrue(children.size() == 5); driver.navigate().to("https://ru.wikipedia.org/wiki/_IV__,___"); children = page.getChildrenUrl(); assertTrue(children.size() == 0); driver.navigate().to("https://ru.wikipedia.org/wiki/__(_)"); children = page.getChildrenUrl(); assertTrue(children.size() == 5);

The test is predictably red.

Now we will finalize the getChildrenUrl () method:

 public List<Person> getChildrenUrl() throws MalformedURLException {    waitLoadPage();    List<WebElement> childrenLinks = getChildrenLinks();    List<Person> children = new ArrayList<Person>();    for (WebElement link : childrenLinks) {        if (DriverHelper.isSup(link)) {            continue;        }        Person person = new Person(link.getAttribute("href"));        person.setNameUrl(link.getText());        if (person.isCorrectNameUrl()) {            children.add(person);        }    }    return children; } private List<WebElement> getChildrenLinks() {    List<WebElement> childrenLinks = DriverHelper.getElements(driver,        By.xpath("//table[contains(@class, 'infobox')]//tr[th[.=':']]" +                "//a[not(@class='new' or @class='extiw')]"));    return childrenLinks; } private void waitLoadPage() {    this.driver.findElement(By.cssSelector("#firstHeading")); } public final class DriverHelper {    /**    *       .<br>    *      -  ,       *    ,          * ,   ,     .       *    ,    ,       * .    */    public static List<WebElement> getElements(WebDriver driver, By by) {        driver.manage().timeouts().implicitlyWait(0, TimeUnit.SECONDS);        List<WebElement> result = driver.findElements(by);        driver.manage().timeouts().implicitlyWait(DriverHelper.TIMEOUT, TimeUnit.SECONDS);        return result;    }    public static boolean isSup(WebElement element) {        String parentTagName = element.findElement(By.xpath(".//..")).getTagName();        return parentTagName.equals("sup");    } } public class Person {    private String nameUrl;    public boolean isCorrectNameUrl() {        Pattern p = Pattern.compile("^[\\D]+.+");        Matcher m = p.matcher(nameUrl);        return m.matches();    } }

nameUrl is the name of the person’s link that it has on the parent’s page.
We rethrow the entire test suite - turned green.

Rurik has a lot of descendants to whom the Russian-language pages are devoted to Wikipedia, so first we will run through the program for Mikhail Fedorovich - the first tsar from the Romanov dynasty. Run, wait for the end and analyze the results.

Romanovs

383 ( , , , 18 II ), , , , , II , . , , .

. :

-
, — V , — I -

— , , , , — .

, — , , .. ( , , - , )
getChildrenUrl(), , url . , «» , .. , .

 public List<Person> getChildrenUrl() {    waitLoadPage();    if (DriverHelper.hasAnchor(driver)) {        return new ArrayList<Person>();    }    ... } public final class DriverHelper {    ...    public static boolean hasAnchor(WebDriver driver) throws MalformedURLException {        URL url = new URL(driver.getCurrentUrl());        return url.getRef() != null;    }    ... }

Add a new test that checks that the person with the anchor URL has no children:

 @Test public void testEmptyChildrenInPersonWithAnchor() throws Exception {    driver.navigate().to("https://ru.wikipedia.org/wiki/_");    PersonPage page = new PersonPage(driver);    List<String> children = page.getChildrenUrl();    assertTrue(children.size() == 5);    driver.navigate().to(        "https://ru.wikipedia.org/wiki/_#.D0.A1.D0.B5.D0.BC.D1.8C.D1.8F");    children = page.getChildrenUrl();    assertTrue(children.size() == 0); }

We rethrow the whole test suite to make sure that nothing is broken.

If offspring is not calculated, then what is the point in going through the url with an anchor? The offspring, of course, is not defined, but other information can be obtained: the name of the person or, for example, the years of life. Therefore, it is better to “save” such URLs for future functionality expansion.

Calculating the name of the "anchor"

In most cases, the name of a person can be calculated by “anchor”: the name is the textual content of the tag with an identifier equal to the value of the “anchor”.
Let's refine the getName () method:

 private String getName() throws MalformedURLException {    waitLoadPage();    String namePage = driver.findElement(By.cssSelector("#firstHeading")).getText();    if (!DriverHelper.hasAnchor(driver)) {        return namePage;    }    String anchor = DriverHelper.getAnchor(driver);    List<WebElement> list = DriverHelper.getElements(driver, By.id(anchor));    if (list.size() == 0) {        return namePage;    }    String name = list.get(0).getText().trim();    return name.isEmpty() ? namePage : name; } public final class DriverHelper {    ...    public static String getAnchor(WebDriver driver) throws MalformedURLException {        URL url = new URL(driver.getCurrentUrl());        return url.getRef();    }    ... }

If the current url contains an “anchor”, then it is necessary to check the existence of an element on the page with an identifier equal to “anchor”. It can not exist, as is the case with Natalia - daughter of Peter I of . On Peter’s page, the link contains a non-existent “anchor” that does not match the “anchor” of Natalia.

You also need to make sure that the tag with the “anchor” identifier contains non-empty text and return the name of the page otherwise. Otherwise, as, for example, in the situation with Demyan Glebovich , an empty name will be determined and the program will crash with an exception.
The tests turned green again.

Adding Link Name, Parent, and Generation Number

: , , «». ?!

, .. getChildrenUrl(). nameUrl , .

, .

, «», « ». , , , nameUrl ( "" ).

We override the program for the Romanovs and check the data with those that were collected before refactoring.

This is how urles with anchors now look like:

id	name	children	url	urlName
eight	Pelagia	[]	link	Pelagia
9	Marfa	[]	link	Marfa
ten	Sofia	[]	link	Sofia
15	Anna	[]	link	Anna
23	Evdokia (younger)	[]	link	Evdokia
26	Theodora	[]	link	Theodora
28	Maria	[]	link	Maria
29	Theodosius	[]	link	Theodosius
36	Children of Peter I	[]	link	Natalia
133	Family	[]	link	Alexander
360	Marriage and children	[]	link	Luana of Orange-Nassau

Adding the name of the link did not pass unnoticed. The program run for Rurik unexpectedly took off with the exception of a violation of the insert instruction on Henry II (King of Navarre) due to the fact that nameUrl contains a value with an apostrophe, Henry II d'Albre. We will refine the setName and setNameUrl methods in the Person class, keeping the specified value without apostrophes.

I recall that in Wikipedia there were about three and a half thousand descendants of Rurik. If you display this information in the form of a table, you get a very large page.you get tired of scrolling. It would be interesting not only to see the whole plate, but also to have the opportunity to highlight the connection of a given representative with the ancestor of the dynasty (that is, to select all the ancestors of a given representative up to the ancestor). And also to know what kind of generation it is.

To make it easier to implement this new functionality, add the fields for the generation number and the list of parents to the Person class (so that it is easier to build an increasing relationship):

 private List<Integer> parents = new ArrayList<Integer>(); private int numberGeneration = 0; public void setParent(int parent) {    parents.add(parent); } public void setNumberGeneration(int numberGeneration) {    if (this.numberGeneration == 0) {        this.numberGeneration = numberGeneration;    } }

The parent in most cases will be one, but there may be situations where both parents are descendants of the forefather of the dynasty from different branches and then there will be two of them. Above was even given an example of an error, when at once three "are" the parents of the same person (the first , second , third , their "common" child - and all of Rurikovich). Of course, physiologically this is impossible, it turns out, as in the anectode , but, unfortunately, it is impossible to automatically determine who their “superfluous” them is, so you will have to save everyone.

It is obvious that in the list of parents there can only be representatives of the dynasty and “gather” all the ancestors of a given representative of the dynasty up to the ancestor can be very easy, the name only this information.

As for the number of the knee, it is set only once from the first parent. At that moment, when the second parent appears, referring to the child, the generation number is no longer updated. Sincebypassing the genealogical tree occurs by generations, it is obvious that the number of the generation of the first parent will be equal to or greater than the number of the tribe of the first parent, which means the fastest way to build a connection through the "first parents".

We will set the generation number and parent identifier in the setChildren (List <Person> children) method of the GenerateGenealogicalTree class:

 public void setChildren(List<Person> children) {    if (isCurrentPersonDeleted) {        throw new IllegalArgumentException(            "    .    ");    }    Person currentPerson = allPersons.get(indexCurrentUnvisitedPerson);    int numberGeneration = currentPerson.getNumberGeneration();    numberGeneration++;    int idParent = currentPerson.getId();    for (Person person : children) {        int index = allPersons.indexOf(person);        int id;        if (index >= 0) { //  ,                allPersons.get(index).setParent(idParent);            id = allPersons.get(index).getId();        } else { //              person.setNumberGeneration(numberGeneration);            person.setParent(idParent);            allPersons.add(person);            id = person.getId();        }        currentPerson.setChild(id);    } }

Of course, we don’t forget to cover all the code with tests - if you don’t do this right away, then it will be difficult to sort through the code porridge.

Final results

:
—
—

( — 3452 ).

:

) Wikipedia
) № . , , , II 29 .
) .

, , , Nicholas II was a descendant of Rurik in the 28th generation. Moreover, all Russian emperors, beginning with Peter III and Catherine II, were descended from Rurik from different branches.

Project source code

PS 24-09-2017

Realization in the form of a tree

To visualize the data as a tree, I used the InfoVis Toolkit JavaScript library .
Links to the tree of the tree in the form of a tree, not a list:
Adam
Genghis Khan
Romanov
Rurikovich

PS 22-10-2017

Generating a tree using Wikidata

As correctly suggested in the comments, it is advisable to use Wikidata for generation. Improvements were small, more about them here .
The results are as follows:
Adam - 243 representatives instead of 31
Genghis Khan - 405 representatives instead of 33

Source: https://habr.com/ru/post/338190/

All Articles