Problems of extracting information from electronic digital sources

I work at the Faculty of International Relations of St. Petersburg State University. We mostly learn “people with a humanitarian mindset,” and when our students in research activities try to use electronic sources of information that they find through the Internet, they sometimes simply do not get the necessary information from there to extract.

For these students, I wrote an article that can be used as a manual for "extracting information." But, since my mindset is also humanitarian, I would like to discuss it with visitors of Habrahabr. Maybe here they will point me to something that I unfairly bypassed with my attention.

Carefully, the text is long enough :)

')
Although virtually any information can be represented in electronic digital form today, text, sound and visual sources are the most relevant for a researcher of international relations. All are machine readable. This means that the information contained in them cannot be perceived directly by the person; To extract information from these sources, technical devices are needed: computer hardware and software capable of processing information objects created using various information processing technologies.

Information processing technologies, its digitization, recording and storage on electronic media were created and are being created in different parts of the Earth, various research teams, academic groups, commercial developers, public organizations and individuals. As a result, there is a considerable number of standards for electronic documents, which require a variety of computer programs. Some standards are recognized internationally; they are recommended for use by reputable international organizations: the International Organization for Standardization (ISO), the International Telecommunication Union (ITU), the International Electrotechnical Commission (IEC). For example, Adobe’s standard for electronic presentation of textual and graphical information PDF (portable document format) developed by Adobe Corporation is an official electronic document standard approved by ISO and obtained ISO 32000 by classification of standards of this organization. This standard is widely used throughout the world. Electronic documents based on it can be found in large quantities both on the world wide web and in file-sharing networks. And finding computer programs that display electronic documents created on the basis of this standard is easy.

Sometimes with officially recognized international standards, technologies that are alternative to them become de facto standards. For example, if a researcher who is not familiar with modern information technologies, but who has a clear understanding of the system of international standards, plans to use electronic correspondence with participants and eyewitnesses of events that interest him, in developing his topic, he may come up with the use of software for processing, transmitting and receiving e-mail based on the X.400 standard, developed and adopted for this purpose by the International Telecommunication Union. But when looking for such software, it will encounter significant difficulties. And if it does, it turns out that its potential respondents do not have similar programs. The fact is that the transfer and processing of mail on the basis of these standards is practiced mainly in government, intergovernmental and banking correspondence. SMTP and POP3 protocols are used by individuals all over the world for this purpose, which are less reliable and secure, not approved by authorized organizations as international standards, but in fact they are at the global level. Accordingly, the researcher needs to use an email client program that works with these standards for transmitting and receiving information.

In addition to such global standards of information technology used to create electronic documents and work with them, there are also regional standards (for example, in the European region, such standards as the European Committee for Standardization, the European Committee for Standardization and the European Standardization Committee develop these standards and promote them to the global level. solutions in electrical engineering and the European Institute for Standardization in the field of network infrastructure. The latter organization is developing and actively implementing called the Information Society Standards System (ESSI). Their standards for creating and processing electronic information sources can exist in different countries and be approved by relevant institutions at the national level. Finally, the original standards for encoding and decoding information can be used in commercial software products and corporate information systems. and just in certain social circles. For example, a researcher cannot ignore any files created using The Microsoft Office programs, although many electronic information technologies used in these programs are not only not recognized as international standards by competent international organizations, but are patented, and therefore require licensed software for legal work with them, which can cost a lot of money, neither “TAR” file archives or “OGG” sound files, the creation technologies of which, again not being the official standard for storing information in electronic form, are but are used by supporters of open source ideologies around the world. Data valuable for the study of international relations can be presented electronically by people from different countries and social groups who have access to very different (sometimes very specific) information technologies and software products, and the researcher today must have at least a general understanding on how to process data presented in different electronic formats.

To extract information from any digital source, you need a device that can convert information from a machine-readable form into a form suitable for human perception. Today, many devices have this functionality: from a stationary computer to a mobile phone. We will not dwell on the complex technologies of information transformation. Experts in the field of computer science in recent years, written a lot of manuals on these issues. Let us dwell on a brief description of those modern solutions that do not require knowledge of the technical subtleties of working with electronic documents, which are primarily necessary for an international researcher who receives information from different countries and processes this information on different devices, some of which may be at home, others in various organizations, in points of public access to the Internet, etc., - in general, where a researcher working with operational information needs to receive and process them. We can not ignore the fact that working with electronic sources in the social sciences can not replace field research. Therefore, some devices may be located abroad - in countries where processes are being investigated.

Information to the international researcher comes in different languages. If electronic digital sources contain information in foreign languages in a sound form, the researcher needs to know these languages in order to extract it. If this particular researcher does not know one or another language necessary to extract information that is valuable for the topic being studied from an electronic digital source, then it makes sense to conduct a study of this topic to a research team in which there are people who speak different languages. If the foreign language source contains information in the form of a printed text, you can use electronic transfer technology to extract this information. This technology is implemented both in special handheld devices, called “electronic translators”, and in software for multifunctional electronic devices. In a generally accessible form, it is implemented on large servers of the global information network. Access to these servers is carried out from any computer connected to the Internet and equipped with a web browser. A fairly well-developed publicly available electronic translation system, developed by Google, can be found at < translate.google.com >. She works with more than a hundred languages. Another electronic translation system that a Russian-speaking researcher needs to know is available at < translate.ru >. It supports a smaller number of languages and directions of translation, but gives a much higher quality translation into Russian. The fact is that this system is developed by Russian specialists and develops longer than the translation system of the corporation “Google”. There are other machine translation systems.

Electronic translators cannot be used to extract information such as regulations or political statements. The subtleties of the wording important to such documents are lost during machine translation. Moreover, “Russian-language” texts obtained as a result of machine translation should not be quoted in academic works. Proposals are often inconsistent, but there is no need to talk about stylistics. The modern level of development of artificial intelligence allows a researcher using machine translation technology to understand only the general meaning of the information contained in the translated source. Although often due to the polyvariance of the translation of many words, even the general meaning of individual sentences may not be completely clear. To clarify the meaning of individual words, you can use electronic dictionaries - programs that work with databases in which words from one language are matched with sets of their meanings in other languages. These programs, unlike electronic translators, do not make the choice of the optimal word translation option for the user. Publicly available electronic dictionaries can be found through the World Wide Web search engines with a simple query like “Sino-Russian online dictionary”, where instead of “Sino” you can insert any other language. If you cannot find out the meaning of a word in this way (there is no public dictionary or there are no search word in the found dictionaries), you can go two ways: either enter an unknown word in a foreign language as a search query and limit the additional options offered by search engines. search results by Russian-language documents (if the text in the Russian language contains a foreign word, it is likely that the meaning of the word is clarified), or find an electronic dictionary giving translation of words of an unfamiliar language into English. There are much more such dictionaries on the World Wide Web due to the greater development of the English-speaking segment of the World Wide Web and the greater prevalence of English among its users. To imagine a qualified researcher of international relations who does not speak English is difficult today.

The machine translation technology can be used by the researcher not only for the translation of already prepared printed texts, but also for electronic correspondence, another participant of which does not speak Russian, but owns one of the languages with which one or another electronic translation system works. To do this, it is enough to keep an open tab in the browser with the page of the electronic translation system and, before sending each message, enter it into this system, receive the translation and send the already translated text. Similarly, you can do with messages coming from the respondent.
It is often necessary to translate the information obtained not only from one language to another, but also from one format to another. For example, the document found by the researcher was prepared in the text editor Microsoft Word 2007 and has the extension “docx”, and the researcher (or directly on the computer that he currently uses) does not have this program. What to do? Run to the nearest software store and buy multiple copies of the Microsoft Office suite, each one worth a lot of money? (One copy is indispensable if you work on different computers.) And what if they do not allow someone else to install this or that software on another computer? The problem is solved by online systems for converting electronic documents. The most advanced of these is available at < zamzar.com >. Going to this site, the user needs to fill out a simple form: select a local file that needs to be converted (the file name can contain only Latin letters and numbers), determine the final format (for example, a file with the extension “docx” can be “translated” into twenty different formats , among which are “doc”, “rtf”, and “pdf”, and “txt”, and “odt”, and “html”), provide your e-mail address and wait a bit. As a result, the user receives a letter with a link that leads to the file in the desired format. Using this and other similar converters, it is easy to extract information from sources for which there is no viewer or processing tool on the device used at any time.

There are many websites that, like Zamzar.com, can help to process various information in electronic digital form without the need to constantly have a computer on hand that has all the software required to work with files of various types. Thanks to them, the researcher can work with his data, from anywhere in the world, at any computer connected to the Internet.

If it is necessary to process e-mail and there is no e-mail client on the computer, a web interface will be useful, which is present in all major free e-mail service providers. The largest of them are available at the following addresses: < mail.ru >, < gmail.com >, < mail.yandex.ru >, < mail.yahoo.com >. The advantage of the web interface over email processing using email clients is not only that this method does not need to have an installed and configured email client, but also that all messages ever received and sent are stored on the service provider servers e-mail (for which many of them provide unlimited space), and they can be accessed from any computer, anywhere, anytime.

The web interface is also in instant messaging systems. Through this type of communication it is very convenient to receive operational information from their eyewitnesses or participants, if the researcher maintains contact with them. The ICQ system has a web interface at < go.icq.com >, Google Talk at < gmail.com >, Yahoo! Messenger at < webmessenger.yahoo.com >. In addition, there are solutions that allow working on multiple messaging systems on the same web page at the same time. The most convenient of them are available at < meebo.com >, < koolim.com >, < flick.im >. When chatting through all these sites, the archive of messages is stored on a remote server, and it can be accessed from any place at any time, using only a computer connected to the Internet.

There are also several ready-made solutions for working with texts. The most comprehensive ones are available at < www.thinkfree.com >, < www.zoho.com >, < docs.google.com >. All of them try to imitate traditional office suites (above all - Microsoft Office) and offer almost all the functionality present in such suites. But at the same time, the program is not installed on the local computer, work with documents is carried out directly through the browser, and the files are stored directly on the server. That is, no special software is needed to extract information from such files. Virtually any computer connected to the Internet that can run a modern browser will do.

Tools for processing spreadsheets, viewing and creating presentations are also present on all these sites.

To extract information from audio recordings and audio files on computers that do not have special software, you can use audio and video hosting systems that convert files into flash technology objects that are also heard or viewed in any modern browser. The largest audio hosting in the CIS countries is available at < vkontakte.ru/audio.php >, and video hosting is available at < vkontakte.ru/video.php >. To use the capabilities of this site, registration is required.But the requirement to be a registered user makes it possible, while maintaining the file on this server, not to provide access to it to an unlimited number of people, which can be prohibited by the owners of exclusive rights to the audio or video used by the researcher as a source. On audio and video hosting services that do not require registration, access to the downloaded file is obtained by anyone who has access to the site, because the user who uploaded the file to the server was not authorized to it, and to provide further access to the file only to him is much more difficult. - for problems with his identification.

To store web pages found on the world wide web, it is extremely convenient to use the so-called “social bookmarking services” that store copies of web pages that are bookmarked on their servers. One of the most convenient tools is presented on the website < diigo.com>. Firstly, it is convenient in that it allows you to always have the addresses of the necessary sites at hand; secondly, it makes it possible to select text and make notes directly on web pages, not saving them on the local computer and not running any additional programs, and on subsequent browsing of these web pages even on other computers again to see what specifically was highlighted and what notes were made (this is very useful when working with text sources); and most importantly, if the content of the page on the world wide web changes, you can always return to the document that was on this page at the time when it was bookmarked.

If there is a link to a document on the World Wide Web, but the document itself cannot be found via this link, it is likely that it was there, but has now been deleted. How to extract information from an electronic digital source that no longer exists? In this case, the Wayback Machine tool created by the public organization Internet Archive, which deals with the practical aspects of the problem of preserving the digital heritage, can be useful. "Wayback Machine" gives you the opportunity to see how a particular web page looked at a certain moment. Unfortunately, not all websites are saved in the “digital library” of the “Internet archive”, but there is still a substantial likelihood that significant documents that have ever been placed on the world wide web can be found. In the "Wayback Machine" you need to enter the website address,which was located the desired document, and then select the date and time when this document could be there, and see what was actually available at the specified address at a specified time.

Similar opportunities provide some search engines. If the search engine found a document, provided a link to it, and the link did not lead to the desired document, sometimes you can view a copy of the found web page stored on the search engine server at the time this page was indexed.

For searching and processing news it is convenient to collect all the news in one place. This is possible thanks to the RSS feeds that virtually every website has regularly updated content: news sites, blogs, forums, etc. By subscribing to a large number of RSS feeds, in which documents on a topic of interest appearing on different sites will be reflected, the researcher has the ability to quickly see new arrivals, read all the most important things and not miss anything without having to go to all these sites and look for new materials that appeared . This opportunity is provided by reader programs, many of which also work on remote servers and are accessible from any computer connected to the Internet: < google.com/reader >, < lenta.yandex.ru> etc.

There are also several dozen websites that integrate all or many of the above types of services. For example, users of the website < desktoptwo.com > can directly on it, without launching any programs other than the browser on the local computer, work with the HTML editor, the office application package OpenOffice.org, the program for reading RSS feeds, the MP3 player, the system instant messaging. At < g.ho.st> on a remote server, you can also run an RSS reader, notepad, clock, email client, instant messaging program; In addition, without leaving this site, you can run Zoho office applications, watch videos from Youtube video hosting, and images from Flickr photo hosting. And on the < ulteo.com > server, you can run a full-fledged distribution of the Linux operating system with a wide range of applications and work with them directly in the browser window.

The last example requires the installation of a special applet on the computer. All other services work simply in the browser. Sometimes you need support technology "AJAX", sometimes - technology "flash". Their support is in most modern browsers. If there is a possibility that the researcher will receive a computer with Internet access, on which such a browser is not installed, you can use the portable version of the free browser “Mozilla Firefox”, which you can carry around with you on a small storage device (for example, on a memory card) and run without installation on almost any modern computer. The distribution of this browser can be found, for example, at < portableapps.com>. Using this program is also convenient because it can save information about accounts on different servers (including passwords), and with its help authorization on different resources from different computers is faster and easier, and confidential information about the user's web sessions not stored on those computers with which it goes on the Internet.

If a computer doesn’t manage to launch a portable version of the browser (for example, due to the fact that the computer does not have the operating system required by the browser), an operating system distribution kit that does not need to be installed on the computer’s hard disk and running from the CD may be useful. disk. Such distributions have different operating systems. The most common of these distributions, in which a modern browser is installed, is available for download at < knoppix.com >.

Many sources of information about international relations contain confidential information, access to which is restricted. Such sources may be electronic. You can get acquainted with them only by receiving an admission. It is impossible to refer to information from these sources in academic works due to its secrecy. But on computers of organizations engaged in international relations, electronic databases of documents that are not secret but inaccessible through public information networks can also be stored. Access to them can only be obtained from computers connected to the local network of these organizations. In fact, the only way to obtain information from such sources (in the absence of direct access to the computer networks of these organizations) is to contact the managers or employees of these organizations,authorized to copy information from such databases, with a request to find and provide certain documents. Of course, no one is obliged to fulfill such a request from a researcher. Assistance can be provided solely by free will.

In addition, often electronic digital sources of information that are necessary for a researcher of international relations, although they are posted on the world wide web, they are not open to direct access by anyone. Only persons who have an account on the sites on which they are located can get access to them, or even only certain visitors of these sites - those who have been granted such access by the person who posted this or that information on the site. With the development of a new type of information systems in the world wide web — sites whose content forms an unlimited number of their users, and the users who post this information can provide access rights to certain visitors to these or other records more and more often.Today it is the most convenient to receive information about certain events from their participants or eyewitnesses. Finding such electronic digital sources of information on the World Wide Web that are openly restricted is not so easy. Search engines can not index these sources, so a simple keyword search in search engines can lead to the fact that such important sources by the researcher will be missed.

Today, it is important for the researcher of social processes to have accounts in different “social networks”, on blogging sites, on thematic forums, etc., in order to access information that is open on these sites only to authorized users, as well as to internal search engines. mechanisms of these sites themselves. Search systems on such sites allow you to search for communities by interest, users by place of residence, political views or other information that they provide about themselves. That is why such sites are an extraordinarily valuable resource for the study of social processes, systems and relationships, including international ones.

However, it is impossible to limit yourself to simple registration on a number of similar sites. As already mentioned, often their users have the right to restrict access to the information they post on these sites. Therefore, a simple observation of the development of these sites in order to track relevant information is not the best way to extract information from such sources. To obtain all the completeness of information, it is necessary to include monitoring of the appearance of new information on these sites or even observing participation - participation in those processes of social communication that are carried out through these sites. Actively participate in community discussions and correspondence with people who can provide valuable information about the subject of the research itself. Without personal communication with people who post valuable information on such sites,by providing limited access to it, it is almost impossible to extract important information from such sources.

Examples with databases of electronic documents that are accessible only through the local networks of certain organizations, and with information available through the World Wide Web to a limited circle of people, show that working with electronic digital sources of information cannot be considered as an alternative to searching for information through social contacts. Without resorting to social interaction, it is impossible to obtain comprehensive information about social processes, systems and relationships.

(First published in my FAQ on academic work )

Source: https://habr.com/ru/post/38510/

All Articles

Problems of extracting information from electronic digital sources

More articles: