📜 ⬆️ ⬇️

Behind the wheel of a vacuum cleaner or smart Firefox

The browser that follows the links, opens / closes tabs, parses, or saves all the content on the file system - it's interesting to look at this, isn't it? Personally, I was interested to create something like that. No fiction! In general, once again, something like a lazy programmer’s laziness of inspiration woke up in me and instead of writing a regular crawler (aka spider or simply site parser) in PHP, Perl or Ruby, I began to figure out how to do it on FireFox.

Training


In order to do all this, I needed a JavaScript console from which I would have access to the XUL DOM. I wrote a very simple Firefox extension that adds an icon to the lower right corner of the browser window.

image

Clicking this icon will open a window in which you can write JavaScript code and manage browser elements. The extension can be downloaded here .
To run the script, press CTRL + Enter (the focus should be on the textbox element).
')

For those who are in the tank


Since not all programmers are familiar with JavaScript objects in FireFox itself, I think a bit of theory will be useful. By analogy with HTML, the “advanced players” are window and document. To access the content (HTML DOM) you need to write
var links = window.content. document .getElementsByTagName( 'a' );
// or just
var links = content. document .querySelectorAll( 'a' );
alert(links.length)


* This source code was highlighted with Source Code Highlighter .
var links = window.content. document .getElementsByTagName( 'a' );
// or just
var links = content. document .querySelectorAll( 'a' );
alert(links.length)


* This source code was highlighted with Source Code Highlighter .
var links = window.content. document .getElementsByTagName( 'a' );
// or just
var links = content. document .querySelectorAll( 'a' );
alert(links.length)


* This source code was highlighted with Source Code Highlighter .


Another very important point is how to catch the moment of loading HTML DOM, as well as how to work with tabs and browser. You can use the global variable gBrowser to refer to the tabbrowser element. With this element you can manipulate tabs. A few examples
// add new tab
var tab = gBrowser.addTab( 'http://habrahabr.ru' );
gBrowser.selectedTab = tab;

// get browser element for tab
var browser = gBrowser.getBrowserForTab(tab);

// add listeners on page load
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
// listener implementation
}, false );

// add listeners on HTML DOM loaded
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
if ( event .originalTarget instanceof HTMLDocument) {
var document = event .originalTarget;
// listener implementation
}
}, false );

* This source code was highlighted with Source Code Highlighter .
// add new tab
var tab = gBrowser.addTab( 'http://habrahabr.ru' );
gBrowser.selectedTab = tab;

// get browser element for tab
var browser = gBrowser.getBrowserForTab(tab);

// add listeners on page load
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
// listener implementation
}, false );

// add listeners on HTML DOM loaded
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
if ( event .originalTarget instanceof HTMLDocument) {
var document = event .originalTarget;
// listener implementation
}
}, false );

* This source code was highlighted with Source Code Highlighter .
// add new tab
var tab = gBrowser.addTab( 'http://habrahabr.ru' );
gBrowser.selectedTab = tab;

// get browser element for tab
var browser = gBrowser.getBrowserForTab(tab);

// add listeners on page load
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
// listener implementation
}, false );

// add listeners on HTML DOM loaded
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
if ( event .originalTarget instanceof HTMLDocument) {
var document = event .originalTarget;
// listener implementation
}
}, false );

* This source code was highlighted with Source Code Highlighter .


I also think it will be useful to learn how to save the web page to a file.
var tab = gBrowser.addTab( 'http://habrahabr.ru' );
gBrowser.selectedTab = tab;
var browser = gBrowser.getBrowserForTab(tab);
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
var document = event .originalTarget;
if ( document instanceof HTMLDocument && this .contentWindow. document == document ) {
var basename = document .location.href.replace(/\/+$/, '' ),
pos = basename.lastIndexOf( '/' );
if (pos != -1) {
basename = basename.substr(pos+1);
}

var file = Components.classes[ "@mozilla.org/file/local;1" ].createInstance(Components.interfaces.nsILocalFile);
file.initWithPath( '/tmp/' +basename);
if (!file.exists()) {
var chosen = new AutoChosen(file, makeFileURI(file));
internalSave( document .location.href, document , null , null , document .contentType, false , null , chosen);
}
}
}, false )

* This source code was highlighted with Source Code Highlighter .
var tab = gBrowser.addTab( 'http://habrahabr.ru' );
gBrowser.selectedTab = tab;
var browser = gBrowser.getBrowserForTab(tab);
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
var document = event .originalTarget;
if ( document instanceof HTMLDocument && this .contentWindow. document == document ) {
var basename = document .location.href.replace(/\/+$/, '' ),
pos = basename.lastIndexOf( '/' );
if (pos != -1) {
basename = basename.substr(pos+1);
}

var file = Components.classes[ "@mozilla.org/file/local;1" ].createInstance(Components.interfaces.nsILocalFile);
file.initWithPath( '/tmp/' +basename);
if (!file.exists()) {
var chosen = new AutoChosen(file, makeFileURI(file));
internalSave( document .location.href, document , null , null , document .contentType, false , null , chosen);
}
}
}, false )

* This source code was highlighted with Source Code Highlighter .
var tab = gBrowser.addTab( 'http://habrahabr.ru' );
gBrowser.selectedTab = tab;
var browser = gBrowser.getBrowserForTab(tab);
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
var document = event .originalTarget;
if ( document instanceof HTMLDocument && this .contentWindow. document == document ) {
var basename = document .location.href.replace(/\/+$/, '' ),
pos = basename.lastIndexOf( '/' );
if (pos != -1) {
basename = basename.substr(pos+1);
}

var file = Components.classes[ "@mozilla.org/file/local;1" ].createInstance(Components.interfaces.nsILocalFile);
file.initWithPath( '/tmp/' +basename);
if (!file.exists()) {
var chosen = new AutoChosen(file, makeFileURI(file));
internalSave( document .location.href, document , null , null , document .contentType, false , null , chosen);
}
}
}, false )

* This source code was highlighted with Source Code Highlighter .


The additional condition this.contentWindow.document == document filters situations when there are iframe elements on the page, and the page needs to be saved only once and that which is needed, and not from the frame.
About the function internalSave (saves the file) and the class AutoChosen (as I understood it emulates the choice of the file by the user) you can read the comments in the source code . More examples of working with files here .

We train FireFox


Let's write a script that will teach FireFox to log in with a specific user on Habr.
var Crawler = {
habrahabr: function (username, password) {
this .username = username;
this .password = password;
},
onHTMLLoaded: function (browser, callback) {
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
var document = event .originalTarget;
if ( document instanceof HTMLDocument
&& this .contentWindow. document == document
) {
this .removeEventListener( 'DOMContentLoaded' , arguments.callee, false );
callback.call( this , event , document );
}
}, false );
return browser;
}
}

Crawler.habrahabr.prototype = {
url: 'http://habrahabr.ru/' ,
openAndSignIn: function (inNewTab) {
var tab = gBrowser.selectedTab, browser = null ;
if (inNewTab) {
tab = gBrowser.addTab( this .url);
} else {
content. document .location = this .url;
}
browser = gBrowser.getBrowserForTab(tab);
var that = this ;
Crawler.onHTMLLoaded(browser, function ( event , document ){
document .location = document .querySelector( 'dl.panel-personal a' ).href;
Crawler.onHTMLLoaded( this , function ( event , document ){
var user = document .getElementById( 'reg-f-username' );
user.value = that.username;
document .getElementById( 'reg-f-password' ).value = that.password;
user.form.querySelector( 'input[type="submit"]' ).click();
});
});
}
};
var cr = new Crawler.habrahabr( 'serjoga' , '***' );
cr.openAndSignIn( true );


* This source code was highlighted with Source Code Highlighter .
var Crawler = {
habrahabr: function (username, password) {
this .username = username;
this .password = password;
},
onHTMLLoaded: function (browser, callback) {
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
var document = event .originalTarget;
if ( document instanceof HTMLDocument
&& this .contentWindow. document == document
) {
this .removeEventListener( 'DOMContentLoaded' , arguments.callee, false );
callback.call( this , event , document );
}
}, false );
return browser;
}
}

Crawler.habrahabr.prototype = {
url: 'http://habrahabr.ru/' ,
openAndSignIn: function (inNewTab) {
var tab = gBrowser.selectedTab, browser = null ;
if (inNewTab) {
tab = gBrowser.addTab( this .url);
} else {
content. document .location = this .url;
}
browser = gBrowser.getBrowserForTab(tab);
var that = this ;
Crawler.onHTMLLoaded(browser, function ( event , document ){
document .location = document .querySelector( 'dl.panel-personal a' ).href;
Crawler.onHTMLLoaded( this , function ( event , document ){
var user = document .getElementById( 'reg-f-username' );
user.value = that.username;
document .getElementById( 'reg-f-password' ).value = that.password;
user.form.querySelector( 'input[type="submit"]' ).click();
});
});
}
};
var cr = new Crawler.habrahabr( 'serjoga' , '***' );
cr.openAndSignIn( true );


* This source code was highlighted with Source Code Highlighter .
var Crawler = {
habrahabr: function (username, password) {
this .username = username;
this .password = password;
},
onHTMLLoaded: function (browser, callback) {
browser.addEventListener( 'DOMContentLoaded' , function ( event ){
var document = event .originalTarget;
if ( document instanceof HTMLDocument
&& this .contentWindow. document == document
) {
this .removeEventListener( 'DOMContentLoaded' , arguments.callee, false );
callback.call( this , event , document );
}
}, false );
return browser;
}
}

Crawler.habrahabr.prototype = {
url: 'http://habrahabr.ru/' ,
openAndSignIn: function (inNewTab) {
var tab = gBrowser.selectedTab, browser = null ;
if (inNewTab) {
tab = gBrowser.addTab( this .url);
} else {
content. document .location = this .url;
}
browser = gBrowser.getBrowserForTab(tab);
var that = this ;
Crawler.onHTMLLoaded(browser, function ( event , document ){
document .location = document .querySelector( 'dl.panel-personal a' ).href;
Crawler.onHTMLLoaded( this , function ( event , document ){
var user = document .getElementById( 'reg-f-username' );
user.value = that.username;
document .getElementById( 'reg-f-password' ).value = that.password;
user.form.querySelector( 'input[type="submit"]' ).click();
});
});
}
};
var cr = new Crawler.habrahabr( 'serjoga' , '***' );
cr.openAndSignIn( true );


* This source code was highlighted with Source Code Highlighter .


At the beginning, create a Crawler object, into which you can then add new parsers. Create a constructor for the parser habrahabr, in which we pass the user name and password. Add the openAndSignIn method to the prototype of the habrahabr object. This method is able to open Habr and log in under a given user.

PS: I often automate the reproduction of test cases of bugs, very funny and not tiring, in that case when you need to press a bunch of buttons and walk 10 pages more on 2 sites.

Source: https://habr.com/ru/post/123568/


All Articles