Introduction

Very often, users are asked to install the DoubleGIS (do not consider advertising) directory, especially if the user goes on business trips, communicates with people from other cities.
And like any system administrator, I had the idea of automatically and centrally updating DoubleGIS for all cities.
For several reasons, it was decided to do this using Linux.
One of the reasons was the lack of a solution for a centralized update for this operating system.
Another one is the absence of a file on the site with all the databases and the shell in one archive for Linux users.
')
In this article I will tell you how to upgrade DoubleGIS for all cities using Linux console tools.
What is needed?
- Linux server (works for me under Fedora 15)
- wget
- sed, grep
- unzip
- Your favorite text editor
We write a script
Here is the script I got.
Downloading a web page with links to cities.
wget --no-proxy --html-extension -P/root/2gis 'http://www.2gis.ru/how-get/linux/'
We tear out all downloaded HTML files from all html files, sort all the lines with links, sort, delete duplicates and write them into a temporary file index.tmp.
cat /root/2gis/*.html | grep http:\/\/ |sort |uniq >/root/2gis/index.tmp
We delete a web page - it is no longer needed.
rm -f /root/2gis/*.html
With this creepy team, we processed Index.tmp to pull out all the links with the how-get string and immediately downloaded the web pages using these links.
cat /root/2gis/index.tmp | grep -o [\'\"\ ]*http:\/\/[^\"\'\ \>]*[\'\"\ \>] | sed s/[\"\'\ \>]//g | grep how-get | xargs wget --no-proxy -r -p -np -l1 -P/root/2gis --tries=10 --html-extension --no-directories --span-hosts --dot-style=mega
Removed index.tmp - only interferes.
rm -f /root/2gis/index.tmp
Glued all the files with the html extension into one index2.tmp.
cat /root/2gis/*.html >/root/2gis/index2.tmp
Remove downloaded web pages.
rm -f /root/2gis/*.html
Now the most interesting thing is to pull out links to updates and download files on them.
We process index2.tmp for links with the string "/ last / linux /", sort, delete duplicates and immediately download only new files to the 2gis.arch folder.
cat /root/2gis/index2.tmp | grep -o [\'\"\ ]*http:\/\/[^\"\'\ \>]*[\'\"\ \>] | sed s/[\"\'\ \>]//g | grep "/last/linux/" | sort | uniq | xargs wget --no-proxy -nc -P/root/2gis.arch --tries=3 --html-extension --no-directories --span-hosts --dot-style=mega
Delete all temporary files.
rm -fr /root/2gis/index*
Extract all zip files from our archives folder to our target folder / root / 2gis /
unzip -o /root/2gis.arch/\*.zip -d /root/2gis/
Delete archives older than 20 days so that there are no duplicates.
find /root/2gis.arch/ -name * -mtime +20 |xargs rm -fr
Now in the / root / 2gis folder there is an unpacked DoubleGIS for all cities, and in the /root/2gis.arch folder there are archives for Linux users downloaded from the site.
Put the script on execution by cron.
I put on every day, the script does not download extra files.
Conclusion
The site structure DoubleGIS is constantly changing, it is possible that the script can not download the update. I recommend periodically to control it.
UPDATED 12/31/2011Edited the script. Removed all unnecessary.
New option.
wget -O - 'http://www.2gis.ru/how-get/linux/' 2>/dev/null | sed "s/^.*\(http:\/\/[^\"\'\ ]*\/how-get\/linux\/\).*$/\1/g" |\
grep "how-get\/linux"|sort|uniq|xargs wget -p -O - 2>/dev/null |sed "s/^.*\(http:\/\/[^\"\'\ ]*\/last\/linux\/\).*$/\1/g"|grep "last\/linux"| sort|uniq|\
xargs wget -N -P/root/2gis.arch
unzip -o /root/2gis.arch/\*.zip -d /root/2gis/
PS Thanks to
kriomant for constructive criticism.
Happy New Year everyone!