In my previous posts (
part 1 ,
part 2 ) I described how to get data from the Internet as HTML, how to set up a simple service to load data regularly, how to adjust HTML and load it into a CLR object. In this post we will discuss how to store and update data in the database. I will also give a complete description of the scrapping process.
Avoiding Repetitions by UPSERT Autogeneration (MERGE DML)
If you have a service that writes at least something in the database, it is important to avoid repetitions, i.e. duplicate entries. The solution is to create a UPSERT procedure. Upsert is an update or insert depending on whether there is already a record or not. If there is no record, you can add it. If it is already there, you can upgrade it.
In SQL Server 2008, instead of triggers and other perversions, you can use the MERGE statement, which was created specifically for implementing UPSERT behavior. One problem is that the instruction itself looks terrible, so it is best to auto-generate it from existing entities.
')
My approach to generating MERGE DML is something like this: since the ORM does not store information on which elements must match for it to be UPDATEd and not INSERTed, it is easiest for me to control this file manually. On the other hand, I have one or another model, and I want to use it to generate initial data.
Let's see how this is done using EF4.0. In EF, we have a file with the extension of EDMX, and if you dig it deep into XPath,
Edmx/Runtime/ConceptualModels/Schema
, we get a description of all the entities. In order to replace them with something-there, you must first find the
System.Data.Resources.CSDLSchema_2.xsd
scheme - it is located in the same place where Studio is installed, in the
\xml\Schemas
.
For entities, it will not be possible to transform EDMX into SQL right away for a number of reasons - first, we don’t forget about the EDMX schema. it is composite and does not parse, well, and even if we were to suppress it, we would have to edit the generated SQL to remove from it those comparisons that are “formative”. Now I will explain what's what.
So, take a typical case - the entity
Person { Name, Age }
that needs to be updated (the age changes) or add a new one (if the name is new).
The first thing we do is tear out the
<Schmema>
section from the conceptual scheme. We get about the following:
<Schema><br/>
<EntityContainer Name= "ModelContainer" annotation:LazyLoadingEnabled= "true" ><br/>
<EntitySet Name= "People" EntityType= "Model.Person" /><br/>
</EntityContainer><br/>
<EntityType Name= "Person" ><br/>
<Key><br/>
<PropertyRef Name= "Id" /><br/>
</Key><br/>
<Property Type= "Int32" Name= "Id" Nullable= "false" annotation:StoreGeneratedPattern= "Identity" /><br/>
<Property Type= "String" Name= "Name" Nullable= "false" /><br/>
<Property Type= "Int32" Name= "Age" Nullable= "false" /><br/>
</EntityType><br/>
</Schema><br/>
Next, we create a mapping that translates this XML into simpler XML (relatively), in which we can mark which fields can change and which ones cannot.

As a result of the transformation, we get something like this:
<tables><br/>
<table name= "Person" ><br/>
<field type= "String" name= "Name" /><br/>
<field type= "Int32" name= "Age" /><br/>
</table><br/>
</tables><br/>
Id
field did not get here, because in the Upsert operation, we do not compare Id. (On the other hand, it should be remembered that in the generated procedure we return
SCOPE_IDENTITY()
, therefore we will not get to give
Id
type like
uniqueidentifier
.) Then, this document is transformed by another XSLT (which is already many years :) and the result is exactly what you need namely:
/* Check that the stored procedure does not exist, and erase if it does. */<br/>
if object_id ( 'dbo.PersonUpsert' , 'P' ) is not null <br/>
drop procedure [dbo].[PersonUpsert];<br/>
go <br/>
/* Upserts an entry into the 'Person' table . */<br/>
create procedure [dbo].[PersonUpsert](<br/>
@Id int output ,<br/>
@Name nvarchar( max ),<br/>
@Age int )<br/>
as <br/>
begin <br/>
merge People as tbl<br/>
using ( select <br/>
@Name as Name,<br/>
@Age as Age) as row <br/>
on <br/>
tbl.Name = row .Name<br/>
when not matched then <br/>
insert (Name,Age)<br/>
values ( row .Name, row .Age)<br/>
when matched then <br/>
update set <br/>
@Id = tbl.Id,<br/>
tbl.Name = row .Name,<br/>
tbl.Age = row .Age<br/>
;<br/>
if @Id is null <br/>
set @Id = SCOPE_IDENTITY()<br/>
return @Id<br/>
end <br/>
Now this stored procedure is mapped to EF, Linq2Sql, or some other ORM, and it can be used. Here is an example in EF4:
var op = new ObjectParameter( "Id" , typeof (Int32));<br/>
using ( var mc = new ModelContainer())<br/>
{<br/>
// add me
mc.PersonUpsert(op, "Dmitri" , 25);<br/>
mc.SaveChanges();<br/>
}<br/>
In the example above, we can also check whether a new object has been added or an old one has been updated, and in any of the cases we will be able to get an
Id
object for later use. Of course, in a typical usage scenario, all these processes are implemented through Repository / UnitOfWork with
TransactionScope
and everyone else there.
I note that it is quite possible to make one T4 file instead of “double jump” with XSLT, which I would do everything myself, but this is such a tedious task that in principle it is easier to do as I described. Of course, the fact that you have to rip out
<Schema>
from the EDMX is also imperfect, but for now it will do. By the way, I also want to note that for unknown reasons (or maybe I was looking badly) there is no mapper who could map XML to TXT and at the same time produce an XSLT transformation. I looked at FlexText, but this program did not allow me to make inserts in the rows, and MapForce spawned only C # with it, but refused to do XSLT.
Full process description
It is time to fully describe the process of creating a typical scraper. In short, in a typical implementation, we perform the following actions:
- Find those pages that need to be processed and look at them using FireBug
- Download pages - if you need complex authentication or input from the user, use WatiN, otherwise use
WebRequest
, etc. - We find on the pages those elements that we need and
- Transform elements to make them XML-compatible.
- We make an entity to store data from this piece of XML.
- Make a collection
Collection<T>
class for this entity. - We generate the corresponding XSD for the collection class with
xsd -t:MyCollection MyAssembly.exe
- Auto-generating XSD from HTML source
- Create a mapping from one XSD to another
- In the code, we do the mapping from the processed HTML to XML
- Read entity or collection of entities from received XML
- Create an Upsert procedure (example):
- Rip out the
<Schema>
element from the WSDL - Transform element into simplified form
- Transform the resulting XML into SQL to create a stored procedure
- Create a stored procedure in the database
- Import the stored procedure into our ORM
- After creating the entity, write it to the database (update or create a new one)
Something like that. Of course, there are certainly simpler ways. Again, as someone already wrote, you can use Linq directly instead of mappings, and it works quite well in simple scripts. Good luck!