📜 ⬆️ ⬇️

Anonymous business logic data: separation from personal data

image If there was a way to reliably sufficiently separate user identification data (name, address, billing data, etc.) and confidential user business logic data (primarily data for statistical analysis and building predictive models: for example, what, when and how do people or significant software agents (in a broad sense), including financial transactions; results of video stream analysis, calls), then the business could be less concerned about the confidentiality of their data and save, using common e remote servers for processing and storing meaningful business logic data, rather than using in-house solutions when data does not leave the territory of a business in order to protect confidentiality.

That is, so that server owners know that the business logic data belongs to the user “x290230x9sksfpoaopdfsafl”. But who would this user not know and would not have reasonably available ways to find out.

In this case, the data can be analyzed, aggregated with the data of other users by the usual means.

At the same time, the users themselves work with their interface in the same way as if the standard data storage scheme was implemented.
')
Generally speaking, the issue of such a separation is relevant not only in terms of cost savings due to the greater departure from in-house solutions, but also in terms of protecting the privacy of users of web services.

Thus, one can consider the question in the context of how the web should (can) become at the stage of increasing capacities, means of observation, analysis of behavior, and in terms of potential control.

Below are the details and a description of the possible architecture.

Personally, I’m always worried about online calendars / planners.
Obviously, if I keep the scheduled meetings and actions in them, and adhere to the plan in most cases, then my route, my actions can not only be approximately predicted, but also very likely to be known. And therefore influence them.

And when nobody cares about me - it's good. And if you are “not going against the system,” one way or another, then it’s as if no one needs you.

But even in this case, the feeling that a “neuroculture” is gradually being built around us does not leave.

And if we do not notice this, we are like chickens from a well-known proverb, who get food every day, warmly, and think that if it was like that yesterday, it’s like that today, then it’s “quite obvious” that it will be like that tomorrow, just because this was our whole personally-sensory (but not at all historical, not social) life.

In the same way, for example, I would like the social network, even with access to my correspondence, to be unable to determine who and with whom it is.

There is an opinion that power corrupts, and absolute power corrupts absolutely. In this sense, in the world one can observe two main vectors of development, which in the limit are formulated as a movement towards the absolute power of the few through the lever of automated centralized management, on the one hand, and movement towards the world of equals, through decentralization, autonomy and integration with decentralized AI - with other.

And therefore, this topic seems relevant in the short term, cost minimization, and in the long term, in reducing the risks from the problems created by the vector of centralization of social management. To be able to perform additional actions aimed at strengthening the decentralization vector.

And I want to share the found possible solution that would allow storing and processing data in such a way that the identification data and business logic data function as if separately. With the increasing difficulty of establishing a correspondence between them with an increase in the number of users of the system.

And also to find out the opinion of the community on how else to act in this case, to achieve a similar goal. And are there any flaws that fundamentally reduce the architecture to nothing.



(clickable)

For example, there are two databases. Relational DBMS (RDBMS / RDB) and column-oriented DBMS (CDBMS / CDB).

The first is “personalized.” It stores personal user information, name, address, billing information, quotas.

The second database is anonymous. It contains business logic data owned by users (secret_account_id is indicated), but it is not structurally possible to compare this data with personal information of users — find which user owns certain data.

At the first user authentication in the system on the client side, secret_account_id is generated and encrypted with a user gpg key.

Then the encrypted value is written to the personalized DB (RDB) in the encrypted_secret_account_id field.

The CDB has a secret_account_id field, in which the secret account id value is stored in an unencrypted form.

And so, before making a request to the CDB, within the session, the client requests the encrypted string encrypted_secret_account_id (from the RDB / cache), decrypts it with the private key, and makes a request with its secret_account_id to the API working with the CDB.

This raises the question of how to make access to the CDB API by token, so that only authorized clients can access the API, but at the same time:

1) so that by owning both databases it was impossible to understand which user made a particular request to the CDB API at a given time (and thereby compare the data).

2) to be able to withdraw a token if necessary (for example, no payment was received). Or be able to set resource quotas for an account in connection with the selected tariff.

One of the possible solutions is as follows:

- “Mock” ​​token requests for all users (or for all users, taking into account the division of users into categories).

That is, the client automatically once in a period (for example, once a second, once in half a second) makes a request for a token during its session even if in reality it does not need a token, and it does not work with the database at the moment.

At the same time, the token is common for all users (or for all users of a certain category to which the requesting user belongs)

To reduce the load, caching or subscriptions are used (for example, Fastly, websockets)

A random number of times within the set limits, in a random period of time, the tokens are re-created and overwritten in the cache.

- all requests to the database are made through [all other things being equal] a fairly reliable anonymizer, for example Tor.

Thus, owning a server with personal information of users and a database with confidential data of user business logic, we can establish a correspondence between them with a probability inversely proportional to the number of users of the system (or the number of users belonging to any category):

1 user - 100% compliance
10 users - 1/10
1000 users - 1/1 000
100,000 users - 1 / 100,000
etc.

At the same time, it is necessary to ensure that the API processing the sections of personal data contained in the business logic data does not isolate and does not save / log the personal information.

That is, it should be an open source code that can be launched and connected by choice (in-house, on the server, in the cloud) or use a boxed solution, if this is justified by the data specifics.

At the same time, it is possible to make it so that automatically detected areas of personal data (for example, about participants in financial transactions) are encrypted and stored in the CDB in encrypted form.

Thus, analyzing such data, for example, one could say that the account with secret_account_id "x290230x9sksfpoaopdfsafl" performed a series of transactions on the account

<85>^B^L^CgdzB9¦<94>Å^A^P^@<8b><98>7Q<84>ÜX^Bl5ú²{è^@<87>K ùý<87>+U<9e><Ä<84> ?9<8c>)S×zhIÿ<8e><95>^Kx^\ùÜ^K<99><99>\¾x_W8ÉC^L<87>çÎ^ZùU˸¨<98>_^RÎ<8b>æàÉ<8d>b<8c><86>;<80>¢<92><99><85><97>2^E²<9d> <9d><80>2ã¶9,<9e>^U+<98><96>^@æÖ<85>ø^_`m[µ¿<8d><82>jã|^R¥^Râ<:OåÇu¾áçM_<9f>^N¼³Y^Ru^ABcßÅ<93>¸_ì¤etlÑC<9d>D^S^K2×ÿà<8c>rnpN\¹#<84> <88>y^_'SS<93>2*^CmNE^],^]GçQ°²¢<92><82><99>orì:Ò*¤ôÑ9õ±<95>Ç<81><96>¤<9a>^E¬¢|Ȫ÷j*yýëß¿éFsÂè¯^K3^QöÜ^L?î+×áÏ<94>¶Ã<84>Õ<9d>(Õæu\#<9a><90>^\Î^Z6<8f>êX'¦<9d>ÿ<9a>ãI^^UÏ<95>£<9b>¢^?i<99>K<9d>v<98><9e>N*>Âkסyx1¬>/^XOhÿ{^LoR-<97>d²w·Lj^NhfhN,{<96>y³«¤^\µ¸hçèNÿ ~Ò¬µup^Sµ0*^\³^âúØ*<83>ªÖ>*Æ;<84>G¨¨%^S<82>Ý^S<8e>WÜí«)Ê<94>öz; ú^AüN³^W΢<92>dÅuV^Bfvè}æqeT»è@ì|6/y<81>S^Y<97>7$ê<85>^[@<81><99>$ÿ¥&¯<93><83>|¬Z7<8e>^EN<95>"ì{^[rcPòØ<97>Ïò÷';9bh·ÅahÏÌÈJ^]¼^VcÉ1 ?eÝØ[ ^K<80>^T<95>¶^]^Y §<8c>^[/¼1â\ÒU^A<80><84>UÅïv7^QÚìùKÏ<98>æMÆz¢â<8e>m"^Eú<9f><8a>F| ¯X <8a>Ì^@^VËtïe<81>m<89>«<95>}¶f<99>Æxº"4^H<9d> lºír¨)C£È7<87>ØÈ{x~N,tïIø% 

on a similarly encrypted account (fiat or cryptocurrency, one way or another).

It would be possible to calculate the size of transactions, trace the dynamics of dates, but not establish the facts about who exactly made them.

And if necessary, introduce additional encryption on those parts of the data that can allow to establish the correspondence between identification data and business logic data by any indirect evidence. That is, to carry out sequential machine learning aimed at such filtering.

By decrypting either on the client’s device side when working in the personal account, if the decryption volumes are insignificant, or through raising the intermediate open source API by the user, again, the choice is in-house (requiring qualitatively fewer resources than for processing and storing all data). ), on the server, in the cloud. Including providing the elementary interface of the designer.

The same applies to messages on the social network.
When instead of a monolithic facade, the social network would delegate some operations to third-party computing resources.

And it would be quite possible a hybrid scheme, in which their computational capabilities are connected, becoming part of their API. Which would process some parts of their own personal data contained in the data of business logic.

There may be a question why they need it, if the locomotive is already driving?

Probably, this could bear the financial benefits of cost reduction. At the same time, convincing ordinary users that they need to pay for their social network page seems utopian. But to begin to mine the social network's own cryptocurrency and to store in private blocks the personal sections of confidential user messages in encrypted form seems quite possible with an appropriate level of text filtering. Accordingly, miners would take on the task of filtering texts and encrypting on the basis of the public keys of the authors of messages as part of their PoW / PoS.

But, of course, this is a controversial issue, whether they will be engaged in the same, or something similar could be seen in a different format, at the next iteration of the development of decentralized systems and VR / AR.

* Making a reservation that decentralized social networks are developed and presented, but it’s about top social networks and the ability to access confidential (non-personalized) data for analysis, rather than blocking access with full encryption for all but authorized users.

And thus, returning to the situation of two databases, it becomes possible to expect that private data, even when entering the remote database, will remain private as unidentified with the person (or company) owner, since with user.

And so, it would be interesting to find, to consider any other solutions for the anonymity of confidential data of business logic. And, perhaps, to refute or modify this.

For example, instead of mock requests, use an algorithm to create a validated anonymous (in connection with the account / personal information) token, taking into account quotas and the possibility of revocation, if such an algorithm could exist (and it may even be known that someone similar uses).

And maybe there are indirectly similar / related p2p solutions (within the blockchain or in another area) that can be transferred here to the level of a standard centralized project. But so that total encryption would not be used, and, accordingly, it would be possible to manage with a smaller amount of computing resources.

Source: https://habr.com/ru/post/328934/


All Articles