[erlang-questions] Re: Conceptual questions on key-value databases for RDBMs users

Tue Nov 9 14:38:37 CET 2010

Hello Rudolph,

Some comments...

On Tue, 09 Nov 2010 20:43:14 +1100, Rudolph van Graan  
<rvg@REDACTED> wrote:

>
>> One of the core concepts in relational systems is that of
>> enforced integrity constraints, via primary keys and foreign keys.
>> If you don't have any integrity constraints that you care to tell the
>> data base about, you probably don't need a relational system.
>
> This was exactly what I meant with "invariance" earlier. If you have  
> (strong) requirements for invariance in your model - an RDBMS is almost  
> the only solution as integrity constraints etc are all first class  
> concepts. If you don't have a requirement for invariance (as in Edmund's  
> example):
>

In many cases (at least for a migrating RDBMS user), invariance is  
*usually* indeed very important. And as you pointed out earlier there are  
ways of faking this in the emerging class of NoSQL/kv-db, the absence of  
which means tougher checks on code that's writing data (the db is unlikely  
to do this for you.)

I've found that when one is making the transition (from SQL-RDBMS to  
NoSQL/Kv-db), the benefits of easier mapping of db-object to  
application-object normally out-weight the loss of things like  
constraints. I guess you have to trust more that programmers know what  
they are doing and won't write bad data! Some people cringe when hearing  
this, but I've found it's as bad as it sounds!

However, I think as this newer class of DB start to attract more and more  
RDBMS migrants, they will have to figure out a ways of solving these  
issues. Some have already stared (CouchDB has some concept of data  
validation.)

>> It is important to understand that SQL is not a good example of a  
>> relational system.
>
> Yes - SQL is a functional language where you state the solution in terms  
> of relational concepts and ask the RDMBS to solve it and present you  
> with an answer. Some of the Non-SQL systems uses search terms "Name =  
> 'John'" or jscript for the same purpose. QLC is also an example (in the  
> case of mnesia).
>

I'd go further and say SQL has evolved into a primary API for accessing  
todays RDBMSs. You can do much more that just query for data. You can  
design/redesign the db, add/remove indicies. Many vendors add elaborate  
extensions like Microsoft's Data Shaping for hierarchical data. In many  
ways SQL is like the non-standard standard shell language for many  
databases.

>> I think a big reason kv-stores are winning over a lot of us long-time  
>> RDBMSs users is they allow us to model  
>> things-that-have-things-inside-them in the database much closer to how  
>> they are modeled in our applications. Orders/receipts/invoices with  
>> their items, users with their permissions, all these map nicely in  
>> kv-stores from db to application. This allows us to model relationships  
>> only when WE REALLY WANT RELATIONSHIPS (this receipt belongs to that  
>> invoice). That fact alone won me over and I've never looked back.
>
> However, it is only simple to store things this way (the example of  
> Orders/Receipts/Invoices with items "inside" them),

I disagree, I store things this way not because it's simple for the  
database but because that's how my applications understand those things  
(BTW, those were *REAL* examples I was using orders/receipts/invoices from  
real code in development. Code previously written against SQL-RDBMSs and  
being re-writing against CouchDB. The latter is proving to be a much more  
pleasant experience!)

I've seen many SQL-RDBMS developers try to achieve a similar end-result  
using object-relational mapping. You know, so when they create an instance  
of a user class in say Java, a new user is automatically added the users  
table in the db and corresponding entries are made in the permissions  
table. I've never agreed with that way of doing things but it *does*  
illustrate that there is a disconnect between how developers want to  
organise data and how SQL-RDBMSs want them organise data.

> if your only interest in the data is that the "outer" or container  
> object encapsulates the items.

This isn't true. At least not with the class of kv-database I'm referring  
to. CouchDB (and I assume Mnesia) allows you to access the 'inner' objects  
(to whatever depth) inside your queries. You can very easily access the  
items inside an invoice.

> Typically you want to read or write the whole thing in one operation. In  
> real life, (and in my experience), you will pretty soon find that  
> someone wants to know how many orders during the last 60 days included  
> item X with quantities larger than say 6.
>

Easy. I do that sort of thing everywhere (no accounting tool would be very  
helpful if it couldn't do those kind of complex queries).

I don't know about query engines for other kv-dbs like Mnesia, but CouchDB  
handles that sort of thing very well. Just create a view that gathers the  
inner data according to the criteria you are looking for (map step) then  
do the necessary tallying (reduce step).

> If your design decision was to store this whole thing (the order and its  
> items) as one big document (my term for it), the only way to retrieve  
> this data is to literally open up every order, filter out the items you  
> want and aggregate them. The only way to make this fast is to avoid  
> reading every single document and processing it over and over.

Couch is very smart at ensuring that any lengthy read/map-reduce  
operations are cached in an internal data structure (B-tree I think) and  
tries it's best to keep the cache update-to-date. So you normally don't  
have to worry too much about the whole vs partial read/write operations --  
you just get on with business of writing your application using whatever  
structures make sense to you. But if the query is one that is rarely  
executed and lengthly, you can run it at startup so the next time it runs  
it's faster.

This is one of the things I mean when I say we have to make clear the  
class of kv-store that the original poster should look into. An RDBMS user  
looking for a suitable alternative isn't looking for just a plain  
kv-store. You're normally looking for something that has a nice well  
thought-out query engine on top and has figured a lot of these issues out.  
The Couch team have done an *excellent* job here.

> And to do this optimisation, you need an index, or more likely several  
> indices - on dates, on items types, etc.

Correct. For couch's case, when you're creating a query (a view in couch  
lingo) what you're essentially doing at the map step is asking couch to  
create an index for you. Couch tries it's best to be very smart at how it  
handles/optimises that index.

> Indices require that you understand what is inside your "document" (in  
> this case line items).

Correct. You can look at couch's map step as providing the query engine an  
understanding of the documents being stored, including any contained or  
deeply hierarchical* data.

*YAY! I've _ALWAYS_ wanted this and tried desperately to simulate it in  
SQL-RDBMSs with varying degrees of success.

> By definition, this implies a relationship - orders have among other  
> things - lines. Completely independent of the fact that you are storing  
> the items inside the document/order.
>

Okay, I see what you're saying here. Indeed...

http://wiki.apache.org/couchdb/EntityRelationship

... you can view embedded data as related data. I hadn't thought of it  
that way.

Let me make my argument clear by referring to what SQL-RDBMS folk tend to  
think of as a relationship -- that is -- a constraint based on a  
foreign-key. The kv-db equivalent of this would be a member of the  
document that is merely a key pointing to another document rather than  
meaningful data in it's own right. Now, using *that* definition...

For Couch's case, the map-reduce paradigm can be effectively used so that  
you *DON'T HAVE TO CREATE RELATIONSHIPS (of the external key variety) IF  
YOU DON'T WANT THEM*

This is my major argument against SQL-RDMBSs and in favor of NoSQL/kv-dbs  
like Couch -- give me containers when I want containers and give me  
relationships when I specifically want relationships.

I don't want my orders to *relate* to lines elsewhere, this is not how my  
application looks at orders. I want orders to *contain* items. On the  
other-hand, I want invoices to *relate* to orders. Only in the latter case  
does it make sense for me to have and foreign-key/id reference stored in  
the orders. I want data in my db to look like data in my application.

The db should fit my application, my application shouldn't have to fit the  
db. I always found I would invest considerable time with SQL-RDBMSs, first  
trying to force the db to understand how I want my data organised, then  
giving up and just adapting the code of my app to think like the db -- in  
terms of tables and foreign keys.

I've found NoSQL dbs like Couch treat me more like a grown up. Give me the  
freedom to define and enforce my own rules in a way that makes sense to me  
not in some generic way the *supposedly* adapts to every situation. Sure,  
the RDBMS style may adapt to a wide array of situations, but from my  
experience, this is never without pain.

> So as a summary from my side - all data has some sort of structure, be  
> it words within documents, or line items within orders. You can  
> represent this any way you want.
>

True, but with most SQL-RDBMSs, things can get particularly hard. Aside  
 from my relationship rant, a lot of data just doesn't want to be stored in  
tables, and when you force it, exponential complexity arises.

> In the distant past we wrote all the items on a single piece of paper  
> called an order. It was all on one physical page. The page contained all  
> the information. For the same reason it is difficult to query pieces of  
> paper (you need to either index them or summarise them in another way),  
> in the same way it is difficult to query data with implied relations  
> stored in a single "thing" (blob/object/values).
>
> - It is very difficult to enforce invariance in KV stores

I agree. And constraints in general are painful.

> - It is very difficult to index KV stores

This depends. For couch, map-reduce is a form of indexing -- a very  
effective and surprisingly flexible one at that. Doors that are normally  
closed in SQL-RDBMS world are suddenly opened wide in Couch.

> - It is hard work to query KV stores.

I disagree. I haven't found a query for *my* data yet that I cannot  
express in a couch view. I'd go further and say I've found it easier than  
SQL because I can organise data in a way that makes more sense to my  
application. This somehow always translates to easier query construction  
for the application domain.

Perhaps it's a question of learning curve. SQL-RDBMS users are so used to  
thinking in terms of SELECTS and JOINS, but that doesn't make SQL easy! It  
just makes it easy once you know how. I think the same applies to many  
kv-dbs.

> - It is trivial to read from or write into KV stores
> - It is hard to read from or write to database (drivers, SQL, ...)
> - RDMBS systems are hard to scale

I agree with all.

> - KV stores scale easily

Largely true. But sometimes it's not as easy as they promise. Some things  
work well locally and break when you replicate. And you only realise this  
when you try and scale out. There is some element of false advertising  
when NoSQL folk promise simple scaling, there should be a big red  
disclaimer attached.

- Edmond -

> Rudolph van Graan
>
>
> On Nov 9, 2010, at 1:59 AM, Richard O'Keefe wrote:
>
>>
>> On 8/11/2010, at 4:06 AM, Steve Davis wrote:
>>> It appears to me that this discussion is another expression of the
>>> 'strong vs weak/dynamic vs static type' discussion.
>>>
>>> ...it makes me suspect that an imperative and strongly-typed language
>>> paradigm has been a very strong motivator in the evolution of SQL
>>> databases; and perhaps the popularity of NoSQL/NotSQL is an expression/
>>> outcome of the rise of recent trends in programming language uptake.
>>
>> You *cannot* call the types in classic SQL "strong".
>> Numbers, strings, and byte strings for everything is what Joe is  
>> complaining
>> of and he is right to do so.  Encoding something as a string or blob  
>> basically
>> results in the data base itself having no clue about the structure or  
>> meaning
>> of the data.
>>
>> It is important to understand that SQL is not a good example of a  
>> relational
>> system.  A move away from SQL *could* be a move towards relational!
>> such as
>> One of the core concepts in relational systems is that of
>> enforced integrity constraints, via primary keys and foreign keys.
>> If you don't have any integrity constraints that you care to tell the
>> data base about, you probably don't need a relational system.
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/