nsomos
Veteran Member
Posts: 140
|
Authored by nsomos on Jun 3, 2014 18:09:23 GMT
|
|
|
Authored by wol on Jun 3, 2014 21:08:04 GMT
Well, Oracle is the poster child of RDBMSs. And you know what my opinion of those is.
How comes (this was a little while ago) a twin-Xeon 800 system running Oracle was embarrassingly trounced for speed by the Pentium 90 running Pick it was supposed to replace? And (very recently) Oracle lost a shoot-out against Cache - the brief was to achieve 100K inserts/sec. Oracle had to cheat to hit the target, Cache did it easily (and hit 250K in production).
At the end of the day, the basic requirements of C&D's 12 (13) rules means that an RDBMS can *N*O*T* be efficient and fast if it wishes to stick to the rules. If you want to get me started, I'll wax lyrical :-) ...
Cheers, Wol
|
|
|
Authored by wayneborean on Jun 4, 2014 18:16:32 GMT
At the end of the day, the basic requirements of C&D's 12 (13) rules means that an RDBMS can *N*O*T* be efficient and fast if it wishes to stick to the rules. If you want to get me started, I'll wax lyrical :-) ... Cheers, Wol Please do. I'd love an explanation. Wayne madhatter.ca
|
|
|
Authored by wol on Jun 4, 2014 20:48:17 GMT
Okay, let's start at the beginning. An RDBMS presents a mathematical view to the application. One of the key features of this mathematical view is that IT MAKES NO PERFORMANCE GUARANTEES WHATSOEVER. Note that one of C&D's rules explicitly forbids the application from knowing anything about "what is behind the curtain", so that the database guys (either the DBA or the guys writing the DBMS) can make changes to improve said performance. It is also important to note that this interface enforces 1st Normal Form. So if 1NF is inefficient, then the application can NOT escape from those inefficiencies (and I intend to show that 1NF is, indeed, incredibly inefficient). Note also that the maths also says "there is guaranteed to be a solution, but there is no guarantee that the problem is solvable within the time available" - this is the contract that underlies cryptography - there is a guaranteed solution to any crytographic puzzle but without the key, the cost of finding the solution exceeds the value of having it. So why is 1NF inefficient. This bit is the maths, but it's logic not formulae so you'll have to use your brains. C&D rule 1 lays down as a requirement that all data comes as sets. In other words, data comes as a *random* collection of *unique* rows. Unfortunately that isn't true in the real world - much information comes in *sorted* form, and much information is *not* unique. Which means that an RDBMS cannot store real-world data that comes as bags and lists (ie most of it) and is forced to model the data instead. Which carries a cost. So we have the following: Fact: A conformant RDBMS is forced to use 1NF as its interface. Postulate: 1NF is inefficient. Observation: 1NF cannot store bags or lists, the form in which most real world information occurs. Conclusion: The postulate is proven, 1NF has to model bags and lists, which comes at a cost. Now let's add to that, 1NF is deprecated in favour of 3NF, yet 3NF is defined in terms of (ie as a layer on top of) 1NF. I'll follow with a worked example ...
|
|
|
Authored by wol on Jun 4, 2014 21:32:43 GMT
I'll do a worked example of a Pizza database, in my favourite db (Pick), and in an RDBMS. Okay, let's do a couple of pizzas and toppings as RECORDs in my Pick db:
( Hawaiian, ( tomato, ham, pineapple) ) ( Four Cheese, ( tomato, cheddar, edam, emmental, gorgonzola) )
Forgive my choice of cheeses, I'm not sure what the real ones are ...
Now let's convert that to normal form ...
{ Hawaiian, 1, tomato } { Hawaiian, 2, ham } { Hawaiian, 3, pineapple } { Four Cheese, 1, tomato } { Four Cheese, 2, cheddar } { Four Cheese, 3, edam } { Four Cheese, 4, emmental } { Four Cheese, 5, gorgonzola }
That's a lot more rows, and a lot more "data" - in quotes because column 2 isn't real-world data - it's been magicked into existence by the normalisation process. As far as the real world goes, it's metadata, it's "first, second, third" etc, and the real world doesn't give a damn what the *value* in the column is, all it wants to know is the *collation*order*.
And what makes it worse, is I don't think that is even 1NF or 3NF, you need to split the pizza names out into a table of their own! So we end up with:
{ 1, Hawaiian } { 2, Four Cheese }
and
{ 1, 1, tomato } { 1, 2, ham } { 1, 3, pineapple } { 2, 1, tomato } { 2, 2, cheddar } { 2, 3, edam } { 2, 4, emmental } { 2, 5, gorgonzola }
Okay, we do need to change the Pick example to match slightly ...
( 1, Hawaiian, ( tomato, ham, pineapple) ) ( 2, Four Cheese, ( tomato, cheddar, edam, emmental, gorgonzola) )
So for relational, we have two tables and ten rows, for Pick we have one FILE and two RECORDs. And note that it's a cinch to write a generic function f() such that the Pick rows are converted to the equivalent normal form representation. Looking at the Pick data, the first column (the key) must be single-valued, and the DICTionary tells the function that the second column is single-valued, so they get spat out as your first relational table, the pizza table. The DICTionary tells the function that the third column is multi-valued, so the function spits out the second, toppings table by repeating the key as column 1, the individual toppings in column 3, and the order of the topping in column 2.
The crucial mathematical point here is that because function f() can convert a Pick RECORD to the matching relational representation, the information in the Pick record must be a superset of that in the relational rows. Because, however, I can write a reverse function f'() but I can't write a generic function g() that converts normal form to a Pick RECORD, that means that the Pick record must be a proper superset, ie it actually contains *more* information than the matching relational representation.
Postulate: Pick RECORDs contain more information than the matching relational records. Observation: It is possible to write a generic function f() to convert from Pick to relational, and a function f'() to reverse f(), but it does not seem possible to write a g() to convert from relational to Pick. Conclusion: The postulate is proven, Pick stores more information than relational.
Now to provide an engineering proof that Pick - BY DEFAULT - is so fast that it's near impossible to improve upon.
|
|
|
Authored by wol on Jun 4, 2014 23:44:34 GMT
Okay. We have a large database of lots of different pizza types. We want to list all pizzas that contain ham as one of their ingredients (along with all said ingredients). In Pick LIST PIZZA TOPPINGS WHEN TOPPING EQ "HAM" The following assumptions are being made: 1) The databases have been optimally designed 2) The OS is "hot" - any information about the disk is cached and will be returned immediately 3) The database is "cold" - any request for data will, on the first attempt, trigger a cache miss and disk access. From the Pick point of view, I am also assuming that disk access will dominate so when calculating how long the Pick query will take, I'm going to assume it's all disk i/o and cpu time is negligible. And I'm going to state that there are ten pizzas that match the query. Given that all pizzas contain tomato, all matching pizzas contain ham, and in order to be unique we need at least nine more toppings, that means that a relational query needs to return at least 29 rows... Okay. In Pick we have an index on toppings. The key is "HAM". Given that the file is keyed, I calculate the hash, ask the OS for "block X of file Y", and I have a list of all ten pizzas. For each pizza, I hash the name, ask the OS for "block X of file Y", and I have the pizza and the toppings. Given that roughly 1 in 20 requests will miss (but contain a pointer to the true location), I'm going to add a miss requiring another disk access. So that gives me 12 seeks total, averaging half a revolution each to fulfil. Assuming a 6000rpm disk, that's 100r/sec, giving me a total time for the query of .06sec. Note I didn't, at any point, mention how big the database is in my calculations... Now let's do the same for the relational database. Except we can't - the application is forbidden to know anything about how the database stores the data. In a worst-case scenario, the db has to retrieve both tables completely from disk to find what it's looking for. And even if it's indexed, does it need to retrieve the entire (possibly large) index? We're not allowed to know. At BEST, I'll assume it takes one access per row, and assuming a perfect hit rate, you need at least 31 disk accesses. That's more than 2 1/2 times what Pick needs! I'll leave it to relational guys to elaborate further, seeing as I don't know the internal details of any RDBMS, but I expect the default scenario is a LOT worse. So let's now apply a bit of engineering logic, and try and optimise our storage to increase the chances of "accidental" cache hits. The easy way, of course, is simply to have a tiny database. Accessing one row or RECORD will, as a side effect, retrieve a bunch of others. The smaller the database, the better the chance. But we can't bank on that. Given that we are accessing value X of attribute Y in row Z, what are the chances we want the same value X and attribute Y but in row A? Not that high, I would have said. And the cost of optimising it is likely to be horrendous. Trying to sort a table by every value for every row That's what we have indices for, and it works fine for both RDBMSs and Pick. But what are the chances of us wanting to access value B of attribute C for the same object instance represented by row Z? Quite high! In the pizza example, it's quite likely I'll want to access the other toppings for the same pizza. And as you can see, Pick has optimised that quite nicely. Taking a more complicated example, let's say I've just accessed the address for invoice 1348. What's the chances I want to access the line items? Quite high. In an RDBMS, that's one disk access for the invoice, another access for the the address, and then another for each of the line items. In Pick, accessing the invoice number has just pulled in all the other information "for free"! That's a pretty simple, engineering statistical proof, that it is *impossible* to improve on the efficiency of Pick in retrieving data! And RDBMS guys wonder why we sneer at query optimisers? Thing is - we can prove that any possible gains an optimiser can give us, are far outweighed by the cost of said optimiser. The basic rule of designing a Pick database, imho, is first do an EAR analysis (entity, attribute, relationship). ALL attributes belonging to an individual entity belong in the same FILE. Then, for each entity, do a full-blown relational analysis to design the Pick RECORD - bear in mind that, because Pick handles nested lists, you can store a complete 3NF view of an object in a single Pick RECORD - it's not an RDBMS because it breaks rule 0 - all access must be via the 1NF interface ... Because RDBMSs present this mathematical face to applications, people forget that they actually do many things under the covers. I can put a 1NF shim over my Pick engine, and pretend to be a relational database. But because I can get under the covers and actually manage the datastore as a Pick datastore, my SQL applications will be none the wiser but my Pick db will simply blow the relational competition out of the water for speed! And while I haven't addressed such things as data integrity, journalling, surviving crashes, etc etc, they are all just engineering issues. Any solution relational adopts (which are outside of the theory!) I can adopt too. And indeed, because my data model is simpler, I can apply simpler versions of their solutions to equal effect. Lastly, a bit more detail (anecdata, as I call it) about how fast Pick is...
|
|
|
Authored by wol on Jun 4, 2014 23:58:45 GMT
Firstly, I'll refer to personal experience. My company ran three R3000 mips minicomputers - that's a 386 era processor. Each mini had 16Mb of ram. And I've sat in the computer room, watching those systems thrash like mad with 20, 30 users, and nobody EVER complained about the speed. Okay, they weren't fast, but they certainly weren't that slow!
Secondly, that Xeon 800 system? I strongly suspect the company in question was William Hill - they "upgraded" from IBM UniVerse on a Pentium 90 to Oracle on a twin-Xeon-800. On a rather complex query (5mins on the Pentium), the consultants doing the migration spent SIX MONTHS optimising Oracle to try and get it to outperform the Pentium. They finally managed to get the Xeon down to 4mins30secs - a 10% improvement!?!?
And the Cache example? (not Pick, but close.) Recently announced that Cache had won an astronomical database contract storing masses of deep-sky observational data. Oracle had to stuff the data into the database as blobs for later post-processing to hit the 100K target. In production, Cache was happily inserting 250K actual rows. (Put otherwise, data inserted into Cache was available instantly for queries to be run against it, Oracle had to wait for the post-processor to catch up.)
I have yet to come across a single credible instance of an RDBMS outperforming Pick in a truly comparable scenario. And, for a well-designed system, I can't think of any credible way it could.
Cheers, Wol
|
|
|
Authored by wol on Jun 5, 2014 10:06:25 GMT
Another important point to bear in mind, not down to the theory as such, I don't think, but very important too ... a guy on the Pick linked-in fora made a comment as follows:
"you all know the conference room of doom, that room with the wall covered with SQL schema that no-one fully comprehends. Well, I never have any trouble holding the entire schema of an equivalent Pick system in my head, nor do my colleagues".
I personally put that down to the phenomena of "emergent complexity". Relational forces you to view everything at the most basic, nuts-and-bolts level. Your tables and views have no equivalent in the real world on which to hang them to make comprehension easy. A Pick FILE on the other hand maps to a real-life object. So you can view the system as a whole in terms of real-life objects, then take an individual object and drill down.
I'm a chemist by education, a physical chemist by interest. In other words, I'm interested in how physical reactions take place. For the most part I can ignore how atoms are put together (that's Physics). I can ignore what a molecule is (that's Chemistry). I can concern myself mostly only with the behaviour of electrons in atomic/molecular orbits, and delve into the other stuff as needed. As a database programmer, if I'm using relational, it's damn hard to concentrate only on the object of interest and ignore everything else, not least because I can't easily draw a dividing line between the two!
Cheers, Wol
|
|
|
Authored by wayneborean on Jun 6, 2014 19:22:52 GMT
Wol, Are there any GPL licensed PICK databases for Linux and/or Mac OSX? Wayne madhatter.ca
|
|
|
Authored by wol on Jun 6, 2014 20:12:46 GMT
Are there any GPL licensed PICK databases for Linux and/or Mac OSX? Free as in freedom? Not really ... The reason I say that, is that there is a database (known variously as OpenQM or ScarletDME) that is allegedly GPL. Snag is, because it includes a compiler, and the copyright holder explicitly does NOT grant a "compiler exception" (like happens with GCC), the claim is that any system written on top of it is also GPL. github.com/geneb/ScarletDME/There's various free as in beer versions. I'm not running it (I really should try) but the one I would use is "UniVerse Personal Edition". Rocket Software (www.rocketsoftware.com) now owns most of the variants of Pick, it'll be available on their site. The restrictions are what you would expect - time-bombed, "no commercial use", and slightly crippled (for personal use you wouldn't notice, most commercial systems I've run (small) wouldn't have noticed either!). IF I can find time, I want to write a Pick/MV database for LibreOffice, but my wife is disabled (I've just got back from hospital :-( and being a full-time carer plays havoc with your life :-( Oh well... Oh - and like dBase, the original Pick (then called GIRLS - General Information Retrieval Language System) was written for the DoD, and hence Public Domain, if you can find a copy anywhere ... Cheers, Wol
|
|
Cm
Guest
|
Authored by Cm on Jun 6, 2014 20:30:37 GMT
I first met Pick on an ADDS Mentor with a Z8000 processor handling 16 (I think) ports.
I used to look after a Honeywell DPS/6 running Pick with 2 times 7x processors, 2M memory and 128 ports (8 of which were printers) running a mail order company (and a spare 68000 machine with 16 ports - I'm very annoyed I didn't put in an offer for it when the company got sold out and I was made redundant).
All these systems were running PICK natively.
If there is a GPL licensed PICK system for Linux I would be very interested - I've been able to do much more with PICK then I've ever managed with other systems (it's so much easier!).
|
|
nsomos
Veteran Member
Posts: 140
|
Authored by nsomos on Jul 9, 2014 21:12:25 GMT
A question for Cm ... was that 68000 machine by any chance a PC with a plug-in 68000 co-processor card?
In the 80's, when I worked for Sritek, we developed a number of plug-in PC co-processors. These had on them nothing more than the advanced processor, memory (which could be accessed by either processor) and cross-coupled interrupts. The PC acted as the I/O processor and was able to support any advanced operating system running on the co-processor. (Such as various Unix flavors, RM/COS, PICK, and others)
By off-loading all the I/O onto the PC, more effective use could be made of the main processor.
|
|
Cm
Guest
|
Authored by Cm on Jul 10, 2014 17:13:39 GMT
It was actually as 68020 (I think, not a plain 68k) and it wasn't, AFAICR, a PC based system, but more a custom (ie designed as a Pick box from the outset) thing...I used it as a mini-development system, but never did very much with it (it just sat under one of the desks in the computer room == my office) and the last time I booted it was back in 1993 (possibly 1992) - the company was sold out in Feb 1993, most staff laid off July 1993 and I was kept on with skeleton warehouse staff and accounts staff whilst accounts and stock was sorted, finally being laid off Jan 1994 (didn't even offer me a possible job in the part of the company that the new owners wanted...possibly because I was able to fix the back-till system (a Tandon PC IIRC running Xenix) when permissions were somehow changed on some of the /dev files; they may have suspected that I caused the problem in the first place rather than being able to diagnose the problem from the symptoms).
|
|