Chris R's Weblog

Daily link November 23rd, 2007

DivShare up for sale on DNForum??

This is really odd. DivShare is putting itself up for sale on DNForum.com?

I’ve been a member there for a long time, and I think that’s the biggest site I’ve seen up for sale there. TC also reported on it.

Daily link November 23rd, 2007

But Chris, how are you going to traverse billions of records in an unmanaged way?

Records are stored in a huge dump file this way:

64bit int key - 32 bit int RankPos - 255 * 8 bit data crunched index ( 31 finite fields totalling 224 chars + 31 unique single char delimiters )

The data is traversed off the disk as a stream since billions of records can not be held in a consumer grade motherboard memory range, such as those we can afford.

Using too much virtual memory to automatically handle this would be slow, so we will handle it.

We read out 4MB of data off the disk at a time into the stream window from the top of the dump down until the requested number of records are matched. The 31 unique 8 bit delimiters will be used to match finite types. The data is ordered by RankPos from top to bottom, so the first matched records are automatically the good ones and you can exit as soon as you matched the last n.

This will be pretty awful for accuracy, because it’s limited to to 224 x 8 bits for 31 fields, but that’s all our consumer grade stuff can handle.

Otherwise there would be no JIT results, just like MySpace, eBay and every other website that runs on a full cold cache. That is unacceptable for search. Results will still be pre-compiled, but this implementation will be the JIT implementation.

Once keys are pulled they are matched into a modified DBM, which will be fast enough for what we are doing. The DBM will return the blob containing the serialized data stream corresponding to the index.

This is not the ideal way to do this. Normally you would split up the request into several parts and have different machines compile different parts and hold caches of the different compilation stages. But again, this is Canada, and in Canada there is no funding. The ghetto search will be the only search engine in Canada. It will still be released on time in December, probably near Xmas.

Q: But how are you going to … …. n total results if the stream exits immediately????

A: “Results 1-10 of about rand() for ___TERM___”

This is Canada, don’t ask too many questions, and you won’t be disappointed ;)

Actually I’ll be serious for a sec. The appx total is taken by

(records found / records traversed) * total records

which is really crappy, but not quite as crappy as rand(). If I really wanted to make it that ghetto I would have just used rand(). It would have been funnier.

UPDATE: I know there will be critics that say that the array mentioned in paragraph 1 above should be kept in the RAM, but we can’t afford a system with a terabyte of RAM. We could have if we had gotten fair treatment from the BDC, but we didn’t. 8GB RAM is pretty much the max we can get for a dedicated JIT machine. We could keep the top 4 GB in the top half of the RAM as raw, but that’s it. That will hardly cover the entire dump. I hope that is enough to match Google. Or at least give people the “feeling” that it’s as fast as Google, because it sure won’t be very accurate with 31 fields crunched into 224 * 8 bits.

Daily link November 23rd, 2007

back from DQ

I just got back from eating a banana split at DQ.

I am now going to write out all millions and millions and millions of records as short data blobs in a dbm derived database, and I am going to SHOVE all the search index data into 255 chars or 255 * 8 bits, plus one long as the 64 bit key, in a super long hash.

This will cut down search accuracy about 80% (or more), but damn will it be way faster than (I got excited there) just as fast Yahoo and Google.

It will convert this project from an accurate search to a you may be pleasantly surprised search. What are you going to do. This is Canada. You can’t expect miracles here.

I’m leaving immediately after the dump starts. It will probably take a day or so anyway.

Daily link November 23rd, 2007

Back to Oracle 11g

It would turn out that no matter the DB keys stored in the RAM, that on 100M+ records in an innodb style table, with a lengthy statement, the result will take at least a minute and a half on 64 bit compiled code on a dual core. This is unacceptable. I have to get the commercial version of Oracle 11g today, no matter how crappy the company’s service is. I am going to order from CDW instead. Then I have to wait for it to arrive.

Having this much data is a nightmare, because I don’t have the human resources to deal with solving the large scale problems of having large scale data sets. I wrote the code, but if the API can’t handle the data sets, the code is worthless. It’s time to try Oracle 11g and hope for the best.

It absolutely kills me to write this, but it’s the truth. Our own data solution is not complete. It will have to be complete before we shift to the full web search. So much for managed this or managed that. It turns out that managed solutions have their limits and managed databases can only handle so much. At the scale of hundreds of millions or billions of records, it’s no longer manageable by bulk built commercial software, no matter how expensive.

Oracle will have to be good enough to hold off the inevitable for a few months.

This is totally the BDC’s fault, and they are truly a detriment to Canada. If they are going to be so worthless to the public, they should just refund everybody’s Federal tax dollars, and close the agency down for good. I’m ashamed to say I help pay their salaries.

UPDATE: I am on the phone with CDW, this is very hellish. If this shaves something like 2 seconds off a minute and a half I will be extremely pissed off. I will have to degrade the finite ability of the search and change the schema to push speed through hundreds of millions of records, or I will have to make it so there is no JIT searches and everything is pushed from compiled results, like ebay and myspace. MSSQL isn’t strong enough to do it either, so they just pre-compile EVERYTHING. I was hoping I could do some JIT for non-compiled searches, this is a search engine after all.
The word “enterprise” this or that often times is just a synonym for crap. Enterprises really don’t care. It depends which ones but that is a good rule of thumb.

This is total bullshit.

UPDATE2: Now the sales reps are giving me the run around, and they said that they would cut the pricing on Oracle to 5% over the US price instead of the 100% over the US price advertised on their website. It made no sense because CAD is still worth more so I complained about it. This is so stupid. Now I have to wait another hour to order it(apparently in an hour the magic Oracle fairy will ok BeerCo to be a customer). I have to order for several seats even though I am only using one. I hate this so badly. This is such a waste. If this doesn’t work, I am going to spaz, because it means that I will have to finish the DB I started from scratch before I can make this live, or I will have to store everything as hash keys and blobs, which means the searches will be sh1tty like Google, and not finite.

UPDATE3: I’ve had it. Nobody does this to BeerCo. Oracle just lost THREE THOUSAND DOLLARS. Be proud Larry Ellison. I am taking all the records out of the fields and dumping them into blobs. Fuck this. Fuck this to fucking hell. Maybe this is why Oracle sucks so bad. Fuck them. I am going to go get a banana split at Dairy Queen, then I will start the dump.

UPDATE4: The guy from CDW called finally me back 3.5 hours later and told me he didn’t know whether the CPU licensing was per core or per CPU. So apparently somebody else is supposed to get back to me with that. But now I don’t care anymore because I’m sick of this whole ordeal. Buying DB software should have been way easier than that. I should have been able to go to an online store and purchase a license key. I can even do that for Microsoft products(but I would never do it). Oracle is just messed with it’s vendors. It’s counter intuitive. They either have to start selling licenses directly online or simply fade away in my opinion. I am not going to go backwards now. Oracle is no longer an option. I am moving forward. If their sales are that bad, I can imagine what the support is like.

Daily link November 23rd, 2007

We’re about 40k short to launch the search

Because our customers pay us about 30-45 after the fact on average and because our bank freezes US checks for 30-60 days we have to have $42,000 locked up in operating expenses to run the outsourcing business available to pay employees and to restock beverages, and other office items on a daily basis, also to pay for enterprise internet, phone, utilities, commercial rent, and other expenses.In the last 2 months we got hit with $15,000 in federal, provincial interim taxes, plus various insurance and other yearly expenses.

That being said, I could take the 40k from the cash flow and hope it all works out, but I don’t know if I want to do that. You never know about the dependability of customers. You never know if or when you’re going to get receivables. I don’t want to risk it.

The huge bandwidth setup requires a multi-thousand dollar deposit, and term, and I am also missing some expensive key parts to finish building the small 7 server array to power the search, such as quad cores @$350 x 6 and other parts I did not previously buy.

We almost don’t make any money on the 12 month programmer contracts, so it ties up tens of thousands of dollars for very little benefit. When I make the 2008 business plan I am going to have to adjust some issues. I think that I am going to have to delay the launch of the search until January at least because of our other business.

I’m also going to email around and see if any one of our associates from pre 05 has 10k 10k for gear plus 30k for 1y x 100MBPS umetered band, so 40k total floating around that they could shoot us fast. Some of our past associates from 2002-2004(before we incorporated) are actually multi-million dollar companies, so I am keeping my fingers crossed on this one. They can afford it, whether or not they think I can pull this off is the issue. I’ve pulled off some pretty amazing stuff though, so hopefully my history(I don’t like the word reputation and what it implies) will help.


November 2007
M T W T F S S
« Oct    
 1234
567891011
12131415161718
19202122232425
2627282930  
BeerCo on YouTube (BCS videoblog)
Photoblog
(on Flickr)
Main RSS Feed
Link Blog (tech news from SiteSpaces.net)
Add to Technorati Favorites
About me
Comment RSS Feed
Click to see the XML version of this web page.


Chris R. works at BeerCoSoftware.com (title: President of Development and Sales). This is Chris's work blog.

Disclaimer: BCS will not let personal views of any employee, including Chris, regarding any software product, company, standards or otherwise get in the way of any company that hires it to provide a solution. Companies pay BCS and BCS provides solutions regardless of the views of any employee. That’s part of being professional, and BCS is a professional software company.

Everything here is Chris's personal opinion and is not read or approved before it is posted. No warranties or other guarantees will be offered as to the quality of the opinions or anything else on this blog.

Login
Blog at WordPress.com.
InstaSize Online and Square InstaPic – Photo Editor for your PC.

How To Fix Svchost.Exe Netsvcs High CPU Usage Problem ? Solved: Netsvcs High CPU