Chris R's Weblog

Daily link December 1st, 2007

I was just pinged by Rory Blyth

While this blog is more popular than his as per alexa.com. I still think it was nice of him to mention it.

Hopefully some coders from socal will see this and contact me about forming a team to promote the search engine in late June of 2008 in socal and SF. I am officially linking back.

Daily link December 1st, 2007

Reverse engineering Google 101 a live blogging event

http://forums.techcrunch.com/forums/thread.jspa?threadID=5976

GetmeGoogleBot
Posted: Dec 1, 2007 3:24 PM PST

TC forum time is messed up. I posted this original blog entry immediately after I posted the forum post on TC.

This was actually posted at about 7:10 pm EST for easy comparison

http://www.google.ca/search?q=GetmeGoogleBot

The pagerank on TC is high

TC has a PR of 7

The search will turn up results in about an hour or so.

http://www.deitylinux.org/test.html

This old custom Linux distro has a Pagerank of 3 which is very low.
The results here won’t show up for months if they show up at all.

UPDATE:

[[email protected] ~]$ date
Sat Dec 1 19:39:53 EST 2007

The google bot just scanned Beercosoftware.com which has a PR of 4

The reference in this blog is now showing up approximately 30 minutes after it was posted.

The higher the pagerank the faster the bot sweeps. It does this recursively down the chain until it reaches PR1 sites.

TC Forums must be queued right now, and will probably appear in the search results in the next hour. The forums may be penalized a bit since they are on a subdomain, while this blog is on the main domain in a subdirectory.

The point of this exercise is to show how Google bot works and how to emulate it most efficiently. This will be important in matching results up against Google. Notice that the other search engines are not using this recursive algorythm:

http://search.live.com/results.aspx?q=GetmeGoogleBot&go=Search&mkt=en-ca&scope=&FORM=LIVSOP

http://search.yahoo.com/search?p=GetmeGoogleBot&fr=yfp-t-501&toggle=1&cop=mss&ei=UTF-8&vc=&fp_ip=CA

Click to view the original size

So here is a pictorial about why Google is different. We will also have the entire Internet blown up in Tree form so we can work from the root of the Tree down to the leaves with our crawler, instead of doing STUPID crawling like the failure search engines.

There is still one major difference between Google and ours and I will not go into that now.

This reverse engineering primer brought to you by Chris.

I’m going to try my hand at being a Googlr now.
DO YOU KNOW WHO I AM????

I say, I say, DO YOU KNOW WHO IIIIII AM?

I’m ON MOTHERF’ING WIKI MOTHERF’ING PEDIA DAMMIIITTTTTT!!!!

… I think I am getting the hang of this.

UDPATE2: I am thinking the post on TC does not have enough peripheral text to be picked up by Google bot, so I just added some ipsum crap around it to see if that makes a difference.

UPDATE 3:

[[email protected] ~]$ date
Sat Dec 1 20:04:18 EST 2007

I just freshly reposted it with the ipsum crap so it meets the minimum text requirement for getting picked up. I am going for a couple hours. I am sure it will be indexed by the time I get back, or at least by tomorrow morning.

UDPATE 4:

[[email protected] ~]$ date
Sun Dec 2 08:00:03 EST 2007

As we can see here, the Google bot got the 2nd techcrunch posting which had enough peripheral lorem ipsum text to meet the minimum requirements of the Google scanner. Google stats say this was spidered 11 hours ago, so that would place the bot hit at around 9 PM EST on Dec 1. since it’s around 8AM EST now on Dec 2.

The image “http://farm3.static.flickr.com/2223/2079950203_6d517b736e.jpg?v=0” cannot be displayed, because it contains errors.

TC moderators removed the post as spam, but not before the Google bot got it. The actual software they are using for their forums, is made by Jive software, the same people that make the FOSS Wildfired Jabber java based server. IT is full of bugs and so is their software as you can see by clicking on the deleted post. They charge 35k to license this software, which essentially is inferior to phpBB which is free.

Most of the time when you see this insanely priced buggy software, it’s because somebody knows somebody at the company and they pity them, so they buy a license or they buy their whole company. Microsoft does this a lot. They even based Channel9 on community server software which was totally full of bugs and really shoddy in it’s initial incarnations. They advertise on Techmeme with multiple 5k adverts, and they also just bought this former employees unknown social networking website.

In the business world you don’t have to be good, you just have to be connected.

Well enough side tracking blasting business people, as you can see my Google bot prediction was exact. It’s easy to see that the frequency of the bot sweeps goes hand in hand with the pagerank via an algorithm. I have pretty much duplicated this. This and pagerank itself are functionalities that are pretty easy to duplicate. The thing that will hinder our engine is that I had to crunch the finite data into a small space in order to deal with our less powerful hardware and speed issues. Otherwise I would have done the whole partitioning of queries across several machines. I would have needed a team of about 10+ people, but yeah, I would have pulled that off too.

UPDATE 5:

Meanwhile MSN and Yahoo are still blind to the web. This is why there is still room, after 10 BLOODY YEARS to advance in search. Google doesn’t have to be better than everybody, they just have to be NOT BAD. This is because the other teams ARE BAD. Imagine an industry where you win BY DEFAULT because the other people suck that bad. That’s the kind of game I want to play, and so I continue with Peeplr.

http://search.live.com/results.aspx?q=GetmeGoogleBot&go=Search&mkt=en-ca&scope=&FORM=LIVSOP

http://search.yahoo.com/search?p=GetmeGoogleBot&fr=yfp-t-501&toggle=1&cop=mss&ei=UTF-8&vc=&fp_ip=CA

Daily link December 1st, 2007

Linux is so easy I don’t even need to hack the memory manager

All I have to do is

A. add another syscall in unistd.h and syscalls.h

B. add the implementation as a .c file in the ipc directory

C. call alloc_pages(gfp, n_order) from the kernel, and wallah, megs or gigs of contiguous physical RAM not mapped by the MMU, and irrelevant to the process space.

D. call the syscall from userland and jet the data to and from this magical physical page land from the kernel to the program reading out the search *stack frame that streams 1MB of data at a time into the regular expression matcher to compile a search.

I can precompile or even JIT millions of searches this way in a few minutes, and keep the top pagerank(or rankpos if you will) search data in GIGS of physical reserved RAM forever and ever.

This would be absolutely impossible to do on windows.

Tomorrow I will implement this and move on. It will suck that I can not use precompiled kernel images any more but what can you do.

I will be doing this kernel hack on teh linux-2.6.23.tar.gz

* no I don’t mean the stack frame as in the stack frame in a process’s stack memory. I am using that term for lack of a better word. I mean a 1MB data stream streamed into the user process from the syscall that can interface the search data in the 1-2 gigs of allocated physical RAM, 1MB at a time. When that 1-2 GB in the RAM is exhausted, it will start pulling more data from small 1.3mb indexed disk files containing the n Terabytes of remaining lower pagerank(RankPos) searchable data. Instead of fseek()ing a huge multi-terabyte file like innodb or DBM, it will go strait to the 1.3mb file it’s supposed to start at via the file table index.

And through this, I will create a JIT search and term compiler, as fast as Google, though a lot crappier, since I had to crunch the finite data in order to keep it smaller in size for search purposes. I had to do this for a commercial hardware(store/tigerdirect.ca bought) target. All data records were crunched to 255 * 8 bits including delimiters in order to maximize speed. The full records will be expanded from the result id sets before they are cached for display by the GUI code.

Daily link December 1st, 2007

All episodes of Code Monkeys free on Google Video

Here is a link to all 12 episodes on Google video.Yeah, I can’t believe it either. They’re all there up to episode 12. Enjoy! Below is #12

Daily link December 1st, 2007

Repost of my TC comment on Jakob Lodwick

As per TC:

Jakob Lodwick, the co-founder of IAC owned video site Vimeo, left the company today. The reason? Apparently Lodwick didn’t see eye to eye with the IAC brass on creative issues, and specifically had a run in with IAC chief Barry Diller three weeks ago.

That’s not surprising, given the picture Lodwick chose to include with his goodbye post. A source close to Lodwick says “he was let go.”
The image “http://www.techcrunch.com/wp-content/jakoblodwick.jpg” cannot be displayed, because it contains errors.

This is just like the dell dude.

Look what happened to him.
http://www.engadget.com/2007/1…..-a-waiter/

Also I posted a video about the PCDOS founder also toking up:
youtube.com/watch?v=303F_qmtKnE

http://simple.wikipedia.org/wiki/Carl_Sagan

Carl Sagan also smoked pot, and the PCDOS founder I was referring to that smoked dope was of course Gary Kimball. Just copy paste the video URL and wait a couple minutes to the jacuzzi part for the reference.

I also found out that Jakob went to RIT. I went to FLCC only a few miles away. That made me especially sorry to hear this sad news.

Daily link December 1st, 2007

Today’s installment of “share a corporate email”

I only responded the way I did below because he ignored my initial response. Our tech stores page index data as objects, not as data dumps like Google. English language lexical analysis is useless because our engine already does it when it analyzes page data.

The spam below wasn’t the only spam i got. He sent me this one on Nov 29th also:

11/29/2007 12:46 PM

Chris,

How about a WebEx session to get started?

Mike Kennedy

-----Original Message-----
From: Chris [mailto:[email protected]]
Sent: Thursday, November 29, 2007 7:40 AM
To: Mike Kennedy
Subject: Re: AskWiki

Right back at you Mike,

We’ll license you our search backend. We’re not stupid either.

Thanks,
Chris - BCSC

Then he sent me the following in response to THE SAME REJECTION EMAIL. Like he answered my same, “would you license ours” response twice, as if I was going to respond differently if he tried it again. I was so utterly annoyed. Why would you email a search engine startup asking them to license YOUR search technology???? Wouldn’t it make sense that we already have that technology? At any rate, here is the last response exchange:

Final response:

Mike,

You have to be kidding me. Our search tech blows yours out of the water battleship style.

Go find yourself a customer buddy,
Chris - BCSC

08:51 AM Today, the 1rst:

Mike Kennedy wrote:
> Chris,
>
> Let’s have a WebEx session to determine exactly which modules you need.
> I am on-the-road all next week. How about Tuesday, August 11th at 11:00
> am Quebec Time (8:00 am CA time)?
>
> Thank you,
>
> Mike
>
> —–Original Message—–
> From: Mike Kennedy Sent: Thursday, November 29, 2007 9:47 AM
> To: ‘Chris’
> Subject: RE: AskWiki
>
> Chris,
>
> How about a WebEx session to get started?
>
> Mike Kennedy
>
>
>
> —–Original Message—–
> From: Chris [mailto:[email protected]] Sent: Thursday, November 29, 2007 7:40 AM
> To: Mike Kennedy
> Subject: Re: AskWiki
>
> Right back at you Mike,
>
> We’ll license you our search backend. We’re not stupid either.
>
> Thanks,
> Chris - BCSC
>
> Mike Kennedy wrote:
>
>> Hi Chris,
>>
>> One of my clients (Expert System http://www.expertsystem.net <http://www.expertsystem.net/>) provides the semantic search and natural language technology utilized by AskWiki.
>>
>> Expert System’s COGITO(r) is a linguistic technology that leverages linguistic analysis and semantics to facilitate the understanding and represent the meaning of unstructured information. Cogito identifies the concepts that are semantically more relevant which provides users specific, highly relevant information. The company is very unique. They have been shipping the technology for over ten years and have been profitable since day one.
>>
>> Attached you will find an IDC whitepaper describing our technology and
>>
>
>
>> a white paper we wrote to highlight the advantages of deep linguistic analysis in the management of unstructured information.
>>
>> Would a WebEx session to demonstrate and discuss our semantic search and natural language capability be of interest to you? *Please
>>
> advise.*
>
>> Thank You.
>>
>> Mike Kennedy

UPDATE:

Mike got back. At least he had a good attitude about it. I’ll give him that.

It’s called skill Mike, … it’s called skill.

BUT THANKS!

Chris - BCSC

Mike Kennedy wrote:
Chris,

Good Luck!

Mike

—–Original Message—–
From: Chris [mailto:[email protected]] Sent: Saturday, December 01, 2007 7:52 AM
To: Mike Kennedy
Subject: Re: FW: AskWiki

Mike,

You have to be kidding me. Our search tech blows yours out of the water battleship style.

Go find yourself a customer buddy,
Chris - BCSC 

Daily link December 1st, 2007

Ah social health care, that’s why taxes are so high and unreasonable

So that’s why our gas is $5 a gallon, our cigarettes cost $13 bucks a pack and our income taxes are enough to make communist Russia look like a tax shelter.

I understand now. I think I am coming to reason.

Oh wait, I have a horrible tooth ache(I’m not kidding) I have a cavity where the nerve is exposed.

I will have to go get a filling on Monday. But I paid tens of thousands of dollars in taxes so it’s ok right???

WRONG. I have to pay upwards of $200 to have a cavity filled because our national health care insurance doesn’t pay for any dental work.

When I was much younger and I worked at BK in NYS, we HAD health insurance at fucking Burger King.

So why doesn’t this insurance that costs people their livelihood cover dental work? Why aren’t vet bills covered for our pets even?

Because this is Bullshit. I’m sure the TENS OF MILLIONS needlessly spent on the just for laughs festivals in Montreal would more than pay for that. Or perhaps the subsidies given to the monarchy or Celine Dion(really, what’s the difference) would pay for everybody, but it doesn’t. I will have to pay for my own filling. Having paid a king’s ransom in taxes or not.

The first prime minister was corrupt

Pacific Scandal

(1872–73), charges of corruption against Canadian prime minister John Macdonald in awarding the contract for a transcontinental railroad; the incident resulted in the downfall of Macdonald’s Conservative administration.

The recent ones are too.

and again with Cretien.

They all get away with it. I can’t wait until Harper’s comes out.

The only way you can stand it is to ignore EVERYTHING, and pretend it didn’t happen. I am so fucking sick of this. I hate these people so much. Not ordinary Canadians, ala Trailer Park Boys, ect… I hate Celine Dion, who takes our money for her shitty music, I hate the govt, who takes our money and spends it on their own bullshit, and taxes us to death, and I hate the dumb looking Canadian flag, which was made by somebody who won a maple leaf drawing contest in the 1960s.

I saw sicko, but I also know that employers have decent health plans. That’s not where the money goes here.

If I was a poor person in Canada, just making minimum wage, not on welfare, my cavity would continue to degenerate, and I would eventually have gum disease and perhaps die. The govt does not pay for the working poor.

That’s the truth about the situation. That being said I brush my teeth every day and even though I take care of my teeth I still have some problems. I am in pain right now with no recourse, as everything professional such as a dentist’s office is closed on Saturday here, and I can’t wait until Monday.

UPDATE:

Regarding the health care that is “free”. I had to wait 3 months to see a dermatologist, my ex-girlfriend had to wait 4 months for hers. My mom got queued for 1 YEAR for cataracts surgeries on both eyes. If I am lying let god(if he were to exist) strike me dead on the spot. This is the worst joke on earth. I am trading off a lot leaving here and going to Cali. It’s still worth it because it’s so bad. We have a family doctor: Dr. Morin. It doesn’t matter because even with a referral you get queued forever because the specialists are constantly overloaded and they are paid like half as much here.


December 2007
M T W T F S S
« Nov    
 12
3456789
10111213141516
17181920212223
24252627282930
31  
BeerCo on YouTube (BCS videoblog)
Photoblog
(on Flickr)
Main RSS Feed
Link Blog (tech news from SiteSpaces.net)
Add to Technorati Favorites
About me
Comment RSS Feed
Click to see the XML version of this web page.


Chris R. works at BeerCoSoftware.com (title: President of Development and Sales). This is Chris's work blog.

Disclaimer: BCS will not let personal views of any employee, including Chris, regarding any software product, company, standards or otherwise get in the way of any company that hires it to provide a solution. Companies pay BCS and BCS provides solutions regardless of the views of any employee. That’s part of being professional, and BCS is a professional software company.

Everything here is Chris's personal opinion and is not read or approved before it is posted. No warranties or other guarantees will be offered as to the quality of the opinions or anything else on this blog.

Login
Blog at WordPress.com.
InstaSize Online and Square InstaPic – Photo Editor for your PC.

How To Fix Svchost.Exe Netsvcs High CPU Usage Problem ? Solved: Netsvcs High CPU