Friday, February 06, 2009
Linksys award lossing help desk
Nicholass: my new WVC54GCA cuts out... can't ping.. other attached devices to WRT54G are fine
Jhoneil (22042): Is the camera wirelessly to the router right now?
Nicholass: it works for a while... 24 hours later can't connect with the web admin via http or the untility
Nicholass: the camera is wireless always
Jhoneil (22042): Can you ping the camera when it's hard wired to the router?
Nicholass: its not hard wired to the router.
Jhoneil (22042): Okay.
Nicholass: signal streght is fine everywhere in the house.
Jhoneil (22042): Before we continue. May I know which country you are based in?
Nicholass: canada
Jhoneil (22042): Can you please confirm the model and version number of your device and it's serial number?
Nicholass: I can't do that see its mounted very very high
Jhoneil (22042): Oh! Is it possible to hardwired the camera to the router?
Nicholass: no
Jhoneil (22042): All right.
Jhoneil (22042): How about your router?
Jhoneil (22042): Can you please confirm the model and version number of your router and it's serial number?
Nicholass: the router is WRT54G v1.0
Jhoneil (22042): Any serial number, Nicholas?
Nicholass: its not accesable to me right now.
Nicholass: everything else like wireless printer, voip phones, 2 laptops, 1 desktop, ps3, nas200, and media extenders are fine
Nicholass: its the new WVC54GCA that not working
Jhoneil (22042): Okay. Please give me a minute. I will check my resources on how to troubleshoot the device.
Nicholass: it was working before I mounter it at 27ft celing like wireless cameras are supposed to be
Jhoneil (22042): Try to unplug and re-plug the camera power.
Nicholass: it works when i unplug and replug the power. but i cant keep doing that
Nicholass: the camera needs to work always
Jhoneil (22042): Yes. I understand.
Nicholass: i tired to unplug it a few days ago.. but the problem returns
Jhoneil (22042): Okay. You can try to upgrade the camera's firmware.
Jhoneil (22042): But you can only do that if the camera is hardwired to the router.
Nicholass: why does linksys sell a new camera with old firmware?
Nicholass: can you just send me a new camera with the firmware updated. I will return the old one
Nicholass: cant keep moving the camera around. you need to sell stuff that works. not stuff that sometimes work after you unplug and replug it
Nicholass: I bought the camera from newegg.ca
Nicholass: my address is
Jhoneil (22042): Old firmware? It's not an old firmware.
Jhoneil (22042): What I mean is e have to re-flash I mean the firmware.
Nicholass: who is e?
Jhoneil (22042): Re-load the firmware of the camera because we don't have an updated one.
Jhoneil (22042): *That's we, sorry.
Nicholass: But the camera is 27ft hieght
Jhoneil (22042): *What I mean is we have to re-flash the firmware.
Nicholass: I cant attach it to the network directly
Nicholass: thats why i bought a wireless camera and not a networked one.
Jhoneil (22042): Yes. I understand. But after you re-flash the firmware you can then put the camera back.
Nicholass: I cant move the camera again
Jhoneil (22042): I'm sorry, Sir. We can't troubleshoot if that is the case.
Nicholass: the documentation said that setup needs only done once... or after a reset
Jhoneil (22042): Yes. I know. If there is no issue that you encounter in the future.
Nicholass: why would i have an issue with a new camera? its like buying a new car that cant start 2 days after you drive it from the dealership..
Nicholass: the camera says it has waranty
Nicholass: you fix it. not me
Jhoneil (22042): How long have you been using this camera?
Nicholass: I don't take the car to dealership then fix it myself. some one like a mechanic fixes the problem. not me the customer
Nicholass: 2 days
Jhoneil (22042): Okay. If that is the case, then you can return the device, Sir. I am willing to fix the device. But we need to hardwire the device to the router.
Nicholass: I want a replacement camera with the right firmware for the defective camera i have
Nicholass: return the device to who?
Jhoneil (22042): I am capable of fixing devices.
Nicholass: I'm not
Jhoneil (22042): To where you purchase the device. I am sure the device is still under warranty.
Jhoneil (22042): So, you can replace the device to them.
Nicholass: newegg.ca said to call linksys
Jhoneil (22042): Okay.
Nicholass: what other options do i have?
Jhoneil (22042): Sir, our RMA procedure requires that we get the serial number of your device. However, in our current situation right now, I suggest you to contact our customer service as soon as you get the serial number.
Nicholass: RMA?
Jhoneil (22042): Yes. RMA is Return Merchandize Authorization.
Jhoneil (22042): You want replacement, correct? Not troubleshooting?
Nicholass: If you send me a box... like microsoft does, I will ship you the crappy camera. If you can... send me a replacement unit in the box... then I'll have something i paid for.
Nicholass: i paid for a working camera
Nicholass: I got a papper wieght
Monday, February 02, 2009
why I think Canada Computers suck
The moral of the story is, if BestBuy.ca or futureshop.ca carry the same product, go to any one of those stores and make sure they price match the sucky CanadaComputers. Trust me they will. You don't even need a flyer either. Just walk up to any internet capable computer and point to the canadacomputers shitty website.
After you pay the discounted price, you can bet that these retailers will honor their commitments and gladly return defective units. You don't even got to explain yourself. Just tell them it sucks!
PS: I got the replacement Linksys WVC54GCA from newegg.ca only because the big boys don't carry that particular item. Linksys usually fixes there mistakes with their newer models. I'll let you know how Linksys WVC54GCA and newegg.ca measure up.
Tuesday, December 09, 2008
Cheap website load testing alternative
A site's PageRank could benefit greatly by being highlighted on a popular technology site like slashdot.org or digg.com but it could come at a cost of being rendered useless by a mechanism called the slashdot effect.
Generally a web server serving static content is fine and can handle the load via automatic web server caching. Its probably your database driven website that could cause the problem.
If you run a small site as a hobby, you probably don't have access to exotic load testing equipment so here is a cheap alternative.
Google Webmasters Tools could help. You just need to turn up your crawl rate to maximum. Just note that you don't have control when it will happen, and it could effect your PR so do this in the early stage before your real PR Campaign starts.
Dashboard > Settings > Crawl rate.
Monday, December 01, 2008
Sectorlink.com down!!!!
The moral of the story: Make sure you backup your site frequently just in case you need to change Web Host providers. Does anyone know how fast it takes Google to evaporate all your PageRank? :(
Sunday, November 30, 2008
Freebase vs DBpedia
Looks like Freebase is wining the online database war! and will become the Wikipedia of databases! JSON is making the difference and is more flexible than RDF
Sunday, November 09, 2008
The secret Sause of Wikipedia.org

Its hard to believe that Wikipedia.org exists and not only does it, it is one of the webs most trafficked sites. Wikipedia founder Jimmy Wales is reported to have told a library group:
- 50% of all Wikipedia edits are done by 0.7% of users
- 1.8% of users have written more than 72% of all articles
Having an editable page is only half the story, wiki stubs reduce the barriers to create new wiki's and provide landing pages for search engines with an audience keen on the subject.
The audience primed, having hoped from site to site, loaded with fresh ideas, almost compelled to press the edit button, just like a graffiti artist presented with a barren stone wall.
Thursday, January 31, 2008
The RSS Generation, Taging, aideRSS for Scale

I've been busy lately reorganizing my online life. The main problem was that I am addicted and enjoy staying on top of information yet scaling was an issue. I don't want to have to stalk and hunt but rather be notified. Being notified also has other advantages such as reducing spam and advertisement, incorporating attention and prioritizing features. I found the solution with a mix of online tools centered around RSS. I'm not new to RSS, as I have been using it since its inception. The difference is how I use it and why. Most notable websites use RSS and not just for news. You no longer have to go to the original site unless you really need to. My online RSS Reader is Google Reader. I've gone though numerous RSS readers both online and off line. Online readers are available from any machine with web access be it Work, Home, public access or even Mobile. I use Google Reader for 4 main reasons.
- Searching/History - If I'm running out of time, the feed accumulates.
- Taging - Allows consolidating, prioritizing and exporting via public feeds for such things as your blogroll, desktop widgets and other RSS readers and services
- Mark all as read - This caries over to iGoogle/Google Mobile Reader so you don't have to worry about rereading marked headlines.
- I can export and move to another RSS reader if the tech is superior
- blog - All my favorite bloggers
- local - Country, City, Government, transportation news and weather
- pulse - FreindFeed.com, Plaxo.com, and Facebook.com allow you to track your friends
- search - reoccurring search terms that I stalk via yahoo search and Google News
- research - ArXiv.org & Google Scholar
- social - Del.icio.us, Digg, ReadBurner, reddit, Slashdot, TailRank
- Finance - Finance news on the Financial products with Stock Quotes via Yahoo Pipes
- forum - Most products I buy and use have forums. Those forums have feeds.
- groups - Yahoo and Google Groups also support feeds
Managing your attention and time is key. You know which Tags are important to you and why. You will naturally read your subscriptions by Tags. Some Tags will grow especially those similar to my social tag.
For my social tag, I make it public and feed it to aideRSS.com producing another consolidated RSS feed that I reload into Google Reader. let aideRSS.com score, filter and track performance giving you a smaller more manageable list.
I track 1000 of feeds a day and growing, Read only what matters.
Friday, January 25, 2008
Freebase the internet's Database

In this brave new 2nd digital age, conventional ideas get retrofitted forming new ones. A sprinkle of social graph and a dab of web2.0. Freebase is the new internet shareable database. Built for both users and web developers in mind. Where Wikipedia ends Freebase begins. It's different than Swivel in that it lets users extend each others data and offers little in the realm of interpretation. However like wikipedia I see it going mainstream. The viral edge it has is that it allows your average developer with little resources but with a free blogger type platform to create persistent widgets/mashups with just a little client side javascript. An example engine could be mjtemplate.org . I also envision a future where programming especially web programming would be a basic skill such as using a calculator. A few years from now, maybe as little over a decade, we will all be considered average developers. That's gives a platform like freebase a whole lot of appeal. In the mean time, the only thing against it is time. As immediate value in ROI is usually asked for in years, and not decades. Good luck.
Thursday, January 24, 2008
Swivel your data!

Swivel.com is a relatively new site with great ambitions. Its the wikipedia of datasets. Users are encouraged to upload and share there data. Swivel converts then to charts and analyzes to see if anything that was previously submitted correlates. Correlation isn't instant but batched nightly. Swivel then suggests correlations and their coefficients, encouraging users to comment on the results. Users can't directly edit others previously upload charts but can mash them with others. If they really wanted to they could export a dataset and re import. I really like this site because it goes beyond just displaying and hosting user data but find interesting, and potentially surprising facts! Check it out!
Friday, January 18, 2008
Buyer Beware - HP Protecttools Security Manager
All I had to do was scan my fingerprint once and it would remember and auto fill my various user name and passwords across other applications and even websites.
The problem is when you start to rely on the technology and something goes wrong. Which it will!!!
I can still log into Windows the old way, but HP Protecttools Security Manager biometrics is locked out. After contacting HP support. They recommended uninstalling and reinstalling which will also delete my profile and the 100's of user name and passwords it has learned to remember for me. There is no export feature.
So the technology defeats its own purpose!! Remember passwords so I wont have too. Now that I have forgotten all those strongly protected user names and passwords, I'm gonna have 1 headache after another.
Again why OpenID will work, and a lesson for biometrics to stop pretending to be something their not!
Thanks HP!!
The computer is personal again => The computer is a personal headache again!
Wednesday, September 27, 2006
Google can do its own Competitive Market Research with its Google Analytics Blackbox

But what can Google get in return: Competitive Market Research regarding the Search marketplace.
Using Google Analytics Aggregated data Google can collect statistics regarding other search engines and how they refer you to sites.

In effect what keywords they refer, vs. what keywords others refer. They also can get raw counts in the number of unique visitors and total visitors by referrer
This is only possible because Google Analytics is a centralized web application maintained and operated by Google vs. AWStats another common used alalytics platform which is installed maintained and operated by the webmaster.
Food for thought: What other web applications have shifted paradigms and what effect and potential does all this newly related aggregated data worth.
1) I imagine that Google Spreadsheets can be used to aggregate related ideas in effect creating a sort of GoogleSets.
2) I imagine that GoogleTalk can be used to aggregate written context in effect creating Artificial Intelligence Chatbots with similar effects of Jabberwacky.com and beyond
3) I imagine that Google Writeley can be used to aggregate thought, not just necessarily of what you write but more specifically what you write, when you write it, how you write, and revise including the whole process etc
I imagine in effect borrowing the collective intelligence from its users.
In the end both Google and You win.
Monday, January 16, 2006
Preventing Web Scrapings
The internet is defiantly an interesting place.
By design, it proliferates content sharing unlike any medium before it. At the same time content providers still want to maintain some control over their work.
Copyrighting is very difficult to enforce. Individuals can simply cut and paste at will, passing your own work as their own within a few keystrokes.
It gets even worst. Programs can automate this. These types of Programs are known as Bots. Bots can crawl your site in a matter of seconds of you posting your new content and plagerse in real time.
At the same time, similar bots known as search engines are desired to index your site and drive traffic your way.
As a content provider their is very little you can do.
Any real attempt to control this type of behavior will no doubt also affect you negatively by inhibiting search engines as well. Why post if no one will come to read it.
It is defiantly an ongoing balancing act, and in the long term against your favor.
Before suggesting a framework let's examine what other content providers are doing.
1. Requiring User Authentication
By requiring uses to authenticate user activity can be monitored and controlled. We can limit each individuals to a curtain speed limit be it kilobits/sec, links/sec etc. However bots have been programmed to create multiple accounts.
2. Asking for personal information
By constantly verifying different past recorded user information we can ensure we have the right user. Bot masters have been known to create there own sites, mixing and mashing user data they have collected and trying it out on other sites. Just ask yourself how many of your passwords are unique for each and every site. What's stopping a web master from using your own information against you.
3. CAPCHA. Text in Image Verification
Currently Bots have difficulty recognizing text in images. By asking users to verify text in images, this can stop or at least slow down bots from further processing your site by also incurring a processing time cost.
4. We can slow down all activity to a rate transverse able by humans
5. Converting content to images. This technique is a little more difficult to maintain. The text in these images do not contribute to your PageRank and users are all prevented from cutting and pasting. Bots can use OCR algorithms to retype back to actual text however his technique is quite strong
Eventually Bots will become adept to the above technologies. Let's explore some alternative framework.
When technology is first introduced, it is the hardest for bot masters to catch up. Wrapping your content in flash or applet like containers like activex etc currently give the content provider the best control. If a technology like flash was been around for a while, moving to the latest and greatest new widget is the best strategy. All you really need to do is create a simple template to follow. You can reuse this template but most if not all bots will be stopped dead in there tracks.
As for search engine marketing, you still need good old basic html. Just keep your latest content inside these new widgets.
Tuesday, November 22, 2005
The death of PageRank
Google has been highly successful as a search engine due to four main factors, Viral Word of Mouth Advertising, simplicity, speed of access, and relevancy.
Among others, PageRank is Google’s premium relevancy algorithm. However there is rumor that approximate 10,000 other parameters decide the overall relevancy of given query to its list of top relevant sites.
This might be accomplished via some sort of polynomial similar to
OveralRanking(site) = w1a1(site) + w2a2(site) + w3a3(site) + ..anwn(site)
Site = the site url,
w = weighing factor,
a = algorithm such as PageRank.
The original premise behind the algorithm is a site’s PageRank is increased when more sites link to it via a referring link from another page.
It is believed the algorithm has been re-designed to have a site’s PageRank increased when more sites of similar topic link to it via a referring link from another page.
As in all networks of this type, this approach has good results when there is a few nodes (sites) with a few links (hyperlinks) between them, however eventually saturation does eventually occur.
When saturation and eventually breakdown occurs the algorithm will no longer return relevant results. Actually it would become no better than returning random results
The breaking factor for this type of network is proportional to its size. Smaller networks breakdown when all the sites are linked to every other site and visa versa. But when a network is the size of Google, lets say 8 billion (last reported) it could be a lot smaller. Imagine only 1,000,000 sites each linking to 1,000,000 other sites are needed for this catastrophe to be put into motion. That’s only 0.0125%
In the not so distant future, spam networks and social networks will reach this milestone. Is Google aware of this? Why does the Google Sandbox exist? If push comes to shove, they have the option not to upgrade their index.
Still skeptical? Just Google for “Google Bomb” The underlying side effect is quite similar just on a different scale.
Possible band aid Solutions do exist.
1) The introduction of a negative SpamRank good extend the life span of a network, but as it continues to grow, the SpamRank would incur the same fate.
OveralRanking(site) = w1PageRank(site) – w2SpamRank(site)
However both PageRank and SpamRank rely on initially seeding with a large number of good sites (dmoz.org was once Googles source for PageRank) and bad sites.
2) Consider when linking to a site to be a vote for a site (Conventional PageRank). What about all the other sites? There is no penalizing factor. What if linking to a site is a vote for that site, and a negative vote for every other site, or a negative vote for every other site with the same topic. (Competitors) Although longevity and relevancy of this strategy is far superior the cost is more cpu intensive and hurts the speed of access bottom line.
Thursday, September 29, 2005
My Common Sence Artificial Intelligence poject
The theory behind my research is to cluster information into sets, made up of related elements (words to start with, followed by concepts, ideas etc) and their potential hierarchies.
This information is mined from the internet using similar search engine based techniques.
Once these base sets of hierarchies and related items are numerous say a few million records they will be used as my seed set.
I will pass large volumes of data; say the free wikipedia encyclopedia, openmind data, learner data, and mind pixel data amongst others over these hierarchies and sets. In effect stimulating various hierarchies and sets and saving then as higher order hierarchies and sets. This process will be continually repeated only limited to my storage capabilities.
Once retention becomes mandatory I will begin deleting random elements that have relatively low stimulations.
Querying the system will be simply a mater of passing information over the hierarchies and watching the stimulus in action.
The resultant will be passed to heuristics and decision trees which will be also processed in a similar manner.
The system will try to predict the incoming stream of information prior to arrival and score its accuracy in a similar manner.
By the way, My Research has no funding and is hobby in nature
Comments?
Tuesday, September 27, 2005
Penalizing Super Sites for the overal good of the people (little guys)
The problem is there so good its bumping out the little guy. Needless to say there are more little guys out there, make about 99% of the internet. We need to do somthing!
Similar to http://mindset.research.yahoo.com/ where users can sort there search results to favor shopping or research it is must become practice to penalize overly ambitious sites. Lets say the upper 10%.
This way we will get two birds with one stone, including the content spammers.. Give the power back to the people!
Wednesday, August 31, 2005
Dealing with Clustering of Common Words, Stop Words and Popular Words in Data Mining Algorithms
When leveraging data mining from generic sources to create such algorithms like Google Sets, Google Suggest, Question Answering, and Reversing Google Sets; undoubtedly the mechanism deployed will create its own “list of words” that are inherent in the underlying data that skew the results.
These words can either be, in the true set of English popular words such as “the”, “and”, “of” but also outside them. Filtering these out might have an undesirable effect while still leaving results unfiltered and of poor quality.
Just like English Written Language has its own list of “popular words” it is possible that the English Spoken Language as a large set of words outside English Written Language. For example “fuck” being a popular Spoken curse word.
As mentions in my previous few posts on “How Google Sets Works” and similarly on “How Google Suggest Works” “Google Sets” Exploits Table and List Data in millions of web sites.
However List Data Statistically is overwhelmed by the word “Introduction”. The most Probable cause is “Table of Contents” in Sites.
Similarly Table Data Statistically is overwhelmed by such words as “Month”,”Cost”,”Sale” and “Revenue”. The most Probable cause is Financial Sites.
Why does “Lyrics” occur so frequent while using Google Suggest?
These are all examples of features inherent in the underlying mechanism and not necessarily in the data.
Specifically I recommend creating a separate exclusion list where all words are ranked by frequency or any other formula used to strengthen one element over the other.
It would then be a trivial exercise to exclude these words
Select goodelement from allelements
Minus
Select goodelement from allelements where rank < 100
Questin.net is here. Come play hands on!!!
Friday, August 19, 2005
A SQL based Ranking Algorithm for Search Results
A SQL based Ranking Algorithm for Search Results
Many web sites and even applications today are empowered by search engine like features although there main business isn’t search. However the results seem to continue to be of poor quality compared with their big brother counterparts. (Google, Yahoo, MSN)
I’m proposing a SQL based approach that gives favorable results and flexibility of quickly changing the underlying business rules.
The algorithm has the requirements that a precompiled list exist with the columns
Keywords
Resources
The algorithm assumes the user will be provided a simple query box, where s/he will be able to supply a list of keywords separated (delimited) by spaces.
Query
Features
Point/Priority Based
Spelling mistakes if keywords are phonetically spelled
Multiple keywords (with the flexibility to add minus keywords)
Variations of Entire Phrase Matching
Variations of Word within Phrase Matching
The Structure
Tables – Description
IndexTable – Mappings from the Recourse to the keyword. Keywords are delimited with the field by space.
IndexTableDetail – Mappings from keyword to IndexTable. One keyword per row.
Create table IndexTable as (
IndexTable_ID ,
URL ,
Keywords
)
Create table IndexTableDetail as (
IndexTableDetail_ID ,
IndexTable_ID ,
URL ,
Keyword
)
The Algorithm
SQL = "select * from IndexTable, (select URL, sum(pt) pt from ( "
if Query = "" then
SQL = SQL + " SELECT 5 pt, URL, Keywords FROM IndexTable union "
End if
if Query <> "" then
SQL = SQL + "
SELECT 10 pt, URL , Keyword FROM IndexTableDetail where Keyword = 'iQuery' union
SELECT 2, URL, Keyword FROM IndexTableDetail where soundex(Keyword) = soundex('iQuery') union
SELECT 7, URL, Keyword FROM IndexTableDetail where Keyword like 'iQuery%' union
SELECT 5, URL, Keyword FROM IndexTableDetail where Keyword like '%iQuery' union "
End if
SQL = Replace(SQL,”iQuery”,Query)
TermsArray = split(Query,” ”)
if Ubound(TermsArray) > 0 then
for i = 0 to Ubound(TermsArray)
SQL = SQL + " SELECT 10, URL, Keyword FROM IndexTableDetail where Keyword = '" & TermsArray(i) & "' union "
SQL = SQL + " SELECT 2, URL, Keyword FROM IndexTableDetail where soundex(Keyword) = soundex('" & TermsArray(i) & "') union "
SQL = SQL + " SELECT 1, URL, Keyword FROM IndexTableDetail where Keyword like '" & TermsArray(i) & "%' union "
SQL = SQL + " SELECT 1, URL, Keyword FROM IndexTableDetail where Keyword like '%" & TermsArray(i) & "' union "
Next
End if
SQL = SQL + " select 0,'','' ) x group by URL) y where "
SQL = SQL + " y.pt > 0 and IndexTable.URL = y.URL "
SQL = SQL + " order by "
if Query = "" then
SQL = SQL + " IndexTable.RowID desc, "
End if
SQL = SQL + " y.pt desc "
In Practice
To see a working website using this particular algorithm check out www.stocko.cc
Tuesday, June 28, 2005
A Generic Algorithm for Classification of Sets (Reversing Google Sets)
A Generic Algorithm for Classification of Sets (Reversing Google Sets)
There's been numerous posts around the net centered around Natural Language Processing and Ontology discovery as they can be applied to Artificial Intelligence. This is therefore my motivation as well.
In previous posts, I described an algorithm for set member discovery using generic web content as the untrained data corpus.
For an example of similar implementations, just look at Google Sets,Google Suggest, Google keyword Tool and Overture Query Term Suggestion tool..
However, the set name they actually belong to, or even probable set names/ontologies is not even attempted.
Static man made ontologies can be found in the popular WordNet application to a limited degree, but I need something machine scaleable.
The purposed algorithm is as simple, and scaleable, as its predecessor for discovering the set members to begin with. Its subset is also comparable to WordNet, except can grow many 1000 times larger if implemented successfully including real persons names, obscure element members and up to date relevant content. However similar to the original algorithm it suffers the same fate of less than perfect quality.
In the original post, I hinted on applying the same algorithm on the many web pages meta keywords tag.
These keywords are themselves similar related but refined to be in the context of descriptions.
Its an assumption that these descriptions are deeply interrelated and contain sometimes at least 1 related ontology with at least 1 field element member. Alternative sources can also be derived from search query terms similar to Google Suggest or Overture Suggestion tool. (See my previous post on Reverse Engineering Google Suggest)
If the original sets corpus was large enough, it itself might even contain those as sets.
To refine the original algorithm we could even include these meta tags along with the fields of tables and lists as well.
The problem now is to relate the sets of the algorithm on tables, lists (referred henceforth to set #1) to similar sets of the same algorithm on keywords/ontologies (referred henceforth to set #2).
Left for another article will be to develop another algorithm, able to recognize sets vs. keyword ontologies
Using the Example
Oracle, Sybase
The Simple Set (using Set Algorithm #1)
select PHRASE from (
Select
PHRASE,
count(SETS_ID)
from
SETS
where
SET_ID in (
select SET_ID from SETS where PHRASE in ('Oracle','Sybase')
)
group by
PHRASE
order by
count(SETS_ID) DESC
)
where rownum < 15
The Results
PHRASE
Oracle
IBM
Microsoft
Intel
Sybase
Compaq
Novell
Apple
Dell
SQL
Hewlett Packard
Cisco
Visual Basic
Java
The Simple Keywords (using the output of Set Algorithm #1)
select PHRASE,CNT from (
Select
PHRASE,
count(SUGGESTS_ID) CNT
from
SUGGESTS
where
SUGGEST_ID in (
select SUGGEST_ID from SUGGESTS where PHRASE in
('Oracle','IBM','Microsoft','Sybase')
)
group by
PHRASE
order by
count(SUGGESTS_ID) DESC
)
where rownum < 15
The Results
PHRASE CNT
oracle 1304
ibm 594
in 348
sybase 65
server 48
sql 48
database 45
table 38
download 36
of 35
for 33
index 32
9i 31
tables 28
The Results
Notice that we must remove common words and any occurrence of the set members from the final results.
PHRASE
server
sql
database
table
download
index
9i
tables
Improvements
1. Order the Set names into hierarchies.
The applications of this type of technology is endless. I will be providing
examples in upcoming future articles.
Try to come up with some on your own, before I taint your creativity with mine.
Check it out first hand @ questsin.net
Monday, June 27, 2005
In Search for Answers (Another Algorithm for Generic Question Answering)
In Search for Answers (Another Algorithm for Generic Question Answering)
There is always multiple ways of solving the same problem. Its only fitting I propose another angle for question answering. This approach is the more traditional of the two. However, I do provide a few twists and potential improvements toward the side of complete machine automation and tuning.
The Assumptions
1. Answers will appear in text, containing 80% of the original question, and/or vise versa.
2. The variations can be attributed to order, tense, spelling, form variations, synonyms etc of the words
3. We can get extra information around the type of answer expected by examining the inclusion of special words:
who, what, when, where, why, how etc.
Is this how BrainBoost.com works? Examine for yourself with "Who is the father of data warehousing"
The Strategy
1. Search via a search engine for permutations and variations in the question. Try using the Gigablast, yahoo or Google API. Hence "Searching for Answers"
2. Score the snippets and return the results. Assume that a snippet contains most of the terms and is within 500 characters of each other. Multiple answers having the same keyword proximities (see previous few post) can be assumed accurate and verified.
3. Remember the essence of the transformations required to find the right answer, to possibly incorporate it in future searches for answers*.
Let's walk through an example
Question:
How old is George Bush?
Can also be rewritten many ways as a question. Some possible variations is
1. Do you know how old George Bush is?
2. When was George Bush born?
3. What age is George Bush?
4. What is George Bush's age?
5. George Bush's age?
6. What is George Bush's birthday?
7. What is George Bush's date of birth?
8. When was George Bush's born?
...
In this context
(old, age, birth date, date of birth, born) are all inter-related properties
(is) can be ignored, as it is very popular
(When) is asking for an time based answer . say "date.*"
A possible answers could of appeared as
1. George Walker BUSH was born on 6 Jul 1946
2. Mr Bush was born July 6 , 1946
3. George Bush - George Bush Born: June 12, 1924
Notice some answers are correlating and some are contradicting.
Essence of the transformations
1. * * * was born on date.*
2. Mr * was born date.*
3. * * - * * Born: date.*
The Algorithm
1. Using all the criteria from the query expand each word into its sets (see Sets Algorithm).
2. Try searching for every combination, including leaving out words.
3. Gather all the retuned snippets and rank for quality
4. Remember the essence of the transformations
Both approaches to question answering together can increase the odds of finding the right answers or recognize the right answer among a collection of possible right answers.
It definitely doesn't end here.
Saturday, June 18, 2005
Common Sense Artificial Intelligence Research Focusing in One Direction
Collectively however they seem to be all focusing in but a few directions. Employing language semantics, hierarchies and statistics. A common theme in most of there work is focusing on fixed phrases such as "is a" to glue everything together.
I'm not saying that these techniques are wrong and will not bear fruit. Building and improving on work by others is definitely a successful common theme in the history of science.
I just don't like "fixing" and "hard coding" anything. Other means need to be explored where this "glue" would naturally emerge. Static algorithms need to become more dynamic and adapt over time.
I'd like to see some works in AI Common Sense move totally away from "text information processing", and neural nets. We need to explore more fresh ideas from the ground up.
As individual as humans are so should the tribute of Artificial Intelligence be.
Tuesday, June 14, 2005
Yin and Yang (Strategies for Data Retention)
Although using the internet as a general untrained data corpus for research; in the field of Artificial intelligence is promising, it also comes with it own unique difficulties.
Algorithms previously discussed around Set Member Discovery (Google Sets), and also but not limited to Keyword Suggestions (Google Suggest), eventually lead to a sort of poisoning when the bad samples are so high they start affecting the quality of the results negatively.
Managing the data set as a whole could also be potentially difficult since it can grow almost exponentially exceeding software and hardware limits.
We need to discuss various strategies for data retention.
Trimming old data by using some sort of cut off date doesn’t lead to the best results because that data might be of good quality. Not keeping enough data could have bad quality or no results. Aggregating the data similar to how data warehouses work for the business community isn't an option because it would negate the dynamic clustering requirements of the algorithm.
The problem then is in being able to qualify the quality, again not a trivial task.
It should also be noted that removing random elements is the simplest approach and the least costly. It also will work on most systems with similar algorithms.
In any case we could throttle the acquisition of data additions to retention to a manageable rate.
The algorithm I'm going to suggest is also quite different approach as it has almost equal if not more system and resource requirements, but as useful in its own right.
In my previous article, "If it doesn't Quack like a Duck?" I showed you an elegant way to calculate word distances.
The assumption is that if the words are really related, they would have similar word distances.
We can trim the elements in the set by setting a specified distance threshold on the word distance function.
The Yin and Yan now comes by having to balance each algorithm with each other. Set discovery with word distances and visa versa.
Monday, June 13, 2005
If It Doesn’t Quack Like a Duck (Algorithm for Calculating Word Distances)
If It Doesn’t Quack Like a Duck (An Elegant Algorithm for Calculating Word
Distances)
Looking under the hood of today’s search engines, we usually hear keyword to results matching are derived by "keyword proximities" (commonly known as "word separation distances") within web pages which impact among other things; web page rankings. Similar to Yahoo among others.
The algorithm of calculating these keyword proximities is therefore the
motivation of this article.
The Algorithm for collecting the sample data
1. Scan the web or any other available rich textual resource and parse and load on a word by word basis into a table as field "word", incrementing ID by 1 for each additional word, for every word regardless if it has been entered before.
2. When switching between resources increment the ID by the "maximum distance" you will use. I use 1000. This will be hard coded. If you decide you would like to calculate the word distance for distances greater than 1000, afterwards, you will get results, but potential bleeding between finishing resources and new resources could occur. Potentially they could also be totally unrelated.
3. Continue with the ID you left off, as all ID’s should be unique.
Keyword Proximities Data Structure (Using Sybase Adaptive Server ASE RDBMS)
CREATE TABLE dbo.stream
(
ID numeric(18,0) IDENTITY,
word varchar(255) NOT NULL
)
The Sample reference
this is a test to see if this algorithm has any potential in solving simple problems ...
The Sample Data
ID,stream
1,this
2,is
3,a
4,test
5,to
6,see
7,if
8,this
9,algorithm
10,has
11,any
12,potential
13,in
14,solving
15,simple
16,problems
...
The Algorithm for calculating the sample data in SQL
select
o.ID,
o.ID - i.ID distance,
o.word
from
stream o,
stream i
where
i.word = "for" and
o.ID > i.ID - 2 and
o.ID < i.ID + 2 and
o.ID != 0
The results can give actual keyword proximities based on the sample data.
ID,distance,stream
39,-1,away
41,1,free
51,-1,things
53,1,free
126,-1,you
128,1,the
We can modify the distance to become the absolute distance, or relative distance from any direction by changing the where clause.
o.ID > i.ID - 3 and
o.ID < i.ID + 5 and
o.ID != 0
Which would read: return words within 3 words to the left and 5 words to the right.
Summarizing the resulting data using counts similar to the sets algorithm (see previous posts) will give the relative weightings.
select
o.word,
count(o.word) count_steam,
avg(o.ID - i.ID) average_distance
from
streams o,
streams i
where
i.stream = "for" and
o.ID > i.ID - 2 and
o.ID < i.ID + 2 and
o.ID != 0
group by
o.stream
order by
count(o.word) desc,
avg(o.ID - i.ID)
The results
stream,count_steam,average_distance
free,2,1
you,1,-1
away,1,-1
things,1,-1
the,1,1
Discovering keyword proximities
With enough samples and a relatively small proximities of around 200 words, ignoring common words we can create a recursive algorithm where the new words, are fed back. The overall result would be a collection of closely related words by concept. You should aim to process about 2 billions words data sample for a good data training corpus.
for example: you would expect to see the words "duck" and "quack" centered on "bird" but not "quack" centered on the word "cpu"
The strength isn't just in this particular set of algorithms but in the data
structure supporting it. The original data format can be recreated. The order of the words is preserved. The structure is simple enough for recursion. Its simplicity will undoubtedly spawn creativity in similar algorithms.
Can you see the bigger picture, impossible is nothing?
Saturday, June 11, 2005
An Algorithm for Generic Question Answering
The latest innovation in this space is now in direct question answering abilities.
Type "who is the president of the united states?" into MSN Search. You will find the top search result will be quite different.
A snippet from MSN Encarta encyclopedia with the answer "George W. Bush" is presented.
Google also has a similar feature but different in design. It references different resources.
Google's approach at least is more inline with their search engine core business; providing links.
On average Google's approach returns almost double the answers of MSN, and the difference is growing in Google's favor.
I prefer Google's approach as once again their using the web as a general training corpus, proven by Google Sets, and Google Suggest to be machine logic scaleable.
Both Google and MSN provide the resultant answers formatted.
However neither of them are the authority. Check out START a research project from MIT. The answers are much more complete and elegantly returned.
Another new entrant showing considerable promises is
Brainboost.com. It returns snippets of answers in text and a URL reference back to the original content.
Gigablast.com also offers gigabits, a variant of question answering, As it returns multiple answers as phrases and relevant weightings.
In further investigating Google's Question Answering Abilities, the keywords of just "president united states" returns the same results. In following the link reference you are likely to find that the results are in a table, link list or following some strict syntax form.
As most new algorithms are synthesized from old ones, We recycle and rearranging the basic algorithms previously presented in this web log.
In creating a question answering system we need to make a few assumptions
Assumptions
1. Multiple Answers exist on the web
2. Their is a high probability that the answers are in tables.
3. Most questions can be expressed as an Object/Subject Property combination
i.e. What color is the car?
object=car, property=color
4. We are looking for the Value of "Object Property" combination.
The possible Algorithm for collecting and organizing answers
1.Scan the web for tables
2. Load the table data into a database table in matrix form with R representing the row, C representing the column and V the actual data in the field.
3. Increment the T_ID for each table.
CREATE TABLE QA
(
QA_ID NUMBER NULL,
T_ID NUMBER NULL,
R NUMBER NULL,
C NUMBER NULL,
V VARCHAR2(255) NULL
)
The possible Algorithm for finding answers to questions
1. We need to break up the question to discover the object and property.
2. Remove any occurrences of common language parts including but not limited to :
i.e. what where when how why is it as a
3. With all the remaining words we can discover all possible T_ID that are common and return the associated rows and columns that intersect. We also pay extra attention to possible header columns, header rows, and adjacent fields
An example for: "George" and "Age" keywords
SELECT
V
FROM
QA
where T_ID in(
SELECT distinct T_ID FROM QUSR.QA where upper(V) like upper('%George%')
INTERSECT
SELECT distinct T_ID FROM QUSR.QA where upper(V) like upper('%Age%')
)
order by
T_ID,
R,
C
Enhancements
1. Since multiply records are returned, we can count similar to the
algorithms for sets and only return the popular answers.
2. If there went enough results we could also try repeating the process leaving out keyword combinations.
3. If there weren't enough results we could also try using synonyms of the keywords
In this state, the answers are still in raw form, it will be left for another
article how to return it back following some generic grammar rules.
By no means does this strategy always return accurate results but it is machine scaleable, and quality does improve with large datasets around 2000000000 fields. 100000 should be the bare minimum.
In this shape we are also limited to answering simple questions with simple answers. I leave it to the user to suggest further innovations before I present more.
Monday, June 06, 2005
A Statistically Based Grammar Checker (Reverse Engineering Google Suggest)
My last article "Racism in the Machine" also alluded to the design of Google's Suggest.
Lots and lots of web snippets across lots and lots of web pages.
What's holding it all together?
Are there any possible tells in the results
1. Is every suggestion a entire snippet in itself? I.e. The number of results
retuned by searching for it as a phrase is equal to the number of results
suggested.
2. A string of snippets some how stitched together? I.e.. The number of
results retuned by searching for it as a phrase is less than the number of
results suggested.
3. Is it a combination of both? 1 & 2
4. Is it totally original? I.e. Cannot be found by searching for it as a
phrase.
5. Is there a limit to the number of keywords?
Some background
Google's Suggest is a beta product of Google. It is closely based on Google Search however as you type in your keyword a drop down box appears suggesting additional keywords appended to the original via a list sorted by number of results. As you include the suggested keyword the drop down list is updated with newer suggestions.
The possible algorithm.
1. Assuming me want to expand the term t1="I want".
2. Enter the term in with quotes in Google
Search or any other search engine. You will see the number of results. This
will should be the maximum number for any phrase that begins with that term.
Results 1 - 10 of about 53,300,000 for "I want". (0.27 seconds)
3. With pattern matching you can enumerate though all the results descriptions can create an array of possible next words. I've highlighted those yellow.
You can repeat the process for each result page and save the array with only unique occurrences.
4. For each expanded term you discover, repeat the process to discover each result counts me want to expand the term t1="I Want One".
Results 1 - 10 of about 648,000 for "I Want One". (0.25 seconds)
5. Choose the expanded term with the highest results count and repeat the entire process.6. Eventually you will no longer be able to discover any new terms. If you desired you could back track and see if alternative suggested expansions retuned longer results (with more words not characters.).
7. An any point you could calculate the probability of the next term p=results(term
* )/results(term)
The possible algorithm improvements.
1. By replacing expanded term new words with its synonyms to promote different phrase paths
2. Save the results in a table similar to
CREATE TABLE dbo.suggest
(
Suggest_ID int NOT NULL,
NumberOfWords int NOT NULL,
Results int NOT NULL,
Phrase varchar(255) NOT NULL
)
And use it in databases similar with Google Sets.
Google has also added an API for exploring such algorithms and encourage you to add them to your projects.
Have fun!
Saturday, June 04, 2005
Chat Bots Exposed! GoogleTalk?
Chat bots are applications designed to mimic man and engage him in a conversation using normal human language.
The challenge is to conceal the identity of the application, and be for all intensive purposes, perceived as just another person on the other end of a communication medium.
The best attempt to date is Alicebot from http://alicebot.org/. Its underlying architecture has been released as open source and packaged as AIML (Artificial Intelligence Markup Language).
The basic premise behind it is to map input patterns to output responses, or input patterns to other simplified input patterns for re-processing. (similar to recursion). There is also a framework for executing functions and retrieve and defining variables during a session and possibly saved and reloaded for others.
Consider a simple input pattern, output response pair, a single conversation thread.
>How are you today?
Never better!
Although the most believable system to date, its still not machine scaleable. The only method for training is via a botmaster who simple edits and modifies the existing AIML text file.
Even worst, conversations can become predictable and stale due to little variations via a random function. Your typical AIML will handle about 40,000 input responses.
In contrast, Google Sets took about 2,000,000 elements. Google Suggest took about 80,000,000. For translation Google admitted they processed 2,000,000,000 United Nations transcripts.
If such a system is going to work you'll probably need trillions of conversation threads.
Why not use Instant Messaging such as GoogleTalk Aggreagated data and/or emails as the training corpus since it can be represented as conversations. It’s impossible for a single botmaster to train anything even remotely scaleable in comparison.
I’ve already started the development. I need your help. Start saving, and on a frequent basis, send me (mailto://questsin@rogers.com) your chat history from your various Instant Messaging applications like ICQ, AIM, MSN Messenger, Yahoo Messenger and Email.
In return, when the application gets enough critical mass, I will email the application, its source code and all underlying dataset, completely free with no strings attached.
I will release the code as open source to everyone else.
Have you seen Google Mail? Google is probably doing this today. They openly admit they never delete any information including the emails you delete.
They say it’s for building a profile for targeting advertisement.
A possible tell
1. We organize your emails into conversations
This technology cannot be left and dominated by private hands, even Google.
Especially a Corporation with such a powerful presence in the media and controlled by relatively few.
Why does a company have to say our number 1 rule is to "Do no Evil"
I cannot help but make an analogy to conquers with mantras "I come in Peace".
When Man Will Become Machine
Machines would do everything man could do. From labor, synthesize thought; engage in general chit chat and even self reproduce.
Scientist would describe it as a series of functions and procedures.
What would religion call it? How would religion explain it?
How would it describe itself? How would it describe us?
Would we call it life? If not, wouldn’t that make us just machines?
What would be left to separate man from machine?
Isn't consciousness the greatest trick we've played on ourselves?
Artificial Intelligence is coming, are you ready for it? are we?
Thursday, June 02, 2005
Google Suggest - Racism in the Machine
Just 20 years ago it would of taken an enormous amount of effort to
duplicate the functionality of Google’s Sets (see previous few posts on a how to). Just look at projects like http://www.cyc.com and WordNet from Princeton. These systems were originally built via supervised learning and screened training data. In contrast with Google they feel incomplete. Google Sets and similar algorithms sprouting up based on unsupervised learning with the generic web as a training corpus, seem to bring new life in the field of Artificial Intelligence. However there is always a cost.
Google, Yahoo and even Microsoft with all there resources try to develop
filtering mechanisms, but can they really expect to filter out the the very
essence of the web.
What is this essence? You only have to answer a few simple questions. What percentage of the web is related to sex, crime, slander, and racism? What percentage of blogs are bias, and rant based?
Exploiting these techniques, inevitable will create AI's not better than us,
but more deviant and racist. Maybe the movies had it right all along.
Skeptical? Just look at Google Suggest for a glimpse in our dismal future.
Below are screen captures of Google's Suggest feature. It expands your
keywords based on result counts discovered while spidering. Click on the image to enlarge
blacks are..
whites are..
jews are..
germans are..
greeks are..
chinese are..
Its not my intension to bring grief to Google. Google is not to blame. They are only the medium we are the message.
What's the point of this warning. Maybe the simplest road is just not worth taking.. The cost is just too great, in the end.
Look at what google says about itself with google is ...
Wednesday, June 01, 2005
Number of Results with a Twist
keywords searched. This is usually in the form of "showing 1-10 of 100,000 results".
It baffles me as to why they constantly display this knowing it probably affects query response time.
Its generally useless information, considering most search engines even limit you to the first 1000 results.
Why not move it off to an advanced mode, which most engines support anyway.
However, since it is there; here are a few tricks, or tiny side trackers.
Check the variations of spelling the same word
some examples (using MSN Search)
- "Britney Spear" 1-9 of 3,843,018 containing "Britney Spears" (0.15 seconds)
- "Britny Spears" 1-10 of 17,346 containing "Britny
Spears" (0.14 seconds) - "Britny Spears" 1-10 of 17,346 containing "Britny
Spears" (0.12 seconds) - "Britney Speers" 1-10 of 8,587 containing "Britney
Speers" (0.15 seconds) - "Brittne Sperce" no results
You could even use this kind of system, with algorithms like the soundex
function as a basis for a spell checker.
We can use this to compare number of results for the same term across many different search engines.
For example, searching for the very popular "contact us" phrase we can easily gauge the relative size of an index assuming 1 out of every 5 pages contains this particular term.
- Gigablast: Results: 1 to 10 of about 2,019,529,331 for "contact
us " . - Google: Results 1 - 10 of about 1,870,000,000 for "contact
us". (0.07 seconds) - MSN Search: 1-10 of 1,807,652,282 containing "contact us" (0.41
seconds) - Search Yahoo: Results 1 - 20 of about 1,100,000,000
for "contact us". Search took 1.30 seconds
What a surprise to find Gigablast on top. However that doesn't mean it has the biggest index by itself. We could create an average for terms like "Home", "Sitemap" etc.
We can compare terms with other similar related terms as some sort of popularity index
some examples (using MSN Search)
- "good" = 1-9 of 344,085,128 containing good (0.15 seconds)
- "evil" = 1-9 of 40,064,691 containing evil (0.14 seconds)
Looks like good is better than evil... but evil is still faster
- "red" = 1-10 of 209,656,261 containing red (0.17 seconds)
- "blue" = 1-9 of 184,141,226 containing blue (0.18 seconds)
- "Toronto" = 1-9 of 41,214,052 containing Toronto (0.23 seconds)
- "Vancouver" = 1-10 of 26,070,922 containing Vancouver (0.18 seconds)
Hmm. hope the westerners don't get offended that there second best
- "Chicago Bulls" = 1-10 of 572,158 containing "Chicago Bulls" (0.22 seconds)
- "Toronto Raptors" = 1-10 of 411,479 containing "Toronto Raptors" (0.33 seconds)
Look, Toronto is catching up, even though its still a relatively new franchise
You could even try your political candidates before, during and after the election. Its better than polling!!!
Tuesday, May 31, 2005
Reverse Engineering Google Sets
feature with wide ranging applications. The only thing that might be more interesting is the possible algorithm behind it.
The secrets are revealed in the properties
- Google sets aren’t ordered
- The sets are generic
- Extremely large sample data similar to numbers and alphabet characters appear less logical, implying too many samples
Here’s an algorithm of how it could potentially be recreated
- Scan and parse the web for simple html 'tables', and/or 'lists', filter using textual content only. This can be done by leveraging simple spidering techniques widely employed by search engines.
- Break up each column into fields and store each field as a record, including a set id uniquely identifying the column.
- Break up each row into fields and store each field as a record, including a set id uniquely identifying the row.
- repeat continuously for different tables on different sites
- Store the results for later use in a database table similar to
CREATE TABLE sets (
field_id int NOT NULL,
set_id int NOT NULL,
field varchar (255) NOT NULL
)
For Example
If we had 2 tables to pares similar to
- Nick,Male
- John,Male
and
- Product,Cost
- Orange,5
- Banana,6
The sets dataset table would be
- field_id,set_id,Field
- 1,1,Nick
- 2,1,John
- 3,2,Male
- 4,2,Male
- 5,3,Nick
- 6,3,Male
- 7,4,John
- 8,4,Male
- 9,5,Product
- 10,5,Orange
- 11,5,Banana
- 12,6,Cost
- 13,6,5
- 14,6,6
- 15,7,Product
- 16,7,Cost
- 17,8,Orange
- 18,8,5
- 19,9,Banana
- 120,9,6
Notice
- Header records are not treaded any differently
Getting the Sets
The Sybase ASE query that can be used to generate the desired set is then:
Select top 15
field, count(set_id)
From
sets
where
set_id in ( select set_id from sets where field in ("apple","Orange") )
group by
field
order by
count(set_id) DESC
Comments
The above SQL will generate sets whose members are centered on “apples” and “oranges”. We use the top 15 option to limit the results to only the top 15. If we were to take the entire list, you would see it get less and less accurate as the count goes to 1. The elegance is in the count function, and how it clusters popular results together.
Tips
- To generate closer matched sets, favor scanning the web pages containing each field discovered.
- To get results with similar accuracy to Google Sets, Try to collect about 2 million fields
Enhancements
- Ordering the results by the average field_id
- Adding the ability to subtract sets by removing set_id’s
Other Uses for similar algorithm
This algorithm could also be used to create sets of commonly related keywords by scanning web pages Meta keyword tags, Knowing that most of the keywords will be related to each other some how
Check it out first hand at questsin.net!!!
