Tag Cloud

01Dec08

I discovered this web site today : http://www.wordle.net/
It proposes some nices layouts to display tag clouds.
The display is made with a java applet, but I’m sure we could do pretty the same thing using SVG.  I used SVG in one of my previous projects 2 or 3 years ago and I really like how simple it was to generate complex vectorial drawings. The only issue was that to display SVG files in a browser we had to use an external plugin : adobe svg viewer. At this time, this plugin was not supporting the last svg specification and was going to be stopped (I suppose it is effectively now).

SVG might have lost some interest now with complex javascript framework like DOJO or EXT, and FLash/Flex, but … well I like SVG, may be I’m a little bit nostalgic. I thing one of the main advantage of SVG is that it’s XML based language. So on the server side it’s quite easy to generate and also to debug (any XML editor to check the syntax or SVG editor to look at the rendering, eg Inkscape). And the task to insert some semantic information into this XML cloud using RDFa or eRDF would be hugely simplify.

To generate a tag cloud, which is mainly a font centered diagram, SVG must do the job quite easily and keeping at the same time the application layers well separated.

mydelicioustagcloud1

Now the last browser generation supports SVG natively and for the others a framework like DOJO or EXT is able to encapsulate it and emulate the rendering using the specific browser capabilities. I should try to implement a composant to render different tag cloud layouts with SVG just for fun. I try to found a suitable algorithm to calculate tags positions but it should not be so hard to have a simple one to have a first drop.

Some links to have a look at
http://arxiv.org/abs/cs.DS/0703109
http://dotnetaddict.dotnetdevelopersjournal.com/tw.htm
http://semanticvoid.com/blog/2006/01/06/tag-cloud-font-distribution-algorithm/
http://poeticcode.wordpress.com/2007/01/27/tag-cloud-algorithmlogicformula/
http://internetducttape.com/2007/02/22/tag-cloud-generator-for-wordpresscom/
http://www.citebase.org/abstract?identifier=oai%3AarXiv.org%3Acs%2F0703109&action=citeshits&citeshits=cites

At least I took time to write a post even if it’s not THE post of the year… ;)





Last year, Reuters acquired text analytics company ClearForest.
They recently launched a new free semantic web service, based on ClearForest technology, named OpenCalais. This service helps to extract from a submitted text (web content for example), entities. And, last but not least, the service returns all these extracted concepts as an RDF graph. So using this service and browsing this graph, you can automatically tag any unstructured content (with RDFa for example), provide enhanced search functionalities based on the semantic (if you have a good knowledge of the used ontology), etc…
See below a short example: I submitted a text found on the web to this service through this web page, then I queried the returned RDF graph using this RDF graph visualization tool and a pretty simple SPARQL like query, to retrieve all what was identified as a “Company”. Well, it could be best if all the found companies were linked by something else than their common type, for example an “acquired” relationship, but it’s already a good start.

Original Plain Text

March 16, 2004 (Computerworld) — Enterprise content management vendor Documentum Inc. has acquired a one-step content integration product line from Xerox Corp. and today unveiled a new “virtual repository” for improved organization of stored data.
In an announcement, the Pleasanton, Calif.-based company said its new Documentum Virtual Repository will allow companies to organize and store a wide range of internal and external information that will be easy to retrieve for use. The repository will allow aggregation for automated and scheduled content collection from multiple sources and will make the information available to others in compatible formats.
The new feature will be available early in the second quarter.
In a related move, Documentum acquired the AskOnce business unit of Xerox for an undisclosed price. AskOnce is a secure enterprise content integration product that searches multiple repositories and data types using a single query. AskOnce relies on a uniform query interface to connect it to existing database, document repository, Internet, corporate intranet or e-mail applications.
Financial details of the transaction weren’t disclosed.
“With the Documentum Virtual Repository solution, companies will be able to control all of their content — internal and external, structured and unstructured — regardless of where it resides,” Dave DeWalt, president of the Documentum division of EMC Corp., said in a statement.
“Most enterprises have limited knowledge of the content scattered throughout their organizations — on employee desktops, internal and external networks, Web sites and portals, or in data archives. There’s a great need in the market for technology that helps companies manage all of this content — especially with the intense public scrutiny of both government agencies and public companies.”

All identified entities

entities2.jpg

Tagged HTML sample

March 16, 2004 (Computerworld) — Enterprise content management vendor Documentum Inc. has acquired a one-step content integration product line from Xerox Corp. and today unveiled a new “virtual repository” for improved organization of stored data. In an announcement, the Pleasanton, Calif. -based company

Global RDF graph

globalgraph.jpg

RDQL (SPARQL like) query : What is identified as a “Company”?

SELECT ?subject ?predicate ?object WHERE
(?subject rdf:type <http://s.opencalais.com/1/type/em/e/Company>)
(?subject ?predicate ?object)

RDF graph / Query result

subgraph.jpg








  • Using Semantic Web Pipes you can fetch, mix and process RDF files published on the Web. As the output of a Pipe is an HTTP retrievable RDF model, simple pipes can also work as inputs to more complex Pipes.






RDF data

30Sep07

What are the possible RDF data sources?

  • triple stores : locals and distant, the last ones can be queried using the SPARQL protocol
  • local rdf datasets or data available in RDF : loaded in memory
  • intermediate layer doing the mapping between the native/legacy data representation (DB or LDAP for example) and an exposed RDF view.=> D2RQ/D2R , SquirrelRDF, …
  • web site containing semantic data : RDFa, microformats, eRDF, … These information can be extracted using GRDDL for example.
  • scrapping applications : Solvent

  • A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the object