Little things you – yes, all of you - should know About Entity Extraction

Submitted by: Lambda on 9 April 2010

Comments: 0

‘Entity Extraction’, sounds like a bliss to me. But what is it, really? And why should you care, at all?

 
What?
‘Entity extraction’ is the process in which text is analyzed to detect and classify the so-called ‘entities’ or ‘atomic elements’, like persons or dates or locations… As my grandma always says – you’d better give an example:  “John is so happy with his brand new Jaguar”. Several entities here: a person called John, his emotional state being happy and a Jaguar. Okay, nothing to special there, you say? Just a dictionary or a flat word list could detect these entities, but without knowing what they mean. Add some intelligence and you can make this machine-readable. 
 
How?
Adding intelligence in this case means programming the relations between all known entities. A Jaguar could be a car, or a wild animal. But since John is so happy with ‘his’ Jaguar, it’s probably the car. The implicit relation between John and the Jaguar is ‘to have’.  Knowing about these relations allows the software to actually understand what’s going on, what you are talking about and how to interpret the content. 
 
Why?
Why would you want your software to understand your text? No need to fear Big Brother, it’s all about usability. Imagine your text processor would be doing stuff for you while you are typing, and I’m not talking about irritating spelling suggestions here. Suppose you’re typing a paper and related titles of other researchers automatically pop up, or you get an overview of blog post recently posted on the same topic, a screen opens with related articles published in your favorite newspaper. The possibilities are endless, literally.
 
In the case of John and his Jaguar, you could imagine how interesting it could be for the company selling Jaguars how people think about their product. In fact, several web services do this already. They monitor every blog and tweet around the world, analyze whether it’s positive or negative and produce results about the public opinion.  Today my car – a Toyota Yaris – got a 3.15 out of 5 (check out some demos). Not bad!
 
Blue Toyota Yaris
 
In fact, a market intelligence research of IDC states that in organizations, unstructured data accounts for more than 80% of all information (read the paper). Imagine the lost value! Entity extraction – part of a more general process called Natural Language Processing – is a first step in turning the information hidden in unstructured data into real value. Now you can actually ‘process’ what’s inside your text docs, papers, meeting reports – and yes even - emails.
 
Help!
Luckily, we are in the digital era, so there are plenty of tools tools that can help you
 

Open Calais


  • Try it out online and see how powerful entity extraction can be (and how it sometimes can give you more than you wanted). Also available for commercial implementations. 
  • Open Calais is also integrated in TopBraid Composer, the user-friendly IDE by TopQuadrant. 
  • Powered by Thomson Reuters (as in the Thomson Reuters)
  • Find out more on the Calais’ website.
 

Luxid


  • Multilingual text analysis
  • Industrial strength for high-end applications
  • Powered by Temis, a French company
  • Find out more on the Temis website
 

Teragram

 

Other products that get a lot of buzz in the community
 

  • Zemanta– automatic tagging to make your blogs smarter
  • Gate– offering a general architecture for text engineering – promising!
  • Arisem – our partner’s partner

    Anyone who can comment on them?
 
Don’t see the wood for the trees anymore? No panic, there are plenty of open source tools to play around with. You’ll find out what you need – and what you definitely don’t need - at a glance!

 

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Insert Flickr images: [flickr-photo:id=230452326,size=s] or [flickr-photoset:id=72157594262419167,size=m].
  • Links to video content with 'rel="lightvideo"' in the <a> tag will appear in a Lightbox when clicked on.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
1 + 2 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

Latest Comments

Robson Wolanski (not verified)
14 July 2010

Thanks for the post. I saw my day-by-day in a lot of sentences. We always need to get better on this way.

Rapidshare SE (not verified)
14 May 2010

For sure TenForce has a lot of benefits. The greatest one is that it helps to save time!