‘Entity Extraction’, sounds like a bliss to me. But what is it, really? And why should you care, at all?
What?
‘Entity extraction’ is the process in which text is analyzed to detect and classify the so-called ‘entities’ or ‘atomic elements’, like persons or dates or locations… As my grandma always says – you’d better give an example: “John is so happy with his brand new Jaguar”. Several entities here: a person called John, his emotional state being happy and a Jaguar. Okay, nothing to special there, you say? Just a dictionary or a flat word list could detect these entities, but without knowing what they mean. Add some intelligence and you can make this machine-readable.
How?
Adding intelligence in this case means programming the relations between all known entities. A Jaguar could be a car, or a wild animal. But since John is so happy with ‘his’ Jaguar, it’s probably the car. The implicit relation between John and the Jaguar is ‘to have’. Knowing about these relations allows the software to actually understand what’s going on, what you are talking about and how to interpret the content.
Why?
Why would you want your software to understand your text? No need to fear Big Brother, it’s all about usability. Imagine your text processor would be doing stuff for you while you are typing, and I’m not talking about irritating spelling suggestions here. Suppose you’re typing a paper and related titles of other researchers automatically pop up, or you get an overview of blog post recently posted on the same topic, a screen opens with related articles published in your favorite newspaper. The possibilities are endless, literally.
In the case of John and his Jaguar, you could imagine how interesting it could be for the company selling Jaguars how people think about their product. In fact, several web services do this already. They monitor every blog and tweet around the world, analyze whether it’s positive or negative and produce results about the public opinion. Today my car – a Toyota Yaris – got a 3.15 out of 5 (
check out some demos). Not bad!
In fact, a market intelligence research of IDC states that in organizations, unstructured data accounts for more than 80% of all information (
read the paper). Imagine the lost value! Entity extraction – part of a more general process called Natural Language Processing – is a first step in turning the information hidden in unstructured data into real value. Now you can actually ‘process’ what’s inside your text docs, papers, meeting reports – and yes even - emails.
Help!
Luckily, we are in the digital era, so there are plenty of tools tools that can help you
Open Calais
- Try it out online and see how powerful entity extraction can be (and how it sometimes can give you more than you wanted). Also available for commercial implementations.
- Open Calais is also integrated in TopBraid Composer, the user-friendly IDE by TopQuadrant.
- Powered by Thomson Reuters (as in the Thomson Reuters)
- Find out more on the Calais’ website.
Luxid
- Multilingual text analysis
- Industrial strength for high-end applications
- Powered by Temis, a French company
- Find out more on the Temis website
Teragram
Other products that get a lot of buzz in the community
- Zemanta– automatic tagging to make your blogs smarter
- Gate– offering a general architecture for text engineering – promising!
- Arisem – our partner’s partner
Anyone who can comment on them?
Don’t see the wood for the trees anymore? No panic, there are plenty of open source tools to play around with. You’ll find out what you need – and what you definitely don’t need - at a glance!
Bookmark/Search this post with:
Post new comment