DeFacto: Deep Fact Validation

DeFacto (Deep Fact Validation) is an algorithm for validating statements by finding confirming sources for it on the web. It takes a statement (such as “Jamaica Inn was directed by Alfred Hitchcock”) as input and then tries to find evidence for the truth of that statement by searching for information in the web.

Demo Source Code Download

Introduction

DeFacto (Deep Fact Validation) is an algorithm for validating statements by finding confirming sources for it on the web. It takes a statement (such as "Jamaica Inn was directed by Alfred Hitchcock") as input and then tries to find evidence for the truth of that statement by searching for information in the web. In contrast to typical search engines, it does not just search for textual occurences of parts of the statement, but tries to find web pages, which contain the statement phrased in natural language. It presents the user with a confidence score for the input statement as well as a set of excerpts of relevant web pages, which allows the user to manually look at the evidence.

The project has two use cases: * given a statement it can be used to find provenance information for the statement * it can directly try to check whether a statement is likely to be true

Architecture and Approach

The DeFacto system consists of the components depicted above. The system takes an RDF triple as input and returns a confidence value for this triple as well as possible evidence for the fact. The evidence consists of a set of webpages, textual excerpts from those pages and meta-information on the pages. The text excerpts and the associated meta information allow the user to quickly get an overview over possible credible sources for the input statement: Instead of having to use search engines, browsing several webpages and looking for relevant pieces of information, the user can more efficiently review the presented information. Moreover, the system uses techniques which are adapted specifically for fact validation instead of only having to rely on generic information retrieval techniques of search engines.

Retrieving Webpages: The first task of the DeFacto system is to retrieve webpages which are relevant for the given task. The retrieval is carried out by issuing several queries to a regular search engine. These queries are computed by verbalizing the RDF triple using natural-language patterns extracted by the BOA framework. As a next step, the highest ranked webpages for each query are retrieved. Those webpages are candidates for being sources for the input fact. Both the search engine queries as well as the retrieval of webpages are executed in parallel to keep the response time for users within a reasonable limit. Note that usually this does not put a high load on particular web servers as webpages are usually derived from several domains.

Evaluating Webpages: Once all webpages have been retrieved, they undergo several further processing steps. First, plain text is extracted from each webpage by removing most HTML markup. We can then apply our fact confirmation approach on this text. In essence, the algorithm decides whether the web page contains a natural language formulation of the input fact. This step distinguishes DeFacto from information retrieval methods. If no webpage confirms a fact according to DeFacto, then the system falls back on light-weight NLP techniques and computes whether the webpage does at least provide useful evidence. In addition to fact confirmation checking, the system computes different indicators for the trustworthiness of a webpage. These indicators are of central importance because a single trustworthy webpage confirming a fact may be a more useful source than several webpages with low trustworthiness. The fact confirmation and the trustworthiness indicators of the most relevant webpages are presented to the user.

Confidence Measurement: In addition to finding and displaying useful sources, DeFacto also outputs a general confidence value for the input fact. This confidence value ranges between 0% and 100% and serves as an indicator for the user: Higher values indicate that the found sources appear to confirm the fact and can be trusted. Low values mean that not much evidence for the fact could be found on the Web and that the websites that do confirm the fact (if such exist) only display low trustworthiness. The training data for this step is available ((https://github.com/AKSW/DeFacto/blob/master/core/resources/training/data.zip?raw=true here)). Naturally, DeFacto is a (semi-)automatic approach: We do assume that users will not blindly trust the system, but additionally analyze the provided evidence.

Publications

by (Editors: ) [BibTex of ]

News

The USPTO Linked Patent Dataset release ( 2017-02-24T17:18:51+01:00 by Mofeed Hassan)

2017-02-24T17:18:51+01:00 by Mofeed Hassan

Dear all, We are happy to announce USPTO Linked Patent Dataset release. Patents are widely used to protect intellectual property and a measure of innovation output. Read more about "The USPTO Linked Patent Dataset release"

Two accepted papers in ESWC 2017 ( 2017-02-22T17:43:38+01:00 by Dr. Mohamed Ahmed Sherif)

2017-02-22T17:43:38+01:00 by Dr. Mohamed Ahmed Sherif

Hello Community! We are very pleased to announce the acceptance of two papers in ESWC 2017 research track. The ESWC 2017 is to be held in Portoroz, Slovenia from 28th of May to the 1st of June. Read more about "Two accepted papers in ESWC 2017"

AKSW Colloquium, 13th February, 3pm, Evaluating Entity Linking ( 2017-02-09T15:53:07+01:00 Michael Roeder)

2017-02-09T15:53:07+01:00 Michael Roeder

On the 13th of February at 3 PM, Michael Röder will present the two papers “Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job” of van Erp et al. Read more about "AKSW Colloquium, 13th February, 3pm, Evaluating Entity Linking"

SLIPO project kick-off meeting ( 2017-02-03T12:59:48+01:00 by Sandra Bartsch)

2017-02-03T12:59:48+01:00 by Sandra Bartsch

SLIPO, a new InfAI project kicked-off between the 18th and 20th of January in Athens, Greece. Funded by the EU-program “Horizon 2020”, the project is planned to have an operational time until the 31st of December 2019. Read more about "SLIPO project kick-off meeting"

AKSW Colloquium 30.Jan.2017 ( 2017-01-30T13:24:04+01:00 by Simon Bin)

2017-01-30T13:24:04+01:00 by Simon Bin

In the upcoming Colloquium, Simon Bin will discuss the paper “Towards Analytics Aware Ontology Based Access to Static and Streaming Data” by Evgeny Kharlamov et.al. that has been presented at ISWC2017. Read more about "AKSW Colloquium 30.Jan.2017"