wikipedia上面关于information integration的说明实在是不敢恭维(不少我看来是错误的或者非常片面):
Information integration (II) (also called information fusion , deduplication and referential integrity ) is the merging of information from disparate sources with differing conceptual, contextual and typographical representations. It is used in data mining and consolidation of data from unstructured or semi-structured resources. Typically, information integration refers to textual representations of knowledge but is sometimes applied to rich media content.
Among the technologies available to integrate information are string metrics that allow detection of similar text in different data sources by fuzzy matching .
信息集成,个人觉得还没有成为一门成熟的学科,所以并没有严格的定义、方法、体系。下面都是我个人的一些看法了:
要集成,首先要明确目的,最终应该是形成一个knowledge base吧,将heterogeneous的信息整合(integrate)而不是收集(collect)到一起。其中heterogeneous是关键,也是最挑战的地方。
其次信息存储是否结构化 database -> XML -> ontology。这里面最成熟的应该是数据库级别的集成了,典型的解决方案就是data warehouse。但是数据仓库里面的集成需要很多认为参与制定集成的规则,自动化程度很低,比如ETL的过程。
要集成,信息的mapping或者matching是核心。这就是最新很火的research topics: schema matching和ontology matching。
待续