1. veracity (quality)
how correct the data is, shows if we can trust the data
2. variability
variety指same data, different object
variability指same data, different meaning
3. visibility
capture and properly present the characteristics of data
common types: charts, tables, graphs, maps, infographics, dashboards
4. value
value from other V's
5. in general
fundamental V's: volume, variety, velocity
characteristics/difficulties: veracity, variability
tools: visibility
objective: value
6. big data management is to server the purpose of big data analytics
7. data acquisition
application oriented: 确定什么样子的信息是问题所需要的
comprehensive: 尽可能全面的收集信息
handle data: 处理来源不同种类不同的信息
8. data storage
a) traditional way: 为structured data设计的, disk-oriented,大数据不适用
b) big data era
b.2) NoSQL -- HBase, Hive, MongoDB
b.3) Distributed file systems -- HDFS
9. data preparation
a) data exploration: understand your data
b) data pre-processing
data cleaning -- veracity
data integration -- variety
10. data explore
trends, correlations, outliers, statistics(mean, mode, median, standard deviation, dange: 可用来数据处理,如身高中大部分都是180,175,一个17的数据就可以被认为是dirty data)
11. data cleaning
dirty data types:
miss values/records: remove the record
invalid data; use another data as replacement
inconsistency: do additional works
duplicate: merge
12. data integration
merge data from multiple, complex and heterogenous resources to perfrom a unified view of data
13. data curation
data curation includes all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data