Exclusive: Big data strategy

Big data software research and development: progresses and challenges

  • LIU Yingbo ,
  • WEI Kai
Expand
  • 1. School of Software, National Engineering Laboratory of Big Data System Software;Beijing Key Laboratory of Industrial Big Data System and Applications, Tsinghua University, Beijing 100084, China;
    2. China Academy of Information and Communications Technology, Beijing 100084, China

Received date: 2019-11-08

  Revised date: 2020-02-01

  Online published: 2020-04-01

Abstract

The rapid development of the big data technology brings about the emergence of various big data products. Today, the ecosystem based on various big data products is very large. Behind the prosperity, the current state of the development of the big data products is difficult to understand for users and practitioners. This paper reviews the core technology of the big data products from two perspectives:the data storage and analysis. Based on the results of authoritative evaluation organizations, the current situation of the big data products in the domestic market is analyzed. Looking forward to the future, China's big data product R&D needs the participations of the open source community, the cultivation of compound talents, the product segmentation and the interdisciplinary collaborative innovation.

Cite this article

LIU Yingbo , WEI Kai . Big data software research and development: progresses and challenges[J]. Science & Technology Review, 2020 , 38(3) : 84 -93 . DOI: 10.3981/j.issn.1000-7857.2020.03.005

References

[1] 涂子沛. 大数据:正在到来的数据革命[M]. 广西:广西师范大学出版社, 2013.
[2] 国务院关于印发促进大数据发展行动纲要的通知[A/OL]. (2015-09-05). http://www.gov.cn/zhengce/content/2015-09/05/content_10137.htm.
[3] Turck M. Great power, great responsibility:The 2018 big data & AI landscape[J/OL].[2019-10-31]. https://mattturck.com/bigdata2018/.
[4] 曾鸣. 龙行天下:中国制造未来十年新格局[M]. 北京:机械工业出版社, 2008.
[5] 尉迟坚. 价值魔方:互联网与e立方经济[M]. 北京:北京交通大学出版社, 2016.
[6] Eisenberg A, Melton J. SQL standardization:The next steps[J]. ACM SIGMOD Record, 2000, 29(1):63-67.
[7] Lith A, Mattsson J. Investigating storage solutions for large data-A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data[M/OL].[2019-10-31]. http://publications.lib.chalmers.se/records/fulltext/123839.pdf.
[8] Evans E. NOSQL 2009[J/OL].[2019-10-31].[http://blog.sym-link.com//05/12/nosql.html.
[9] Han J, Haihong E, Le G, et al. Survey on NoSQL database[C]//Proceedings of the 20116th International Conference on Pervasive Computing And Applications. Piscataway N J:IEEE, 2011, doi:10.1109/ICPCA.2011.6106531
[10] Carlson J L. Redis in action[M]. Greenwich:Manning Publications Co., 2013.
[11] About redis[EB/OL].[2019-10-31]. https://redis.io/.
[12] About Memcached[EB/OL].[2019-10-31]. http://www.memcached.org/about.
[13] George L. HBase:The definitive guide-Random access to your planet-size data[M]. New York:O'Reilly Media, Inc., 2011.
[14] Lakshman A, Malik P. Cassandra:A decentralized structured storage system[J]. ACM SIGOPS Operating Systems Review, 2010, 44(2):35-40.
[15] Chodorow K. MongoDB:The definitive guide-Powerful and scalable data storage[M]. New York:O'Reilly Media, Inc., 2013.
[16] Anderson J C, Lehnardt J, Slater N. CouchDB:The definitive guide-Time to relax[M]. New York:O'Reilly Media, Inc., 2010.
[17] Nuescheler D, Piegaze P, Anderson T, et al. Content repository API for Java technology specification[M/OL].[2019-10-31]. https://www.docin.com/p-1147687815.html
[18] Gormley C, Tong Z. Elasticsearch:The definitive guide:A distributed real-time search and analytics engine[M]. New York:O'Reilly Media, Inc., 2015.
[19] Holzschuher F, Peinl R. Performance of graph query languages:Comparison of cypher, gremlin and native access in Neo4j[C]//Proceedings of the Joint EDBT/ICDT 2013 Workshops. New York:ACM, 2013, doi:10.1145/2457317.2457351.
[20] Seaborne A, Manjunath G, Bizer C, et al. SPARQL/Update:A language for updating RDF graphs[EB/OL].[2019-10-31]. https://www.hpl.hp.com/techreports/2007/HPL-2007-102.pdf.
[21] Persen T, Winslow R. Benchmarking InfluxDB vs Cassandra-InfluxDB outperforms Cassandra by 4.5x[J/OL].[2019-11-15]. https://www.influxdata.com/resources/benchmarking-influxdb-vs-cassandra-for-time-series-data-metrics-and-management/.
[22] Prasad S, Avinash S B. Smart meter data analytics using OpenTSDB and Hadoop[C]//Innovative Smart Grid Technologies-Asia (ISGT Asia), 2013 IEEE. Piscataway N J:IEEE, 2013, doi:10.1109/ISGT-Asia.2013.6698774.
[23] IoTDB Homepage[J/OL].[2019-11-15]. http://iotdb.apache.org/.
[24] Coad P, Yourdon E. Object-oriented design[M]. New York:Yourdon Press, 1991.
[25] Ranking D E. DB-Engines[J/OL].[2019-10-31]. http://www.db-engines.com.
[26] Pavlo A, Aslett M. What's really new with NewSQL[J]. ACM Sigmod Record, 2016, 45(2):45-55.
[27] Han J, Kamber M. Data mining:Concepts and techniques[M]. 1st ed. San Francisco:Morgan Kaufmann, 2001.
[28] Thusoo A, Sarma J S, Jain N, et al. Hive:A warehousing solution over a map-reduce framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2):1626-1629.
[29] Kornacker M, Behm A, Bittorf V, et al. Impala:A modern, open-source SQL engine for Hadoop[C]//7th Biennial Conference on Innovative Data Systems Research (CIDR'15), Asilomar, California, January 4-17, 2015.
[30] Chang L, Wang Z, Ma T, et al. HAWQ:A massively parallel processing SQL engine in hadoop[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2014:1223-1234.
[31] Announcing Kylin:Extreme OLAP engine for big data[EB/OL].[2019-11-15]. https://tech.ebayinc.com/engineering/announcing-kylin-extreme-olap-engine-forbig-data/.
[32] Ranawade S V, Navale S, Dhamal A, et al. Online analytical processing on hadoop using apache kylin[J/OL].[2019-10-31]. http://www.ijais.org/archives/volume12/number2/ranawade--ijais-451682.pdf.
[33] Waas F M. Beyond conventional data warehousingMassively parallel data processing with greenplum database[M]//Business Intelligence for the Real-Time Enterprise. Berlin:Springer, 2009.
[34] Färber F, Cha S K, Primsch J, et al. SAP HANA database:Data management for modern business applications[J]. ACM Sigmod Record, 2012, 40(4):45-51.
[35] Dewitt D, Stonebraker M. MapReduce:A major step backwards[J]. The Database Column, 2008, 1:23.
[36] Gropp W, Thakur R, Lusk E. Using MPI-2:Advanced features of the message passing interface[M]. Cambridge:MIT Press, 1999.
[37] Hadoop A. MapReduce tutorial[EB/OL].[2019-10-31]. https://hadoop.apache.org/docs/r1.
[38] Dean J, Ghemawat S. MapReduce:Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113.
[39] 姜春宇, 魏凯. 大数据平台的基础能力和性能测试[J]. 大数据, 2017, 3(4):37-45.
[40] Sverdlik Y. Google dumps MapReduce in favor of new hyper-scale cloud analytics system[J/OL].[2019-10-31]. http://www.datacenterknowledge.com/archives//06/25/google-dumps-mapreduce-favor-newhyper-scale-analytics-system.
[41] Patterson D A. How to build a bad research center[J]. Communication of the ACM, 2014, 57(3):33-6.
[42] Zaharia M, Chowdhury M, Franklin M J, et al. Spark:Cluster computing with working sets[J]. Hot Cloud, 2010, 10(10):95.
[43] Narkhede N, Shapira G, Palino T. Kafka:The definitive guide:real-time data and stream processing at scale[M]. New York:O'Reilly Media, Inc., 2017.
[44] Toshniwal A, Taneja S, Shukla A, et al. Storm@twitter[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2014, doi:10.1145/2588555.2595641.
[45] Chintapalli S, Dagit D, Evans B, et al. Benchmarking streaming computation engines:Storm, flink and spark streaming[C]//2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Piascataway N J:IEEE, 2016, doi:10.1109/IPDPSW.2016.138.
[46] Carbone P, Katsifodimos A, Ewen S, et al. Apache flinkTM:Stream and batch processing in a single engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4):28-38.
[47] Hall M, Frank E, Holmes G, et al. The WEKA data mining software:an update[J]. ACM SIGKDD Explorations Newsletter, 2009, 11(1):10-18.
[48] Lyubimov D, Palumbo A. Apache Mahout:Beyond MapReduce[M]. North Charleston:CreateSpace Independent Publishing Platform, 2016.
[49] Meng X, Bradley J K, Yavuz B, et al. MLlib:Machine learning in apache spark[J]. Journal of Machine Learning Research, 2016, 17(1):1235-1241.
[50] Malewicz G, Austern M H, Bik A J, et al. Pregel:A system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2010:135-146.
[51] Xin R S, Gonzalez J E, Franklin M J, et al. GraphX:A resilient distributed graph system on Spark[C]//First International Workshop on Graph Data Management Experiences and Systems. New York:ACM, 2013, doi:10.1145/2484425.2484427.
[52] 尹首一, 郭珩, 魏少军. 人工智能芯片发展的现状及趋势[J]. 科技导报, 2018, 36(17):45-51.
[53] Domingos P. The master algorithm:How the quest for the ultimate learning machine will remake our world[M]. New York:Basic Books, 2015.
[54] Erickson B J, Korfiatis P, Akkus Z, et al. Toolkits and libraries for deep learning[J]. Journal of digital imaging, 2017, 30(4):400-405.
[55] Casters M, Bouman R, Van Dongen J. Pentaho kettle solutions:Building open source ETL solutions with pentaho data integration[M]. Hoboken:John Wiley & Sons, 2010.
[56] Anderson C. Free:The future of a radical price[M]. New York:Random House, 2009.
[57] Information U B S O. Online master of information and data science[EB/OL].[2019-10-31]. https://datascience.berkeley.edu/academics/curriculum/.
[58] 周润松. 大数据产品、解决方案与案例测评认定结果分享[J]. 软件和集成电路, 2018(4):32-33.
[59] Ellis S, Brown M. Hacking growth:How today's fastestgrowing companies drive breakout success[M]. Strawberry Hills:Currency Press, 2017.
Outlines

/