? 数据科学相关岗位面试题库【1-100】-bet36官方网址_bet36投注_bet36手机客户端 bet36官方网址_bet36投注_bet36手机客户端
教育教学
当前位置: 首页 >> 教育教学 >> 正文

数据科学相关岗位面试题库【1-100】

2018/03/25 18:49:27点击:[]

1你在用哪些机器学习技术,是研究层次的,还是生产层次的?

“What ML techniques doyou work with? / Are these research level or production level techniques?”[来源1]

2请告诉我一项你曾全程参与的项目,包括项目名称,所解决的问题及其解决方案和项目最终结果。

“Tell me about anin-depth example of projects you have worked on from inception to completion.What was the project, how did you approach the problem, what was the end resultetc.”?[来源1]

3你最喜欢的算法是什么

“What’s your favoritealgorithm?”[来源1]

4[编程语言]能力达到什么级别?你通常用[编程语言]做什么?以及你遇到过最难的挑战是什么?

“What level of experiencedo you have with [programming language]? What do you do daily with[programming language] and what was your hardest challenges with this?”[来源1]

5你处理过最大的数据集是什么?你是如何处理的,最终结果怎么样?

“What is the largestdata set that you have processed? How did you approach this, and what was theend result?”[来源1]

6如果让你向一名业务主管解释“线性回归”,你会如何解释?

How would you explain alinear regression to a business executive?[来源2]

7线性回归的一些替代模型有哪些?这些替代模型的优缺点是什么?

What are somealternative models to a linear regression? Why are they better or worse?[来源2]

8(基于以下关系表,)请编写SQL查询语句,创建对应关系表,并计算出每个班的最高成绩(Grade)。

Write a SQL query tocreate a table that shows, for each class, the value of the highest grade inthe class.[来源2]

                                           说明:https://mmbiz.qpic.cn/mmbiz_png/WmswYAExT4fwialBVLlibEcvOJQYcyU0fOVeicCMd1c5oFWeSVOBeJuYth9T4DtQAbmNcHL7eN1FIAUtIqPpUiaRIw/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1

9基于上表,我想计算出每个班得分最高的同学的姓名,请写出SQL查询语句。

Suppose I had the sametable as the previous question, but instead for each class I want to find thename of the student who got the highest grade. Write a query to do that.[来源2]

10用伪代码或任何您想用的编程语言编写一个程序,要求如下:1)输出数字从11002)遇到3的倍数、5的倍数以及35的公倍数,分别用“Fizz”“Buzz”“FizzBuzz”代替。

In pseudo-code orwhatever language you would like: write a program that prints the numbers from1 to 100. But for multiples of three print “Fizz” instead of the number and forthe multiples of five print “Buzz”. For numbers which are multiples of boththree and five print “FizzBuzz”.[来源2]

11一家公司正在出售Microsoft Office的竞争对手的产品,该公司正在通过发送两套不同的电子邮件方案来测试自己的营销策略。其中,一种方案涉及与业务相关的内容,另一种方案涉及与消费者相关的内容。以下是关于两种电子邮件的一系列图表。最下面的两张图与前两张的数据相同,是根据客户在发送电子邮件前一年在公司消费的金额计算得出的数据。请问,哪种方式效果更好?

A company selling acompetitor to Microsoft Office is testing their marketing by sending out twodifferent sets of emails. One set contains business related content, and onecontains consumer related content. We are interested in how each campaignperformed; did one do at getting people to click-through? Below is a selectionof graphs on the two email campaigns. The bottom two graphs have the same dataas the top two, only bucketed by the amount the customer has spent with thecompany the year before the emails were sent. Which campaign did better?[来源2]

12什么是正则化?有什么用?

Explain whatregularization is and why it is useful[来源3]

13你最喜欢的数据科学家以及创业公司有哪些?

Which data scientistsdo you admire most? which startups?[来源3]

14您将如何检验一个基于多元回归的预测模型的有效性?

How would you validatea model you created to generate a pre dictive model of a quantitative outcomevariable using multiple regression.[来源3]

15、解释什么是精确率和召回率。它们与ROC曲线的关系?

Explain what precisionand recall are. How do they relate to the ROC curve?[来源3]

16、你怎样证明你对算法的改进确实比不改进有用?

How can you prove thatone improvement you've brought to an algorithm is really an improvement overnot doing anything?[来源3]

17、什么是根因分析(root cause analysis)?

What is root causeanalysis?[来源3]

18、您是否熟悉价格优化,价格弹性,库存管理,竞争情报?举例说明。

Are you familiar with priceoptimization, price elasticity, inventory management, competitive intelligence?Give examples.[来源3]

19什么是统计功效?

What is statisticalpower?[来源3]

20、解释什么是“重采样”方法,并揭示它们为什么有用?说明其局限性。

Explain whatresampling methods are and why they are useful. Also explain theirlimitations.[来源3]




21、过多的假正或过多的假负例,哪一个会更好吗?请给出揭示。

Is it better tohave too many false positives, or too many false negatives? Explain.[来源3]

22、什么是选择性偏差,为什么它很重要,你如何避免它?

What is selection bias,why is it important and how can you avoid it?[来源3]

23、举例说明如何使用试验设计来回答有关用户行为的问题。

Give an example of howyou would use experimental design to answer a question about user behavior.[来源3]

24、长表和宽表的区别,即“long”“tall”)和“wide”格式数据有什么区别?

What is the differencebetween "long" ("tall") and "wide" format data?[来源3]

25、你用什么方法来确定在一篇文章中发布(或出现在报纸或其他媒体上)的统计数据为错误或者只是为了支持作者的观点而给出的,并非为关于正确数据?[来源3]

What method do you useto determine whether the statistics published in an article (or appeared in anewspaper or other media) are either wrong or presented to support the author'spoint of view, rather than correct, comprehensive factual information on aspecific subject?[来源3]

26、解释EdwardTufte提出的的“chartjunk”的概念。

Explain Edward Tufte'sconcept of "chart junk."[来源3]

27、如何筛选异常值,如果发现异常值,应该怎么做?

How would you screenfor outliers and what should you do if you find one? [来源3]

28、你如何使用极值理论,蒙特卡罗模拟或数理统计(或其他)来正确估计非常罕见事件的概率?

How would you useeither the extreme value theory, Monte Carlo simulations or mathematicalstatistics (or anything else) to correctly estimate the chance of a very rareevent?[来源3]

29、什么是推荐引擎?它是如何工作的?

What is arecommendation engine? How does it work?[来源3]

30、解释什么是假正和假负。为什么区分它们?

Explain what a falsepositive and a false negative are. Why is it important to differentiate thesefrom each other?[来源3]

31、你在用哪些工具进行可视化?您觉得Tableau怎么样?RSAS的可视化功能如何?如何在图表(或视频)中有效地呈现5维数据?

Which tools do you usefor visualization? What do you think of Tableau? R? SAS? (for graphs). How toefficiently represent 5 dimension in a chart (or in a video)?[来源3]

32、谈一谈你之前做过的项目以及你的贡献

Talk about you priorprojects and your contribution[来源4]

33、你通常怎么处理ETL过程

How can you cope withETL process usually?[来源4]

34、数据科学,机器学习和人工智能之间的区别是什么?

Differentiate betweenData Science , Machine Learning and AI.[来源5]

35PythonR 语言你更喜欢用哪一种语言来进行文本分析?

Python or R – Which onewould you prefer for text analytics?[来源5]

36、你使用哪种技术来预测分类响应(categorical responses)

Which technique is usedto predict categorical responses?[来源5]

37

解释什么是逻辑回归,或者说明一个最近你使用逻辑回归的例子。

What is logisticregression? Or State an example when you have used logistic regressionrecently.[来源5]

38、什么是推荐系统?

What are RecommenderSystems?[来源5]

39为什么数据清洗在数据分析过程中起着至关重要的作用?

Why data cleaning playsa vital role in analysis?[来源5]

40、单变量分析,双变量分析和多变量分析的区别是什么。

Differentiate betweenunivariate, bivariate and multivariate analysis.[来源5]

41、通过正态分布,你可以了解到哪些东西?

What do you understandby the term Normal Distribution?[来源5]

42什么是线性回归?

What is LinearRegression?[来源5]

43什么是插值和推断?

What is Interpolationand Extrapolation?[来源5]

44、什么是功效分析?

What is poweranalysis?[来源5]

45、什么是K-means

你如何确定K-means中的K值?Whatis K-means? How can you select K for K-means?[来源5]

46、什么是协同过滤?

What is Collaborativefiltering?[来源5]

47、整群抽样和系统抽样有什么区别?

What is the differencebetween Cluster and Systematic Sampling?[来源5]

48、期望值和平均值有何不同?

Are expectedvalue and mean value different?[来源5]

49P值对统计数据有什么意义?

What does P-valuesignify about the statistical data?[来源5]

50、梯度下降方法是否总是收敛到同一点?

Do gradient descentmethods always converge to same point?[来源5]

51什么是分类变量?

What are categoricalvariables?[来源5]

52、测试的真正例率和假正率分别为100%和5%。有1/1000的人口总体符合测试识别的条件。考虑到一个正检验,有这种情况的可能性是多少?

A test has a truepositive rate of 100% and false positive rate of 5%. There is a population witha 1/1000 rate of having the condition the test identifies. Considering apositive test, what is the probability of having that condition?[来源5]

53、如何使用Box-Cox转换使数据正态分布?

How you can make datanormal using Box-Cox transformation?[来源5]

54、有监督学习和无监督学习有什么区别?

What is the differencebetween Supervised Learning and Unsupervised Learning?[来源5]

55、解释组合数学在数据科学中的应用。

Explain the use ofCombinatorics in data science.[来源5]

56、为什么向量化被认为是优化数值的一种方法?

Why is vectorizationconsidered a powerful method for optimizing numerical code?[来源5]

57A / B测试的目的是什么?

What is the goal of A/BTesting?[来源5]

58、什么是特征值和特征向量?

What is an Eigenvalueand Eigenvector?[来源5]

59、什么是梯度下降(算法)?

What is GradientDescent?[来源5]

60、如何处理异常值?

How can outliervalues be treated?[来源5]

61、你如何评估一个好的逻辑模型?

How can you assess agood logistic model?[来源5]

62、分析项目的主要步骤是什么?

What are various stepsinvolved in an analytics project?[来源5]

63、如何同时迭代一个列表并检索元素索引?

How can you iterateover a list and also retrieve element indices at the same time?[来源5]

64、在分析过程中,你如何处理缺失值?

During analysis, how doyou treat missing values?[来源5]

65、解释回归模型中的box-cox转换。

Explain about thebox-cox transformation in regression models.[来源5]

66、你会使用机器学习来进行时间序列分析吗?

Can you use machinelearning for time series analysis?[来源5]

67、编写一个函数,它接收两个有序列表并输出一个合并二者之后的有序列表。

Write a function thattakes in two sorted lists and outputs a sorted list that is their union. [来源5]

68、贝叶斯估计和最大似然估计(MLE)有什么区别?

What is the differencebetween Bayesian Estimate and Maximum Likelihood Estimation (MLE)?[来源5]

69、什么是正则化?正规化能解决什么问题?

What is Regularizationand what kind of problems does regularization solve?[来源5]

70、什么是多重共线性以及如何克服它?

What is multicollinearityand how you can overcome it? [来源5]

71、维度灾难是什么?

What is the curse ofdimensionality? [来源5]

72、你如何检验你的线性回归模型是否拟合数据?

How do you decidewhether your linear regression model fits the data? [来源5]

73、平均平方误差和绝对误差之间有什么区别?

What is the difference betweensquared error and absolute error? [来源5]

74什么是机器学习?

What is MachineLearning? [来源5]

75、如何构建置信区间,以及如何解释它们?

How are confidenceintervals constructed and how will you interpret them?[来源5]

76、你如何分别向经济学家,医师科学家和生物学家解释逻辑回归?

How will you explainlogistic regression to an economist, physican scientist and biologist? [来源5]

77、你如何克服过度拟合?

How can you overcomeOverfitting? [来源5]

78、宽数据格式和高数据格式的区别是什么?

Differentiate betweenwide and tall data formats?[来源5]

79朴素贝叶斯是不好的吗?如果是的话,在哪些方面不好?

Is Naïve Bayes bad? Ifyes, under what aspects. [来源5]

80、你将如何开发一个模型来识别抄袭?

How would you develop amodel to identify plagiarism?[来源5]

81、如何确定聚类算法中的聚类个数?

How will you define thenumber of clusters in a clustering algorithm?[来源5]

82、假负例太多或假正例太多,哪一个更好?

Is it better to havetoo many false negatives or too many false positives?[来源5]

83、是否可以使用MicrosoftExcel进行逻辑回归?

Is it possible toperform logistic regression with Microsoft Excel?[来源5]

84、你怎么理解模糊融合?你会用何种语言来处理它?

What do you understandby Fuzzy merging ? Which language will you use to handle it?[来源5]

85、偏态分布和均匀分布之间有什么区别?

What is the differencebetween skewed and uniform distribution?[来源5]

86、如果你已经使用多重回归建立了定量结果变量的预测模型,验证你的模型涉及哪些步骤?

You created apredictive model of a quantitative outcome variable using multiple regressions.What are the steps you would follow to validate the model?[来源5]

87、在机器学习中,通过假设(Hypothesis),你可以了解到哪些东西?

What do you understandby Hypothesis in the content of Machine Learning?[来源5]

88、通过召回率和准确率,你可以了解到哪些东西?

What do you understandby Recall and Precision?[来源5]

89、你如何确定K-means的正确K值?

How will you find theright K for K-means?[来源5]

90、为什么L1正则化会导致参数的稀疏性,而L2正则化不会呢?

Why L1 regularizationscauses parameter sparsity whereas L2 regularization does not?[来源5]

91、你如何处理时间序列建模中的不同类型的季节性?

How can you deal withdifferent types of seasonality in time series modelling?[来源5]

92、在试验设计中,是否有必要进行随机化处理?如果有必要的话,为什么?

In experimental design,is it necessary to do randomization? If yes, why?[来源5]

93、你是如何理解与朴素贝叶斯有关的共轭先验?

What do you understandby conjugate-prior with respect to Naïve Bayes?[来源5]

94、你能举出一些假正例比假负例重要的例子吗?

Can you cite someexamples where a false positive is important than a false negative?[来源5]

95、你能举出一些假负例比假正例更重要的例子吗?

Can you cite someexamples where a false negative important than a false positive?[来源5]

96、你能举出一些假正例和假负例同样重要的例子吗?

Can you cite someexamples where both false positive and false negatives are equally important?[来源5]

97、你能解释测试集和验证集之间的区别吗?

Can you explain thedifference between a Test Set and a Validation Set?[来源5]

98、数据集黄金标准(dataset gold standard)是什么?为什么需要它?

What makes a datasetgold standard?[来源5]

99、如何计算灵敏度的统计功效,通过它,你可以了解到哪些东西?

What do you understandby statistical power of sensitivity and how do you calculate it?[来源5]

100、选择偏差(selectionbias)的重要性是什么?

What is the importanceof having a selection bias?[来源5]

上一条:对新建院校《本科教学合格评估指标体系》内涵的理解 下一条:“关注消防,平安你我”专题安全教育知识讲座

欢迎来到吕梁学院bet36官方网址_bet36投注_bet36手机客户端

Copyright ? 吕梁学院bet36官方网址_bet36投注_bet36手机客户端 2015 · 吕梁市离石区学院路1号 033000