新的开始

这是一场新的开始，今天刚刚搭好了hexo的博客还没完善完(刚好处于期末状态)。

总结一下前几天干的几件事情作为一片博客。同时也作为一篇hexo新的开始的新博客。

第一件事情——sklcc实验室招新的笔试题

笔试题我做了三道编程题和4道简答题，都是现学现做的。

编程题：

first:socket

Socket编程 — by Senix
背景
江爷爷是一位德高望重的长者，备受全国人民的爱戴。人民为了兹次他，自发的捐出了自己的一分钟内的1s，形成了庞大的时间
流：11111111111111111111111…然而来自全国各地时间流在续给爷爷的路上出了点差错，一些时间丢失了：1011100010000…..江爷爷很苦恼，机
智的你，能不能帮江爷爷计算下两个时间流的和，并且仍以时间流的形式返回江爷爷，帮爷爷续1s呢？
任务
因为来自全国各地的时间流是源源不断的，所以你需要向爷爷那里获取最新的时间流。
现在我们通过socket与爷爷通信，获取这个字符串。爷爷的地址为10.10.65.153 ,开放给你们的端口为3336

下面使用的是python写的简易的代码：

import socket
ip = ('10.10.65.153',3336)
s = socket.socket()
s.connect(ip)

#GET
msg1 = s.recv(400).decode()

#PROCESS
msgList = msg1.split(',')
num1 = int(msgList[0],2)
num2 = int(msgList[1],2)

temp = bin(num1+num2)
temp = str(temp)
tempList = temp.split('b')
ans = tempList[1]
print(ans)
while(len(ans)<100):
    ans = '0'+ans

#RETURN
returnString = "王仁杰1627406066-13773289761 收到请回复"
s.send(ans.encode())
s.send(returnString.encode())

msg2 = s.recv(300).decode()
msg3 = s.recv(300).decode()
print(msg2)
print(msg3)

second:进程

第二题是一道进程题。由于我先学的pipe管道通信，从某种程度上来说并没有实现题目的要求。现在觉得用信号sigmal会比较好，而且代码也简短。

剪刀石头布！ — by Senix
背景
小亮某一天突发奇想，想试试看自己和自己进行剪刀石头布会是什么样子。
自己的大脑作为裁判，右手和左手进行比赛。每进行一次比赛大脑都会告诉左手和右手谁赢了
任务
因为是大脑进行判断而左右手进行比赛，你需要做的就是使用计算机模拟这个效果。
编写代码，由父进程创建子进程进行比赛，父进程判断最终结果。每一次比赛父进程都要即时宣布比赛结果

下面是我提交的代码：

#include<cstdio>
#include<iostream>
#include<unistd.h>
#include<sys/wait.h>
#include<stdlib.h>
#include<cstring>
#include<ctime>
#define maxn 2048
using namespace std;

bool judge(){
    srand(unsigned(time(0)));
    return rand()%2;
}

int main(){
    pid_t pid;
    char buf[maxn],sendMessage[maxn];
    int fd[2],n;
    if(pipe(fd) == -1){
        //failed
        printf("Create pipe failed\n");
        exit(0);
    }
    pid = fork();//use once,return twice
    if(pid<0){
        //error raised
        //1.the number of processes is bigger than manx limited
        //2.memory is poor
        printf("Create fork failed\n");
        exit(1);
    }else if (pid == 0){
        //in the child send
        printf("Process %d start Judging\nThe ans is:",getpid());
        close(fd[0]);
        if(judge()) strcpy(sendMessage, "Left hand wins\n");
        else strcpy(sendMessage,"Right hand wins\n");
        write(fd[1],sendMessage,strlen(sendMessage));
        return 0;
    }else {
        //parent get
        printf("Father %d get alive\n",getpid());
        sleep(1);
        close(fd[1]);
        n = read(fd[0], buf, maxn);
        write(STDOUT_FILENO, buf, n);
    }
    printf("请等待10s钟\n");
    sleep(10);
    return main();
}

third:爬虫

这是一道有趣的爬虫题。

我要吐槽！ –by Python小组
背景
小孙很喜欢去B站看各种弹幕视频，但是弹幕的质量良莠不齐，有脑洞大开的吐槽，有素质三连。所以，他想统计B站弹幕里面的词频。机智的你，能
不能帮助小孙将B站某一视频的弹幕爬取下来呢？
任务
小孙将会提供给你一个网址，你需要爬取全部的弹幕内容。
将爬取的每一条弹幕分词，去除标点和给定停用词文件中所包含的停用词，进行词频统计，注意用字典形式存储结果({词语1:词频1, 词语2:词频
2……..})
将结果保存为一个json文件，文件命名为source.json
以词频为关键字对结果从高到低进行排序，写一个静态web页面，在页面上用柱状图展示词频最高的前5项

使用的python实现的爬虫，不过从代码风格来说还是比较丑的。现学现卖的程度是有的。

import requests
import random
import time
import socket
import http.client
import re
import gc
import collections
import json

from bs4 import BeautifulSoup

def getContent(url , data = None):
    header={
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
    }

    timeout = random.choice(range(80, 180))
    while True:
        try:
            rep = requests.get(url,headers = header,timeout = timeout)
            rep.encoding = 'utf-8'
            break

        except socket.timeout as e:
            print( '3:', e)
            time.sleep(random.choice(range(8,15)))

        except socket.error as e:
            print( '4:', e)
            time.sleep(random.choice(range(20, 60)))

        except http.client.BadStatusLine as e:
            print( '5:', e)
            time.sleep(random.choice(range(30, 80)))

        except http.client.IncompleteRead as e:
            print( '6:', e)
            time.sleep(random.choice(range(5, 15)))

    return rep.text

def getDanmu(html_text):
    final = []
    bs = BeautifulSoup(html_text, "html.parser")  # 创建BeautifulSoup对象
    listOfd = bs.findAll('d')
    listOfdanmu = []

    #get danmu
    pattern  = re.compile(r'>([\w || \W]+)<')

    for i in range(len(listOfd)):
        tempStr = str(listOfd[i])
        matchList = pattern.findall(tempStr)
        listOfdanmu.append(matchList[0])
    return listOfdanmu

def getStopList():
    stopFile = open("stopwords.txt", "r")
    try:
        flag = 0
        readList = []
        while not flag:
            newLine = stopFile.readline()
            if(newLine == '\n'): continue
            if(newLine != ''):
                readList.append(newLine.rstrip())
            else:
                flag = 1

        segments = ['。','，','、','＇','：','∶','；','?','‘','’','“','”','〝','〞','ˆ',
                    'ˇ','﹕','︰','﹔','﹖',',','﹑','·','¨','…','.','¸',';','！','´','？',
                    '！','～','—','ˉ','｜','‖','＂','〃','｀','@','﹫','¡','¿','﹏','﹋',
                    '﹌','︴','々','﹟','#','﹩','$','﹠','&','﹪','%','*','﹡','﹢','﹦','﹤',
                    '‐','￣','¯','―','﹨','ˆ','˜','﹍','﹎','+','=','<','','','＿','_',"'",
                    '-','\\','ˇ','~','﹉','﹊','（','）','〈','〉','‹','›','﹛','﹜','『','』',
                    '〖','〗','［','］','《','》','〔','〕','{','}','「','」','【','】','︵','︷',
                    '︿','︹','︽','_','﹁','﹃','︻','︶','︸','﹀','︺','︾','ˉ','﹂','﹄','︼']
        for i in segments:
            readList.append(i)
        return readList
    except:
        print("wrong at getStopList")
        exit()
    finally:
        stopFile.close()

def solveSplit(List):
    tempList = List
    if(type(tempList) != type([])):
        print("Wrong at solve")
        exit()

    stopList = getStopList()
    finalList = []
    for stop in range(len(stopList)):
        if(stopList[stop] == None ): continue
        if(stopList[stop] == '\n' ): continue
        list = []
        for temp in range(len(tempList)):
            if(tempList[temp] == None): continue
            if(type(tempList[temp]) != type('a')):continue
            cache = tempList[temp].split(stopList[stop])
            if(cache == ['']):continue
            list.append(cache[0].strip())
        #cover
        tempList = list
        # memory clear
        del list
        gc.collect()
    final = sorted(tempList)
    return final

def solveCount(list):
    countDict = collections.Counter(list)
    print(countDict)
    return countDict

def save(dict):
    fd = open("source.json","w+", encoding='utf-8')
    try:
        fd.write(json.dumps(dict, ensure_ascii=False, indent=2))

    finally:
        fd.close()

if __name__ == '__main__':
    url ='http://comment.bilibili.com/17158352.xml'
    html = getContent(url)
    danmutemp = getDanmu(html)
    danmuSplit = solveSplit(danmutemp)
    danmuDict = solveCount(danmuSplit)
    save(danmuDict)

    #http://comment.bilibili.com/17158352.xml

这是爬取的代码。还有剩下的则是网页和js部分的代码。js是使用的d3.js

var width = 600;  
var height = 600;  
  
dataset = {
  x:['献','回','莱纳','胡佛','卧槽'],
  y:[98,97,48,47,45]
};

var svg = d3.select("body").append("svg")  
                        .attr("width",width)  
                        .attr("height",height);  
  
var xAxisScale = d3.scale.ordinal()  
                .domain(dataset.x)  
                .rangeRoundBands([0,500]);  
                      
var yAxisScale = d3.scale.linear()  
                .domain([0,d3.max(dataset.y)])  
                .range([500,0]);  
                      
var xAxis = d3.svg.axis()  
                .scale(xAxisScale)  
                .orient("bottom");  
  
var yAxis = d3.svg.axis()  
                .scale(yAxisScale)  
                .orient("left");  

var xScale = d3.scale.ordinal()  
                .domain(dataset.x)
                .rangeRoundBands([0,500],0.05,0);  
                      
var yScale = d3.scale.linear()  
                .domain([0,d3.max(dataset.y)])  
                .range([0,500]);  

//图
svg.selectAll("rect")  
   .data(dataset.y)  
   .enter()  
   .append("rect")  
   .attr("x", function(d,i){  
        return 30 + xScale(dataset.x[i]);  
   } )  
   .attr("y",function(d,i){  
        return 50 + 500 - yScale(d) ;  
   })  
   .attr("width", function(d,i){  
        return xScale.rangeBand();  
   })  
   .attr("height",yScale)  
   .attr("fill","blue");  

//数字
svg.selectAll("text")  
    .data(dataset.y)  
    .enter().append("text")  
    .attr("x", function(d,i){  
        return 30 + xScale(dataset.x[i]);  
   } )  
   .attr("y",function(d,i){  
        return 50 + 500 - yScale(d) ;  
   })  
    .attr("dx", function(d,i){  
        return xScale.rangeBand()/3;  
   })  
    .attr("dy", 15)  
    .attr("text-anchor", "begin")  
    .attr("font-size", 14)  
    .attr("fill","white")  
    .text(function(d,i){  
        return d;  
    });  
     
svg.append("g")  
    .attr("class","axis")  
    .attr("transform","translate(30,550)")  
    .call(xAxis);  
      
svg.append("g")  
    .attr("class","axis")  
    .attr("transform","translate(30,50)")  
    .call(yAxis);

网页其实没怎么写，只是为了显示嘛。

<!DOCTYPE html>
<html>
<head>
	<title>huchi's 图片</title>
	<h1 align="center">柱形图</h1> 
</head>
<body align="center">
<script src="http://d3js.org/d3.v3.min.js" charset="utf-8">
</script>  
<script type="text/javascript" src="javascript.js"></script>

</body>
</html>

简答题：

简答题也是非常有趣的，但是由于时间原因我只挑选了其中四个进行解答。剩下的两个一个是云计算的还有一个是数据库的没有实现完。

first:网络

前一段时间有看过图解http,然后百度阅读资料大概就写了一下：

请简述从输入www.baidu.com到返回结果中的流程及其中用的相关技术或协议

获取IP地址

打开浏览器，在地址栏中输入度娘的网址。
浏览器根据输入的URI先去看浏览器缓存中是否有未过期的记录，
然后去查看一下host文件看看有没有度娘这个网址对应的IP地址，
没有的话，浏览器向本机DNS模块发出DNS请求，DNS模块生成相关的DNS报文
DNS模块将生成的DNS报文传递给传输层的UDP协议单元
UDP协议单元将该数据封装成UDP数据包，传递给网络层的IP协议单元
IP协议单元将该数据封装成IP数据包
封装好的IP数据包将传递给数据链路层的协议单元进行发送；
发送时如果ARP缓存(IP地址和DNS服务器物理地址)中没有相关数据，则发送ARP广播请求，等待ARP(地址解析协议)回应；
得到ARP回应后，将IP地址与路由下一跳MAC地址对应的信息写入ARP缓存表；
写入缓存后，以路由下一跳地址填充目的MAC地址，并以数据帧形式转发；
向上
DNS请求被发送到DNS服务器的数据链路层协议单元，其内部的IP数据包传递给网络层IP协议单元，其内部的UDP数据报传递给传输层的UDP协议单元，其内部的DNS报文传递给该服务器上的DNS服务单元，解析后得到相应的IP地址，产生DNS回应报文。
将域名解析的结果以域名和IP地址对应的方式写入DNS缓存表。
(概括就是DNS-UDP-IP-MAC-发送-IP-UDP-DNS-返回DNS响应报文。。)

如果访问到多个IP地址，一般取第一个。

使用的协议：UDP 53，DNS协议,ARP协议，IP协议

使用技术：DNS域名解析负载均衡

TCP建立的三次握手

已经get到一个IP地址啦，浏览器就要向web服务器发送一个GET或POST请求了。

客户端发送请求后，服务器端返回响应，客户端再返回确认报文即可。一般返回2XX(200 OK)状态码就没什么问题了把。

发出TCP连接请求报文-IP-MAC(可能要进行ARP请求)-度娘主机或者代理
度娘MAC-IP-TCP-回应请求报文并请求确认
客户端经过一系列操作收到响应报文，然后再返回确认报文
如果没有收到这个确认报文则执行2

HTTP/1.1都默认使用了了持久连接，为了减少通信量，从而避免了一次又一次的TCP连接断开。

第四次连接应该就是TCP断开连接了。至此我们已经得到了www.baidu.com这个html并由浏览器解析呈现给我。

使用的协议：TCP协议，http协议。

second:python元类

这个东西的理解毒，其实是不够的，按照廖雪峰的教程敲了一遍，其实没懂(●’◡’●)。

下面给出我写的答案：

元类就像是类的模版，可以控制类的创建行为，动态创建类。所以从type类型派生。

通过元类修改类定义，可以实现如ORM(对象-关系映射).元类实现查找类中的属性，保存后删除(防止运行时错破)。继承元类的子类可以实现数据库的操作，之后继承那个子类的子类就可以当成是一个表，从而实现一个简易的数据库。(例子来于廖雪峰教程)

你可以自由的、动态的修改/增加/删除类的或者实例中的方法或者属性
批量的对某些方法使用decorator，而不需要每次都在方法的上面加入@decorator_func
当引入第三方库的时候，如果该库某些类需要patch的时候可以用metaclass
可以用于序列化(参见yaml这个库的实现，我没怎么仔细看）
提供接口注册，接口格式检查等
自动委托(auto delegate)
以上为引用

元类的主要用途是创建API。

主要思路就是：

拦截类的创建
修改类
返回修改的类

third:js闭包

刚好这段时间有在看js DOM编程艺术。

下面给出我的答案：

闭包听上去是一个很高大上的东西，其实也就是变量作用域搞的鬼。

var a = 1;
function test(){
	alert(a);
}
test();

显然读取外部var变量是so easy的。但反之是不行的。

function test(){
	a = 2;
  	add = function(){a++;};//global
	function innerTest(){
		alert(a);
	}
	return innerTest;
}
var result = test();
result();

这就是闭包通过返回一个内部的函数，建立起与函数外的桥梁。

但闭包会使变量都保存在内存，然后内存upup，严重的话就内存泄漏了，退出函数前需要删除局部变量。

不过这种作用域的机制促成了闭包，但是给我造成了一个大坑。

匿名函数的大坑。匿名函数取得的值都是任何变量的最后一个值。

//show the picture in old page
var links = document.getElementsByTagName("a");
for(var i=0; i < links.length; i++){
	var source = links[i].getAttribute("href");
	if(!source) continue;
	
	links[i].onmouseover = function(){
		var temp = links[i].getAttribute("href");//wrong here	
		var Holder = document.getElementById("pictureHolder");
		Holder.setAttribute("src", temp);
	};
}

这里取得匿名函数取得的i就是3了，然后一查是undefined。。。。gg

不过把links[i]改成this就ok了。

fourth:父子进程通信方法

这是一道很让我蒙蔽的。因为通信的方式那么多。。。而且只能写200字，最后大概是我每个方法写了两百字，(●’◡’●)。

1.管道

匿名管道用于亲缘进程通信，命名管道任意。

管道的原理: 管道实为内核使用环形队列机制，借助内核缓冲区(4k)实现。

以管道文件作为通信的媒介,下面使用了pipe生成管道和fork生成进程，子进程给父进程通信。

#include<cstdio>
#include<iostream>
#include<unistd.h>
#include<sys/wait.h>
#include<stdlib.h>
#include<cstring>
#define maxn 2048
using namespace std;

int main(){
    pid_t pid;
    char buf[maxn],written[]={"hello wrold\n"};
    int fd[2],n;
    if(pipe(fd) == -1){
        //failed
        printf("Create pipe failed\n");
        exit(0);
    }
    pid = fork();//use once,return twice
    if(pid<0){
        //error raised
        //1.the number of processes is bigger than manx limited
        //2.memory is poor
        printf("Create fork failed\n");
        exit(1);
    }else if (pid == 0){
        //in the child send
        printf("I am the child,my process ID is %d\n",getpid());
        close(fd[0]);
        write(fd[1],written,strlen(written));
    }else {
        //parent get 
        printf("I am the parent,my process Id is %d\n",getpid());
        close(fd[1]);
        n = read(fd[0], buf, maxn);
        write(STDOUT_FILENO, buf, n);
    }
    return 0;
}

2.信号

生成信号-捕获信号

#include<cstdio>
#include<iostream>
#include<signal.h>
#include<cstdlib>
#include<unistd.h>
#include<sys/wait.h>
#include<sys/types.h>
using namespace std;
static int alarm_fired = 0;
void ouch(int sig){
    alarm_fired = 1;
}

int main(){
    pid_t pid;
    pid = fork();
    if(pid == -1){
        perror("fork failed\n");
        exit(1);
    }else if(pid == 0){
        sleep(5);
        kill(getppid(), SIGALRM);  //send s signal to parent
        exit(0);
    }
    signal(SIGALRM, ouch);//Once accept then run ouch
    while(!alarm_fired){
        printf("Hello world\n");
        sleep(1);
    }
    if(alarm_fired){
        printf("\nI got a siganl %d\n",SIGALRM);
    }
    exit(0);
}

3. 消息队列

通过msgget建立消息队列，子进程msgsnd发送消息，父进程msgrcv接受消息并输出。

#include<cstdio>
#include<iostream>
#include<signal.h>
#include<cstdlib>
#include<unistd.h>
#include<sys/wait.h>
#include<sys/types.h>
#include<sys/msg.h>
#include<errno.h>
#include<cstring>
#define MAX_TEXT 512
using namespace std;
struct msg_st{
    long int msg_type;
    char text[BUFSIZ];
};

int main(){
    pid_t pid;
    int running = 1;
    int msgid = -1;
    msg_st data;
    long int msgtype = 0;
    char buffer[BUFSIZ];
    //steup up
    msgid = msgget((key_t)1234, 0666 | IPC_CREAT);
    if(msgid == -1){
        fprintf(stderr, "msgget failed with error: %d\n", errno);
        exit(-1);
    }
    printf("q to quit:\n");
    pid = fork();
    while(running){
        if(pid == -1){exit(-1);}
        else if(pid == 0){
            printf("Enter:");
            fgets(buffer, BUFSIZ, stdin);
            data.msg_type = 1;
            strcpy(data.text, buffer);
            //send falied
            if(msgsnd(msgid, (void*)&data, MAX_TEXT, 0) == -1){
                fprintf(stderr, "msgsnd failed\n");
                exit(-1);
            }
            //quit
            if(strncmp(buffer, "q", 1) == 0)
                running = 0;
            sleep(1);

        }else {

            if(msgrcv(msgid, (void*)&data, BUFSIZ, msgtype, 0) == -1){
                fprintf(stderr, "msgrcv failed with error: %d\n", errno);
                exit(-1);
            }
            printf("Write: %s\n",data.text);
            //quit
            if(strncmp(data.text, "q", 1) == 0)
                running = 0;
        }
    }
    //delete
    if(pid>0 && msgctl(msgid, IPC_RMID, 0) == -1){
        fprintf(stderr, "msgctl failed\n");
        exit(-1);
    }
    return 0;
}

4. shared Memory

#include<cstdio>
#include<iostream>
#include<signal.h>
#include<cstdlib>
#include<unistd.h>
#include<sys/wait.h>
#include<sys/types.h>
#include<sys/msg.h>
#include<errno.h>
#include<cstring>
#include<sys/shm.h>
#define MAX_TEXT 512
using namespace std;

struct shared_use_st{
    int written;
    char text[MAX_TEXT];
};

int main(){

    pid_t pid;
    int running = 1;
    void *shm = NULL;
    struct shared_use_st *shared;
    int shmid;//shareMemory number
    char buffer[BUFSIZ];

    shmid = shmget((key_t)1234, sshared Memoryizeof(shared_use_st*), 0666|IPC_CREAT);
    if(shmid == -1){
        fprintf(stderr, "shmget failed\n");
        exit(-1);
    }
    //point to memory space
    shm = shmat(shmid, (void*)0, 0);
    if(shm == (void*)-1){
        fprintf(stderr, "shmat failed\n");
        exit(-1);
    }
    printf("SharedMemory attacher at %X\n",(unsigned long long)shm);
    shared = (struct shared_use_st*)shm;
    shared->written = 0;

    pid = fork();
    if(pid <0)exit(-1);
    while(running){
        if(pid == 0){
            while(shared->written == 1){
                sleep(1);
                printf("Wait a second\n");
            }
            printf("Enetr:");
            fgets(buffer, BUFSIZ, stdin);
            strncpy(shared->text, buffer, MAX_TEXT);

            shared->written = 1;
            if(strncmp(buffer, "q", 1) == 0)
                running = 0;
        }
        else{
            if(shared->written != 0){
                printf("Written: %s",shared->text);

                shared->written = 0;
                if(strncmp(shared->text, "p", 1) == 0)
                    running = 0;
            }else sleep(1);
        }
    }
    //seperate the memory
    if(shmdt(shm) == -1){
        fprintf(stderr, "shmdt failed\n");
        exit(-1);
    }
    //delete
    if(pid>0 && shmctl(shmid, IPC_RMID, 0) == -1){
        fprintf(stderr, "delete failed\n");
        exit(-1);
    }
    return 0;
}

fifth:数据库

是一道增删查改，多表查询的题。不详细贴出来了。

sixth：云计算

这个我没看。。。里面还有好几个小问题呢。

Welcome to Cloud Computing Group
Here are some chanllenges for you, don’t be afraid :)
Short answer
How do you understand distributed file systems and distributed computing?
Please tell the difference between parallelism and concurrency.
Please describe features about Binary Search Tree.
How do you understand functional programming.
Programming questions
Ok, it’s time to prove your programming skills. Give you some English articles, you need to count how many
times each word appears, which is so-called WordCount problem. You can do it with language like Scala, Java,
Python and R, and we prefer you to use Scala because it’s very simple and more suitable to the Spark.
Before you write your codes, you should set up hadoop and spark environment on your computer, you
just need to set up a single node using your computer, and the file system must be hdfs. You can set
up the enironment on Windows, but indeed Linux is more easier to set up the enironment. By the way,
if you use standalone for spark, it’s more esaier. If you don’t know how to set up the enironment, just
baidu or google it, don’t be afraid, it’s not very difficult.The search key word is “spark单机伪分布式配
置”。
There are some requirments for this question.
You should put the articles into HDFS and read it from HDFS in your codes.
Do WordCount on spark, using map-reduce.
You should save your results into HDFS, and your WordCount program’s output may be like this:
hello 1
world 2
…
The articles are in http://pan.baidu.com/s/1c11Dkq8 (pass word: kfdo)
Tips & Requirements
Learn to use search engine
Learn about linux
Learn about how to use HDFS, Hadoop, Spark
Programming languages includes Java, scala, python etc.(Scala is better.)
Once you have finished all these above, you should write a report about your work, including your code,
your answer, your ideas, your understanding, whatever you want to say.What’s more, if you do the
wordcount successfully, you should get the results directory(include part-xxx and _SUCCESS file) from HDFS.
If you can’t do the wordcount completely, don’t be discouraged, you just write your ideas and difficulties in
your report, we are not just focuing on your results, we prefer to your ideas and your learning ability.
Submit your report, code files and result directory together in a single compressed file.

就直接贴出来了。据说里面的问答题还是可以做做的，

第二件事情——微软的编程之美

讲真。。我这个水平应该是不适合参加的，但是社团要求，然后就组了几个小伙伴稍微搞了一下。

我负责的是jieba分词，后面有让我做爬虫爬数据训练的，后来发现爬的数据真是。。。。

下面贴jieba分词的代码：

# _*_coding:utf-8_*_
from jieba import analyse
import jieba
import jieba.posseg as pseg
tfidf = analyse.extract_tags
num = [[]]
questions = []
#ans = [[]]

stopkey=[line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()] 

with open("train.txt",'rb') as fp:
	line = fp.readline().decode('utf-8').split('\t')
	i = 0
	questions.append(line[0].rstrip())
	while line:
		num[i].append(line[1].rstrip())
		#ans[i].append(line[2].rstrip())
		if  line[1] not in questions:
			questions.append(line[0])
			i+=1
			num.append([])
			#ans.append([])
		line = fp.readline().decode('utf-8')
		if(line) : line = line.split('\t')
		# 小数据测试
		# if(i>10000) :break
		if(i%100==0) : print(i)

	# 分词
	for i in range(len(questions)):
		# words = pseg.cut(questions[i])
		# 词性
		# attribue = []
		# for word in words:
		# 	if word.flag not in attribue:
		# 		attribue.append(word.flag)

		temp = ''
		index = 0
		for qtemp in jieba.cut(questions[i]):
			if qtemp not in stopkey:
				temp += qtemp
				# if(index<len(attribue)):temp+=attribue[index]
				temp += ' '
				# index += 1
		questions[i] = temp

		for j in range(len(num[i])):

			# 词性
			# attribueOfans = []
			# for word in pseg.cut(num[i][j]):
			# 	if word.flag not in attribueOfans:
			# 		attribueOfans.append(word.flag)


			temp = ''
			# index = 0
			for qtemp in jieba.cut(num[i][j]):
				if qtemp not in stopkey:
					temp += qtemp
					# if(index<len(attribueOfans)):
						# temp += attribueOfans[index]
					temp += ' '
					# index+=1
			num[i][j] = temp
	print("分词成功")
	# write
	with open("a_file.txt",'w',encoding='utf-8') as wf:
		for i in range(len(num)):
			for j in range(len(num[i])):
				if(num[i][j]==[]):continue
				#if(ans[i][j]==[]):break
				wf.write(questions[i])
				wf.write('\t')
				wf.write(num[i][j])
				#wf.write('\t')
				#wf.write(ans[i][j])
				wf.write('\n')
			if(i%100==1):print(i)

不过事实证明，用词性来计算问答句的关联度是不怎么正确的，毕竟我们组没有人学过dl,也没有接触过nlp。不过还是可以称得上是一次很好的经历吧。

conclusion

burn the pages for me

turn the pages for me

本文标题:新的开始

文章作者:呼哧

发布时间:2017-06-07, 22:16:25

最后更新:2017-06-07, 15:06:55

原始链接:http://hu-chi.github.io/2017/06/07/随笔集/新的开始/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。