博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
netflix 推荐算法学习1(转)
阅读量:4307 次
发布时间:2019-06-06

本文共 3030 字,大约阅读时间需要 10 分钟。

http://www.csie.ntu.edu.tw/~r95007/thesis/svdnetflix/report/report.pdf
http://eecs.wsu.edu/~vjakkula/MLProject.pdf
http://michielvanwezel.com/papers/kagie_vdloos_vwezelV2.pdf
http://cseweb.ucsd.edu/users/elkan/KddNetflixWorkshop.pdf
http://www.cs.uic.edu/~liub/KDD-cup-2007/proceedings/The-Netflix-Prize-Bennett.pdf
准备数据集
1shell 将所有测试数据集文件合并为一个文件
#!/bin/bash
for x in netflix/training_set/mv_*.txt ;
 do cat $x >> ratings.txt ;
done &
http://www.netflixprize.com/community/viewtopic.php?id=87
需要下载path模块
#!/usr/bin/env python
import sys
import csv
from path import path
NULL = '\N'
class Dialect(csv.excel):
    delimiter = '\t'
    lineterminator = '\n'
    doublequote = False
    escapechar = None
    quoting = csv.QUOTE_MINIMAL
def csvDump(iter_rows_func, basename, dir='.', csvdir='csv', dialect=Dialect):
    dir,csvdir = path(dir),path(csvdir)
    if not csvdir.exists():
        csvdir.mkdir()
    inpath = dir/basename
    outfile = csvdir/inpath.namebase + '.csv'
    if not outfile.exists():
        write = csv.writer(open(outfile, 'wb'), dialect).writerow
        print >> sys.stderr, 'Writing %s ...' % outfile
        for row in iter_rows_func(inpath):
            write(row)
def iterMovieRows(path):
    for line in open(path):
        id,year,title = line.rstrip().split(',',2)
        year = year!='NULL' and int(year) or NULL
        yield (int(id), year, title)
def iterTrainingSetRows(dir):
    for path in dir.walkfiles():
        iterlines = (line.strip() for line in open(path))
        movie_id = int(iterlines.next()[:-1])
        for line in iterlines:
            user_id,rating,date = line.split(',',2)
            yield (movie_id, int(user_id), date, float(rating))
def iterProbeSetRows(path):
    for line in (line.strip() for line in open(path)):
        try:
            user_id = int(line)
        except ValueError:
            movie_id = int(line[:-1])
        else:
            yield (movie_id,user_id)
def iterQualifyingSetRows(path):
    for line in (line.strip() for line in open(path)):
        try:
            user_id,date = line.split(',')
        except ValueError:
            movie_id = int(line[:-1])
        else:
            yield (movie_id,user_id,date)
if __name__ == '__main__':
    kwds = {}
    if len(sys.argv) > 1:
        kwds['dir'] = sys.argv[1]
    if len(sys.argv) > 2:
        kwds['csvdir'] = sys.argv[2]
    for iterfunc, basename in [
        (iterMovieRows,         'movie_titles.txt'),
        (iterTrainingSetRows,   'training_set'),
        (iterProbeSetRows,      'probe.txt'),
        (iterQualifyingSetRows, 'qualifying.txt')]:
            csvDump(iterfunc, basename, **kwds)
            
perl脚本     
#!/usr/bin/perl
use strict;
my $dir = '/path/to/your/training_set';
opendir DIR, $dir or die("could not open $dir");
while(my $fname = readdir DIR) {
        my $fname = "$dir/$fname";
        open FILE, $fname or die("could not open $fname");
        (my $mid = <FILE>) =~ s/:.*//s;
        while(<FILE>) {
                chomp;
                print qq("$mid",);
                map { print qq("$_",) } split /,/;
                print "\n";
        }
        close FILE;
}
closedir DIR;
exit;
$ time ./bigcsv.pl > bigcsv.csv
real    35m11.521s
user    10m36.272s
sys     4m9.940s
mysql> LOAD DATA INFILE 'bigcsv.csv' INTO TABLE main FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n';
Query OK, 100480507 rows affected (5 min 34.39 sec)
Records: 100480507  Deleted: 0  Skipped: 0  Warnings: 0

转载于:https://www.cnblogs.com/qq78292959/archive/2011/05/31/2076602.html

你可能感兴趣的文章
VOPO对象介绍
查看>>
suse创建的虚拟机,修改ip地址
查看>>
linux的挂载的问题,重启后就挂载就没有了
查看>>
docker原始镜像启动容器并创建Apache服务器实现反向代理
查看>>
docker容器秒死的解决办法
查看>>
管理网&业务网的一些笔记
查看>>
openstack报错解决一
查看>>
openstack报错解决二
查看>>
linux source命令
查看>>
openstack报错解决三
查看>>
乙未年年终总结
查看>>
子网掩码
查看>>
第一天上班没精神
查看>>
启动eclipse报错:Failed to load the JNI shared library
查看>>
eclipse安装插件的两种方式在线和离线
查看>>
linux下源的相关笔记(suse)
查看>>
linux系统分区文件系统划分札记
查看>>
Linux(SUSE 12)安装Tomcat
查看>>
Linux(SUSE 12)安装jboss4并实现远程访问
查看>>
Neutron在给虚拟机分配网络时,底层是如何实现的?
查看>>