dblp是一个开放数据集,许多进行数据挖掘的项目均使用它来验证自己的算法.但是,900多M的XML实在很难解析,用Dom解析的话根本不可能,我试过用SAX,不知道是不是我第一次使用SAX的缘故,我将java虚拟机的内存设为1.5g仍然有溢出.实在没办法,就自己动手逐行读dblp的xml文件,再用正则表达式进行匹配,来获得我想要的内容,虽然方法蠢了些,不过还是比较高效的,大概遍历一遍文件只要两分钟.

我要获得的内容是dblp中<author>和<title>以及他俩的相关性.因此我将这两个作为<article>的子元素.这样作者和文章就能对应起来.

代码如下,编程技巧拙劣,希望大牛们多多指教.


/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

package readline;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 *
 * @author binbin
 */
public class Main {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws FileNotFoundException, IOException {
        // TODO code application logic here
        int i=0;
        boolean k=false;
        FileReader fileReader =new FileReader("c://dblp.xml");
        FileWriter fileWriter = new FileWriter("f://dblpAuthorTitle.xml");
        fileWriter.write("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n");
        fileWriter.write("<!DOCTYPE dblp SYSTEM \"dblp.dtd\">\n");
        fileWriter.write("<root>\n");
        BufferedReader br=new BufferedReader(fileReader);
        //Pattern pattern = Pattern.compile("<author>.*</author>");
        Pattern pattern = Pattern.compile("<article.*");
        Pattern patternEnd = Pattern.compile("</article>");
        Pattern patternAuthor = Pattern.compile("<author>.*</author>");
        Pattern patternTitle = Pattern.compile("<title>.*</title>");
        String line="";
        while(true){
        try{

            line=br.readLine();
            //if(i==12819770-2)
                //System.out.println(line);
            i++;
            }
        catch(Exception e){
            br.close();
            fileReader.close();
            fileWriter.write("</root>\n");
            fileWriter.close();
            System.exit(0);

        }
        try{
        Matcher matcher = pattern.matcher(line);
         Matcher matcherEnd = patternEnd.matcher(line);
        Matcher matcherAuthor = patternAuthor.matcher(line);
        Matcher matcherTitle = patternTitle.matcher(line);
        if(matcher.matches())
        {
            fileWriter.write(line+'\n');
        }
        if(matcherAuthor.matches())
        {
            fileWriter.write(line+'\n');
        }
        if(matcherTitle.matches())
        {
            fileWriter.write(line+'\n');
        }
        if(matcherEnd.matches())
        {
            fileWriter.write(line+'\n');
        }
        }
         catch(Exception e){
            br.close();
            fileReader.close();
            System.out.println(i);
            fileWriter.write("</root>\n");
            fileWriter.close();
            System.exit(0);
        }
        }

    }

}

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐