May 22, 2022

Legal Research with AI Part 7: Wrangling Data with Julia

File management with Julia in preperation for data merge.

4 Minutes, 43 Seconds

2022-05-22 16:30 +0000

Intro

In a previous post, I seperated all of the results returned from the Library of Congress API into individual JSON documents to be imported as nodes into a neo4j graph.

In this post, I filter the LOC data against another data set from Oyez that will be integrated in the next post.

Filtering Data

Both data sets have been seperated into individual case nodes stored in the json format as a file with the format : .json.

The Library of Congress data contains indices, admonitions, briefs, and other data that I will not yet be incorporating into my data set.

In order to find only the case data I will be creating a dataframe containing the paths of json files with matching citations.

Using Julia Instead of Python

I love Python, but I want to try something new. Julia’s multiple dispatch design tempted me to try it out. This is my first Julia program. I will be documenting the work more so than usual.

Julia “import” Functions

Coming from Python, I typically import libraries/packages with an import call. Something like:


import numpy

In Julia, we use the using call to import the package. Like:


using DataFrames
using CSV

A package can also be imported, but this does not instantiate the methods and functions within it (As far as I understand it).

For instance import CSV would only load the package but I would have to call CSV.method to actually do something. Something like from pandas import to_csv in Python.


import DataFrames
import CSV

The Main Function

Just like in C -and like we should in Python-, I declared a main function to run the program. I call it with main(). I do not know if there is a similar convention to Python’s if __name__ == "__main__". I will find out soon.

The main difference in function declaration between Python and Julia is the inclusion of the end keyword and the end of the function.

For instance review the main function below :

function main()
    
    # outpath fo the current file
    outpath = joinpath(pwd(),"case_files.csv")

    #Glob files from directory
    oyez_dataframe = get_files("oyez_cited")
    
    #Glob files from directory
    loc_dataframe = get_files("loc_cited")

    
    # Join on File excluding extraneous data not in the oyez dataset
    master_df = innerjoin(oyez_dataframe, loc_dataframe, on = :File, validate=(true, true), makeunique = true)

    #Select every file but the .DS_Store from the dataframe.  
    master_df = filter(row -> !(row.File == ".DS_Store"), master_df)

    #Write to file
    outpath = df_to_file(master_df,outpath)
    

end

main()

Creating an outpath

The main function creates an outpath to write the resultant master df to file by calling joinpath(pwd(), "case_files.csv").

The Get Files Function

Next, the get_files function is called to create two data frames: the loc_df and the oyez_df.

Declaring empty string arrays

Each file name is appended to a file_name array declared with <array_name> = String[]

Reading Files with readdir()

File names are from from a directory passed to the built in readdir() function.

Appending Files to file_name Array

Each file name is appended to a file_name array declared with file_name = String[] and appended to with the push!(file_name,f) call. Note the ! following push. This typically means that the function is operating on the data in memory and will not return a new value.

Appending File Paths to file_path Array

I also include the file path by appending what is returned by path = joinpath(working_path, f) to the file_path list.

I love the built in joinpath function. Pythons os.sep.join() works well, but I really like Julia’s implementation.

Sorting the Arrays with Merge Sort

Arrays are soreted by call sort_array(<array>). It returns a sorted array using the merge sort alogorithm.

function sort_array(array)
    
    return sort(array; alg=MergeSort)

end

Crating a Dataframe with the Arrays

Finally a dataframe containing the sorted file_name and file_path lists as the columns file and path is created and then returned.

A note on refactoring

This function should be refactored into seperate ones, but it works well enough with this workflow that I am going to leave it.

function get_files(directory)
    file_name = String[]
    file_path = String[]

    working_path = joinpath(pwd(), directory)
    # context management.  Cd and then go back to the orignal pwd
    cd(working_path) do 
        #print("Current directory: ", working_path)
        foreach(readdir()) do f
            path = joinpath(working_path, f)
            push!(file_name,f)
            push!(file_path, path)
            #dump(stat(f.desc)) # you can customize what you want to print
        end
    end
    #println('\n', pwd())
    #display(file_paths)
    file_name = sort_array(file_name)
    file_path = sort_array(file_path)
    df = DataFrame(File = file_name, Path = file_path)
    return df

end

Joining Data Frames by Citation

Julia’s DataFrames package can easily join dataframes on a column. In this workflow the file which is titled after a case citation is used.

# Join on File excluding extraneous data not in the oyez dataset
master_df = innerjoin(oyez_dataframe, loc_dataframe, on = :File, validate=(true, true), makeunique = true)

Filtering the DF for Extraneous Files

The master_df is filtered to remove .DS_Store from the list of files to be processed. Below notice the ! in this case it will return all a data frame of values that are not equal to .DS_Store in the File column.

#Select every file but the .DS_Store from the dataframe.  
master_df = filter(row -> !(row.File == ".DS_Store"), master_df)

The df_to_file Function

Finally the df is written to file.

#Write to file
outpath = df_to_file(master_df,outpath)


function df_to_file(df,outpath)
    CSV.write(outpath, df)
    return outpath

end

The Complete Program

using DataFrames
using CSV

function get_files(directory)
    file_name = String[]
    file_path = String[]

    working_path = joinpath(pwd(), directory)
    # context management.  Cd and then go back to the orignal pwd
    cd(working_path) do 
        #print("Current directory: ", working_path)
        foreach(readdir()) do f
            path = joinpath(working_path, f)
            push!(file_name,f)
            push!(file_path, path)
            #dump(stat(f.desc)) # you can customize what you want to print
        end
    end
    #println('\n', pwd())
    #display(file_paths)
    file_name = sort_array(file_name)
    file_path = sort_array(file_path)
    df = DataFrame(File = file_name, Path = file_path)
    return df

end



function sort_array(array)
    
    return sort(array; alg=MergeSort)

end



function df_to_file(df,outpath)
    CSV.write(outpath, df)
    return outpath

end


function main()
    
    # outpath fo the current file
    outpath = joinpath(pwd(),"case_files.csv")

    #Glob files from directory
    oyez_dataframe = get_files("oyez_cited")
    
    #Glob files from directory
    loc_dataframe = get_files("loc_cited")

    
    # Join on File excluding extraneous data not in the oyez dataset
    master_df = innerjoin(oyez_dataframe, loc_dataframe, on = :File, validate=(true, true), makeunique = true)

    #Select every file but the .DS_Store from the dataframe.  
    master_df = filter(row -> !(row.File == ".DS_Store"), master_df)

    #Write to file
    outpath = df_to_file(master_df,outpath)
    

end

main()